Files

Storme-bit 44989a2b8b documentation updated for model inference settings

2026-04-18 06:41:50 -07:00

4.9 KiB

Raw Blame History

Inference Service

Package: @nexusai/inference-service
Location: packages/inference-service
Deployed on: Main PC (192.168.0.79)
Port: 3001

Purpose

Thin adapter layer around the local LLM runtime. Receives assembled context packages from the orchestration service and returns model responses. Uses a provider pattern to abstract the underlying runtime, making it straightforward to switch inference backends without changes to the rest of the system.

Dependencies

express — HTTP API
ollama — Ollama client (used by the Ollama provider, kept as fallback)
dotenv — environment variable loading
@nexusai/shared — shared utilities

Environment Variables

Variable	Required	Default	Description
PORT	No	3001	Port to listen on
INFERENCE_PROVIDER	No	llamacpp	Active provider (`ollama` or `llamacpp`)
INFERENCE_URL	No	http://localhost:8080	URL of the inference runtime
DEFAULT_MODEL	No	local-model	Default model name passed to the provider

INFERENCE_URL points to llama-server directly (port 8080), not to this service. The orchestration service uses INFERENCE_SERVICE_URL to reach this service on port 3001.

Provider Architecture

The active provider is selected at startup via INFERENCE_PROVIDER and loaded from src/providers/. Both providers expose identical function signatures.

Supported Providers

Provider	Value	Runtime
llama.cpp	`llamacpp`	llama.cpp server (OpenAI-compatible API) — current default
Ollama	`ollama`	Ollama via the `ollama` npm package — available as fallback

Switching providers requires only a .env change — no code modifications:

INFERENCE_PROVIDER=llamacpp
INFERENCE_URL=http://localhost:8080

The provider loader throws immediately on an unknown value, preventing silent misconfiguration.

LM Studio compatibility note: LM Studio exposes an OpenAI-compatible /v1/chat/completions endpoint with the same request shape as llama.cpp. A future lmstudio.js provider would be nearly identical to llamacpp.js — only the BASE_URL would differ. No architectural changes required.

Internal Structure

src/
├── providers/
│   ├── ollama.js      # Ollama provider
│   └── llamacpp.js    # llama.cpp provider (OpenAI-compatible REST)
├── routes/
│   └── inference.js   # /complete and /complete/stream route handlers
├── infer.js           # Provider loader — selects and re-exports active provider
└── index.js           # Express app + route definitions

llama.cpp Provider

Uses the OpenAI-compatible REST API exposed by llama-server.

Starting llama-server

Must be started manually on the main PC before the inference service can handle requests:

.\llama-gpu\llama-server.exe `
  -m .\models\gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf `
  -ngl 99 `
  --reasoning off `
  --host 0.0.0.0 `
  --port 8080 `
  -c 64000

Flag	Description
`-ngl 99`	Offload as many layers as possible to GPU
`--reasoning off`	Disables thinking delay on Gemma 4 models
`--host 0.0.0.0`	Allows LAN connections
`-c 64000`	Context window size in tokens

-c 64000 is intentionally large. NexusAI's memory architecture handles context injection so 6–8K is often sufficient if VRAM pressure builds.

Model Naming

The model name in requests must match the name reported by llama-server including the .gguf extension:

Invoke-RestMethod -Uri "http://192.168.0.79:8080/v1/models"

Set DEFAULT_MODEL in .env to the exact reported name.

Inference Parameters

All parameters are resolved in resolveOptions() — falling back to INFERENCE_DEFAULTS from @nexusai/shared if not provided in the request. In normal usage, orchestration reads these from settings.json and forwards them on every request.

NexusAI option	API field	Default	Description
`temperature`	`temperature`	0.7	Response randomness (0 = deterministic)
`maxTokens`	`max_tokens`	1024	Max tokens to generate
`topP`	`top_p`	0.9	Nucleus sampling probability mass
`topK`	`top_k`	40	Top-K token candidates per step
`repeatPenalty`	`repeat_penalty`	1.1	Penalty for recently used tokens
`seed`	`seed`	null	null = random; integer for reproducible output

Streaming Response Format

The llama.cpp provider yields chunks in this shape:

{ response: "token text", done: false }
// final chunk:
{ response: '', done: true, model: "model-name.gguf", tokenCount: 42 }

The inference route re-emits as SSE:

data: {"response":"token text"}
data: {"done":true,"model":"model-name.gguf","tokenCount":42}
data: [DONE]

model and tokenCount are captured from the llama.cpp finish_reason: stop chunk and emitted on the done event.

For all HTTP endpoints, see api-routes.md.

4.9 KiB Raw Blame History Unescape Escape