Files

Storme-bit 045da0d7f4 updated documentation

2026-04-13 03:42:14 -07:00

6.4 KiB

Raw Blame History

Inference Service

Package: @nexusai/inference-service
Location: packages/inference-service
Deployed on: Main PC (192.168.0.79)
Port: 3001

Purpose

Thin adapter layer around the local LLM runtime. Receives assembled context packages from the orchestration service and returns model responses. Uses a provider pattern to abstract the underlying runtime, making it straightforward to switch inference backends without changes to the rest of the system.

Dependencies

express — HTTP API
ollama — Ollama client (used by the Ollama provider, kept as fallback)
dotenv — environment variable loading
@nexusai/shared — shared utilities

Environment Variables

Variable	Required	Default	Description
PORT	No	3001	Port to listen on
INFERENCE_PROVIDER	No	llamacpp	Active inference provider (`ollama` or `llamacpp`)
INFERENCE_URL	No	http://localhost:8080	URL of the inference runtime
DEFAULT_MODEL	No	local-model	Default model name passed to the provider

INFERENCE_URL points to llama-server directly (port 8080), not to this service itself. The orchestration service uses INFERENCE_SERVICE_URL to reach this service on port 3001.

Provider Architecture

The inference service uses a provider pattern to abstract the underlying LLM runtime. The active provider is selected at startup via INFERENCE_PROVIDER and loaded from src/providers/. Both providers expose identical function signatures, so the rest of the service is unaware of which backend is active.

Supported Providers

Provider	Value	Runtime
llama.cpp	`llamacpp`	llama.cpp server (OpenAI-compatible API) — current default
Ollama	`ollama`	Ollama via the `ollama` npm package — available as fallback

Switching providers requires only a .env change — no code modifications needed:

INFERENCE_PROVIDER=llamacpp
INFERENCE_URL=http://localhost:8080

Provider Validation

The provider loader validates INFERENCE_PROVIDER at startup and throws immediately if an unknown value is set — prevents silent misconfiguration:

Error: Unknown inference provider: "foo". Valid options: ollama, llamacpp

llama.cpp Provider

The llama.cpp provider uses the OpenAI-compatible REST API exposed by llama-server.

Starting llama-server

llama-server must be started manually on the main PC before the inference service can handle requests. It loads a single model at startup:

.\llama-gpu\llama-server.exe `
  -m .\models\gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf `
  -ngl 99 `
  --reasoning off `
  --host 0.0.0.0 `
  --port 8080 `
  -c 64000

Key flags:

Flag	Description
`-m`	Path to the `.gguf` model file
`-ngl 99`	Offload as many layers as possible to GPU
`--reasoning off`	Disables thinking/reasoning delay on Gemma 4 models
`--host 0.0.0.0`	Allows connections from other machines on the LAN
`--port 8080`	Port for the llama-server HTTP API
`-c 64000`	Context window size in tokens

-c 64000 is intentionally large. Monitor VRAM usage — if pressure builds, reduce this value. The NexusAI memory architecture handles context injection so a smaller window (6–8K) is often sufficient.

Model Naming

The model name sent in API requests must match the name as reported by llama-server — including the .gguf extension. The reported name can be verified with:

Invoke-RestMethod -Uri "http://192.168.0.79:8080/v1/models"

Set DEFAULT_MODEL in .env to the exact reported name:

DEFAULT_MODEL=gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf

Inference Parameters

The llamacpp provider maps NexusAI options to OpenAI-compatible fields:

NexusAI option	API field	Default
`temperature`	`temperature`	0.7
`maxTokens`	`max_tokens`	1024
`topP`	`top_p`	0.9
`topK`	`top_k`	40
`repeatPenalty`	`repeat_penalty`	1.1
`seed`	`seed`	null (random)

Internal Structure

src/
├── providers/
│   ├── ollama.js      # Ollama provider — uses ollama npm package
│   └── llamacpp.js    # llama.cpp provider — uses OpenAI-compatible REST API
├── routes/
│   └── inference.js   # /complete and /complete/stream route handlers
├── infer.js           # Provider loader — selects and re-exports active provider
└── index.js           # Express app + route definitions

Streaming Response Format

The llama.cpp provider yields chunks in this shape:

{ response: "token text", done: false }
// final chunk:
{ response: '', done: true, model: "model-name.gguf", tokenCount: 42 }

The inference route re-emits these as SSE events:

data: {"response":"token text"}
data: {"done":true,"model":"model-name.gguf","tokenCount":42}
data: [DONE]

model and tokenCount are captured from the llama.cpp finish_reason: stop chunk (usage.completion_tokens) and emitted on the done event so the orchestration layer can forward them to the client.

Endpoints

Health

Method	Path	Description
GET	/health	Service health check — reports active provider and model

Inference

Method	Path	Description
POST	/complete	Standard completion — returns full response when done
POST	/complete/stream	Streaming completion via Server-Sent Events

POST /complete

Request body:

{
  "prompt": "What is the capital of France?",
  "model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
  "temperature": 0.7,
  "maxTokens": 1024
}

model is optional — falls back to DEFAULT_MODEL if omitted.
maxTokens is optional — defaults to 1024.
temperature is optional — defaults to 0.7.

Response:

{
  "text": "The capital of France is Paris.",
  "model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
  "done": true,
  "evalCount": 8,
  "promptEvalCount": 41
}

POST /complete/stream

Same request body as /complete.

Response is a stream of Server-Sent Events:

data: {"response":"The"}
data: {"response":" capital of France is Paris."}
data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":8}
data: [DONE]

Clients should accumulate response fields to build the full response string. The done event carries model and tokenCount for display in the UI.

6.4 KiB Raw Blame History Unescape Escape