Files
nexusAI/docs/services/inference-service.md
2026-04-13 03:42:14 -07:00

6.4 KiB
Raw Blame History

Inference Service

Package: @nexusai/inference-service
Location: packages/inference-service
Deployed on: Main PC (192.168.0.79)
Port: 3001

Purpose

Thin adapter layer around the local LLM runtime. Receives assembled context packages from the orchestration service and returns model responses. Uses a provider pattern to abstract the underlying runtime, making it straightforward to switch inference backends without changes to the rest of the system.

Dependencies

  • express — HTTP API
  • ollama — Ollama client (used by the Ollama provider, kept as fallback)
  • dotenv — environment variable loading
  • @nexusai/shared — shared utilities

Environment Variables

Variable Required Default Description
PORT No 3001 Port to listen on
INFERENCE_PROVIDER No llamacpp Active inference provider (ollama or llamacpp)
INFERENCE_URL No http://localhost:8080 URL of the inference runtime
DEFAULT_MODEL No local-model Default model name passed to the provider

INFERENCE_URL points to llama-server directly (port 8080), not to this service itself. The orchestration service uses INFERENCE_SERVICE_URL to reach this service on port 3001.

Provider Architecture

The inference service uses a provider pattern to abstract the underlying LLM runtime. The active provider is selected at startup via INFERENCE_PROVIDER and loaded from src/providers/. Both providers expose identical function signatures, so the rest of the service is unaware of which backend is active.

Supported Providers

Provider Value Runtime
llama.cpp llamacpp llama.cpp server (OpenAI-compatible API) — current default
Ollama ollama Ollama via the ollama npm package — available as fallback

Switching providers requires only a .env change — no code modifications needed:

INFERENCE_PROVIDER=llamacpp
INFERENCE_URL=http://localhost:8080

Provider Validation

The provider loader validates INFERENCE_PROVIDER at startup and throws immediately if an unknown value is set — prevents silent misconfiguration:

Error: Unknown inference provider: "foo". Valid options: ollama, llamacpp

llama.cpp Provider

The llama.cpp provider uses the OpenAI-compatible REST API exposed by llama-server.

Starting llama-server

llama-server must be started manually on the main PC before the inference service can handle requests. It loads a single model at startup:

.\llama-gpu\llama-server.exe `
  -m .\models\gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf `
  -ngl 99 `
  --reasoning off `
  --host 0.0.0.0 `
  --port 8080 `
  -c 64000

Key flags:

Flag Description
-m Path to the .gguf model file
-ngl 99 Offload as many layers as possible to GPU
--reasoning off Disables thinking/reasoning delay on Gemma 4 models
--host 0.0.0.0 Allows connections from other machines on the LAN
--port 8080 Port for the llama-server HTTP API
-c 64000 Context window size in tokens

-c 64000 is intentionally large. Monitor VRAM usage — if pressure builds, reduce this value. The NexusAI memory architecture handles context injection so a smaller window (68K) is often sufficient.

Model Naming

The model name sent in API requests must match the name as reported by llama-server — including the .gguf extension. The reported name can be verified with:

Invoke-RestMethod -Uri "http://192.168.0.79:8080/v1/models"

Set DEFAULT_MODEL in .env to the exact reported name:

DEFAULT_MODEL=gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf

Inference Parameters

The llamacpp provider maps NexusAI options to OpenAI-compatible fields:

NexusAI option API field Default
temperature temperature 0.7
maxTokens max_tokens 1024
topP top_p 0.9
topK top_k 40
repeatPenalty repeat_penalty 1.1
seed seed null (random)

Internal Structure

src/
├── providers/
│   ├── ollama.js      # Ollama provider — uses ollama npm package
│   └── llamacpp.js    # llama.cpp provider — uses OpenAI-compatible REST API
├── routes/
│   └── inference.js   # /complete and /complete/stream route handlers
├── infer.js           # Provider loader — selects and re-exports active provider
└── index.js           # Express app + route definitions

Streaming Response Format

The llama.cpp provider yields chunks in this shape:

{ response: "token text", done: false }
// final chunk:
{ response: '', done: true, model: "model-name.gguf", tokenCount: 42 }

The inference route re-emits these as SSE events:

data: {"response":"token text"}
data: {"done":true,"model":"model-name.gguf","tokenCount":42}
data: [DONE]

model and tokenCount are captured from the llama.cpp finish_reason: stop chunk (usage.completion_tokens) and emitted on the done event so the orchestration layer can forward them to the client.

Endpoints

Health

Method Path Description
GET /health Service health check — reports active provider and model

Inference

Method Path Description
POST /complete Standard completion — returns full response when done
POST /complete/stream Streaming completion via Server-Sent Events

POST /complete

Request body:

{
  "prompt": "What is the capital of France?",
  "model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
  "temperature": 0.7,
  "maxTokens": 1024
}

model is optional — falls back to DEFAULT_MODEL if omitted.
maxTokens is optional — defaults to 1024.
temperature is optional — defaults to 0.7.

Response:

{
  "text": "The capital of France is Paris.",
  "model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
  "done": true,
  "evalCount": 8,
  "promptEvalCount": 41
}

POST /complete/stream

Same request body as /complete.

Response is a stream of Server-Sent Events:

data: {"response":"The"}
data: {"response":" capital of France is Paris."}
data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":8}
data: [DONE]

Clients should accumulate response fields to build the full response string. The done event carries model and tokenCount for display in the UI.