6.4 KiB
Inference Service
Package: @nexusai/inference-service
Location: packages/inference-service
Deployed on: Main PC (192.168.0.79)
Port: 3001
Purpose
Thin adapter layer around the local LLM runtime. Receives assembled context packages from the orchestration service and returns model responses. Uses a provider pattern to abstract the underlying runtime, making it straightforward to switch inference backends without changes to the rest of the system.
Dependencies
express— HTTP APIollama— Ollama client (used by the Ollama provider, kept as fallback)dotenv— environment variable loading@nexusai/shared— shared utilities
Environment Variables
| Variable | Required | Default | Description |
|---|---|---|---|
| PORT | No | 3001 | Port to listen on |
| INFERENCE_PROVIDER | No | llamacpp | Active inference provider (ollama or llamacpp) |
| INFERENCE_URL | No | http://localhost:8080 | URL of the inference runtime |
| DEFAULT_MODEL | No | local-model | Default model name passed to the provider |
INFERENCE_URLpoints tollama-serverdirectly (port 8080), not to this service itself. The orchestration service usesINFERENCE_SERVICE_URLto reach this service on port 3001.
Provider Architecture
The inference service uses a provider pattern to abstract the underlying
LLM runtime. The active provider is selected at startup via INFERENCE_PROVIDER
and loaded from src/providers/. Both providers expose identical function
signatures, so the rest of the service is unaware of which backend is active.
Supported Providers
| Provider | Value | Runtime |
|---|---|---|
| llama.cpp | llamacpp |
llama.cpp server (OpenAI-compatible API) — current default |
| Ollama | ollama |
Ollama via the ollama npm package — available as fallback |
Switching providers requires only a .env change — no code modifications needed:
INFERENCE_PROVIDER=llamacpp
INFERENCE_URL=http://localhost:8080
Provider Validation
The provider loader validates INFERENCE_PROVIDER at startup and throws immediately
if an unknown value is set — prevents silent misconfiguration:
Error: Unknown inference provider: "foo". Valid options: ollama, llamacpp
llama.cpp Provider
The llama.cpp provider uses the OpenAI-compatible REST API exposed by llama-server.
Starting llama-server
llama-server must be started manually on the main PC before the inference service
can handle requests. It loads a single model at startup:
.\llama-gpu\llama-server.exe `
-m .\models\gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf `
-ngl 99 `
--reasoning off `
--host 0.0.0.0 `
--port 8080 `
-c 64000
Key flags:
| Flag | Description |
|---|---|
-m |
Path to the .gguf model file |
-ngl 99 |
Offload as many layers as possible to GPU |
--reasoning off |
Disables thinking/reasoning delay on Gemma 4 models |
--host 0.0.0.0 |
Allows connections from other machines on the LAN |
--port 8080 |
Port for the llama-server HTTP API |
-c 64000 |
Context window size in tokens |
-c 64000is intentionally large. Monitor VRAM usage — if pressure builds, reduce this value. The NexusAI memory architecture handles context injection so a smaller window (6–8K) is often sufficient.
Model Naming
The model name sent in API requests must match the name as reported by
llama-server — including the .gguf extension. The reported name can be
verified with:
Invoke-RestMethod -Uri "http://192.168.0.79:8080/v1/models"
Set DEFAULT_MODEL in .env to the exact reported name:
DEFAULT_MODEL=gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf
Inference Parameters
The llamacpp provider maps NexusAI options to OpenAI-compatible fields:
| NexusAI option | API field | Default |
|---|---|---|
temperature |
temperature |
0.7 |
maxTokens |
max_tokens |
1024 |
topP |
top_p |
0.9 |
topK |
top_k |
40 |
repeatPenalty |
repeat_penalty |
1.1 |
seed |
seed |
null (random) |
Internal Structure
src/
├── providers/
│ ├── ollama.js # Ollama provider — uses ollama npm package
│ └── llamacpp.js # llama.cpp provider — uses OpenAI-compatible REST API
├── routes/
│ └── inference.js # /complete and /complete/stream route handlers
├── infer.js # Provider loader — selects and re-exports active provider
└── index.js # Express app + route definitions
Streaming Response Format
The llama.cpp provider yields chunks in this shape:
{ response: "token text", done: false }
// final chunk:
{ response: '', done: true, model: "model-name.gguf", tokenCount: 42 }
The inference route re-emits these as SSE events:
data: {"response":"token text"}
data: {"done":true,"model":"model-name.gguf","tokenCount":42}
data: [DONE]
model and tokenCount are captured from the llama.cpp finish_reason: stop
chunk (usage.completion_tokens) and emitted on the done event so the
orchestration layer can forward them to the client.
Endpoints
Health
| Method | Path | Description |
|---|---|---|
| GET | /health | Service health check — reports active provider and model |
Inference
| Method | Path | Description |
|---|---|---|
| POST | /complete | Standard completion — returns full response when done |
| POST | /complete/stream | Streaming completion via Server-Sent Events |
POST /complete
Request body:
{
"prompt": "What is the capital of France?",
"model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
"temperature": 0.7,
"maxTokens": 1024
}
model is optional — falls back to DEFAULT_MODEL if omitted.
maxTokens is optional — defaults to 1024.
temperature is optional — defaults to 0.7.
Response:
{
"text": "The capital of France is Paris.",
"model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
"done": true,
"evalCount": 8,
"promptEvalCount": 41
}
POST /complete/stream
Same request body as /complete.
Response is a stream of Server-Sent Events:
data: {"response":"The"}
data: {"response":" capital of France is Paris."}
data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":8}
data: [DONE]
Clients should accumulate response fields to build the full response string.
The done event carries model and tokenCount for display in the UI.