4.9 KiB
Inference Service
Package: @nexusai/inference-service
Location: packages/inference-service
Deployed on: Main PC (192.168.0.79)
Port: 3001
Purpose
Thin adapter layer around the local LLM runtime. Receives assembled context packages from the orchestration service and returns model responses. Uses a provider pattern to abstract the underlying runtime, making it straightforward to switch inference backends without changes to the rest of the system.
Dependencies
express— HTTP APIollama— Ollama client (used by the Ollama provider, kept as fallback)dotenv— environment variable loading@nexusai/shared— shared utilities
Environment Variables
| Variable | Required | Default | Description |
|---|---|---|---|
| PORT | No | 3001 | Port to listen on |
| INFERENCE_PROVIDER | No | llamacpp | Active provider (ollama or llamacpp) |
| INFERENCE_URL | No | http://localhost:8080 | URL of the inference runtime |
| DEFAULT_MODEL | No | local-model | Default model name passed to the provider |
INFERENCE_URLpoints tollama-serverdirectly (port 8080), not to this service. The orchestration service usesINFERENCE_SERVICE_URLto reach this service on port 3001.
Provider Architecture
The active provider is selected at startup via INFERENCE_PROVIDER and
loaded from src/providers/. Both providers expose identical function
signatures.
Supported Providers
| Provider | Value | Runtime |
|---|---|---|
| llama.cpp | llamacpp |
llama.cpp server (OpenAI-compatible API) — current default |
| Ollama | ollama |
Ollama via the ollama npm package — available as fallback |
Switching providers requires only a .env change — no code modifications:
INFERENCE_PROVIDER=llamacpp
INFERENCE_URL=http://localhost:8080
The provider loader throws immediately on an unknown value, preventing silent misconfiguration.
LM Studio compatibility note: LM Studio exposes an OpenAI-compatible
/v1/chat/completionsendpoint with the same request shape as llama.cpp. A futurelmstudio.jsprovider would be nearly identical tollamacpp.js— only theBASE_URLwould differ. No architectural changes required.
Internal Structure
src/
├── providers/
│ ├── ollama.js # Ollama provider
│ └── llamacpp.js # llama.cpp provider (OpenAI-compatible REST)
├── routes/
│ └── inference.js # /complete and /complete/stream route handlers
├── infer.js # Provider loader — selects and re-exports active provider
└── index.js # Express app + route definitions
llama.cpp Provider
Uses the OpenAI-compatible REST API exposed by llama-server.
Starting llama-server
Must be started manually on the main PC before the inference service can handle requests:
.\llama-gpu\llama-server.exe `
-m .\models\gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf `
-ngl 99 `
--reasoning off `
--host 0.0.0.0 `
--port 8080 `
-c 64000
| Flag | Description |
|---|---|
-ngl 99 |
Offload as many layers as possible to GPU |
--reasoning off |
Disables thinking delay on Gemma 4 models |
--host 0.0.0.0 |
Allows LAN connections |
-c 64000 |
Context window size in tokens |
-c 64000is intentionally large. NexusAI's memory architecture handles context injection so 6–8K is often sufficient if VRAM pressure builds.
Model Naming
The model name in requests must match the name reported by llama-server
including the .gguf extension:
Invoke-RestMethod -Uri "http://192.168.0.79:8080/v1/models"
Set DEFAULT_MODEL in .env to the exact reported name.
Inference Parameters
All parameters are resolved in resolveOptions() — falling back to
INFERENCE_DEFAULTS from @nexusai/shared if not provided in the request.
In normal usage, orchestration reads these from settings.json and forwards
them on every request.
| NexusAI option | API field | Default | Description |
|---|---|---|---|
temperature |
temperature |
0.7 | Response randomness (0 = deterministic) |
maxTokens |
max_tokens |
1024 | Max tokens to generate |
topP |
top_p |
0.9 | Nucleus sampling probability mass |
topK |
top_k |
40 | Top-K token candidates per step |
repeatPenalty |
repeat_penalty |
1.1 | Penalty for recently used tokens |
seed |
seed |
null | null = random; integer for reproducible output |
Streaming Response Format
The llama.cpp provider yields chunks in this shape:
{ response: "token text", done: false }
// final chunk:
{ response: '', done: true, model: "model-name.gguf", tokenCount: 42 }
The inference route re-emits as SSE:
data: {"response":"token text"}
data: {"done":true,"model":"model-name.gguf","tokenCount":42}
data: [DONE]
model and tokenCount are captured from the llama.cpp finish_reason: stop
chunk and emitted on the done event.
For all HTTP endpoints, see api-routes.md.