3.8 KiB
Inference Service
Package: @nexusai/inference-service
Location: packages/inference-service
Deployed on: Main PC
Port: 3001
Purpose
Thin adapter layer around the local LLM runtime. Receives assembled context packages from the orchestration service and returns model responses. Uses a provider pattern to abstract the underlying runtime, making it straightforward to switch inference backends without changes to the rest of the system.
Dependencies
express— HTTP APIollama— Ollama client (used by the Ollama provider)dotenv— environment variable loading@nexusai/shared— shared utilities
Environment Variables
| Variable | Required | Default | Description |
|---|---|---|---|
| PORT | No | 3001 | Port to listen on |
| INFERENCE_PROVIDER | No | ollama | Active inference provider (ollama, llamacpp) |
| INFERENCE_URL | No | http://localhost:11434 | URL of the inference runtime |
| DEFAULT_MODEL | No | llama3.2 | Default model name passed to the provider |
Provider Architecture
The inference service uses a provider pattern to abstract the underlying
LLM runtime. The active provider is selected at startup via INFERENCE_PROVIDER
and loaded from src/providers/. Both providers expose identical function
signatures, so the rest of the service is unaware of which backend is active.
Supported Providers
| Provider | Value | Runtime |
|---|---|---|
| Ollama | ollama |
Ollama via the ollama npm package |
| llama.cpp | llamacpp |
llama.cpp server (OpenAI-compatible API) |
Switching providers requires only a .env change — no code modifications needed.
INFERENCE_PROVIDER=llamacpp
INFERENCE_URL=http://localhost:8080
Internal Structure
src/ ├── providers/ │ ├── ollama.js # Ollama provider — uses ollama npm package │ └── llamacpp.js # llama.cpp provider — uses OpenAI-compatible REST API ├── routes/ │ └── inference.js # /complete and /complete/stream route handlers ├── infer.js # Provider loader — selects and re-exports active provider └── index.js # Express app + route definitions
Endpoints
Health
| Method | Path | Description |
|---|---|---|
| GET | /health | Service health check — reports active provider and model |
Inference
| Method | Path | Description |
|---|---|---|
| POST | /complete | Standard completion — returns full response when done |
| POST | /complete/stream | Streaming completion via Server-Sent Events |
POST /complete
Request body:
{
"prompt": "What is the capital of France?",
"model": "companion:latest",
"temperature": 0.7,
"maxTokens": 1024
}
model is optional — falls back to DEFAULT_MODEL if omitted.
maxTokens is optional — defaults to 1024.
temperature is optional — defaults to 0.7.
Response:
{
"text": "The capital of France is Paris.",
"model": "companion:latest",
"done": true,
"evalCount": 8,
"promptEvalCount": 41
}
| Field | Description |
|---|---|
text |
The model's response |
model |
Model name as reported by the provider |
done |
Whether generation completed normally |
evalCount |
Number of tokens generated |
promptEvalCount |
Number of tokens in the prompt |
POST /complete/stream
Same request body as /complete (maxTokens not applicable for streaming).
Response is a stream of Server-Sent Events. Each event contains a partial
response chunk as JSON. The stream closes with a final data: [DONE] event.
data: {"model":"companion:latest","response":"The","done":false}
data: {"model":"companion:latest","response":" capital","done":false}
data: {"model":"companion:latest","response":" of France is Paris.","done":false}
data: [DONE]
Clients should read the response field from each chunk and accumulate
them to build the full response string.