Inference Service

Package: @nexusai/inference-service
Location: packages/inference-service
Deployed on: Main PC
Port: 3001

Purpose

Thin adapter layer around the local LLM runtime. Receives assembled context packages from the orchestration service and returns model responses. Uses a provider pattern to abstract the underlying runtime, making it straightforward to switch inference backends without changes to the rest of the system.

Dependencies

express — HTTP API
ollama — Ollama client (used by the Ollama provider)
dotenv — environment variable loading
@nexusai/shared — shared utilities

Environment Variables

Variable	Required	Default	Description
PORT	No	3001	Port to listen on
INFERENCE_PROVIDER	No	ollama	Active inference provider (ollama, llamacpp)
INFERENCE_URL	No	http://localhost:11434	URL of the inference runtime
DEFAULT_MODEL	No	llama3.2	Default model name passed to the provider

Provider Architecture

The inference service uses a provider pattern to abstract the underlying LLM runtime. The active provider is selected at startup via INFERENCE_PROVIDER and loaded from src/providers/. Both providers expose identical function signatures, so the rest of the service is unaware of which backend is active.

Supported Providers

Provider	Value	Runtime
Ollama	`ollama`	Ollama via the `ollama` npm package
llama.cpp	`llamacpp`	llama.cpp server (OpenAI-compatible API)

Switching providers requires only a .env change — no code modifications needed. INFERENCE_PROVIDER=llamacpp INFERENCE_URL=http://localhost:8080

Internal Structure

src/ ├── providers/ │ ├── ollama.js # Ollama provider — uses ollama npm package │ └── llamacpp.js # llama.cpp provider — uses OpenAI-compatible REST API ├── routes/ │ └── inference.js # /complete and /complete/stream route handlers ├── infer.js # Provider loader — selects and re-exports active provider └── index.js # Express app + route definitions

Endpoints

Health

Method	Path	Description
GET	/health	Service health check — reports active provider and model

Inference

Method	Path	Description
POST	/complete	Standard completion — returns full response when done
POST	/complete/stream	Streaming completion via Server-Sent Events

POST /complete

Request body:

{
  "prompt": "What is the capital of France?",
  "model": "companion:latest",
  "temperature": 0.7,
  "maxTokens": 1024
}

model is optional — falls back to DEFAULT_MODEL if omitted.
maxTokens is optional — defaults to 1024.
temperature is optional — defaults to 0.7.

Response:

{
  "text": "The capital of France is Paris.",
  "model": "companion:latest",
  "done": true,
  "evalCount": 8,
  "promptEvalCount": 41
}

Field	Description
`text`	The model's response
`model`	Model name as reported by the provider
`done`	Whether generation completed normally
`evalCount`	Number of tokens generated
`promptEvalCount`	Number of tokens in the prompt

POST /complete/stream

Same request body as /complete (maxTokens not applicable for streaming).

Response is a stream of Server-Sent Events. Each event contains a partial response chunk as JSON. The stream closes with a final data: [DONE] event. data: {"model":"companion:latest","response":"The","done":false} data: {"model":"companion:latest","response":" capital","done":false} data: {"model":"companion:latest","response":" of France is Paris.","done":false} data: [DONE]

Clients should read the response field from each chunk and accumulate them to build the full response string.

3.8 KiB Raw Blame History