Files
nexusAI/docs/services/inference-service.md
2026-04-05 04:22:49 -07:00

3.8 KiB

Inference Service

Package: @nexusai/inference-service
Location: packages/inference-service
Deployed on: Main PC
Port: 3001

Purpose

Thin adapter layer around the local LLM runtime. Receives assembled context packages from the orchestration service and returns model responses. Uses a provider pattern to abstract the underlying runtime, making it straightforward to switch inference backends without changes to the rest of the system.

Dependencies

  • express — HTTP API
  • ollama — Ollama client (used by the Ollama provider)
  • dotenv — environment variable loading
  • @nexusai/shared — shared utilities

Environment Variables

Variable Required Default Description
PORT No 3001 Port to listen on
INFERENCE_PROVIDER No ollama Active inference provider (ollama, llamacpp)
INFERENCE_URL No http://localhost:11434 URL of the inference runtime
DEFAULT_MODEL No llama3.2 Default model name passed to the provider

Provider Architecture

The inference service uses a provider pattern to abstract the underlying LLM runtime. The active provider is selected at startup via INFERENCE_PROVIDER and loaded from src/providers/. Both providers expose identical function signatures, so the rest of the service is unaware of which backend is active.

Supported Providers

Provider Value Runtime
Ollama ollama Ollama via the ollama npm package
llama.cpp llamacpp llama.cpp server (OpenAI-compatible API)

Switching providers requires only a .env change — no code modifications needed. INFERENCE_PROVIDER=llamacpp INFERENCE_URL=http://localhost:8080

Internal Structure

src/ ├── providers/ │ ├── ollama.js # Ollama provider — uses ollama npm package │ └── llamacpp.js # llama.cpp provider — uses OpenAI-compatible REST API ├── routes/ │ └── inference.js # /complete and /complete/stream route handlers ├── infer.js # Provider loader — selects and re-exports active provider └── index.js # Express app + route definitions

Endpoints

Health

Method Path Description
GET /health Service health check — reports active provider and model

Inference

Method Path Description
POST /complete Standard completion — returns full response when done
POST /complete/stream Streaming completion via Server-Sent Events

POST /complete

Request body:

{
  "prompt": "What is the capital of France?",
  "model": "companion:latest",
  "temperature": 0.7,
  "maxTokens": 1024
}

model is optional — falls back to DEFAULT_MODEL if omitted.
maxTokens is optional — defaults to 1024.
temperature is optional — defaults to 0.7.

Response:

{
  "text": "The capital of France is Paris.",
  "model": "companion:latest",
  "done": true,
  "evalCount": 8,
  "promptEvalCount": 41
}
Field Description
text The model's response
model Model name as reported by the provider
done Whether generation completed normally
evalCount Number of tokens generated
promptEvalCount Number of tokens in the prompt

POST /complete/stream

Same request body as /complete (maxTokens not applicable for streaming).

Response is a stream of Server-Sent Events. Each event contains a partial response chunk as JSON. The stream closes with a final data: [DONE] event. data: {"model":"companion:latest","response":"The","done":false} data: {"model":"companion:latest","response":" capital","done":false} data: {"model":"companion:latest","response":" of France is Paris.","done":false} data: [DONE]

Clients should read the response field from each chunk and accumulate them to build the full response string.