# Inference Service **Package:** `@nexusai/inference-service` **Location:** `packages/inference-service` **Deployed on:** Main PC **Port:** 3001 ## Purpose Thin adapter layer around the local LLM runtime. Receives assembled context packages from the orchestration service and returns model responses. Uses a provider pattern to abstract the underlying runtime, making it straightforward to switch inference backends without changes to the rest of the system. ## Dependencies - `express` — HTTP API - `ollama` — Ollama client (used by the Ollama provider) - `dotenv` — environment variable loading - `@nexusai/shared` — shared utilities ## Environment Variables | Variable | Required | Default | Description | |---|---|---|---| | PORT | No | 3001 | Port to listen on | | INFERENCE_PROVIDER | No | ollama | Active inference provider (ollama, llamacpp) | | INFERENCE_URL | No | http://localhost:11434 | URL of the inference runtime | | DEFAULT_MODEL | No | llama3.2 | Default model name passed to the provider | ## Provider Architecture The inference service uses a provider pattern to abstract the underlying LLM runtime. The active provider is selected at startup via `INFERENCE_PROVIDER` and loaded from `src/providers/`. Both providers expose identical function signatures, so the rest of the service is unaware of which backend is active. ### Supported Providers | Provider | Value | Runtime | |---|---|---| | Ollama | `ollama` | Ollama via the `ollama` npm package | | llama.cpp | `llamacpp` | llama.cpp server (OpenAI-compatible API) | Switching providers requires only a `.env` change — no code modifications needed. INFERENCE_PROVIDER=llamacpp INFERENCE_URL=http://localhost:8080 ## Internal Structure src/ ├── providers/ │ ├── ollama.js # Ollama provider — uses ollama npm package │ └── llamacpp.js # llama.cpp provider — uses OpenAI-compatible REST API ├── routes/ │ └── inference.js # /complete and /complete/stream route handlers ├── infer.js # Provider loader — selects and re-exports active provider └── index.js # Express app + route definitions ## Endpoints ### Health | Method | Path | Description | |---|---|---| | GET | /health | Service health check — reports active provider and model | ### Inference | Method | Path | Description | |---|---|---| | POST | /complete | Standard completion — returns full response when done | | POST | /complete/stream | Streaming completion via Server-Sent Events | --- **POST /complete** Request body: ```json { "prompt": "What is the capital of France?", "model": "companion:latest", "temperature": 0.7, "maxTokens": 1024 } ``` `model` is optional — falls back to `DEFAULT_MODEL` if omitted. `maxTokens` is optional — defaults to 1024. `temperature` is optional — defaults to 0.7. Response: ```json { "text": "The capital of France is Paris.", "model": "companion:latest", "done": true, "evalCount": 8, "promptEvalCount": 41 } ``` | Field | Description | |---|---| | `text` | The model's response | | `model` | Model name as reported by the provider | | `done` | Whether generation completed normally | | `evalCount` | Number of tokens generated | | `promptEvalCount` | Number of tokens in the prompt | --- **POST /complete/stream** Same request body as `/complete` (`maxTokens` not applicable for streaming). Response is a stream of Server-Sent Events. Each event contains a partial response chunk as JSON. The stream closes with a final `data: [DONE]` event. data: {"model":"companion:latest","response":"The","done":false} data: {"model":"companion:latest","response":" capital","done":false} data: {"model":"companion:latest","response":" of France is Paris.","done":false} data: [DONE] Clients should read the `response` field from each chunk and accumulate them to build the full response string.