diff --git a/docs/services/inference-service.md b/docs/services/inference-service.md index 4a8e961..2f96a09 100644 --- a/docs/services/inference-service.md +++ b/docs/services/inference-service.md @@ -7,14 +7,15 @@ ## Purpose -Thin adapter layer around the local LLM runtime (Ollama). Receives -assembled context packages from the orchestration service and returns -model responses. +Thin adapter layer around the local LLM runtime. Receives assembled context +packages from the orchestration service and returns model responses. Uses a +provider pattern to abstract the underlying runtime, making it straightforward +to switch inference backends without changes to the rest of the system. ## Dependencies - `express` — HTTP API -- `ollama` — Ollama client +- `ollama` — Ollama client (used by the Ollama provider) - `dotenv` — environment variable loading - `@nexusai/shared` — shared utilities @@ -23,13 +24,102 @@ model responses. | Variable | Required | Default | Description | |---|---|---|---| | PORT | No | 3001 | Port to listen on | -| OLLAMA_URL | No | http://localhost:11434 | Ollama instance URL | -| DEFAULT_MODEL | No | llama3 | Default model to use for inference | +| INFERENCE_PROVIDER | No | ollama | Active inference provider (ollama, llamacpp) | +| INFERENCE_URL | No | http://localhost:11434 | URL of the inference runtime | +| DEFAULT_MODEL | No | llama3.2 | Default model name passed to the provider | + +## Provider Architecture + +The inference service uses a provider pattern to abstract the underlying +LLM runtime. The active provider is selected at startup via `INFERENCE_PROVIDER` +and loaded from `src/providers/`. Both providers expose identical function +signatures, so the rest of the service is unaware of which backend is active. + +### Supported Providers + +| Provider | Value | Runtime | +|---|---|---| +| Ollama | `ollama` | Ollama via the `ollama` npm package | +| llama.cpp | `llamacpp` | llama.cpp server (OpenAI-compatible API) | + +Switching providers requires only a `.env` change — no code modifications needed. +INFERENCE_PROVIDER=llamacpp +INFERENCE_URL=http://localhost:8080 + +## Internal Structure +src/ +├── providers/ +│ ├── ollama.js # Ollama provider — uses ollama npm package +│ └── llamacpp.js # llama.cpp provider — uses OpenAI-compatible REST API +├── routes/ +│ └── inference.js # /complete and /complete/stream route handlers +├── infer.js # Provider loader — selects and re-exports active provider +└── index.js # Express app + route definitions ## Endpoints +### Health + | Method | Path | Description | |---|---|---| -| GET | /health | Service health check | +| GET | /health | Service health check — reports active provider and model | -> Further endpoints will be documented as the service is built out. \ No newline at end of file +### Inference + +| Method | Path | Description | +|---|---|---| +| POST | /complete | Standard completion — returns full response when done | +| POST | /complete/stream | Streaming completion via Server-Sent Events | + +--- + +**POST /complete** + +Request body: +```json +{ + "prompt": "What is the capital of France?", + "model": "companion:latest", + "temperature": 0.7, + "maxTokens": 1024 +} +``` + +`model` is optional — falls back to `DEFAULT_MODEL` if omitted. +`maxTokens` is optional — defaults to 1024. +`temperature` is optional — defaults to 0.7. + +Response: +```json +{ + "text": "The capital of France is Paris.", + "model": "companion:latest", + "done": true, + "evalCount": 8, + "promptEvalCount": 41 +} +``` + +| Field | Description | +|---|---| +| `text` | The model's response | +| `model` | Model name as reported by the provider | +| `done` | Whether generation completed normally | +| `evalCount` | Number of tokens generated | +| `promptEvalCount` | Number of tokens in the prompt | + +--- + +**POST /complete/stream** + +Same request body as `/complete` (`maxTokens` not applicable for streaming). + +Response is a stream of Server-Sent Events. Each event contains a partial +response chunk as JSON. The stream closes with a final `data: [DONE]` event. +data: {"model":"companion:latest","response":"The","done":false} +data: {"model":"companion:latest","response":" capital","done":false} +data: {"model":"companion:latest","response":" of France is Paris.","done":false} +data: [DONE] + +Clients should read the `response` field from each chunk and accumulate +them to build the full response string. \ No newline at end of file