updated inference service documentation

2026-04-05 04:22:49 -07:00
parent a449d570ea
commit 4b3f6455f9
1 changed files with 98 additions and 8 deletions
--- a/docs/services/inference-service.md
+++ b/docs/services/inference-service.md
@@ -7,14 +7,15 @@

 ## Purpose

-Thin adapter layer around the local LLM runtime (Ollama). Receives
-assembled context packages from the orchestration service and returns
-model responses.
+Thin adapter layer around the local LLM runtime. Receives assembled context
+packages from the orchestration service and returns model responses. Uses a
+provider pattern to abstract the underlying runtime, making it straightforward
+to switch inference backends without changes to the rest of the system.

 ## Dependencies

 - `express` — HTTP API
- `ollama` — Ollama client
+- `ollama` — Ollama client (used by the Ollama provider)
 - `dotenv` — environment variable loading
 - `@nexusai/shared` — shared utilities

@@ -23,13 +24,102 @@ model responses.
 | Variable | Required | Default | Description |
 |---|---|---|---|
 | PORT | No | 3001 | Port to listen on |
-| OLLAMA_URL | No | http://localhost:11434 | Ollama instance URL |
-| DEFAULT_MODEL | No | llama3 | Default model to use for inference |
+| INFERENCE_PROVIDER | No | ollama | Active inference provider (ollama, llamacpp) |
+| INFERENCE_URL | No | http://localhost:11434 | URL of the inference runtime |
+| DEFAULT_MODEL | No | llama3.2 | Default model name passed to the provider |
+
+## Provider Architecture
+
+The inference service uses a provider pattern to abstract the underlying
+LLM runtime. The active provider is selected at startup via `INFERENCE_PROVIDER`
+and loaded from `src/providers/`. Both providers expose identical function
+signatures, so the rest of the service is unaware of which backend is active.
+
+### Supported Providers
+
+| Provider | Value | Runtime |
+|---|---|---|
+| Ollama | `ollama` | Ollama via the `ollama` npm package |
+| llama.cpp | `llamacpp` | llama.cpp server (OpenAI-compatible API) |
+
+Switching providers requires only a `.env` change — no code modifications needed.
+INFERENCE_PROVIDER=llamacpp
+INFERENCE_URL=http://localhost:8080
+
+## Internal Structure
+src/
+├── providers/
+│   ├── ollama.js      # Ollama provider — uses ollama npm package
+│   └── llamacpp.js    # llama.cpp provider — uses OpenAI-compatible REST API
+├── routes/
+│   └── inference.js   # /complete and /complete/stream route handlers
+├── infer.js           # Provider loader — selects and re-exports active provider
+└── index.js           # Express app + route definitions

 ## Endpoints

+### Health
+
 | Method | Path | Description |
 |---|---|---|
-| GET | /health | Service health check |
+| GET | /health | Service health check — reports active provider and model |

-> Further endpoints will be documented as the service is built out.
+### Inference
+
+| Method | Path | Description |
+|---|---|---|
+| POST | /complete | Standard completion — returns full response when done |
+| POST | /complete/stream | Streaming completion via Server-Sent Events |
+
+---
+
+**POST /complete**
+
+Request body:
+```json
+{
+  "prompt": "What is the capital of France?",
+  "model": "companion:latest",
+  "temperature": 0.7,
+  "maxTokens": 1024
+}
+```
+
+`model` is optional — falls back to `DEFAULT_MODEL` if omitted.  
+`maxTokens` is optional — defaults to 1024.  
+`temperature` is optional — defaults to 0.7.
+
+Response:
+```json
+{
+  "text": "The capital of France is Paris.",
+  "model": "companion:latest",
+  "done": true,
+  "evalCount": 8,
+  "promptEvalCount": 41
+}
+```
+
+| Field | Description |
+|---|---|
+| `text` | The model's response |
+| `model` | Model name as reported by the provider |
+| `done` | Whether generation completed normally |
+| `evalCount` | Number of tokens generated |
+| `promptEvalCount` | Number of tokens in the prompt |
+
+---
+
+**POST /complete/stream**
+
+Same request body as `/complete` (`maxTokens` not applicable for streaming).
+
+Response is a stream of Server-Sent Events. Each event contains a partial
+response chunk as JSON. The stream closes with a final `data: [DONE]` event.
+data: {"model":"companion:latest","response":"The","done":false}
+data: {"model":"companion:latest","response":" capital","done":false}
+data: {"model":"companion:latest","response":" of France is Paris.","done":false}
+data: [DONE]
+
+Clients should read the `response` field from each chunk and accumulate
+them to build the full response string.