updated inference service documentation

This commit is contained in:
Storme-bit
2026-04-05 04:22:49 -07:00
parent a449d570ea
commit 4b3f6455f9

View File

@@ -7,14 +7,15 @@
## Purpose
Thin adapter layer around the local LLM runtime (Ollama). Receives
assembled context packages from the orchestration service and returns
model responses.
Thin adapter layer around the local LLM runtime. Receives assembled context
packages from the orchestration service and returns model responses. Uses a
provider pattern to abstract the underlying runtime, making it straightforward
to switch inference backends without changes to the rest of the system.
## Dependencies
- `express` — HTTP API
- `ollama` — Ollama client
- `ollama` — Ollama client (used by the Ollama provider)
- `dotenv` — environment variable loading
- `@nexusai/shared` — shared utilities
@@ -23,13 +24,102 @@ model responses.
| Variable | Required | Default | Description |
|---|---|---|---|
| PORT | No | 3001 | Port to listen on |
| OLLAMA_URL | No | http://localhost:11434 | Ollama instance URL |
| DEFAULT_MODEL | No | llama3 | Default model to use for inference |
| INFERENCE_PROVIDER | No | ollama | Active inference provider (ollama, llamacpp) |
| INFERENCE_URL | No | http://localhost:11434 | URL of the inference runtime |
| DEFAULT_MODEL | No | llama3.2 | Default model name passed to the provider |
## Provider Architecture
The inference service uses a provider pattern to abstract the underlying
LLM runtime. The active provider is selected at startup via `INFERENCE_PROVIDER`
and loaded from `src/providers/`. Both providers expose identical function
signatures, so the rest of the service is unaware of which backend is active.
### Supported Providers
| Provider | Value | Runtime |
|---|---|---|
| Ollama | `ollama` | Ollama via the `ollama` npm package |
| llama.cpp | `llamacpp` | llama.cpp server (OpenAI-compatible API) |
Switching providers requires only a `.env` change — no code modifications needed.
INFERENCE_PROVIDER=llamacpp
INFERENCE_URL=http://localhost:8080
## Internal Structure
src/
├── providers/
│ ├── ollama.js # Ollama provider — uses ollama npm package
│ └── llamacpp.js # llama.cpp provider — uses OpenAI-compatible REST API
├── routes/
│ └── inference.js # /complete and /complete/stream route handlers
├── infer.js # Provider loader — selects and re-exports active provider
└── index.js # Express app + route definitions
## Endpoints
### Health
| Method | Path | Description |
|---|---|---|
| GET | /health | Service health check |
| GET | /health | Service health check — reports active provider and model |
> Further endpoints will be documented as the service is built out.
### Inference
| Method | Path | Description |
|---|---|---|
| POST | /complete | Standard completion — returns full response when done |
| POST | /complete/stream | Streaming completion via Server-Sent Events |
---
**POST /complete**
Request body:
```json
{
"prompt": "What is the capital of France?",
"model": "companion:latest",
"temperature": 0.7,
"maxTokens": 1024
}
```
`model` is optional — falls back to `DEFAULT_MODEL` if omitted.
`maxTokens` is optional — defaults to 1024.
`temperature` is optional — defaults to 0.7.
Response:
```json
{
"text": "The capital of France is Paris.",
"model": "companion:latest",
"done": true,
"evalCount": 8,
"promptEvalCount": 41
}
```
| Field | Description |
|---|---|
| `text` | The model's response |
| `model` | Model name as reported by the provider |
| `done` | Whether generation completed normally |
| `evalCount` | Number of tokens generated |
| `promptEvalCount` | Number of tokens in the prompt |
---
**POST /complete/stream**
Same request body as `/complete` (`maxTokens` not applicable for streaming).
Response is a stream of Server-Sent Events. Each event contains a partial
response chunk as JSON. The stream closes with a final `data: [DONE]` event.
data: {"model":"companion:latest","response":"The","done":false}
data: {"model":"companion:latest","response":" capital","done":false}
data: {"model":"companion:latest","response":" of France is Paris.","done":false}
data: [DONE]
Clients should read the `response` field from each chunk and accumulate
them to build the full response string.