updated inference service documentation
This commit is contained in:
@@ -7,14 +7,15 @@
|
||||
|
||||
## Purpose
|
||||
|
||||
Thin adapter layer around the local LLM runtime (Ollama). Receives
|
||||
assembled context packages from the orchestration service and returns
|
||||
model responses.
|
||||
Thin adapter layer around the local LLM runtime. Receives assembled context
|
||||
packages from the orchestration service and returns model responses. Uses a
|
||||
provider pattern to abstract the underlying runtime, making it straightforward
|
||||
to switch inference backends without changes to the rest of the system.
|
||||
|
||||
## Dependencies
|
||||
|
||||
- `express` — HTTP API
|
||||
- `ollama` — Ollama client
|
||||
- `ollama` — Ollama client (used by the Ollama provider)
|
||||
- `dotenv` — environment variable loading
|
||||
- `@nexusai/shared` — shared utilities
|
||||
|
||||
@@ -23,13 +24,102 @@ model responses.
|
||||
| Variable | Required | Default | Description |
|
||||
|---|---|---|---|
|
||||
| PORT | No | 3001 | Port to listen on |
|
||||
| OLLAMA_URL | No | http://localhost:11434 | Ollama instance URL |
|
||||
| DEFAULT_MODEL | No | llama3 | Default model to use for inference |
|
||||
| INFERENCE_PROVIDER | No | ollama | Active inference provider (ollama, llamacpp) |
|
||||
| INFERENCE_URL | No | http://localhost:11434 | URL of the inference runtime |
|
||||
| DEFAULT_MODEL | No | llama3.2 | Default model name passed to the provider |
|
||||
|
||||
## Provider Architecture
|
||||
|
||||
The inference service uses a provider pattern to abstract the underlying
|
||||
LLM runtime. The active provider is selected at startup via `INFERENCE_PROVIDER`
|
||||
and loaded from `src/providers/`. Both providers expose identical function
|
||||
signatures, so the rest of the service is unaware of which backend is active.
|
||||
|
||||
### Supported Providers
|
||||
|
||||
| Provider | Value | Runtime |
|
||||
|---|---|---|
|
||||
| Ollama | `ollama` | Ollama via the `ollama` npm package |
|
||||
| llama.cpp | `llamacpp` | llama.cpp server (OpenAI-compatible API) |
|
||||
|
||||
Switching providers requires only a `.env` change — no code modifications needed.
|
||||
INFERENCE_PROVIDER=llamacpp
|
||||
INFERENCE_URL=http://localhost:8080
|
||||
|
||||
## Internal Structure
|
||||
src/
|
||||
├── providers/
|
||||
│ ├── ollama.js # Ollama provider — uses ollama npm package
|
||||
│ └── llamacpp.js # llama.cpp provider — uses OpenAI-compatible REST API
|
||||
├── routes/
|
||||
│ └── inference.js # /complete and /complete/stream route handlers
|
||||
├── infer.js # Provider loader — selects and re-exports active provider
|
||||
└── index.js # Express app + route definitions
|
||||
|
||||
## Endpoints
|
||||
|
||||
### Health
|
||||
|
||||
| Method | Path | Description |
|
||||
|---|---|---|
|
||||
| GET | /health | Service health check |
|
||||
| GET | /health | Service health check — reports active provider and model |
|
||||
|
||||
> Further endpoints will be documented as the service is built out.
|
||||
### Inference
|
||||
|
||||
| Method | Path | Description |
|
||||
|---|---|---|
|
||||
| POST | /complete | Standard completion — returns full response when done |
|
||||
| POST | /complete/stream | Streaming completion via Server-Sent Events |
|
||||
|
||||
---
|
||||
|
||||
**POST /complete**
|
||||
|
||||
Request body:
|
||||
```json
|
||||
{
|
||||
"prompt": "What is the capital of France?",
|
||||
"model": "companion:latest",
|
||||
"temperature": 0.7,
|
||||
"maxTokens": 1024
|
||||
}
|
||||
```
|
||||
|
||||
`model` is optional — falls back to `DEFAULT_MODEL` if omitted.
|
||||
`maxTokens` is optional — defaults to 1024.
|
||||
`temperature` is optional — defaults to 0.7.
|
||||
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"text": "The capital of France is Paris.",
|
||||
"model": "companion:latest",
|
||||
"done": true,
|
||||
"evalCount": 8,
|
||||
"promptEvalCount": 41
|
||||
}
|
||||
```
|
||||
|
||||
| Field | Description |
|
||||
|---|---|
|
||||
| `text` | The model's response |
|
||||
| `model` | Model name as reported by the provider |
|
||||
| `done` | Whether generation completed normally |
|
||||
| `evalCount` | Number of tokens generated |
|
||||
| `promptEvalCount` | Number of tokens in the prompt |
|
||||
|
||||
---
|
||||
|
||||
**POST /complete/stream**
|
||||
|
||||
Same request body as `/complete` (`maxTokens` not applicable for streaming).
|
||||
|
||||
Response is a stream of Server-Sent Events. Each event contains a partial
|
||||
response chunk as JSON. The stream closes with a final `data: [DONE]` event.
|
||||
data: {"model":"companion:latest","response":"The","done":false}
|
||||
data: {"model":"companion:latest","response":" capital","done":false}
|
||||
data: {"model":"companion:latest","response":" of France is Paris.","done":false}
|
||||
data: [DONE]
|
||||
|
||||
Clients should read the `response` field from each chunk and accumulate
|
||||
them to build the full response string.
|
||||
Reference in New Issue
Block a user