Files
nexusAI/docs/services/inference-service.md
2026-04-13 03:42:14 -07:00

216 lines
6.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Inference Service
**Package:** `@nexusai/inference-service`
**Location:** `packages/inference-service`
**Deployed on:** Main PC (192.168.0.79)
**Port:** 3001
## Purpose
Thin adapter layer around the local LLM runtime. Receives assembled context
packages from the orchestration service and returns model responses. Uses a
provider pattern to abstract the underlying runtime, making it straightforward
to switch inference backends without changes to the rest of the system.
## Dependencies
- `express` — HTTP API
- `ollama` — Ollama client (used by the Ollama provider, kept as fallback)
- `dotenv` — environment variable loading
- `@nexusai/shared` — shared utilities
## Environment Variables
| Variable | Required | Default | Description |
|---|---|---|---|
| PORT | No | 3001 | Port to listen on |
| INFERENCE_PROVIDER | No | llamacpp | Active inference provider (`ollama` or `llamacpp`) |
| INFERENCE_URL | No | http://localhost:8080 | URL of the inference runtime |
| DEFAULT_MODEL | No | local-model | Default model name passed to the provider |
> `INFERENCE_URL` points to `llama-server` directly (port 8080), not to this
> service itself. The orchestration service uses `INFERENCE_SERVICE_URL` to
> reach this service on port 3001.
## Provider Architecture
The inference service uses a provider pattern to abstract the underlying
LLM runtime. The active provider is selected at startup via `INFERENCE_PROVIDER`
and loaded from `src/providers/`. Both providers expose identical function
signatures, so the rest of the service is unaware of which backend is active.
### Supported Providers
| Provider | Value | Runtime |
|---|---|---|
| llama.cpp | `llamacpp` | llama.cpp server (OpenAI-compatible API) — **current default** |
| Ollama | `ollama` | Ollama via the `ollama` npm package — available as fallback |
Switching providers requires only a `.env` change — no code modifications needed:
```
INFERENCE_PROVIDER=llamacpp
INFERENCE_URL=http://localhost:8080
```
### Provider Validation
The provider loader validates `INFERENCE_PROVIDER` at startup and throws immediately
if an unknown value is set — prevents silent misconfiguration:
```
Error: Unknown inference provider: "foo". Valid options: ollama, llamacpp
```
## llama.cpp Provider
The llama.cpp provider uses the OpenAI-compatible REST API exposed by `llama-server`.
### Starting llama-server
`llama-server` must be started manually on the main PC before the inference service
can handle requests. It loads a single model at startup:
```powershell
.\llama-gpu\llama-server.exe `
-m .\models\gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf `
-ngl 99 `
--reasoning off `
--host 0.0.0.0 `
--port 8080 `
-c 64000
```
Key flags:
| Flag | Description |
|---|---|
| `-m` | Path to the `.gguf` model file |
| `-ngl 99` | Offload as many layers as possible to GPU |
| `--reasoning off` | Disables thinking/reasoning delay on Gemma 4 models |
| `--host 0.0.0.0` | Allows connections from other machines on the LAN |
| `--port 8080` | Port for the llama-server HTTP API |
| `-c 64000` | Context window size in tokens |
> `-c 64000` is intentionally large. Monitor VRAM usage — if pressure builds,
> reduce this value. The NexusAI memory architecture handles context injection
> so a smaller window (68K) is often sufficient.
### Model Naming
The model name sent in API requests must match the name as reported by
`llama-server` — including the `.gguf` extension. The reported name can be
verified with:
```powershell
Invoke-RestMethod -Uri "http://192.168.0.79:8080/v1/models"
```
Set `DEFAULT_MODEL` in `.env` to the exact reported name:
```
DEFAULT_MODEL=gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf
```
### Inference Parameters
The llamacpp provider maps NexusAI options to OpenAI-compatible fields:
| NexusAI option | API field | Default |
|---|---|---|
| `temperature` | `temperature` | 0.7 |
| `maxTokens` | `max_tokens` | 1024 |
| `topP` | `top_p` | 0.9 |
| `topK` | `top_k` | 40 |
| `repeatPenalty` | `repeat_penalty` | 1.1 |
| `seed` | `seed` | null (random) |
## Internal Structure
```
src/
├── providers/
│ ├── ollama.js # Ollama provider — uses ollama npm package
│ └── llamacpp.js # llama.cpp provider — uses OpenAI-compatible REST API
├── routes/
│ └── inference.js # /complete and /complete/stream route handlers
├── infer.js # Provider loader — selects and re-exports active provider
└── index.js # Express app + route definitions
```
## Streaming Response Format
The llama.cpp provider yields chunks in this shape:
```js
{ response: "token text", done: false }
// final chunk:
{ response: '', done: true, model: "model-name.gguf", tokenCount: 42 }
```
The inference route re-emits these as SSE events:
```
data: {"response":"token text"}
data: {"done":true,"model":"model-name.gguf","tokenCount":42}
data: [DONE]
```
`model` and `tokenCount` are captured from the llama.cpp `finish_reason: stop`
chunk (`usage.completion_tokens`) and emitted on the done event so the
orchestration layer can forward them to the client.
## Endpoints
### Health
| Method | Path | Description |
|---|---|---|
| GET | /health | Service health check — reports active provider and model |
### Inference
| Method | Path | Description |
|---|---|---|
| POST | /complete | Standard completion — returns full response when done |
| POST | /complete/stream | Streaming completion via Server-Sent Events |
---
**POST /complete**
Request body:
```json
{
"prompt": "What is the capital of France?",
"model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
"temperature": 0.7,
"maxTokens": 1024
}
```
`model` is optional — falls back to `DEFAULT_MODEL` if omitted.
`maxTokens` is optional — defaults to 1024.
`temperature` is optional — defaults to 0.7.
Response:
```json
{
"text": "The capital of France is Paris.",
"model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
"done": true,
"evalCount": 8,
"promptEvalCount": 41
}
```
---
**POST /complete/stream**
Same request body as `/complete`.
Response is a stream of Server-Sent Events:
```
data: {"response":"The"}
data: {"response":" capital of France is Paris."}
data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":8}
data: [DONE]
```
Clients should accumulate `response` fields to build the full response string.
The `done` event carries `model` and `tokenCount` for display in the UI.