Files
nexusAI/docs/services/inference-service.md
2026-04-17 03:46:17 -07:00

140 lines
4.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Inference Service
**Package:** `@nexusai/inference-service`
**Location:** `packages/inference-service`
**Deployed on:** Main PC (192.168.0.79)
**Port:** 3001
## Purpose
Thin adapter layer around the local LLM runtime. Receives assembled context
packages from the orchestration service and returns model responses. Uses a
provider pattern to abstract the underlying runtime, making it straightforward
to switch inference backends without changes to the rest of the system.
## Dependencies
- `express` — HTTP API
- `ollama` — Ollama client (used by the Ollama provider, kept as fallback)
- `dotenv` — environment variable loading
- `@nexusai/shared` — shared utilities
## Environment Variables
| Variable | Required | Default | Description |
|---|---|---|---|
| PORT | No | 3001 | Port to listen on |
| INFERENCE_PROVIDER | No | llamacpp | Active provider (`ollama` or `llamacpp`) |
| INFERENCE_URL | No | http://localhost:8080 | URL of the inference runtime |
| DEFAULT_MODEL | No | local-model | Default model name passed to the provider |
> `INFERENCE_URL` points to `llama-server` directly (port 8080), not to this
> service. The orchestration service uses `INFERENCE_SERVICE_URL` to reach
> this service on port 3001.
## Provider Architecture
The active provider is selected at startup via `INFERENCE_PROVIDER` and
loaded from `src/providers/`. Both providers expose identical function
signatures.
### Supported Providers
| Provider | Value | Runtime |
|---|---|---|
| llama.cpp | `llamacpp` | llama.cpp server (OpenAI-compatible API) — **current default** |
| Ollama | `ollama` | Ollama via the `ollama` npm package — available as fallback |
Switching providers requires only a `.env` change — no code modifications:
```
INFERENCE_PROVIDER=llamacpp
INFERENCE_URL=http://localhost:8080
```
The provider loader throws immediately on an unknown value, preventing silent
misconfiguration.
## Internal Structure
```
src/
├── providers/
│ ├── ollama.js # Ollama provider
│ └── llamacpp.js # llama.cpp provider (OpenAI-compatible REST)
├── routes/
│ └── inference.js # /complete and /complete/stream route handlers
├── infer.js # Provider loader — selects and re-exports active provider
└── index.js # Express app + route definitions
```
## llama.cpp Provider
Uses the OpenAI-compatible REST API exposed by `llama-server`.
### Starting llama-server
Must be started manually on the main PC before the inference service can
handle requests:
```powershell
.\llama-gpu\llama-server.exe `
-m .\models\gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf `
-ngl 99 `
--reasoning off `
--host 0.0.0.0 `
--port 8080 `
-c 64000
```
| Flag | Description |
|---|---|
| `-ngl 99` | Offload as many layers as possible to GPU |
| `--reasoning off` | Disables thinking delay on Gemma 4 models |
| `--host 0.0.0.0` | Allows LAN connections |
| `-c 64000` | Context window size in tokens |
> `-c 64000` is intentionally large. NexusAI's memory architecture handles
> context injection so 68K is often sufficient if VRAM pressure builds.
### Model Naming
The model name in requests must match the name reported by `llama-server`
including the `.gguf` extension:
```powershell
Invoke-RestMethod -Uri "http://192.168.0.79:8080/v1/models"
```
Set `DEFAULT_MODEL` in `.env` to the exact reported name.
### Inference Parameters
| NexusAI option | API field | Default |
|---|---|---|
| `temperature` | `temperature` | 0.7 |
| `maxTokens` | `max_tokens` | 1024 |
| `topP` | `top_p` | 0.9 |
| `topK` | `top_k` | 40 |
| `repeatPenalty` | `repeat_penalty` | 1.1 |
| `seed` | `seed` | null (random) |
## Streaming Response Format
The llama.cpp provider yields chunks in this shape:
```js
{ response: "token text", done: false }
// final chunk:
{ response: '', done: true, model: "model-name.gguf", tokenCount: 42 }
```
The inference route re-emits as SSE:
```
data: {"response":"token text"}
data: {"done":true,"model":"model-name.gguf","tokenCount":42}
data: [DONE]
```
`model` and `tokenCount` are captured from the llama.cpp `finish_reason: stop`
chunk and emitted on the done event.
For all HTTP endpoints, see `api-routes.md`.