updated documentation

This commit is contained in:
Storme-bit
2026-04-13 03:42:14 -07:00
parent 5f024093d1
commit 045da0d7f4
5 changed files with 464 additions and 112 deletions

View File

@@ -2,7 +2,7 @@
**Package:** `@nexusai/inference-service`
**Location:** `packages/inference-service`
**Deployed on:** Main PC
**Deployed on:** Main PC (192.168.0.79)
**Port:** 3001
## Purpose
@@ -15,7 +15,7 @@ to switch inference backends without changes to the rest of the system.
## Dependencies
- `express` — HTTP API
- `ollama` — Ollama client (used by the Ollama provider)
- `ollama` — Ollama client (used by the Ollama provider, kept as fallback)
- `dotenv` — environment variable loading
- `@nexusai/shared` — shared utilities
@@ -24,9 +24,13 @@ to switch inference backends without changes to the rest of the system.
| Variable | Required | Default | Description |
|---|---|---|---|
| PORT | No | 3001 | Port to listen on |
| INFERENCE_PROVIDER | No | ollama | Active inference provider (ollama, llamacpp) |
| INFERENCE_URL | No | http://localhost:11434 | URL of the inference runtime |
| DEFAULT_MODEL | No | llama3.2 | Default model name passed to the provider |
| INFERENCE_PROVIDER | No | llamacpp | Active inference provider (`ollama` or `llamacpp`) |
| INFERENCE_URL | No | http://localhost:8080 | URL of the inference runtime |
| DEFAULT_MODEL | No | local-model | Default model name passed to the provider |
> `INFERENCE_URL` points to `llama-server` directly (port 8080), not to this
> service itself. The orchestration service uses `INFERENCE_SERVICE_URL` to
> reach this service on port 3001.
## Provider Architecture
@@ -39,14 +43,87 @@ signatures, so the rest of the service is unaware of which backend is active.
| Provider | Value | Runtime |
|---|---|---|
| Ollama | `ollama` | Ollama via the `ollama` npm package |
| llama.cpp | `llamacpp` | llama.cpp server (OpenAI-compatible API) |
| llama.cpp | `llamacpp` | llama.cpp server (OpenAI-compatible API) — **current default** |
| Ollama | `ollama` | Ollama via the `ollama` npm package — available as fallback |
Switching providers requires only a `.env` change — no code modifications needed.
Switching providers requires only a `.env` change — no code modifications needed:
```
INFERENCE_PROVIDER=llamacpp
INFERENCE_URL=http://localhost:8080
```
### Provider Validation
The provider loader validates `INFERENCE_PROVIDER` at startup and throws immediately
if an unknown value is set — prevents silent misconfiguration:
```
Error: Unknown inference provider: "foo". Valid options: ollama, llamacpp
```
## llama.cpp Provider
The llama.cpp provider uses the OpenAI-compatible REST API exposed by `llama-server`.
### Starting llama-server
`llama-server` must be started manually on the main PC before the inference service
can handle requests. It loads a single model at startup:
```powershell
.\llama-gpu\llama-server.exe `
-m .\models\gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf `
-ngl 99 `
--reasoning off `
--host 0.0.0.0 `
--port 8080 `
-c 64000
```
Key flags:
| Flag | Description |
|---|---|
| `-m` | Path to the `.gguf` model file |
| `-ngl 99` | Offload as many layers as possible to GPU |
| `--reasoning off` | Disables thinking/reasoning delay on Gemma 4 models |
| `--host 0.0.0.0` | Allows connections from other machines on the LAN |
| `--port 8080` | Port for the llama-server HTTP API |
| `-c 64000` | Context window size in tokens |
> `-c 64000` is intentionally large. Monitor VRAM usage — if pressure builds,
> reduce this value. The NexusAI memory architecture handles context injection
> so a smaller window (68K) is often sufficient.
### Model Naming
The model name sent in API requests must match the name as reported by
`llama-server` — including the `.gguf` extension. The reported name can be
verified with:
```powershell
Invoke-RestMethod -Uri "http://192.168.0.79:8080/v1/models"
```
Set `DEFAULT_MODEL` in `.env` to the exact reported name:
```
DEFAULT_MODEL=gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf
```
### Inference Parameters
The llamacpp provider maps NexusAI options to OpenAI-compatible fields:
| NexusAI option | API field | Default |
|---|---|---|
| `temperature` | `temperature` | 0.7 |
| `maxTokens` | `max_tokens` | 1024 |
| `topP` | `top_p` | 0.9 |
| `topK` | `top_k` | 40 |
| `repeatPenalty` | `repeat_penalty` | 1.1 |
| `seed` | `seed` | null (random) |
## Internal Structure
```
src/
├── providers/
│ ├── ollama.js # Ollama provider — uses ollama npm package
@@ -55,6 +132,27 @@ src/
│ └── inference.js # /complete and /complete/stream route handlers
├── infer.js # Provider loader — selects and re-exports active provider
└── index.js # Express app + route definitions
```
## Streaming Response Format
The llama.cpp provider yields chunks in this shape:
```js
{ response: "token text", done: false }
// final chunk:
{ response: '', done: true, model: "model-name.gguf", tokenCount: 42 }
```
The inference route re-emits these as SSE events:
```
data: {"response":"token text"}
data: {"done":true,"model":"model-name.gguf","tokenCount":42}
data: [DONE]
```
`model` and `tokenCount` are captured from the llama.cpp `finish_reason: stop`
chunk (`usage.completion_tokens`) and emitted on the done event so the
orchestration layer can forward them to the client.
## Endpoints
@@ -79,7 +177,7 @@ Request body:
```json
{
"prompt": "What is the capital of France?",
"model": "companion:latest",
"model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
"temperature": 0.7,
"maxTokens": 1024
}
@@ -93,33 +191,26 @@ Response:
```json
{
"text": "The capital of France is Paris.",
"model": "companion:latest",
"model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
"done": true,
"evalCount": 8,
"promptEvalCount": 41
}
```
| Field | Description |
|---|---|
| `text` | The model's response |
| `model` | Model name as reported by the provider |
| `done` | Whether generation completed normally |
| `evalCount` | Number of tokens generated |
| `promptEvalCount` | Number of tokens in the prompt |
---
**POST /complete/stream**
Same request body as `/complete` (`maxTokens` not applicable for streaming).
Same request body as `/complete`.
Response is a stream of Server-Sent Events. Each event contains a partial
response chunk as JSON. The stream closes with a final `data: [DONE]` event.
data: {"model":"companion:latest","response":"The","done":false}
data: {"model":"companion:latest","response":" capital","done":false}
data: {"model":"companion:latest","response":" of France is Paris.","done":false}
Response is a stream of Server-Sent Events:
```
data: {"response":"The"}
data: {"response":" capital of France is Paris."}
data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":8}
data: [DONE]
```
Clients should read the `response` field from each chunk and accumulate
them to build the full response string.
Clients should accumulate `response` fields to build the full response string.
The `done` event carries `model` and `tokenCount` for display in the UI.