updated documentation

2026-04-13 03:42:14 -07:00
parent 5f024093d1
commit 045da0d7f4
5 changed files with 464 additions and 112 deletions
--- a/docs/services/inference-service.md
+++ b/docs/services/inference-service.md
@@ -2,7 +2,7 @@

 **Package:** `@nexusai/inference-service`  
 **Location:** `packages/inference-service`  
-**Deployed on:** Main PC  
+**Deployed on:** Main PC (192.168.0.79)  
 **Port:** 3001

 ## Purpose
@@ -15,7 +15,7 @@ to switch inference backends without changes to the rest of the system.
 ## Dependencies

 - `express` — HTTP API
- `ollama` — Ollama client (used by the Ollama provider)
+- `ollama` — Ollama client (used by the Ollama provider, kept as fallback)
 - `dotenv` — environment variable loading
 - `@nexusai/shared` — shared utilities

@@ -24,9 +24,13 @@ to switch inference backends without changes to the rest of the system.
 | Variable | Required | Default | Description |
 |---|---|---|---|
 | PORT | No | 3001 | Port to listen on |
-| INFERENCE_PROVIDER | No | ollama | Active inference provider (ollama, llamacpp) |
-| INFERENCE_URL | No | http://localhost:11434 | URL of the inference runtime |
-| DEFAULT_MODEL | No | llama3.2 | Default model name passed to the provider |
+| INFERENCE_PROVIDER | No | llamacpp | Active inference provider (`ollama` or `llamacpp`) |
+| INFERENCE_URL | No | http://localhost:8080 | URL of the inference runtime |
+| DEFAULT_MODEL | No | local-model | Default model name passed to the provider |
+
+> `INFERENCE_URL` points to `llama-server` directly (port 8080), not to this
+> service itself. The orchestration service uses `INFERENCE_SERVICE_URL` to
+> reach this service on port 3001.

 ## Provider Architecture

@@ -39,14 +43,87 @@ signatures, so the rest of the service is unaware of which backend is active.

 | Provider | Value | Runtime |
 |---|---|---|
-| Ollama | `ollama` | Ollama via the `ollama` npm package |
-| llama.cpp | `llamacpp` | llama.cpp server (OpenAI-compatible API) |
+| llama.cpp | `llamacpp` | llama.cpp server (OpenAI-compatible API) — **current default** |
+| Ollama | `ollama` | Ollama via the `ollama` npm package — available as fallback |

-Switching providers requires only a `.env` change — no code modifications needed.
+Switching providers requires only a `.env` change — no code modifications needed:
+```
 INFERENCE_PROVIDER=llamacpp
 INFERENCE_URL=http://localhost:8080
+```
+
+### Provider Validation
+
+The provider loader validates `INFERENCE_PROVIDER` at startup and throws immediately
+if an unknown value is set — prevents silent misconfiguration:
+```
+Error: Unknown inference provider: "foo". Valid options: ollama, llamacpp
+```
+
+## llama.cpp Provider
+
+The llama.cpp provider uses the OpenAI-compatible REST API exposed by `llama-server`.
+
+### Starting llama-server
+
+`llama-server` must be started manually on the main PC before the inference service
+can handle requests. It loads a single model at startup:
+
+```powershell
+.\llama-gpu\llama-server.exe `
+  -m .\models\gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf `
+  -ngl 99 `
+  --reasoning off `
+  --host 0.0.0.0 `
+  --port 8080 `
+  -c 64000
+```
+
+Key flags:
+
+| Flag | Description |
+|---|---|
+| `-m` | Path to the `.gguf` model file |
+| `-ngl 99` | Offload as many layers as possible to GPU |
+| `--reasoning off` | Disables thinking/reasoning delay on Gemma 4 models |
+| `--host 0.0.0.0` | Allows connections from other machines on the LAN |
+| `--port 8080` | Port for the llama-server HTTP API |
+| `-c 64000` | Context window size in tokens |
+
+> `-c 64000` is intentionally large. Monitor VRAM usage — if pressure builds,
+> reduce this value. The NexusAI memory architecture handles context injection
+> so a smaller window (6–8K) is often sufficient.
+
+### Model Naming
+
+The model name sent in API requests must match the name as reported by
+`llama-server` — including the `.gguf` extension. The reported name can be
+verified with:
+
+```powershell
+Invoke-RestMethod -Uri "http://192.168.0.79:8080/v1/models"
+```
+
+Set `DEFAULT_MODEL` in `.env` to the exact reported name:
+```
+DEFAULT_MODEL=gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf
+```
+
+### Inference Parameters
+
+The llamacpp provider maps NexusAI options to OpenAI-compatible fields:
+
+| NexusAI option | API field | Default |
+|---|---|---|
+| `temperature` | `temperature` | 0.7 |
+| `maxTokens` | `max_tokens` | 1024 |
+| `topP` | `top_p` | 0.9 |
+| `topK` | `top_k` | 40 |
+| `repeatPenalty` | `repeat_penalty` | 1.1 |
+| `seed` | `seed` | null (random) |

 ## Internal Structure
+```
 src/
 ├── providers/
 │   ├── ollama.js      # Ollama provider — uses ollama npm package
@@ -55,6 +132,27 @@ src/
 │   └── inference.js   # /complete and /complete/stream route handlers
 ├── infer.js           # Provider loader — selects and re-exports active provider
 └── index.js           # Express app + route definitions
+```
+
+## Streaming Response Format
+
+The llama.cpp provider yields chunks in this shape:
+```js
+{ response: "token text", done: false }
+// final chunk:
+{ response: '', done: true, model: "model-name.gguf", tokenCount: 42 }
+```
+
+The inference route re-emits these as SSE events:
+```
+data: {"response":"token text"}
+data: {"done":true,"model":"model-name.gguf","tokenCount":42}
+data: [DONE]
+```
+
+`model` and `tokenCount` are captured from the llama.cpp `finish_reason: stop`
+chunk (`usage.completion_tokens`) and emitted on the done event so the
+orchestration layer can forward them to the client.

 ## Endpoints

@@ -79,7 +177,7 @@ Request body:
 ```json
 {
  "prompt": "What is the capital of France?",
-  "model": "companion:latest",
+  "model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
  "temperature": 0.7,
  "maxTokens": 1024
 }
@@ -93,33 +191,26 @@ Response:
 ```json
 {
  "text": "The capital of France is Paris.",
-  "model": "companion:latest",
+  "model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
  "done": true,
  "evalCount": 8,
  "promptEvalCount": 41
 }
 ```

-| Field | Description |
-|---|---|
-| `text` | The model's response |
-| `model` | Model name as reported by the provider |
-| `done` | Whether generation completed normally |
-| `evalCount` | Number of tokens generated |
-| `promptEvalCount` | Number of tokens in the prompt |
-
 ---

 **POST /complete/stream**

-Same request body as `/complete` (`maxTokens` not applicable for streaming).
+Same request body as `/complete`.

-Response is a stream of Server-Sent Events. Each event contains a partial
-response chunk as JSON. The stream closes with a final `data: [DONE]` event.
-data: {"model":"companion:latest","response":"The","done":false}
-data: {"model":"companion:latest","response":" capital","done":false}
-data: {"model":"companion:latest","response":" of France is Paris.","done":false}
+Response is a stream of Server-Sent Events:
+```
+data: {"response":"The"}
+data: {"response":" capital of France is Paris."}
+data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":8}
 data: [DONE]
+```

-Clients should read the `response` field from each chunk and accumulate
-them to build the full response string.
+Clients should accumulate `response` fields to build the full response string.
+The `done` event carries `model` and `tokenCount` for display in the UI.