update documentation

2026-04-17 03:46:17 -07:00
parent 27e3c98304
commit 5145b9a7db
13 changed files with 822 additions and 794 deletions
--- a/docs/services/inference-service.md
+++ b/docs/services/inference-service.md
@@ -24,20 +24,19 @@ to switch inference backends without changes to the rest of the system.
 | Variable | Required | Default | Description |
 |---|---|---|---|
 | PORT | No | 3001 | Port to listen on |
-| INFERENCE_PROVIDER | No | llamacpp | Active inference provider (`ollama` or `llamacpp`) |
+| INFERENCE_PROVIDER | No | llamacpp | Active provider (`ollama` or `llamacpp`) |
 | INFERENCE_URL | No | http://localhost:8080 | URL of the inference runtime |
 | DEFAULT_MODEL | No | local-model | Default model name passed to the provider |

 > `INFERENCE_URL` points to `llama-server` directly (port 8080), not to this
-> service itself. The orchestration service uses `INFERENCE_SERVICE_URL` to
-> reach this service on port 3001.
+> service. The orchestration service uses `INFERENCE_SERVICE_URL` to reach
+> this service on port 3001.

 ## Provider Architecture

-The inference service uses a provider pattern to abstract the underlying
-LLM runtime. The active provider is selected at startup via `INFERENCE_PROVIDER`
-and loaded from `src/providers/`. Both providers expose identical function
-signatures, so the rest of the service is unaware of which backend is active.
+The active provider is selected at startup via `INFERENCE_PROVIDER` and
+loaded from `src/providers/`. Both providers expose identical function
+signatures.

 ### Supported Providers

@@ -46,28 +45,36 @@ signatures, so the rest of the service is unaware of which backend is active.
 | llama.cpp | `llamacpp` | llama.cpp server (OpenAI-compatible API) — **current default** |
 | Ollama | `ollama` | Ollama via the `ollama` npm package — available as fallback |

-Switching providers requires only a `.env` change — no code modifications needed:
+Switching providers requires only a `.env` change — no code modifications:
 ```
 INFERENCE_PROVIDER=llamacpp
 INFERENCE_URL=http://localhost:8080
 ```

-### Provider Validation
+The provider loader throws immediately on an unknown value, preventing silent
+misconfiguration.
+
+## Internal Structure

-The provider loader validates `INFERENCE_PROVIDER` at startup and throws immediately
-if an unknown value is set — prevents silent misconfiguration:
 ```
-Error: Unknown inference provider: "foo". Valid options: ollama, llamacpp
+src/
+├── providers/
+│   ├── ollama.js      # Ollama provider
+│   └── llamacpp.js    # llama.cpp provider (OpenAI-compatible REST)
+├── routes/
+│   └── inference.js   # /complete and /complete/stream route handlers
+├── infer.js           # Provider loader — selects and re-exports active provider
+└── index.js           # Express app + route definitions
 ```

 ## llama.cpp Provider

-The llama.cpp provider uses the OpenAI-compatible REST API exposed by `llama-server`.
+Uses the OpenAI-compatible REST API exposed by `llama-server`.

 ### Starting llama-server

-`llama-server` must be started manually on the main PC before the inference service
-can handle requests. It loads a single model at startup:
+Must be started manually on the main PC before the inference service can
+handle requests:

 ```powershell
 .\llama-gpu\llama-server.exe `
@@ -79,40 +86,29 @@ can handle requests. It loads a single model at startup:
  -c 64000
 ```

-Key flags:
-
 | Flag | Description |
 |---|---|
-| `-m` | Path to the `.gguf` model file |
 | `-ngl 99` | Offload as many layers as possible to GPU |
-| `--reasoning off` | Disables thinking/reasoning delay on Gemma 4 models |
-| `--host 0.0.0.0` | Allows connections from other machines on the LAN |
-| `--port 8080` | Port for the llama-server HTTP API |
+| `--reasoning off` | Disables thinking delay on Gemma 4 models |
+| `--host 0.0.0.0` | Allows LAN connections |
 | `-c 64000` | Context window size in tokens |

-> `-c 64000` is intentionally large. Monitor VRAM usage — if pressure builds,
-> reduce this value. The NexusAI memory architecture handles context injection
-> so a smaller window (6–8K) is often sufficient.
+> `-c 64000` is intentionally large. NexusAI's memory architecture handles
+> context injection so 6–8K is often sufficient if VRAM pressure builds.

 ### Model Naming

-The model name sent in API requests must match the name as reported by
-`llama-server` — including the `.gguf` extension. The reported name can be
-verified with:
+The model name in requests must match the name reported by `llama-server`
+including the `.gguf` extension:

 ```powershell
 Invoke-RestMethod -Uri "http://192.168.0.79:8080/v1/models"
 ```

-Set `DEFAULT_MODEL` in `.env` to the exact reported name:
-```
-DEFAULT_MODEL=gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf
-```
+Set `DEFAULT_MODEL` in `.env` to the exact reported name.

 ### Inference Parameters

-The llamacpp provider maps NexusAI options to OpenAI-compatible fields:
-
 | NexusAI option | API field | Default |
 |---|---|---|
 | `temperature` | `temperature` | 0.7 |
@@ -122,18 +118,6 @@ The llamacpp provider maps NexusAI options to OpenAI-compatible fields:
 | `repeatPenalty` | `repeat_penalty` | 1.1 |
 | `seed` | `seed` | null (random) |

-## Internal Structure
-```
-src/
-├── providers/
-│   ├── ollama.js      # Ollama provider — uses ollama npm package
-│   └── llamacpp.js    # llama.cpp provider — uses OpenAI-compatible REST API
-├── routes/
-│   └── inference.js   # /complete and /complete/stream route handlers
-├── infer.js           # Provider loader — selects and re-exports active provider
-└── index.js           # Express app + route definitions
-```
-
 ## Streaming Response Format

 The llama.cpp provider yields chunks in this shape:
@@ -143,7 +127,7 @@ The llama.cpp provider yields chunks in this shape:
 { response: '', done: true, model: "model-name.gguf", tokenCount: 42 }
 ```

-The inference route re-emits these as SSE events:
+The inference route re-emits as SSE:
 ```
 data: {"response":"token text"}
 data: {"done":true,"model":"model-name.gguf","tokenCount":42}
@@ -151,66 +135,6 @@ data: [DONE]
 ```

 `model` and `tokenCount` are captured from the llama.cpp `finish_reason: stop`
-chunk (`usage.completion_tokens`) and emitted on the done event so the
-orchestration layer can forward them to the client.
+chunk and emitted on the done event.

-## Endpoints
-
-### Health
-
-| Method | Path | Description |
-|---|---|---|
-| GET | /health | Service health check — reports active provider and model |
-
-### Inference
-
-| Method | Path | Description |
-|---|---|---|
-| POST | /complete | Standard completion — returns full response when done |
-| POST | /complete/stream | Streaming completion via Server-Sent Events |
-
---
-
-**POST /complete**
-
-Request body:
-```json
-{
-  "prompt": "What is the capital of France?",
-  "model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
-  "temperature": 0.7,
-  "maxTokens": 1024
-}
-```
-
-`model` is optional — falls back to `DEFAULT_MODEL` if omitted.  
-`maxTokens` is optional — defaults to 1024.  
-`temperature` is optional — defaults to 0.7.
-
-Response:
-```json
-{
-  "text": "The capital of France is Paris.",
-  "model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
-  "done": true,
-  "evalCount": 8,
-  "promptEvalCount": 41
-}
-```
-
---
-
-**POST /complete/stream**
-
-Same request body as `/complete`.
-
-Response is a stream of Server-Sent Events:
-```
-data: {"response":"The"}
-data: {"response":" capital of France is Paris."}
-data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":8}
-data: [DONE]
-```
-
-Clients should accumulate `response` fields to build the full response string.
-The `done` event carries `model` and `tokenCount` for display in the UI.
+For all HTTP endpoints, see `api-routes.md`.