update documentation
This commit is contained in:
@@ -24,20 +24,19 @@ to switch inference backends without changes to the rest of the system.
|
||||
| Variable | Required | Default | Description |
|
||||
|---|---|---|---|
|
||||
| PORT | No | 3001 | Port to listen on |
|
||||
| INFERENCE_PROVIDER | No | llamacpp | Active inference provider (`ollama` or `llamacpp`) |
|
||||
| INFERENCE_PROVIDER | No | llamacpp | Active provider (`ollama` or `llamacpp`) |
|
||||
| INFERENCE_URL | No | http://localhost:8080 | URL of the inference runtime |
|
||||
| DEFAULT_MODEL | No | local-model | Default model name passed to the provider |
|
||||
|
||||
> `INFERENCE_URL` points to `llama-server` directly (port 8080), not to this
|
||||
> service itself. The orchestration service uses `INFERENCE_SERVICE_URL` to
|
||||
> reach this service on port 3001.
|
||||
> service. The orchestration service uses `INFERENCE_SERVICE_URL` to reach
|
||||
> this service on port 3001.
|
||||
|
||||
## Provider Architecture
|
||||
|
||||
The inference service uses a provider pattern to abstract the underlying
|
||||
LLM runtime. The active provider is selected at startup via `INFERENCE_PROVIDER`
|
||||
and loaded from `src/providers/`. Both providers expose identical function
|
||||
signatures, so the rest of the service is unaware of which backend is active.
|
||||
The active provider is selected at startup via `INFERENCE_PROVIDER` and
|
||||
loaded from `src/providers/`. Both providers expose identical function
|
||||
signatures.
|
||||
|
||||
### Supported Providers
|
||||
|
||||
@@ -46,28 +45,36 @@ signatures, so the rest of the service is unaware of which backend is active.
|
||||
| llama.cpp | `llamacpp` | llama.cpp server (OpenAI-compatible API) — **current default** |
|
||||
| Ollama | `ollama` | Ollama via the `ollama` npm package — available as fallback |
|
||||
|
||||
Switching providers requires only a `.env` change — no code modifications needed:
|
||||
Switching providers requires only a `.env` change — no code modifications:
|
||||
```
|
||||
INFERENCE_PROVIDER=llamacpp
|
||||
INFERENCE_URL=http://localhost:8080
|
||||
```
|
||||
|
||||
### Provider Validation
|
||||
The provider loader throws immediately on an unknown value, preventing silent
|
||||
misconfiguration.
|
||||
|
||||
## Internal Structure
|
||||
|
||||
The provider loader validates `INFERENCE_PROVIDER` at startup and throws immediately
|
||||
if an unknown value is set — prevents silent misconfiguration:
|
||||
```
|
||||
Error: Unknown inference provider: "foo". Valid options: ollama, llamacpp
|
||||
src/
|
||||
├── providers/
|
||||
│ ├── ollama.js # Ollama provider
|
||||
│ └── llamacpp.js # llama.cpp provider (OpenAI-compatible REST)
|
||||
├── routes/
|
||||
│ └── inference.js # /complete and /complete/stream route handlers
|
||||
├── infer.js # Provider loader — selects and re-exports active provider
|
||||
└── index.js # Express app + route definitions
|
||||
```
|
||||
|
||||
## llama.cpp Provider
|
||||
|
||||
The llama.cpp provider uses the OpenAI-compatible REST API exposed by `llama-server`.
|
||||
Uses the OpenAI-compatible REST API exposed by `llama-server`.
|
||||
|
||||
### Starting llama-server
|
||||
|
||||
`llama-server` must be started manually on the main PC before the inference service
|
||||
can handle requests. It loads a single model at startup:
|
||||
Must be started manually on the main PC before the inference service can
|
||||
handle requests:
|
||||
|
||||
```powershell
|
||||
.\llama-gpu\llama-server.exe `
|
||||
@@ -79,40 +86,29 @@ can handle requests. It loads a single model at startup:
|
||||
-c 64000
|
||||
```
|
||||
|
||||
Key flags:
|
||||
|
||||
| Flag | Description |
|
||||
|---|---|
|
||||
| `-m` | Path to the `.gguf` model file |
|
||||
| `-ngl 99` | Offload as many layers as possible to GPU |
|
||||
| `--reasoning off` | Disables thinking/reasoning delay on Gemma 4 models |
|
||||
| `--host 0.0.0.0` | Allows connections from other machines on the LAN |
|
||||
| `--port 8080` | Port for the llama-server HTTP API |
|
||||
| `--reasoning off` | Disables thinking delay on Gemma 4 models |
|
||||
| `--host 0.0.0.0` | Allows LAN connections |
|
||||
| `-c 64000` | Context window size in tokens |
|
||||
|
||||
> `-c 64000` is intentionally large. Monitor VRAM usage — if pressure builds,
|
||||
> reduce this value. The NexusAI memory architecture handles context injection
|
||||
> so a smaller window (6–8K) is often sufficient.
|
||||
> `-c 64000` is intentionally large. NexusAI's memory architecture handles
|
||||
> context injection so 6–8K is often sufficient if VRAM pressure builds.
|
||||
|
||||
### Model Naming
|
||||
|
||||
The model name sent in API requests must match the name as reported by
|
||||
`llama-server` — including the `.gguf` extension. The reported name can be
|
||||
verified with:
|
||||
The model name in requests must match the name reported by `llama-server`
|
||||
including the `.gguf` extension:
|
||||
|
||||
```powershell
|
||||
Invoke-RestMethod -Uri "http://192.168.0.79:8080/v1/models"
|
||||
```
|
||||
|
||||
Set `DEFAULT_MODEL` in `.env` to the exact reported name:
|
||||
```
|
||||
DEFAULT_MODEL=gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf
|
||||
```
|
||||
Set `DEFAULT_MODEL` in `.env` to the exact reported name.
|
||||
|
||||
### Inference Parameters
|
||||
|
||||
The llamacpp provider maps NexusAI options to OpenAI-compatible fields:
|
||||
|
||||
| NexusAI option | API field | Default |
|
||||
|---|---|---|
|
||||
| `temperature` | `temperature` | 0.7 |
|
||||
@@ -122,18 +118,6 @@ The llamacpp provider maps NexusAI options to OpenAI-compatible fields:
|
||||
| `repeatPenalty` | `repeat_penalty` | 1.1 |
|
||||
| `seed` | `seed` | null (random) |
|
||||
|
||||
## Internal Structure
|
||||
```
|
||||
src/
|
||||
├── providers/
|
||||
│ ├── ollama.js # Ollama provider — uses ollama npm package
|
||||
│ └── llamacpp.js # llama.cpp provider — uses OpenAI-compatible REST API
|
||||
├── routes/
|
||||
│ └── inference.js # /complete and /complete/stream route handlers
|
||||
├── infer.js # Provider loader — selects and re-exports active provider
|
||||
└── index.js # Express app + route definitions
|
||||
```
|
||||
|
||||
## Streaming Response Format
|
||||
|
||||
The llama.cpp provider yields chunks in this shape:
|
||||
@@ -143,7 +127,7 @@ The llama.cpp provider yields chunks in this shape:
|
||||
{ response: '', done: true, model: "model-name.gguf", tokenCount: 42 }
|
||||
```
|
||||
|
||||
The inference route re-emits these as SSE events:
|
||||
The inference route re-emits as SSE:
|
||||
```
|
||||
data: {"response":"token text"}
|
||||
data: {"done":true,"model":"model-name.gguf","tokenCount":42}
|
||||
@@ -151,66 +135,6 @@ data: [DONE]
|
||||
```
|
||||
|
||||
`model` and `tokenCount` are captured from the llama.cpp `finish_reason: stop`
|
||||
chunk (`usage.completion_tokens`) and emitted on the done event so the
|
||||
orchestration layer can forward them to the client.
|
||||
chunk and emitted on the done event.
|
||||
|
||||
## Endpoints
|
||||
|
||||
### Health
|
||||
|
||||
| Method | Path | Description |
|
||||
|---|---|---|
|
||||
| GET | /health | Service health check — reports active provider and model |
|
||||
|
||||
### Inference
|
||||
|
||||
| Method | Path | Description |
|
||||
|---|---|---|
|
||||
| POST | /complete | Standard completion — returns full response when done |
|
||||
| POST | /complete/stream | Streaming completion via Server-Sent Events |
|
||||
|
||||
---
|
||||
|
||||
**POST /complete**
|
||||
|
||||
Request body:
|
||||
```json
|
||||
{
|
||||
"prompt": "What is the capital of France?",
|
||||
"model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
|
||||
"temperature": 0.7,
|
||||
"maxTokens": 1024
|
||||
}
|
||||
```
|
||||
|
||||
`model` is optional — falls back to `DEFAULT_MODEL` if omitted.
|
||||
`maxTokens` is optional — defaults to 1024.
|
||||
`temperature` is optional — defaults to 0.7.
|
||||
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"text": "The capital of France is Paris.",
|
||||
"model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
|
||||
"done": true,
|
||||
"evalCount": 8,
|
||||
"promptEvalCount": 41
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**POST /complete/stream**
|
||||
|
||||
Same request body as `/complete`.
|
||||
|
||||
Response is a stream of Server-Sent Events:
|
||||
```
|
||||
data: {"response":"The"}
|
||||
data: {"response":" capital of France is Paris."}
|
||||
data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":8}
|
||||
data: [DONE]
|
||||
```
|
||||
|
||||
Clients should accumulate `response` fields to build the full response string.
|
||||
The `done` event carries `model` and `tokenCount` for display in the UI.
|
||||
For all HTTP endpoints, see `api-routes.md`.
|
||||
Reference in New Issue
Block a user