150 lines
4.9 KiB
Markdown
150 lines
4.9 KiB
Markdown
# Inference Service
|
||
|
||
**Package:** `@nexusai/inference-service`
|
||
**Location:** `packages/inference-service`
|
||
**Deployed on:** Main PC (192.168.0.79)
|
||
**Port:** 3001
|
||
|
||
## Purpose
|
||
|
||
Thin adapter layer around the local LLM runtime. Receives assembled context
|
||
packages from the orchestration service and returns model responses. Uses a
|
||
provider pattern to abstract the underlying runtime, making it straightforward
|
||
to switch inference backends without changes to the rest of the system.
|
||
|
||
## Dependencies
|
||
|
||
- `express` — HTTP API
|
||
- `ollama` — Ollama client (used by the Ollama provider, kept as fallback)
|
||
- `dotenv` — environment variable loading
|
||
- `@nexusai/shared` — shared utilities
|
||
|
||
## Environment Variables
|
||
|
||
| Variable | Required | Default | Description |
|
||
|---|---|---|---|
|
||
| PORT | No | 3001 | Port to listen on |
|
||
| INFERENCE_PROVIDER | No | llamacpp | Active provider (`ollama` or `llamacpp`) |
|
||
| INFERENCE_URL | No | http://localhost:8080 | URL of the inference runtime |
|
||
| DEFAULT_MODEL | No | local-model | Default model name passed to the provider |
|
||
|
||
> `INFERENCE_URL` points to `llama-server` directly (port 8080), not to this
|
||
> service. The orchestration service uses `INFERENCE_SERVICE_URL` to reach
|
||
> this service on port 3001.
|
||
|
||
## Provider Architecture
|
||
|
||
The active provider is selected at startup via `INFERENCE_PROVIDER` and
|
||
loaded from `src/providers/`. Both providers expose identical function
|
||
signatures.
|
||
|
||
### Supported Providers
|
||
|
||
| Provider | Value | Runtime |
|
||
|---|---|---|
|
||
| llama.cpp | `llamacpp` | llama.cpp server (OpenAI-compatible API) — **current default** |
|
||
| Ollama | `ollama` | Ollama via the `ollama` npm package — available as fallback |
|
||
|
||
Switching providers requires only a `.env` change — no code modifications:
|
||
```
|
||
INFERENCE_PROVIDER=llamacpp
|
||
INFERENCE_URL=http://localhost:8080
|
||
```
|
||
|
||
The provider loader throws immediately on an unknown value, preventing silent
|
||
misconfiguration.
|
||
|
||
> **LM Studio compatibility note:** LM Studio exposes an OpenAI-compatible
|
||
> `/v1/chat/completions` endpoint with the same request shape as llama.cpp.
|
||
> A future `lmstudio.js` provider would be nearly identical to `llamacpp.js` —
|
||
> only the `BASE_URL` would differ. No architectural changes required.
|
||
|
||
## Internal Structure
|
||
|
||
```
|
||
src/
|
||
├── providers/
|
||
│ ├── ollama.js # Ollama provider
|
||
│ └── llamacpp.js # llama.cpp provider (OpenAI-compatible REST)
|
||
├── routes/
|
||
│ └── inference.js # /complete and /complete/stream route handlers
|
||
├── infer.js # Provider loader — selects and re-exports active provider
|
||
└── index.js # Express app + route definitions
|
||
```
|
||
|
||
## llama.cpp Provider
|
||
|
||
Uses the OpenAI-compatible REST API exposed by `llama-server`.
|
||
|
||
### Starting llama-server
|
||
|
||
Must be started manually on the main PC before the inference service can
|
||
handle requests:
|
||
|
||
```powershell
|
||
.\llama-gpu\llama-server.exe `
|
||
-m .\models\gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf `
|
||
-ngl 99 `
|
||
--reasoning off `
|
||
--host 0.0.0.0 `
|
||
--port 8080 `
|
||
-c 64000
|
||
```
|
||
|
||
| Flag | Description |
|
||
|---|---|
|
||
| `-ngl 99` | Offload as many layers as possible to GPU |
|
||
| `--reasoning off` | Disables thinking delay on Gemma 4 models |
|
||
| `--host 0.0.0.0` | Allows LAN connections |
|
||
| `-c 64000` | Context window size in tokens |
|
||
|
||
> `-c 64000` is intentionally large. NexusAI's memory architecture handles
|
||
> context injection so 6–8K is often sufficient if VRAM pressure builds.
|
||
|
||
### Model Naming
|
||
|
||
The model name in requests must match the name reported by `llama-server`
|
||
including the `.gguf` extension:
|
||
|
||
```powershell
|
||
Invoke-RestMethod -Uri "http://192.168.0.79:8080/v1/models"
|
||
```
|
||
|
||
Set `DEFAULT_MODEL` in `.env` to the exact reported name.
|
||
|
||
### Inference Parameters
|
||
|
||
All parameters are resolved in `resolveOptions()` — falling back to
|
||
`INFERENCE_DEFAULTS` from `@nexusai/shared` if not provided in the request.
|
||
In normal usage, orchestration reads these from `settings.json` and forwards
|
||
them on every request.
|
||
|
||
| NexusAI option | API field | Default | Description |
|
||
|---|---|---|---|
|
||
| `temperature` | `temperature` | 0.7 | Response randomness (0 = deterministic) |
|
||
| `maxTokens` | `max_tokens` | 1024 | Max tokens to generate |
|
||
| `topP` | `top_p` | 0.9 | Nucleus sampling probability mass |
|
||
| `topK` | `top_k` | 40 | Top-K token candidates per step |
|
||
| `repeatPenalty` | `repeat_penalty` | 1.1 | Penalty for recently used tokens |
|
||
| `seed` | `seed` | null | null = random; integer for reproducible output |
|
||
|
||
## Streaming Response Format
|
||
|
||
The llama.cpp provider yields chunks in this shape:
|
||
```js
|
||
{ response: "token text", done: false }
|
||
// final chunk:
|
||
{ response: '', done: true, model: "model-name.gguf", tokenCount: 42 }
|
||
```
|
||
|
||
The inference route re-emits as SSE:
|
||
```
|
||
data: {"response":"token text"}
|
||
data: {"done":true,"model":"model-name.gguf","tokenCount":42}
|
||
data: [DONE]
|
||
```
|
||
|
||
`model` and `tokenCount` are captured from the llama.cpp `finish_reason: stop`
|
||
chunk and emitted on the done event.
|
||
|
||
For all HTTP endpoints, see `api-routes.md`. |