# Inference Service **Package:** `@nexusai/inference-service` **Location:** `packages/inference-service` **Deployed on:** Main PC (192.168.0.79) **Port:** 3001 ## Purpose Thin adapter layer around the local LLM runtime. Receives assembled context packages from the orchestration service and returns model responses. Uses a provider pattern to abstract the underlying runtime, making it straightforward to switch inference backends without changes to the rest of the system. ## Dependencies - `express` — HTTP API - `ollama` — Ollama client (used by the Ollama provider, kept as fallback) - `dotenv` — environment variable loading - `@nexusai/shared` — shared utilities ## Environment Variables | Variable | Required | Default | Description | |---|---|---|---| | PORT | No | 3001 | Port to listen on | | INFERENCE_PROVIDER | No | llamacpp | Active inference provider (`ollama` or `llamacpp`) | | INFERENCE_URL | No | http://localhost:8080 | URL of the inference runtime | | DEFAULT_MODEL | No | local-model | Default model name passed to the provider | > `INFERENCE_URL` points to `llama-server` directly (port 8080), not to this > service itself. The orchestration service uses `INFERENCE_SERVICE_URL` to > reach this service on port 3001. ## Provider Architecture The inference service uses a provider pattern to abstract the underlying LLM runtime. The active provider is selected at startup via `INFERENCE_PROVIDER` and loaded from `src/providers/`. Both providers expose identical function signatures, so the rest of the service is unaware of which backend is active. ### Supported Providers | Provider | Value | Runtime | |---|---|---| | llama.cpp | `llamacpp` | llama.cpp server (OpenAI-compatible API) — **current default** | | Ollama | `ollama` | Ollama via the `ollama` npm package — available as fallback | Switching providers requires only a `.env` change — no code modifications needed: ``` INFERENCE_PROVIDER=llamacpp INFERENCE_URL=http://localhost:8080 ``` ### Provider Validation The provider loader validates `INFERENCE_PROVIDER` at startup and throws immediately if an unknown value is set — prevents silent misconfiguration: ``` Error: Unknown inference provider: "foo". Valid options: ollama, llamacpp ``` ## llama.cpp Provider The llama.cpp provider uses the OpenAI-compatible REST API exposed by `llama-server`. ### Starting llama-server `llama-server` must be started manually on the main PC before the inference service can handle requests. It loads a single model at startup: ```powershell .\llama-gpu\llama-server.exe ` -m .\models\gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf ` -ngl 99 ` --reasoning off ` --host 0.0.0.0 ` --port 8080 ` -c 64000 ``` Key flags: | Flag | Description | |---|---| | `-m` | Path to the `.gguf` model file | | `-ngl 99` | Offload as many layers as possible to GPU | | `--reasoning off` | Disables thinking/reasoning delay on Gemma 4 models | | `--host 0.0.0.0` | Allows connections from other machines on the LAN | | `--port 8080` | Port for the llama-server HTTP API | | `-c 64000` | Context window size in tokens | > `-c 64000` is intentionally large. Monitor VRAM usage — if pressure builds, > reduce this value. The NexusAI memory architecture handles context injection > so a smaller window (6–8K) is often sufficient. ### Model Naming The model name sent in API requests must match the name as reported by `llama-server` — including the `.gguf` extension. The reported name can be verified with: ```powershell Invoke-RestMethod -Uri "http://192.168.0.79:8080/v1/models" ``` Set `DEFAULT_MODEL` in `.env` to the exact reported name: ``` DEFAULT_MODEL=gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf ``` ### Inference Parameters The llamacpp provider maps NexusAI options to OpenAI-compatible fields: | NexusAI option | API field | Default | |---|---|---| | `temperature` | `temperature` | 0.7 | | `maxTokens` | `max_tokens` | 1024 | | `topP` | `top_p` | 0.9 | | `topK` | `top_k` | 40 | | `repeatPenalty` | `repeat_penalty` | 1.1 | | `seed` | `seed` | null (random) | ## Internal Structure ``` src/ ├── providers/ │ ├── ollama.js # Ollama provider — uses ollama npm package │ └── llamacpp.js # llama.cpp provider — uses OpenAI-compatible REST API ├── routes/ │ └── inference.js # /complete and /complete/stream route handlers ├── infer.js # Provider loader — selects and re-exports active provider └── index.js # Express app + route definitions ``` ## Streaming Response Format The llama.cpp provider yields chunks in this shape: ```js { response: "token text", done: false } // final chunk: { response: '', done: true, model: "model-name.gguf", tokenCount: 42 } ``` The inference route re-emits these as SSE events: ``` data: {"response":"token text"} data: {"done":true,"model":"model-name.gguf","tokenCount":42} data: [DONE] ``` `model` and `tokenCount` are captured from the llama.cpp `finish_reason: stop` chunk (`usage.completion_tokens`) and emitted on the done event so the orchestration layer can forward them to the client. ## Endpoints ### Health | Method | Path | Description | |---|---|---| | GET | /health | Service health check — reports active provider and model | ### Inference | Method | Path | Description | |---|---|---| | POST | /complete | Standard completion — returns full response when done | | POST | /complete/stream | Streaming completion via Server-Sent Events | --- **POST /complete** Request body: ```json { "prompt": "What is the capital of France?", "model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf", "temperature": 0.7, "maxTokens": 1024 } ``` `model` is optional — falls back to `DEFAULT_MODEL` if omitted. `maxTokens` is optional — defaults to 1024. `temperature` is optional — defaults to 0.7. Response: ```json { "text": "The capital of France is Paris.", "model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf", "done": true, "evalCount": 8, "promptEvalCount": 41 } ``` --- **POST /complete/stream** Same request body as `/complete`. Response is a stream of Server-Sent Events: ``` data: {"response":"The"} data: {"response":" capital of France is Paris."} data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":8} data: [DONE] ``` Clients should accumulate `response` fields to build the full response string. The `done` event carries `model` and `tokenCount` for display in the UI.