# Inference Service **Package:** `@nexusai/inference-service` **Location:** `packages/inference-service` **Deployed on:** Main PC (192.168.0.79) **Port:** 3001 ## Purpose Thin adapter layer around the local LLM runtime. Receives assembled context packages from the orchestration service and returns model responses. Uses a provider pattern to abstract the underlying runtime, making it straightforward to switch inference backends without changes to the rest of the system. ## Dependencies - `express` — HTTP API - `ollama` — Ollama client (used by the Ollama provider, kept as fallback) - `dotenv` — environment variable loading - `@nexusai/shared` — shared utilities ## Environment Variables | Variable | Required | Default | Description | |---|---|---|---| | PORT | No | 3001 | Port to listen on | | INFERENCE_PROVIDER | No | llamacpp | Active provider (`ollama` or `llamacpp`) | | INFERENCE_URL | No | http://localhost:8080 | URL of the inference runtime | | DEFAULT_MODEL | No | local-model | Default model name passed to the provider | > `INFERENCE_URL` points to `llama-server` directly (port 8080), not to this > service. The orchestration service uses `INFERENCE_SERVICE_URL` to reach > this service on port 3001. ## Provider Architecture The active provider is selected at startup via `INFERENCE_PROVIDER` and loaded from `src/providers/`. Both providers expose identical function signatures. ### Supported Providers | Provider | Value | Runtime | |---|---|---| | llama.cpp | `llamacpp` | llama.cpp server (OpenAI-compatible API) — **current default** | | Ollama | `ollama` | Ollama via the `ollama` npm package — available as fallback | Switching providers requires only a `.env` change — no code modifications: ``` INFERENCE_PROVIDER=llamacpp INFERENCE_URL=http://localhost:8080 ``` The provider loader throws immediately on an unknown value, preventing silent misconfiguration. > **LM Studio compatibility note:** LM Studio exposes an OpenAI-compatible > `/v1/chat/completions` endpoint with the same request shape as llama.cpp. > A future `lmstudio.js` provider would be nearly identical to `llamacpp.js` — > only the `BASE_URL` would differ. No architectural changes required. ## Internal Structure ``` src/ ├── providers/ │ ├── ollama.js # Ollama provider │ └── llamacpp.js # llama.cpp provider (OpenAI-compatible REST) ├── routes/ │ └── inference.js # /complete and /complete/stream route handlers ├── infer.js # Provider loader — selects and re-exports active provider └── index.js # Express app + route definitions ``` ## llama.cpp Provider Uses the OpenAI-compatible REST API exposed by `llama-server`. ### Starting llama-server Must be started manually on the main PC before the inference service can handle requests: ```powershell .\llama-gpu\llama-server.exe ` -m .\models\gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf ` -ngl 99 ` --reasoning off ` --host 0.0.0.0 ` --port 8080 ` -c 64000 ``` | Flag | Description | |---|---| | `-ngl 99` | Offload as many layers as possible to GPU | | `--reasoning off` | Disables thinking delay on Gemma 4 models | | `--host 0.0.0.0` | Allows LAN connections | | `-c 64000` | Context window size in tokens | > `-c 64000` is intentionally large. NexusAI's memory architecture handles > context injection so 6–8K is often sufficient if VRAM pressure builds. ### Model Naming The model name in requests must match the name reported by `llama-server` including the `.gguf` extension: ```powershell Invoke-RestMethod -Uri "http://192.168.0.79:8080/v1/models" ``` Set `DEFAULT_MODEL` in `.env` to the exact reported name. ### Inference Parameters All parameters are resolved in `resolveOptions()` — falling back to `INFERENCE_DEFAULTS` from `@nexusai/shared` if not provided in the request. In normal usage, orchestration reads these from `settings.json` and forwards them on every request. | NexusAI option | API field | Default | Description | |---|---|---|---| | `temperature` | `temperature` | 0.7 | Response randomness (0 = deterministic) | | `maxTokens` | `max_tokens` | 1024 | Max tokens to generate | | `topP` | `top_p` | 0.9 | Nucleus sampling probability mass | | `topK` | `top_k` | 40 | Top-K token candidates per step | | `repeatPenalty` | `repeat_penalty` | 1.1 | Penalty for recently used tokens | | `seed` | `seed` | null | null = random; integer for reproducible output | ## Streaming Response Format The llama.cpp provider yields chunks in this shape: ```js { response: "token text", done: false } // final chunk: { response: '', done: true, model: "model-name.gguf", tokenCount: 42 } ``` The inference route re-emits as SSE: ``` data: {"response":"token text"} data: {"done":true,"model":"model-name.gguf","tokenCount":42} data: [DONE] ``` `model` and `tokenCount` are captured from the llama.cpp `finish_reason: stop` chunk and emitted on the done event. For all HTTP endpoints, see `api-routes.md`.