Files
nexusAI/docs/services/inference-service.md
2026-04-17 03:46:17 -07:00

4.2 KiB
Raw Blame History

Inference Service

Package: @nexusai/inference-service
Location: packages/inference-service
Deployed on: Main PC (192.168.0.79)
Port: 3001

Purpose

Thin adapter layer around the local LLM runtime. Receives assembled context packages from the orchestration service and returns model responses. Uses a provider pattern to abstract the underlying runtime, making it straightforward to switch inference backends without changes to the rest of the system.

Dependencies

  • express — HTTP API
  • ollama — Ollama client (used by the Ollama provider, kept as fallback)
  • dotenv — environment variable loading
  • @nexusai/shared — shared utilities

Environment Variables

Variable Required Default Description
PORT No 3001 Port to listen on
INFERENCE_PROVIDER No llamacpp Active provider (ollama or llamacpp)
INFERENCE_URL No http://localhost:8080 URL of the inference runtime
DEFAULT_MODEL No local-model Default model name passed to the provider

INFERENCE_URL points to llama-server directly (port 8080), not to this service. The orchestration service uses INFERENCE_SERVICE_URL to reach this service on port 3001.

Provider Architecture

The active provider is selected at startup via INFERENCE_PROVIDER and loaded from src/providers/. Both providers expose identical function signatures.

Supported Providers

Provider Value Runtime
llama.cpp llamacpp llama.cpp server (OpenAI-compatible API) — current default
Ollama ollama Ollama via the ollama npm package — available as fallback

Switching providers requires only a .env change — no code modifications:

INFERENCE_PROVIDER=llamacpp
INFERENCE_URL=http://localhost:8080

The provider loader throws immediately on an unknown value, preventing silent misconfiguration.

Internal Structure

src/
├── providers/
│   ├── ollama.js      # Ollama provider
│   └── llamacpp.js    # llama.cpp provider (OpenAI-compatible REST)
├── routes/
│   └── inference.js   # /complete and /complete/stream route handlers
├── infer.js           # Provider loader — selects and re-exports active provider
└── index.js           # Express app + route definitions

llama.cpp Provider

Uses the OpenAI-compatible REST API exposed by llama-server.

Starting llama-server

Must be started manually on the main PC before the inference service can handle requests:

.\llama-gpu\llama-server.exe `
  -m .\models\gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf `
  -ngl 99 `
  --reasoning off `
  --host 0.0.0.0 `
  --port 8080 `
  -c 64000
Flag Description
-ngl 99 Offload as many layers as possible to GPU
--reasoning off Disables thinking delay on Gemma 4 models
--host 0.0.0.0 Allows LAN connections
-c 64000 Context window size in tokens

-c 64000 is intentionally large. NexusAI's memory architecture handles context injection so 68K is often sufficient if VRAM pressure builds.

Model Naming

The model name in requests must match the name reported by llama-server including the .gguf extension:

Invoke-RestMethod -Uri "http://192.168.0.79:8080/v1/models"

Set DEFAULT_MODEL in .env to the exact reported name.

Inference Parameters

NexusAI option API field Default
temperature temperature 0.7
maxTokens max_tokens 1024
topP top_p 0.9
topK top_k 40
repeatPenalty repeat_penalty 1.1
seed seed null (random)

Streaming Response Format

The llama.cpp provider yields chunks in this shape:

{ response: "token text", done: false }
// final chunk:
{ response: '', done: true, model: "model-name.gguf", tokenCount: 42 }

The inference route re-emits as SSE:

data: {"response":"token text"}
data: {"done":true,"model":"model-name.gguf","tokenCount":42}
data: [DONE]

model and tokenCount are captured from the llama.cpp finish_reason: stop chunk and emitted on the done event.

For all HTTP endpoints, see api-routes.md.