nexusAI/docs/services/inference-service.md

# Inference Service

**Package:** `@nexusai/inference-service`
**Location:** `packages/inference-service`
**Deployed on:** Main PC
**Port:** 3001

## Purpose

Thin adapter layer around the local LLM runtime. Receives assembled context
packages from the orchestration service and returns model responses. Uses a
provider pattern to abstract the underlying runtime, making it straightforward
to switch inference backends without changes to the rest of the system.

## Dependencies

- `express` — HTTP API
- `ollama` — Ollama client (used by the Ollama provider)
- `dotenv` — environment variable loading
- `@nexusai/shared` — shared utilities

## Environment Variables

| Variable | Required | Default | Description |
|---|---|---|---|
| PORT | No | 3001 | Port to listen on |
| INFERENCE_PROVIDER | No | ollama | Active inference provider (ollama, llamacpp) |
| INFERENCE_URL | No | http://localhost:11434 | URL of the inference runtime |
| DEFAULT_MODEL | No | llama3.2 | Default model name passed to the provider |

## Provider Architecture

The inference service uses a provider pattern to abstract the underlying
LLM runtime. The active provider is selected at startup via `INFERENCE_PROVIDER`
and loaded from `src/providers/`. Both providers expose identical function
signatures, so the rest of the service is unaware of which backend is active.

### Supported Providers

| Provider | Value | Runtime |
|---|---|---|
| Ollama | `ollama` | Ollama via the `ollama` npm package |
| llama.cpp | `llamacpp` | llama.cpp server (OpenAI-compatible API) |

Switching providers requires only a `.env` change — no code modifications needed.
INFERENCE_PROVIDER=llamacpp
INFERENCE_URL=http://localhost:8080

## Internal Structure
src/
├── providers/
│   ├── ollama.js      # Ollama provider — uses ollama npm package
│   └── llamacpp.js    # llama.cpp provider — uses OpenAI-compatible REST API
├── routes/
│   └── inference.js   # /complete and /complete/stream route handlers
├── infer.js           # Provider loader — selects and re-exports active provider
└── index.js           # Express app + route definitions

## Endpoints

### Health

| Method | Path | Description |
|---|---|---|
| GET | /health | Service health check — reports active provider and model |

### Inference

| Method | Path | Description |
|---|---|---|
| POST | /complete | Standard completion — returns full response when done |
| POST | /complete/stream | Streaming completion via Server-Sent Events |

---

**POST /complete**

Request body:
```json
{
  "prompt": "What is the capital of France?",
  "model": "companion:latest",
  "temperature": 0.7,
  "maxTokens": 1024
}
```

`model` is optional — falls back to `DEFAULT_MODEL` if omitted.
`maxTokens` is optional — defaults to 1024.
`temperature` is optional — defaults to 0.7.

Response:
```json
{
  "text": "The capital of France is Paris.",
  "model": "companion:latest",
  "done": true,
  "evalCount": 8,
  "promptEvalCount": 41
}
```

| Field | Description |
|---|---|
| `text` | The model's response |
| `model` | Model name as reported by the provider |
| `done` | Whether generation completed normally |
| `evalCount` | Number of tokens generated |
| `promptEvalCount` | Number of tokens in the prompt |

---

**POST /complete/stream**

Same request body as `/complete` (`maxTokens` not applicable for streaming).

Response is a stream of Server-Sent Events. Each event contains a partial
response chunk as JSON. The stream closes with a final `data: [DONE]` event.
data: {"model":"companion:latest","response":"The","done":false}
data: {"model":"companion:latest","response":" capital","done":false}
data: {"model":"companion:latest","response":" of France is Paris.","done":false}
data: [DONE]

Clients should read the `response` field from each chunk and accumulate
them to build the full response string.