CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

See the root CLAUDE.md for overall architecture, service roles, and deployment layout.

Running This Service

npm run inference                          # From repo root
npm -w packages/inference-service run dev  # With --watch

Default port: 3001. Set INFERENCE_PROVIDER to select the backend.

Provider Pattern

src/infer.js reads INFERENCE_PROVIDER at startup and loads one of two providers:

`INFERENCE_PROVIDER`	Module	Backend
`ollama` (default)	`src/providers/ollama.js`	Ollama npm client → `/api/generate`
`llamacpp`	`src/providers/llamacpp.js`	Raw fetch → `/v1/chat/completions` (OpenAI-compatible)

An unknown provider throws immediately at startup — fail-fast, not at request time.

Both providers export the same interface: complete(prompt, options) and completeStream(prompt, options).

Environment Variables

Variable	Default	Description
`PORT`	`3001`	Port to listen on
`INFERENCE_PROVIDER`	`ollama`	`ollama` or `llamacpp`
`INFERENCE_URL`	`http://localhost:11434` (Ollama) / `http://localhost:8080` (llama.cpp)	Backend URL
`DEFAULT_MODEL`	Provider-specific	Model name passed to backend

INFERENCE_URL defaults differ per provider — Ollama uses the Ollama default URL, llama.cpp uses the llama-server default.

Options Resolution

Both providers use resolveOptions(options) to merge caller-supplied options with INFERENCE_DEFAULTS from shared constants. Any option not supplied by the caller falls back to the constant.

Streaming Chunk Format

The two providers yield differently shaped chunks — the route in src/routes/inference.js normalises them:

Ollama yields raw Ollama generate chunks: { response, done, model, eval_count, prompt_eval_count, ... }

llama.cpp yields:

Per-token: { response: delta, done: false }
Final: { response: '', done: true, model, tokenCount } — token count is the sum of completion_tokens + prompt_tokens from the usage chunk

The route checks chunk.response to stream text and chunk.done to capture metadata. For Ollama streaming, token count is not captured — the done chunk from Ollama contains eval_count/prompt_eval_count but the route only reads chunk.tokenCount (a llama.cpp field). Ollama streaming calls always report tokenCount: 0 to the client.

Known Issue: `maxTokens` Missing from Streaming Route

POST /complete correctly destructures maxTokens from the request body and passes it through. POST /complete/stream does not — it omits maxTokens from its destructuring, so streaming completions always use INFERENCE_DEFAULTS.MAX_TOKENS regardless of what the caller sends. This means /chat/stream has a different effective token ceiling than /chat.

SSE Format (route → caller)

data: {"response":"Hello"}        ← per token
data: {"response":" world"}
data: {"done":true,"model":"...","tokenCount":42}  ← final metadata
data: [DONE]                       ← sentinel

API Endpoints

Method	Path	Notes
GET	`/health`	Returns `{ service, status, provider, model }`
POST	`/complete`	Body: `{ prompt, model?, temperature?, maxTokens?, topP?, topK?, repeatPenalty? }`
POST	`/complete/stream`	Same body as `/complete` except `maxTokens` is silently ignored

3.4 KiB Raw Blame History