3.4 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
See the root CLAUDE.md for overall architecture, service roles, and deployment layout.
Running This Service
npm run inference # From repo root
npm -w packages/inference-service run dev # With --watch
Default port: 3001. Set INFERENCE_PROVIDER to select the backend.
Provider Pattern
src/infer.js reads INFERENCE_PROVIDER at startup and loads one of two providers:
INFERENCE_PROVIDER |
Module | Backend |
|---|---|---|
ollama (default) |
src/providers/ollama.js |
Ollama npm client → /api/generate |
llamacpp |
src/providers/llamacpp.js |
Raw fetch → /v1/chat/completions (OpenAI-compatible) |
An unknown provider throws immediately at startup — fail-fast, not at request time.
Both providers export the same interface: complete(prompt, options) and completeStream(prompt, options).
Environment Variables
| Variable | Default | Description |
|---|---|---|
PORT |
3001 |
Port to listen on |
INFERENCE_PROVIDER |
ollama |
ollama or llamacpp |
INFERENCE_URL |
http://localhost:11434 (Ollama) / http://localhost:8080 (llama.cpp) |
Backend URL |
DEFAULT_MODEL |
Provider-specific | Model name passed to backend |
INFERENCE_URL defaults differ per provider — Ollama uses the Ollama default URL, llama.cpp uses the llama-server default.
Options Resolution
Both providers use resolveOptions(options) to merge caller-supplied options with INFERENCE_DEFAULTS from shared constants. Any option not supplied by the caller falls back to the constant.
Streaming Chunk Format
The two providers yield differently shaped chunks — the route in src/routes/inference.js normalises them:
Ollama yields raw Ollama generate chunks: { response, done, model, eval_count, prompt_eval_count, ... }
llama.cpp yields:
- Per-token:
{ response: delta, done: false } - Final:
{ response: '', done: true, model, tokenCount }— token count is the sum ofcompletion_tokens + prompt_tokensfrom the usage chunk
The route checks chunk.response to stream text and chunk.done to capture metadata. For Ollama streaming, token count is not captured — the done chunk from Ollama contains eval_count/prompt_eval_count but the route only reads chunk.tokenCount (a llama.cpp field). Ollama streaming calls always report tokenCount: 0 to the client.
Known Issue: maxTokens Missing from Streaming Route
POST /complete correctly destructures maxTokens from the request body and passes it through. POST /complete/stream does not — it omits maxTokens from its destructuring, so streaming completions always use INFERENCE_DEFAULTS.MAX_TOKENS regardless of what the caller sends. This means /chat/stream has a different effective token ceiling than /chat.
SSE Format (route → caller)
data: {"response":"Hello"} ← per token
data: {"response":" world"}
data: {"done":true,"model":"...","tokenCount":42} ← final metadata
data: [DONE] ← sentinel
API Endpoints
| Method | Path | Notes |
|---|---|---|
| GET | /health |
Returns { service, status, provider, model } |
| POST | /complete |
Body: { prompt, model?, temperature?, maxTokens?, topP?, topK?, repeatPenalty? } |
| POST | /complete/stream |
Same body as /complete except maxTokens is silently ignored |