Files
nexusAI/packages/inference-service/CLAUDE.md
2026-04-27 20:17:05 -07:00

76 lines
3.4 KiB
Markdown

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
See the root [CLAUDE.md](../../CLAUDE.md) for overall architecture, service roles, and deployment layout.
## Running This Service
```bash
npm run inference # From repo root
npm -w packages/inference-service run dev # With --watch
```
Default port: **3001**. Set `INFERENCE_PROVIDER` to select the backend.
## Provider Pattern
`src/infer.js` reads `INFERENCE_PROVIDER` at startup and loads one of two providers:
| `INFERENCE_PROVIDER` | Module | Backend |
|---|---|---|
| `ollama` (default) | `src/providers/ollama.js` | Ollama npm client → `/api/generate` |
| `llamacpp` | `src/providers/llamacpp.js` | Raw fetch → `/v1/chat/completions` (OpenAI-compatible) |
An unknown provider throws immediately at startup — fail-fast, not at request time.
Both providers export the same interface: `complete(prompt, options)` and `completeStream(prompt, options)`.
## Environment Variables
| Variable | Default | Description |
|---|---|---|
| `PORT` | `3001` | Port to listen on |
| `INFERENCE_PROVIDER` | `ollama` | `ollama` or `llamacpp` |
| `INFERENCE_URL` | `http://localhost:11434` (Ollama) / `http://localhost:8080` (llama.cpp) | Backend URL |
| `DEFAULT_MODEL` | Provider-specific | Model name passed to backend |
`INFERENCE_URL` defaults differ per provider — Ollama uses the Ollama default URL, llama.cpp uses the llama-server default.
## Options Resolution
Both providers use `resolveOptions(options)` to merge caller-supplied options with `INFERENCE_DEFAULTS` from shared constants. Any option not supplied by the caller falls back to the constant.
## Streaming Chunk Format
The two providers yield differently shaped chunks — the route in `src/routes/inference.js` normalises them:
**Ollama** yields raw Ollama generate chunks: `{ response, done, model, eval_count, prompt_eval_count, ... }`
**llama.cpp** yields:
- Per-token: `{ response: delta, done: false }`
- Final: `{ response: '', done: true, model, tokenCount }` — token count is the sum of `completion_tokens + prompt_tokens` from the usage chunk
The route checks `chunk.response` to stream text and `chunk.done` to capture metadata. For Ollama streaming, **token count is not captured** — the done chunk from Ollama contains `eval_count`/`prompt_eval_count` but the route only reads `chunk.tokenCount` (a llama.cpp field). Ollama streaming calls always report `tokenCount: 0` to the client.
## Known Issue: `maxTokens` Missing from Streaming Route
`POST /complete` correctly destructures `maxTokens` from the request body and passes it through. `POST /complete/stream` does **not** — it omits `maxTokens` from its destructuring, so streaming completions always use `INFERENCE_DEFAULTS.MAX_TOKENS` regardless of what the caller sends. This means `/chat/stream` has a different effective token ceiling than `/chat`.
## SSE Format (route → caller)
```
data: {"response":"Hello"} ← per token
data: {"response":" world"}
data: {"done":true,"model":"...","tokenCount":42} ← final metadata
data: [DONE] ← sentinel
```
## API Endpoints
| Method | Path | Notes |
|---|---|---|
| GET | `/health` | Returns `{ service, status, provider, model }` |
| POST | `/complete` | Body: `{ prompt, model?, temperature?, maxTokens?, topP?, topK?, repeatPenalty? }` |
| POST | `/complete/stream` | Same body as `/complete` except `maxTokens` is silently ignored |