minor clean up

2026-04-27 20:17:05 -07:00
parent 055683424d
commit b58a4e4692
13 changed files with 171 additions and 18 deletions
--- a/packages/inference-service/CLAUDE.md
+++ b/packages/inference-service/CLAUDE.md
@@ -0,0 +1,75 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+See the root [CLAUDE.md](../../CLAUDE.md) for overall architecture, service roles, and deployment layout.
+
+## Running This Service
+
+```bash
+npm run inference                          # From repo root
+npm -w packages/inference-service run dev  # With --watch
+```
+
+Default port: **3001**. Set `INFERENCE_PROVIDER` to select the backend.
+
+## Provider Pattern
+
+`src/infer.js` reads `INFERENCE_PROVIDER` at startup and loads one of two providers:
+
+| `INFERENCE_PROVIDER` | Module | Backend |
+|---|---|---|
+| `ollama` (default) | `src/providers/ollama.js` | Ollama npm client → `/api/generate` |
+| `llamacpp` | `src/providers/llamacpp.js` | Raw fetch → `/v1/chat/completions` (OpenAI-compatible) |
+
+An unknown provider throws immediately at startup — fail-fast, not at request time.
+
+Both providers export the same interface: `complete(prompt, options)` and `completeStream(prompt, options)`.
+
+## Environment Variables
+
+| Variable | Default | Description |
+|---|---|---|
+| `PORT` | `3001` | Port to listen on |
+| `INFERENCE_PROVIDER` | `ollama` | `ollama` or `llamacpp` |
+| `INFERENCE_URL` | `http://localhost:11434` (Ollama) / `http://localhost:8080` (llama.cpp) | Backend URL |
+| `DEFAULT_MODEL` | Provider-specific | Model name passed to backend |
+
+`INFERENCE_URL` defaults differ per provider — Ollama uses the Ollama default URL, llama.cpp uses the llama-server default.
+
+## Options Resolution
+
+Both providers use `resolveOptions(options)` to merge caller-supplied options with `INFERENCE_DEFAULTS` from shared constants. Any option not supplied by the caller falls back to the constant.
+
+## Streaming Chunk Format
+
+The two providers yield differently shaped chunks — the route in `src/routes/inference.js` normalises them:
+
+**Ollama** yields raw Ollama generate chunks: `{ response, done, model, eval_count, prompt_eval_count, ... }`
+
+**llama.cpp** yields:
+- Per-token: `{ response: delta, done: false }`
+- Final: `{ response: '', done: true, model, tokenCount }` — token count is the sum of `completion_tokens + prompt_tokens` from the usage chunk
+
+The route checks `chunk.response` to stream text and `chunk.done` to capture metadata. For Ollama streaming, **token count is not captured** — the done chunk from Ollama contains `eval_count`/`prompt_eval_count` but the route only reads `chunk.tokenCount` (a llama.cpp field). Ollama streaming calls always report `tokenCount: 0` to the client.
+
+## Known Issue: `maxTokens` Missing from Streaming Route
+
+`POST /complete` correctly destructures `maxTokens` from the request body and passes it through. `POST /complete/stream` does **not** — it omits `maxTokens` from its destructuring, so streaming completions always use `INFERENCE_DEFAULTS.MAX_TOKENS` regardless of what the caller sends. This means `/chat/stream` has a different effective token ceiling than `/chat`.
+
+## SSE Format (route → caller)
+
+```
+data: {"response":"Hello"}        ← per token
+data: {"response":" world"}
+data: {"done":true,"model":"...","tokenCount":42}  ← final metadata
+data: [DONE]                       ← sentinel
+```
+
+## API Endpoints
+
+| Method | Path | Notes |
+|---|---|---|
+| GET | `/health` | Returns `{ service, status, provider, model }` |
+| POST | `/complete` | Body: `{ prompt, model?, temperature?, maxTokens?, topP?, topK?, repeatPenalty? }` |
+| POST | `/complete/stream` | Same body as `/complete` except `maxTokens` is silently ignored |
--- a/packages/inference-service/src/index.js
+++ b/packages/inference-service/src/index.js
@@ -4,7 +4,7 @@ const {getEnv, PORTS, OLLAMA, logger} = require('@nexusai/shared');
 const inferenceRouter = require('./routes/inference');

 const app = express();
-app.use(express.json());
+app.use(express.json({ limit: '8mb' }));  // prompts include full context window

 const PORT      = getEnv('PORT', PORTS.INFERENCE);
 const PROVIDER  = getEnv('INFERENCE_PROVIDER',   'ollama');
--- a/packages/inference-service/src/providers/ollama.js
+++ b/packages/inference-service/src/providers/ollama.js
@@ -57,7 +57,16 @@ async function* completeStream(prompt, options = {} ) {
    });

    for await (const chunk of stream) {
-        yield chunk;
+        if (chunk.done) {
+            yield {
+                response:   '',
+                done:       true,
+                model:      chunk.model,
+                tokenCount: (chunk.eval_count ?? 0) + (chunk.prompt_eval_count ?? 0),
+            };
+        } else {
+            yield chunk;
+        }
    }
 }

--- a/packages/inference-service/src/routes/inference.js
+++ b/packages/inference-service/src/routes/inference.js
@@ -23,7 +23,7 @@ router.post('/complete', async (req, res) => {

 // Streaming completion endpoint - sends partial responses as they arrive
 router.post('/complete/stream', async (req, res) => {
-    const { prompt, model, temperature, topP, topK, repeatPenalty } = req.body;
+    const { prompt, model, temperature, maxTokens, topP, topK, repeatPenalty } = req.body;

    if (!prompt) return res.status(400).json({ error: 'prompt is required' });

@@ -35,7 +35,7 @@ router.post('/complete/stream', async (req, res) => {
        let lastModel = model;
        let tokenCount = 0;

-        for await (const chunk of completeStream(prompt, { model, temperature, topP, topK, repeatPenalty })) {
+        for await (const chunk of completeStream(prompt, { model, temperature, maxTokens,topP, topK, repeatPenalty })) {
            if (chunk.response) {
                res.write(`data: ${JSON.stringify({ response: chunk.response })}\n\n`);
            }