Files
nexusAI/packages/orchestration-service/CLAUDE.md
Storme-bit 5ad01c6ad8 clean up
2026-04-27 00:14:51 -07:00

125 lines
6.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
See the root [CLAUDE.md](../../CLAUDE.md) for overall architecture, service roles, and the end-to-end chat flow.
## Running This Service
```bash
npm run orchestration # From repo root (node src/index.js)
npm -w packages/orchestration-service run dev # With --watch
```
Default port: **4000**. Depends on memory-service, embedding-service, inference-service, and Qdrant.
## Context Assembly (`src/chat/index.js`)
`assembleContext(externalId, userMessage)` is the core function that builds the inference prompt. Order of operations:
1. Resolve session by `externalId` (creates it if missing — every chat call is self-healing).
2. If session has a `project_id`, load the project and fetch all sibling sessions (via `getProjectSessions`, hardcoded `limit=200`).
3. Fetch `recentEpisodeLimit` recent episodes from memory-service.
4. Embed the user message; search Qdrant EPISODES with `scoreThreshold`:
- No project: `must: [sessionId == this session]`
- Project: `should: [sessionId == s1, sessionId == s2, ...]` across all project sessions
- Dedup against recent episode IDs before including.
5. Embed and search Qdrant ENTITIES; filter by `projectId` if applicable.
6. Build prompt in this fixed order: **system prompt → entities → semantic episodes → recent episodes → user message → "Assistant:"**
The ordering prioritizes established facts (entities) and relevant past context (semantic) over pure recency.
## System Prompt Resolution
Priority from highest to lowest:
1. `project.system_prompt` (stored on the project row in memory-service)
2. `settings.systemPrompt` (saved in `data/settings.json`)
3. `ORCHESTRATION.SYSTEM_PROMPT` (shared constants fallback)
## Settings (`src/config/settings.js`)
Settings are loaded from `data/settings.json` merged with defaults at every `GET /settings` call. `PATCH /settings` validates each field individually with specific constraints:
| Field | Constraint |
|---|---|
| `recentEpisodeLimit` | integer, 120 |
| `semanticLimit` | integer, 120 |
| `scoreThreshold` | number, 01 |
| `temperature` | number, 02 |
| `repeatPenalty` | number, 12 |
| `topP` | number, 01 |
| `topK` | integer, 1100 |
| `modelsFolderPath` | path must exist and be readable |
| `systemPrompt` | string (trimmed); `null` reverts to shared default |
`data/settings.json` is created on first save. Parent directories are created if missing.
## Streaming SSE (`src/chat/index.js` — `chatStream`)
The route sets SSE headers and delegates to `chatStream`, which:
1. Calls `inference.completeStream()` → receives a raw HTTP Response with a readable body.
2. Reads the body in chunks, buffers across chunk boundaries, splits on `\n\n`.
3. For each event line starting with `data: `, parses the JSON and calls `onChunk(data.response)`.
4. The `[DONE]` sentinel (used by some llama-server versions) is explicitly ignored.
5. After stream ends, saves the assembled full response as an episode (same as non-streaming).
If a chunk parse fails the error is logged and the stream continues. If the response body closes with no text accumulated, the episode is not saved (logged as warning).
## Fire-and-Forget Tasks
After every successful chat turn:
- **Summarization** (`services/summarization.js``triggerSummary`): checks token threshold → recency guard → calls Ollama → POSTs to memory-service. Only runs if `SUMMARIES.THRESHOLD_TOKENS` is exceeded AND at least `SUMMARIES.MIN_EPISODES_SINCE` new episodes have occurred since the last summary.
- **Auto-naming** (`chat/index.js``autoNameSession`): only fires on the first message of a session. Uses temp 0.3, `maxTokens=20`, prompts for a ≤5-word title.
Both tasks catch all errors and log warnings without surfacing to the client.
## Summarization Recency Guard
`src/services/summarization.js` reads the `episode_range` field of the latest existing summary (format: `"<startId>-<endId>"`). It counts SQLite episodes with `id > endId`; if fewer than `SUMMARIES.MIN_EPISODES_SINCE`, it skips. This prevents rapid re-summarization on high-traffic sessions.
When the existing summary's token count exceeds `SUMMARIES.MAX_SUMMARY_TOKENS`, it is treated as "expired" — a fresh summary is generated instead of an incremental update.
## Qdrant Calls (Direct, Not Via Memory-Service)
`src/services/qdrant.js` makes REST calls to Qdrant directly at `QDRANT_URL`. This bypasses memory-service for semantic search performance. Orchestration fetches episode/entity content from memory-service by ID *after* getting vector search results from Qdrant.
`searchEntities` checks `projectId !== null && projectId !== undefined` before applying the filter — a session with no project skips the filter entirely and searches globally.
## Models Endpoint
`GET /models` scans `modelsFolderPath` for `.gguf` files and optionally reads a `models.json` manifest (keyed by filename) for labels and descriptions. File size is reported in GB. Returns 500 if the folder is inaccessible.
`GET /models/props` proxies `/props` from llama-server and returns `{contextWindow, modelAlias}`. Returns 503 if llama-server is unreachable.
## Health Check
`GET /health/services` runs parallel fetch calls to all four dependent services with a 3-second `AbortSignal.timeout` each. Results are returned as an array — the endpoint never returns a non-2xx itself regardless of downstream status.
## Background Model (qwen2.5:3b)
Used for entity extraction and summarization via Ollama on Mini PC 1. Uses **ChatML
format** (`<|im_start|>` / `<|im_end|>`) — not Phi3 format. Use `format: 'json'`
only for structured extraction, never for free-text summarization.
## API Endpoints Quick Reference
| Method | Path | Notes |
|---|---|---|
| GET | `/health` | Returns service URLs |
| GET | `/health/services` | Parallel status of all dependencies |
| POST | `/chat` | Blocking completion |
| POST | `/chat/stream` | SSE streaming |
| GET/PATCH | `/settings` | Persistent settings |
| GET | `/models` | `.gguf` file scan |
| GET | `/models/props` | llama-server model info |
| GET | `/sessions` | Delegates to memory-service |
| GET | `/sessions/:sessionId/history` | Paginated episodes by external ID |
| PATCH | `/sessions/:sessionId` | `name` and/or `projectId` |
| DELETE | `/sessions/:sessionId` | |
| GET | `/episodes` | Delegates; supports `q` for FTS |
| DELETE | `/episodes/:id` | Delegates |
| GET/POST/PATCH/DELETE | `/projects` and `/projects/:id` | Delegates |
| POST | `/summaries/project/:projectId/generate` | On-demand; 422 if no data |
| GET | `/summaries/project/:projectId/overview` | |
| GET | `/summaries/session/:sessionId` | Resolves external ID first |
| GET | `/summaries/project/:projectId` | |