Files
nexusAI/packages/orchestration-service/CLAUDE.md
2026-04-27 07:03:46 -07:00

157 lines
9.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
See the root [CLAUDE.md](../../CLAUDE.md) for overall architecture, service roles, and the end-to-end chat flow.
## Running This Service
```bash
npm run orchestration # From repo root (node src/index.js)
npm -w packages/orchestration-service run dev # With --watch
```
Default port: **4000**. Depends on memory-service, embedding-service, inference-service, and Qdrant.
## Context Assembly (`src/chat/index.js`)
`assembleContext(externalId, userMessage)` is the core function that builds the inference prompt. Order of operations:
1. Resolve session by `externalId` (creates it if missing — every chat call is self-healing).
2. If session has a `project_id`, load the project and fetch all sibling sessions (via `getProjectSessions`, hardcoded `limit=200`).
3. Fetch `recentEpisodeLimit` recent episodes from memory-service.
4. Embed the user message; search Qdrant EPISODES with `scoreThreshold`:
- No project: `must: [sessionId == this session]`
- Project: `should: [sessionId == s1, sessionId == s2, ...]` across all project sessions
- Dedup against recent episode IDs before including.
5. Run **fused episode retrieval** via `getFusedEpisodes` — Qdrant semantic search and FTS5 keyword search run in parallel, both filtered against `recentIds`, then merged via Reciprocal Rank Fusion (RRF). If `keywordWeight` is `0`, the FTS call is skipped. Returns top `semanticLimit` episodes by fused score.
6. Embed and search Qdrant ENTITIES (filtered by `projectId` if in a project). Returns entity IDs alongside payload — the Qdrant point ID equals the SQLite entity ID.
7. Expand matched entities into a 1-hop graph neighborhood via `POST /graph/neighbors` on the memory-service. Returns `{ nodes, edges }` — the full entity objects plus connecting relationships. Falls back to flat entity list (no edges) if the graph call fails.
8. Build prompt in this fixed order: **system prompt → graph context → fused episodes → recent episodes → user message → "Assistant:"**
The ordering prioritizes established facts (graph context) and relevant past context (semantic) over pure recency.
## Graph Context Format
`formatGraphContext(nodes, edges)` in `src/chat/index.js` formats the neighborhood as:
```
- Alice (person): software engineer working on NexusAI
→ works_on NexusAI (project)
→ knows Bob (person)
- NexusAI (project): AI assistant framework
- Bob (person): Alice's colleague
```
Each node shows its notes on the first line. Outbound edges are indented below with `→ label target (type)`. Nodes with only inbound edges (neighbors pulled in by traversal) appear without connection lines.
## System Prompt Resolution
Priority from highest to lowest:
1. `project.system_prompt` (stored on the project row in memory-service)
2. `settings.systemPrompt` (saved in `data/settings.json`)
3. `ORCHESTRATION.SYSTEM_PROMPT` (shared constants fallback)
## Settings (`src/config/settings.js`)
Settings are loaded from `data/settings.json` merged with defaults at every `GET /settings` call. `PATCH /settings` validates each field individually with specific constraints:
| Field | Constraint |
|---|---|
| `recentEpisodeLimit` | integer, 120 |
| `semanticLimit` | integer, 120 |
| `scoreThreshold` | number, 01 |
| `temperature` | number, 02 |
| `repeatPenalty` | number, 12 |
| `topP` | number, 01 |
| `topK` | integer, 1100 |
| `modelsFolderPath` | path must exist and be readable |
| `systemPrompt` | string (trimmed); `null` reverts to shared default |
`data/settings.json` is created on first save. Parent directories are created if missing.
## Streaming SSE (`src/chat/index.js` — `chatStream`)
The route sets SSE headers and delegates to `chatStream`, which:
1. Calls `inference.completeStream()` → receives a raw HTTP Response with a readable body.
2. Reads the body in chunks, buffers across chunk boundaries, splits on `\n\n`.
3. For each event line starting with `data: `, parses the JSON and calls `onChunk(data.response)`.
4. The `[DONE]` sentinel (used by some llama-server versions) is explicitly ignored.
5. After stream ends, saves the assembled full response as an episode (same as non-streaming).
If a chunk parse fails the error is logged and the stream continues. If the response body closes with no text accumulated, the episode is not saved (logged as warning).
## Fire-and-Forget Tasks
After every successful chat turn:
- **Summarization** (`services/summarization.js``triggerSummary`): checks token threshold → recency guard → calls Ollama → POSTs to memory-service. Only runs if `SUMMARIES.THRESHOLD_TOKENS` is exceeded AND at least `SUMMARIES.MIN_EPISODES_SINCE` new episodes have occurred since the last summary.
- **Auto-naming** (`chat/index.js``autoNameSession`): only fires on the first message of a session. Uses temp 0.3, `maxTokens=20`, prompts for a ≤5-word title.
Both tasks catch all errors and log warnings without surfacing to the client.
## Summarization Recency Guard
`src/services/summarization.js` reads the `episode_range` field of the latest existing summary (format: `"<startId>-<endId>"`). It counts SQLite episodes with `id > endId`; if fewer than `SUMMARIES.MIN_EPISODES_SINCE`, it skips. This prevents rapid re-summarization on high-traffic sessions.
When the existing summary's token count exceeds `SUMMARIES.MAX_SUMMARY_TOKENS`, it is treated as "expired" — a fresh summary is generated instead of an incremental update.
## Qdrant Calls (Direct, Not Via Memory-Service)
`src/services/qdrant.js` makes REST calls to Qdrant directly at `QDRANT_URL`. This bypasses memory-service for semantic search performance. Orchestration fetches episode/entity content from memory-service by ID *after* getting vector search results from Qdrant.
`searchEntities` checks `projectId !== null && projectId !== undefined` before applying the filter — a session with no project skips the filter entirely and searches globally.
## Retrieval Fusion (`src/chat/index.js`)
Three functions handle fusion — all pure or lightly async, all non-critical:
- **`getFTSResults(userMessage, { limit, sessionIds })`** — calls `memory.searchEpisodes`; returns `[]` and logs a warning on failure
- **`fuseEpisodeResults(semanticEps, keywordEps, { semanticWeight, keywordWeight, limit })`** — pure RRF implementation. Key guard: FTS-only episodes are only added to the scores Map if `contrib > 0` (prevents score-0 bleed-through when `keywordWeight: 0`)
- **`getFusedEpisodes(userMessage, session, recentIds, projectSessionIds, settings)`** — orchestrates both paths in `Promise.all`, applies `recentIds` filter to FTS results, calls fusion. Short-circuits FTS call entirely if `keywordWeight === 0`
FTS is scoped to `projectSessionIds` if in a project, otherwise `[session.id]` — mirrors Qdrant scoping exactly.
> For RRF formula, weight semantics, and enabling keyword search, see `docs/services/retrieval-fusion.md`.
## Graph Service Client (`src/services/graph.js`)
Thin HTTP client for memory-service graph endpoints. One function:
- **`getNeighbors(entityIds[])`** — POSTs to `memory-service/graph/neighbors` with the entity IDs from Qdrant entity search. Returns `{ nodes, edges }`. Throws on non-2xx — caller wraps in try/catch with graceful fallback.
## Models Endpoint
`GET /models` scans `modelsFolderPath` for `.gguf` files and optionally reads a `models.json` manifest (keyed by filename) for labels and descriptions. File size is reported in GB. Returns 500 if the folder is inaccessible.
`GET /models/props` proxies `/props` from llama-server and returns `{contextWindow, modelAlias}`. Returns 503 if llama-server is unreachable.
## Health Check
`GET /health/services` runs parallel fetch calls to all four dependent services with a 3-second `AbortSignal.timeout` each. Results are returned as an array — the endpoint never returns a non-2xx itself regardless of downstream status.
## Background Model (qwen2.5:3b)
Used for entity/relationship extraction and summarization via Ollama on Mini PC 1. Uses **ChatML format** (`<|im_start|>` / `<|im_end|>`) — not Phi3 format. Use `format: 'json'` only for structured extraction, never for free-text summarization.
## API Endpoints Quick Reference
| Method | Path | Notes |
|---|---|---|
| GET | `/health` | Returns service URLs |
| GET | `/health/services` | Parallel status of all dependencies |
| POST | `/chat` | Blocking completion |
| POST | `/chat/stream` | SSE streaming |
| GET/PATCH | `/settings` | Persistent settings |
| GET | `/models` | `.gguf` file scan |
| GET | `/models/props` | llama-server model info |
| GET | `/sessions` | Delegates to memory-service |
| GET | `/sessions/:sessionId/history` | Paginated episodes by external ID |
| PATCH | `/sessions/:sessionId` | `name` and/or `projectId` |
| DELETE | `/sessions/:sessionId` | |
| GET | `/episodes` | Delegates; supports `q` for FTS |
| DELETE | `/episodes/:id` | Delegates |
| GET/POST/PATCH/DELETE | `/projects` and `/projects/:id` | Delegates |
| POST | `/summaries/project/:projectId/generate` | On-demand; 422 if no data |
| GET | `/summaries/project/:projectId/overview` | |
| GET | `/summaries/session/:sessionId` | Resolves external ID first |
| GET | `/summaries/project/:projectId` | |