# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. See the root [CLAUDE.md](../../CLAUDE.md) for overall architecture, service roles, and the end-to-end chat flow. ## Running This Service ```bash npm run orchestration # From repo root (node src/index.js) npm -w packages/orchestration-service run dev # With --watch ``` Default port: **4000**. Depends on memory-service, embedding-service, inference-service, and Qdrant. ## Context Assembly (`src/chat/index.js`) `assembleContext(externalId, userMessage)` is the core function that builds the inference prompt. Order of operations: 1. Resolve session by `externalId` (creates it if missing — every chat call is self-healing). 2. If session has a `project_id`, load the project and fetch all sibling sessions (via `getProjectSessions`, hardcoded `limit=200`). 3. Fetch `recentEpisodeLimit` recent episodes from memory-service. 4. Embed the user message; search Qdrant EPISODES with `scoreThreshold`: - No project: `must: [sessionId == this session]` - Project: `should: [sessionId == s1, sessionId == s2, ...]` across all project sessions - Dedup against recent episode IDs before including. 5. Run **fused episode retrieval** via `getFusedEpisodes` — Qdrant semantic search and FTS5 keyword search run in parallel, both filtered against `recentIds`, then merged via Reciprocal Rank Fusion (RRF). If `keywordWeight` is `0`, the FTS call is skipped. Returns top `semanticLimit` episodes by fused score. 6. Embed and search Qdrant ENTITIES (filtered by `projectId` if in a project). Returns entity IDs alongside payload — the Qdrant point ID equals the SQLite entity ID. 7. Expand matched entities into a 1-hop graph neighborhood via `POST /graph/neighbors` on the memory-service. Returns `{ nodes, edges }` — the full entity objects plus connecting relationships. Falls back to flat entity list (no edges) if the graph call fails. 8. Build prompt in this fixed order: **system prompt → graph context → fused episodes → recent episodes → user message → "Assistant:"** The ordering prioritizes established facts (graph context) and relevant past context (semantic) over pure recency. ## Graph Context Format `formatGraphContext(nodes, edges)` in `src/chat/index.js` formats the neighborhood as: ``` - Alice (person): software engineer working on NexusAI → works_on NexusAI (project) → knows Bob (person) - NexusAI (project): AI assistant framework - Bob (person): Alice's colleague ``` Each node shows its notes on the first line. Outbound edges are indented below with `→ label target (type)`. Nodes with only inbound edges (neighbors pulled in by traversal) appear without connection lines. ## System Prompt Resolution Priority from highest to lowest: 1. `project.system_prompt` (stored on the project row in memory-service) 2. `settings.systemPrompt` (saved in `data/settings.json`) 3. `ORCHESTRATION.SYSTEM_PROMPT` (shared constants fallback) ## Settings (`src/config/settings.js`) Settings are loaded from `data/settings.json` merged with defaults at every `GET /settings` call. `PATCH /settings` validates each field individually with specific constraints: | Field | Constraint | |---|---| | `recentEpisodeLimit` | integer, 1–20 | | `semanticLimit` | integer, 1–20 | | `scoreThreshold` | number, 0–1 | | `temperature` | number, 0–2 | | `repeatPenalty` | number, 1–2 | | `topP` | number, 0–1 | | `topK` | integer, 1–100 | | `modelsFolderPath` | path must exist and be readable | | `systemPrompt` | string (trimmed); `null` reverts to shared default | `data/settings.json` is created on first save. Parent directories are created if missing. ## Streaming SSE (`src/chat/index.js` — `chatStream`) The route sets SSE headers and delegates to `chatStream`, which: 1. Calls `inference.completeStream()` → receives a raw HTTP Response with a readable body. 2. Reads the body in chunks, buffers across chunk boundaries, splits on `\n\n`. 3. For each event line starting with `data: `, parses the JSON and calls `onChunk(data.response)`. 4. The `[DONE]` sentinel (used by some llama-server versions) is explicitly ignored. 5. After stream ends, saves the assembled full response as an episode (same as non-streaming). If a chunk parse fails the error is logged and the stream continues. If the response body closes with no text accumulated, the episode is not saved (logged as warning). ## Fire-and-Forget Tasks After every successful chat turn: - **Summarization** (`services/summarization.js` → `triggerSummary`): checks token threshold → recency guard → calls Ollama → POSTs to memory-service. Only runs if `SUMMARIES.THRESHOLD_TOKENS` is exceeded AND at least `SUMMARIES.MIN_EPISODES_SINCE` new episodes have occurred since the last summary. - **Auto-naming** (`chat/index.js` → `autoNameSession`): only fires on the first message of a session. Uses temp 0.3, `maxTokens=20`, prompts for a ≤5-word title. Both tasks catch all errors and log warnings without surfacing to the client. ## Summarization Recency Guard `src/services/summarization.js` reads the `episode_range` field of the latest existing summary (format: `"-"`). It counts SQLite episodes with `id > endId`; if fewer than `SUMMARIES.MIN_EPISODES_SINCE`, it skips. This prevents rapid re-summarization on high-traffic sessions. When the existing summary's token count exceeds `SUMMARIES.MAX_SUMMARY_TOKENS`, it is treated as "expired" — a fresh summary is generated instead of an incremental update. ## Qdrant Calls (Direct, Not Via Memory-Service) `src/services/qdrant.js` makes REST calls to Qdrant directly at `QDRANT_URL`. This bypasses memory-service for semantic search performance. Orchestration fetches episode/entity content from memory-service by ID *after* getting vector search results from Qdrant. `searchEntities` checks `projectId !== null && projectId !== undefined` before applying the filter — a session with no project skips the filter entirely and searches globally. ## Retrieval Fusion (`src/chat/index.js`) Three functions handle fusion — all pure or lightly async, all non-critical: - **`getFTSResults(userMessage, { limit, sessionIds })`** — calls `memory.searchEpisodes`; returns `[]` and logs a warning on failure - **`fuseEpisodeResults(semanticEps, keywordEps, { semanticWeight, keywordWeight, limit })`** — pure RRF implementation. Key guard: FTS-only episodes are only added to the scores Map if `contrib > 0` (prevents score-0 bleed-through when `keywordWeight: 0`) - **`getFusedEpisodes(userMessage, session, recentIds, projectSessionIds, settings)`** — orchestrates both paths in `Promise.all`, applies `recentIds` filter to FTS results, calls fusion. Short-circuits FTS call entirely if `keywordWeight === 0` FTS is scoped to `projectSessionIds` if in a project, otherwise `[session.id]` — mirrors Qdrant scoping exactly. > For RRF formula, weight semantics, and enabling keyword search, see `docs/services/retrieval-fusion.md`. ## Graph Service Client (`src/services/graph.js`) Thin HTTP client for memory-service graph endpoints. One function: - **`getNeighbors(entityIds[])`** — POSTs to `memory-service/graph/neighbors` with the entity IDs from Qdrant entity search. Returns `{ nodes, edges }`. Throws on non-2xx — caller wraps in try/catch with graceful fallback. ## Models Endpoint `GET /models` scans `modelsFolderPath` for `.gguf` files and optionally reads a `models.json` manifest (keyed by filename) for labels and descriptions. File size is reported in GB. Returns 500 if the folder is inaccessible. `GET /models/props` proxies `/props` from llama-server and returns `{contextWindow, modelAlias}`. Returns 503 if llama-server is unreachable. ## Health Check `GET /health/services` runs parallel fetch calls to all four dependent services with a 3-second `AbortSignal.timeout` each. Results are returned as an array — the endpoint never returns a non-2xx itself regardless of downstream status. ## Background Model (qwen2.5:3b) Used for entity/relationship extraction and summarization via Ollama on Mini PC 1. Uses **ChatML format** (`<|im_start|>` / `<|im_end|>`) — not Phi3 format. Use `format: 'json'` only for structured extraction, never for free-text summarization. ## API Endpoints Quick Reference | Method | Path | Notes | |---|---|---| | GET | `/health` | Returns service URLs | | GET | `/health/services` | Parallel status of all dependencies | | POST | `/chat` | Blocking completion | | POST | `/chat/stream` | SSE streaming | | GET/PATCH | `/settings` | Persistent settings | | GET | `/models` | `.gguf` file scan | | GET | `/models/props` | llama-server model info | | GET | `/sessions` | Delegates to memory-service | | GET | `/sessions/:sessionId/history` | Paginated episodes by external ID | | PATCH | `/sessions/:sessionId` | `name` and/or `projectId` | | DELETE | `/sessions/:sessionId` | | | GET | `/episodes` | Delegates; supports `q` for FTS | | DELETE | `/episodes/:id` | Delegates | | GET/POST/PATCH/DELETE | `/projects` and `/projects/:id` | Delegates | | POST | `/summaries/project/:projectId/generate` | On-demand; 422 if no data | | GET | `/summaries/project/:projectId/overview` | | | GET | `/summaries/session/:sessionId` | Resolves external ID first | | GET | `/summaries/project/:projectId` | |