9.2 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
See the root CLAUDE.md for overall architecture, service roles, and the end-to-end chat flow.
Running This Service
npm run orchestration # From repo root (node src/index.js)
npm -w packages/orchestration-service run dev # With --watch
Default port: 4000. Depends on memory-service, embedding-service, inference-service, and Qdrant.
Context Assembly (src/chat/index.js)
assembleContext(externalId, userMessage) is the core function that builds the inference prompt. Order of operations:
- Resolve session by
externalId(creates it if missing — every chat call is self-healing). - If session has a
project_id, load the project and fetch all sibling sessions (viagetProjectSessions, hardcodedlimit=200). - Fetch
recentEpisodeLimitrecent episodes from memory-service. - Embed the user message; search Qdrant EPISODES with
scoreThreshold:- No project:
must: [sessionId == this session] - Project:
should: [sessionId == s1, sessionId == s2, ...]across all project sessions - Dedup against recent episode IDs before including.
- No project:
- Run fused episode retrieval via
getFusedEpisodes— Qdrant semantic search and FTS5 keyword search run in parallel, both filtered againstrecentIds, then merged via Reciprocal Rank Fusion (RRF). IfkeywordWeightis0, the FTS call is skipped. Returns topsemanticLimitepisodes by fused score. - Embed and search Qdrant ENTITIES (filtered by
projectIdif in a project). Returns entity IDs alongside payload — the Qdrant point ID equals the SQLite entity ID. - Expand matched entities into a 1-hop graph neighborhood via
POST /graph/neighborson the memory-service. Returns{ nodes, edges }— the full entity objects plus connecting relationships. Falls back to flat entity list (no edges) if the graph call fails. - Build prompt in this fixed order: system prompt → graph context → fused episodes → recent episodes → user message → "Assistant:"
The ordering prioritizes established facts (graph context) and relevant past context (semantic) over pure recency.
Graph Context Format
formatGraphContext(nodes, edges) in src/chat/index.js formats the neighborhood as:
- Alice (person): software engineer working on NexusAI
→ works_on NexusAI (project)
→ knows Bob (person)
- NexusAI (project): AI assistant framework
- Bob (person): Alice's colleague
Each node shows its notes on the first line. Outbound edges are indented below with → label target (type). Nodes with only inbound edges (neighbors pulled in by traversal) appear without connection lines.
System Prompt Resolution
Priority from highest to lowest:
project.system_prompt(stored on the project row in memory-service)settings.systemPrompt(saved indata/settings.json)ORCHESTRATION.SYSTEM_PROMPT(shared constants fallback)
Settings (src/config/settings.js)
Settings are loaded from data/settings.json merged with defaults at every GET /settings call. PATCH /settings validates each field individually with specific constraints:
| Field | Constraint |
|---|---|
recentEpisodeLimit |
integer, 1–20 |
semanticLimit |
integer, 1–20 |
scoreThreshold |
number, 0–1 |
temperature |
number, 0–2 |
repeatPenalty |
number, 1–2 |
topP |
number, 0–1 |
topK |
integer, 1–100 |
modelsFolderPath |
path must exist and be readable |
systemPrompt |
string (trimmed); null reverts to shared default |
data/settings.json is created on first save. Parent directories are created if missing.
Streaming SSE (src/chat/index.js — chatStream)
The route sets SSE headers and delegates to chatStream, which:
- Calls
inference.completeStream()→ receives a raw HTTP Response with a readable body. - Reads the body in chunks, buffers across chunk boundaries, splits on
\n\n. - For each event line starting with
data:, parses the JSON and callsonChunk(data.response). - The
[DONE]sentinel (used by some llama-server versions) is explicitly ignored. - After stream ends, saves the assembled full response as an episode (same as non-streaming).
If a chunk parse fails the error is logged and the stream continues. If the response body closes with no text accumulated, the episode is not saved (logged as warning).
Fire-and-Forget Tasks
After every successful chat turn:
- Summarization (
services/summarization.js→triggerSummary): checks token threshold → recency guard → calls Ollama → POSTs to memory-service. Only runs ifSUMMARIES.THRESHOLD_TOKENSis exceeded AND at leastSUMMARIES.MIN_EPISODES_SINCEnew episodes have occurred since the last summary. - Auto-naming (
chat/index.js→autoNameSession): only fires on the first message of a session. Uses temp 0.3,maxTokens=20, prompts for a ≤5-word title.
Both tasks catch all errors and log warnings without surfacing to the client.
Summarization Recency Guard
src/services/summarization.js reads the episode_range field of the latest existing summary (format: "<startId>-<endId>"). It counts SQLite episodes with id > endId; if fewer than SUMMARIES.MIN_EPISODES_SINCE, it skips. This prevents rapid re-summarization on high-traffic sessions.
When the existing summary's token count exceeds SUMMARIES.MAX_SUMMARY_TOKENS, it is treated as "expired" — a fresh summary is generated instead of an incremental update.
Qdrant Calls (Direct, Not Via Memory-Service)
src/services/qdrant.js makes REST calls to Qdrant directly at QDRANT_URL. This bypasses memory-service for semantic search performance. Orchestration fetches episode/entity content from memory-service by ID after getting vector search results from Qdrant.
searchEntities checks projectId !== null && projectId !== undefined before applying the filter — a session with no project skips the filter entirely and searches globally.
Retrieval Fusion (src/chat/index.js)
Three functions handle fusion — all pure or lightly async, all non-critical:
getFTSResults(userMessage, { limit, sessionIds })— callsmemory.searchEpisodes; returns[]and logs a warning on failurefuseEpisodeResults(semanticEps, keywordEps, { semanticWeight, keywordWeight, limit })— pure RRF implementation. Key guard: FTS-only episodes are only added to the scores Map ifcontrib > 0(prevents score-0 bleed-through whenkeywordWeight: 0)getFusedEpisodes(userMessage, session, recentIds, projectSessionIds, settings)— orchestrates both paths inPromise.all, appliesrecentIdsfilter to FTS results, calls fusion. Short-circuits FTS call entirely ifkeywordWeight === 0
FTS is scoped to projectSessionIds if in a project, otherwise [session.id] — mirrors Qdrant scoping exactly.
For RRF formula, weight semantics, and enabling keyword search, see
docs/services/retrieval-fusion.md.
Graph Service Client (src/services/graph.js)
Thin HTTP client for memory-service graph endpoints. One function:
getNeighbors(entityIds[])— POSTs tomemory-service/graph/neighborswith the entity IDs from Qdrant entity search. Returns{ nodes, edges }. Throws on non-2xx — caller wraps in try/catch with graceful fallback.
Models Endpoint
GET /models scans modelsFolderPath for .gguf files and optionally reads a models.json manifest (keyed by filename) for labels and descriptions. File size is reported in GB. Returns 500 if the folder is inaccessible.
GET /models/props proxies /props from llama-server and returns {contextWindow, modelAlias}. Returns 503 if llama-server is unreachable.
Health Check
GET /health/services runs parallel fetch calls to all four dependent services with a 3-second AbortSignal.timeout each. Results are returned as an array — the endpoint never returns a non-2xx itself regardless of downstream status.
Background Model (qwen2.5:3b)
Used for entity/relationship extraction and summarization via Ollama on Mini PC 1. Uses ChatML format (<|im_start|> / <|im_end|>) — not Phi3 format. Use format: 'json' only for structured extraction, never for free-text summarization.
API Endpoints Quick Reference
| Method | Path | Notes |
|---|---|---|
| GET | /health |
Returns service URLs |
| GET | /health/services |
Parallel status of all dependencies |
| POST | /chat |
Blocking completion |
| POST | /chat/stream |
SSE streaming |
| GET/PATCH | /settings |
Persistent settings |
| GET | /models |
.gguf file scan |
| GET | /models/props |
llama-server model info |
| GET | /sessions |
Delegates to memory-service |
| GET | /sessions/:sessionId/history |
Paginated episodes by external ID |
| PATCH | /sessions/:sessionId |
name and/or projectId |
| DELETE | /sessions/:sessionId |
|
| GET | /episodes |
Delegates; supports q for FTS |
| DELETE | /episodes/:id |
Delegates |
| GET/POST/PATCH/DELETE | /projects and /projects/:id |
Delegates |
| POST | /summaries/project/:projectId/generate |
On-demand; 422 if no data |
| GET | /summaries/project/:projectId/overview |
|
| GET | /summaries/session/:sessionId |
Resolves external ID first |
| GET | /summaries/project/:projectId |