Files
nexusAI/packages/orchestration-service/CLAUDE.md
2026-04-27 03:10:39 -07:00

7.9 KiB
Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

See the root CLAUDE.md for overall architecture, service roles, and the end-to-end chat flow.

Running This Service

npm run orchestration             # From repo root (node src/index.js)
npm -w packages/orchestration-service run dev   # With --watch

Default port: 4000. Depends on memory-service, embedding-service, inference-service, and Qdrant.

Context Assembly (src/chat/index.js)

assembleContext(externalId, userMessage) is the core function that builds the inference prompt. Order of operations:

  1. Resolve session by externalId (creates it if missing — every chat call is self-healing).
  2. If session has a project_id, load the project and fetch all sibling sessions (via getProjectSessions, hardcoded limit=200).
  3. Fetch recentEpisodeLimit recent episodes from memory-service.
  4. Embed the user message; search Qdrant EPISODES with scoreThreshold:
    • No project: must: [sessionId == this session]
    • Project: should: [sessionId == s1, sessionId == s2, ...] across all project sessions
    • Dedup against recent episode IDs before including.
  5. Embed and search Qdrant ENTITIES (filtered by projectId if in a project). Returns entity IDs alongside payload — the Qdrant point ID equals the SQLite entity ID.
  6. Expand matched entities into a 1-hop graph neighborhood via POST /graph/neighbors on the memory-service. Returns { nodes, edges } — the full entity objects plus connecting relationships. Falls back to flat entity list (no edges) if the graph call fails.
  7. Build prompt in this fixed order: system prompt → graph context → semantic episodes → recent episodes → user message → "Assistant:"

The ordering prioritizes established facts (graph context) and relevant past context (semantic) over pure recency.

Graph Context Format

formatGraphContext(nodes, edges) in src/chat/index.js formats the neighborhood as:

- Alice (person): software engineer working on NexusAI
  → works_on NexusAI (project)
  → knows Bob (person)
- NexusAI (project): AI assistant framework
- Bob (person): Alice's colleague

Each node shows its notes on the first line. Outbound edges are indented below with → label target (type). Nodes with only inbound edges (neighbors pulled in by traversal) appear without connection lines.

System Prompt Resolution

Priority from highest to lowest:

  1. project.system_prompt (stored on the project row in memory-service)
  2. settings.systemPrompt (saved in data/settings.json)
  3. ORCHESTRATION.SYSTEM_PROMPT (shared constants fallback)

Settings (src/config/settings.js)

Settings are loaded from data/settings.json merged with defaults at every GET /settings call. PATCH /settings validates each field individually with specific constraints:

Field Constraint
recentEpisodeLimit integer, 120
semanticLimit integer, 120
scoreThreshold number, 01
temperature number, 02
repeatPenalty number, 12
topP number, 01
topK integer, 1100
modelsFolderPath path must exist and be readable
systemPrompt string (trimmed); null reverts to shared default

data/settings.json is created on first save. Parent directories are created if missing.

Streaming SSE (src/chat/index.jschatStream)

The route sets SSE headers and delegates to chatStream, which:

  1. Calls inference.completeStream() → receives a raw HTTP Response with a readable body.
  2. Reads the body in chunks, buffers across chunk boundaries, splits on \n\n.
  3. For each event line starting with data: , parses the JSON and calls onChunk(data.response).
  4. The [DONE] sentinel (used by some llama-server versions) is explicitly ignored.
  5. After stream ends, saves the assembled full response as an episode (same as non-streaming).

If a chunk parse fails the error is logged and the stream continues. If the response body closes with no text accumulated, the episode is not saved (logged as warning).

Fire-and-Forget Tasks

After every successful chat turn:

  • Summarization (services/summarization.jstriggerSummary): checks token threshold → recency guard → calls Ollama → POSTs to memory-service. Only runs if SUMMARIES.THRESHOLD_TOKENS is exceeded AND at least SUMMARIES.MIN_EPISODES_SINCE new episodes have occurred since the last summary.
  • Auto-naming (chat/index.jsautoNameSession): only fires on the first message of a session. Uses temp 0.3, maxTokens=20, prompts for a ≤5-word title.

Both tasks catch all errors and log warnings without surfacing to the client.

Summarization Recency Guard

src/services/summarization.js reads the episode_range field of the latest existing summary (format: "<startId>-<endId>"). It counts SQLite episodes with id > endId; if fewer than SUMMARIES.MIN_EPISODES_SINCE, it skips. This prevents rapid re-summarization on high-traffic sessions.

When the existing summary's token count exceeds SUMMARIES.MAX_SUMMARY_TOKENS, it is treated as "expired" — a fresh summary is generated instead of an incremental update.

Qdrant Calls (Direct, Not Via Memory-Service)

src/services/qdrant.js makes REST calls to Qdrant directly at QDRANT_URL. This bypasses memory-service for semantic search performance. Orchestration fetches episode/entity content from memory-service by ID after getting vector search results from Qdrant.

searchEntities checks projectId !== null && projectId !== undefined before applying the filter — a session with no project skips the filter entirely and searches globally.

Graph Service Client (src/services/graph.js)

Thin HTTP client for memory-service graph endpoints. One function:

  • getNeighbors(entityIds[]) — POSTs to memory-service/graph/neighbors with the entity IDs from Qdrant entity search. Returns { nodes, edges }. Throws on non-2xx — caller wraps in try/catch with graceful fallback.

Models Endpoint

GET /models scans modelsFolderPath for .gguf files and optionally reads a models.json manifest (keyed by filename) for labels and descriptions. File size is reported in GB. Returns 500 if the folder is inaccessible.

GET /models/props proxies /props from llama-server and returns {contextWindow, modelAlias}. Returns 503 if llama-server is unreachable.

Health Check

GET /health/services runs parallel fetch calls to all four dependent services with a 3-second AbortSignal.timeout each. Results are returned as an array — the endpoint never returns a non-2xx itself regardless of downstream status.

Background Model (qwen2.5:3b)

Used for entity/relationship extraction and summarization via Ollama on Mini PC 1. Uses ChatML format (<|im_start|> / <|im_end|>) — not Phi3 format. Use format: 'json' only for structured extraction, never for free-text summarization.

API Endpoints Quick Reference

Method Path Notes
GET /health Returns service URLs
GET /health/services Parallel status of all dependencies
POST /chat Blocking completion
POST /chat/stream SSE streaming
GET/PATCH /settings Persistent settings
GET /models .gguf file scan
GET /models/props llama-server model info
GET /sessions Delegates to memory-service
GET /sessions/:sessionId/history Paginated episodes by external ID
PATCH /sessions/:sessionId name and/or projectId
DELETE /sessions/:sessionId
GET /episodes Delegates; supports q for FTS
DELETE /episodes/:id Delegates
GET/POST/PATCH/DELETE /projects and /projects/:id Delegates
POST /summaries/project/:projectId/generate On-demand; 422 if no data
GET /summaries/project/:projectId/overview
GET /summaries/session/:sessionId Resolves external ID first
GET /summaries/project/:projectId