storme/nexusAI

Fork 0

Files

Storme-bit 055683424d retrieval fusion

2026-04-27 07:03:46 -07:00

9.2 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

See the root CLAUDE.md for overall architecture, service roles, and the end-to-end chat flow.

Running This Service

npm run orchestration             # From repo root (node src/index.js)
npm -w packages/orchestration-service run dev   # With --watch

Default port: 4000. Depends on memory-service, embedding-service, inference-service, and Qdrant.

Context Assembly (`src/chat/index.js`)

assembleContext(externalId, userMessage) is the core function that builds the inference prompt. Order of operations:

Resolve session by externalId (creates it if missing — every chat call is self-healing).
If session has a project_id, load the project and fetch all sibling sessions (via getProjectSessions, hardcoded limit=200).
Fetch recentEpisodeLimit recent episodes from memory-service.
Embed the user message; search Qdrant EPISODES with scoreThreshold:
- No project: must: [sessionId == this session]
- Project: should: [sessionId == s1, sessionId == s2, ...] across all project sessions
- Dedup against recent episode IDs before including.
Run fused episode retrieval via getFusedEpisodes — Qdrant semantic search and FTS5 keyword search run in parallel, both filtered against recentIds, then merged via Reciprocal Rank Fusion (RRF). If keywordWeight is 0, the FTS call is skipped. Returns top semanticLimit episodes by fused score.
Embed and search Qdrant ENTITIES (filtered by projectId if in a project). Returns entity IDs alongside payload — the Qdrant point ID equals the SQLite entity ID.
Expand matched entities into a 1-hop graph neighborhood via POST /graph/neighbors on the memory-service. Returns { nodes, edges } — the full entity objects plus connecting relationships. Falls back to flat entity list (no edges) if the graph call fails.
Build prompt in this fixed order: system prompt → graph context → fused episodes → recent episodes → user message → "Assistant:"

The ordering prioritizes established facts (graph context) and relevant past context (semantic) over pure recency.

Graph Context Format

formatGraphContext(nodes, edges) in src/chat/index.js formats the neighborhood as:

- Alice (person): software engineer working on NexusAI
  → works_on NexusAI (project)
  → knows Bob (person)
- NexusAI (project): AI assistant framework
- Bob (person): Alice's colleague

Each node shows its notes on the first line. Outbound edges are indented below with → label target (type). Nodes with only inbound edges (neighbors pulled in by traversal) appear without connection lines.

System Prompt Resolution

Priority from highest to lowest:

project.system_prompt (stored on the project row in memory-service)
settings.systemPrompt (saved in data/settings.json)
ORCHESTRATION.SYSTEM_PROMPT (shared constants fallback)

Settings (`src/config/settings.js`)

Settings are loaded from data/settings.json merged with defaults at every GET /settings call. PATCH /settings validates each field individually with specific constraints:

Field	Constraint
`recentEpisodeLimit`	integer, 1–20
`semanticLimit`	integer, 1–20
`scoreThreshold`	number, 0–1
`temperature`	number, 0–2
`repeatPenalty`	number, 1–2
`topP`	number, 0–1
`topK`	integer, 1–100
`modelsFolderPath`	path must exist and be readable
`systemPrompt`	string (trimmed); `null` reverts to shared default

data/settings.json is created on first save. Parent directories are created if missing.

Streaming SSE (`src/chat/index.js` — `chatStream`)

The route sets SSE headers and delegates to chatStream, which:

Calls inference.completeStream() → receives a raw HTTP Response with a readable body.
Reads the body in chunks, buffers across chunk boundaries, splits on \n\n.
For each event line starting with data: , parses the JSON and calls onChunk(data.response).
The [DONE] sentinel (used by some llama-server versions) is explicitly ignored.
After stream ends, saves the assembled full response as an episode (same as non-streaming).

If a chunk parse fails the error is logged and the stream continues. If the response body closes with no text accumulated, the episode is not saved (logged as warning).

Fire-and-Forget Tasks

After every successful chat turn:

Summarization (services/summarization.js → triggerSummary): checks token threshold → recency guard → calls Ollama → POSTs to memory-service. Only runs if SUMMARIES.THRESHOLD_TOKENS is exceeded AND at least SUMMARIES.MIN_EPISODES_SINCE new episodes have occurred since the last summary.
Auto-naming (chat/index.js → autoNameSession): only fires on the first message of a session. Uses temp 0.3, maxTokens=20, prompts for a ≤5-word title.

Both tasks catch all errors and log warnings without surfacing to the client.

Summarization Recency Guard

src/services/summarization.js reads the episode_range field of the latest existing summary (format: "<startId>-<endId>"). It counts SQLite episodes with id > endId; if fewer than SUMMARIES.MIN_EPISODES_SINCE, it skips. This prevents rapid re-summarization on high-traffic sessions.

When the existing summary's token count exceeds SUMMARIES.MAX_SUMMARY_TOKENS, it is treated as "expired" — a fresh summary is generated instead of an incremental update.

Qdrant Calls (Direct, Not Via Memory-Service)

src/services/qdrant.js makes REST calls to Qdrant directly at QDRANT_URL. This bypasses memory-service for semantic search performance. Orchestration fetches episode/entity content from memory-service by ID after getting vector search results from Qdrant.

searchEntities checks projectId !== null && projectId !== undefined before applying the filter — a session with no project skips the filter entirely and searches globally.

Retrieval Fusion (`src/chat/index.js`)

Three functions handle fusion — all pure or lightly async, all non-critical:

getFTSResults(userMessage, { limit, sessionIds }) — calls memory.searchEpisodes; returns [] and logs a warning on failure
fuseEpisodeResults(semanticEps, keywordEps, { semanticWeight, keywordWeight, limit }) — pure RRF implementation. Key guard: FTS-only episodes are only added to the scores Map if contrib > 0 (prevents score-0 bleed-through when keywordWeight: 0)
getFusedEpisodes(userMessage, session, recentIds, projectSessionIds, settings) — orchestrates both paths in Promise.all, applies recentIds filter to FTS results, calls fusion. Short-circuits FTS call entirely if keywordWeight === 0

FTS is scoped to projectSessionIds if in a project, otherwise [session.id] — mirrors Qdrant scoping exactly.

For RRF formula, weight semantics, and enabling keyword search, see docs/services/retrieval-fusion.md.

Graph Service Client (`src/services/graph.js`)

Thin HTTP client for memory-service graph endpoints. One function:

getNeighbors(entityIds[]) — POSTs to memory-service/graph/neighbors with the entity IDs from Qdrant entity search. Returns { nodes, edges }. Throws on non-2xx — caller wraps in try/catch with graceful fallback.

Models Endpoint

GET /models scans modelsFolderPath for .gguf files and optionally reads a models.json manifest (keyed by filename) for labels and descriptions. File size is reported in GB. Returns 500 if the folder is inaccessible.

GET /models/props proxies /props from llama-server and returns {contextWindow, modelAlias}. Returns 503 if llama-server is unreachable.

Health Check

GET /health/services runs parallel fetch calls to all four dependent services with a 3-second AbortSignal.timeout each. Results are returned as an array — the endpoint never returns a non-2xx itself regardless of downstream status.

Background Model (qwen2.5:3b)

Used for entity/relationship extraction and summarization via Ollama on Mini PC 1. Uses ChatML format (<|im_start|> / <|im_end|>) — not Phi3 format. Use format: 'json' only for structured extraction, never for free-text summarization.

API Endpoints Quick Reference

Method	Path	Notes
GET	`/health`	Returns service URLs
GET	`/health/services`	Parallel status of all dependencies
POST	`/chat`	Blocking completion
POST	`/chat/stream`	SSE streaming
GET/PATCH	`/settings`	Persistent settings
GET	`/models`	`.gguf` file scan
GET	`/models/props`	llama-server model info
GET	`/sessions`	Delegates to memory-service
GET	`/sessions/:sessionId/history`	Paginated episodes by external ID
PATCH	`/sessions/:sessionId`	`name` and/or `projectId`
DELETE	`/sessions/:sessionId`
GET	`/episodes`	Delegates; supports `q` for FTS
DELETE	`/episodes/:id`	Delegates
GET/POST/PATCH/DELETE	`/projects` and `/projects/:id`	Delegates
POST	`/summaries/project/:projectId/generate`	On-demand; 422 if no data
GET	`/summaries/project/:projectId/overview`
GET	`/summaries/session/:sessionId`	Resolves external ID first
GET	`/summaries/project/:projectId`

9.2 KiB Raw Blame History Unescape Escape