storme/nexusAI

Fork 0

Files

Storme-bit 1a97b19280 roadmap phase 1 complete

2026-04-27 03:10:39 -07:00

7.9 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

See the root CLAUDE.md for overall architecture, service roles, and the end-to-end chat flow.

Running This Service

npm run orchestration             # From repo root (node src/index.js)
npm -w packages/orchestration-service run dev   # With --watch

Default port: 4000. Depends on memory-service, embedding-service, inference-service, and Qdrant.

Context Assembly (`src/chat/index.js`)

assembleContext(externalId, userMessage) is the core function that builds the inference prompt. Order of operations:

Resolve session by externalId (creates it if missing — every chat call is self-healing).
If session has a project_id, load the project and fetch all sibling sessions (via getProjectSessions, hardcoded limit=200).
Fetch recentEpisodeLimit recent episodes from memory-service.
Embed the user message; search Qdrant EPISODES with scoreThreshold:
- No project: must: [sessionId == this session]
- Project: should: [sessionId == s1, sessionId == s2, ...] across all project sessions
- Dedup against recent episode IDs before including.
Embed and search Qdrant ENTITIES (filtered by projectId if in a project). Returns entity IDs alongside payload — the Qdrant point ID equals the SQLite entity ID.
Expand matched entities into a 1-hop graph neighborhood via POST /graph/neighbors on the memory-service. Returns { nodes, edges } — the full entity objects plus connecting relationships. Falls back to flat entity list (no edges) if the graph call fails.
Build prompt in this fixed order: system prompt → graph context → semantic episodes → recent episodes → user message → "Assistant:"

The ordering prioritizes established facts (graph context) and relevant past context (semantic) over pure recency.

Graph Context Format

formatGraphContext(nodes, edges) in src/chat/index.js formats the neighborhood as:

- Alice (person): software engineer working on NexusAI
  → works_on NexusAI (project)
  → knows Bob (person)
- NexusAI (project): AI assistant framework
- Bob (person): Alice's colleague

Each node shows its notes on the first line. Outbound edges are indented below with → label target (type). Nodes with only inbound edges (neighbors pulled in by traversal) appear without connection lines.

System Prompt Resolution

Priority from highest to lowest:

project.system_prompt (stored on the project row in memory-service)
settings.systemPrompt (saved in data/settings.json)
ORCHESTRATION.SYSTEM_PROMPT (shared constants fallback)

Settings (`src/config/settings.js`)

Settings are loaded from data/settings.json merged with defaults at every GET /settings call. PATCH /settings validates each field individually with specific constraints:

Field	Constraint
`recentEpisodeLimit`	integer, 1–20
`semanticLimit`	integer, 1–20
`scoreThreshold`	number, 0–1
`temperature`	number, 0–2
`repeatPenalty`	number, 1–2
`topP`	number, 0–1
`topK`	integer, 1–100
`modelsFolderPath`	path must exist and be readable
`systemPrompt`	string (trimmed); `null` reverts to shared default

data/settings.json is created on first save. Parent directories are created if missing.

Streaming SSE (`src/chat/index.js` — `chatStream`)

The route sets SSE headers and delegates to chatStream, which:

Calls inference.completeStream() → receives a raw HTTP Response with a readable body.
Reads the body in chunks, buffers across chunk boundaries, splits on \n\n.
For each event line starting with data: , parses the JSON and calls onChunk(data.response).
The [DONE] sentinel (used by some llama-server versions) is explicitly ignored.
After stream ends, saves the assembled full response as an episode (same as non-streaming).

If a chunk parse fails the error is logged and the stream continues. If the response body closes with no text accumulated, the episode is not saved (logged as warning).

Fire-and-Forget Tasks

After every successful chat turn:

Summarization (services/summarization.js → triggerSummary): checks token threshold → recency guard → calls Ollama → POSTs to memory-service. Only runs if SUMMARIES.THRESHOLD_TOKENS is exceeded AND at least SUMMARIES.MIN_EPISODES_SINCE new episodes have occurred since the last summary.
Auto-naming (chat/index.js → autoNameSession): only fires on the first message of a session. Uses temp 0.3, maxTokens=20, prompts for a ≤5-word title.

Both tasks catch all errors and log warnings without surfacing to the client.

Summarization Recency Guard

src/services/summarization.js reads the episode_range field of the latest existing summary (format: "<startId>-<endId>"). It counts SQLite episodes with id > endId; if fewer than SUMMARIES.MIN_EPISODES_SINCE, it skips. This prevents rapid re-summarization on high-traffic sessions.

When the existing summary's token count exceeds SUMMARIES.MAX_SUMMARY_TOKENS, it is treated as "expired" — a fresh summary is generated instead of an incremental update.

Qdrant Calls (Direct, Not Via Memory-Service)

src/services/qdrant.js makes REST calls to Qdrant directly at QDRANT_URL. This bypasses memory-service for semantic search performance. Orchestration fetches episode/entity content from memory-service by ID after getting vector search results from Qdrant.

searchEntities checks projectId !== null && projectId !== undefined before applying the filter — a session with no project skips the filter entirely and searches globally.

Graph Service Client (`src/services/graph.js`)

Thin HTTP client for memory-service graph endpoints. One function:

getNeighbors(entityIds[]) — POSTs to memory-service/graph/neighbors with the entity IDs from Qdrant entity search. Returns { nodes, edges }. Throws on non-2xx — caller wraps in try/catch with graceful fallback.

Models Endpoint

GET /models scans modelsFolderPath for .gguf files and optionally reads a models.json manifest (keyed by filename) for labels and descriptions. File size is reported in GB. Returns 500 if the folder is inaccessible.

GET /models/props proxies /props from llama-server and returns {contextWindow, modelAlias}. Returns 503 if llama-server is unreachable.

Health Check

GET /health/services runs parallel fetch calls to all four dependent services with a 3-second AbortSignal.timeout each. Results are returned as an array — the endpoint never returns a non-2xx itself regardless of downstream status.

Background Model (qwen2.5:3b)

Used for entity/relationship extraction and summarization via Ollama on Mini PC 1. Uses ChatML format (<|im_start|> / <|im_end|>) — not Phi3 format. Use format: 'json' only for structured extraction, never for free-text summarization.

API Endpoints Quick Reference

Method	Path	Notes
GET	`/health`	Returns service URLs
GET	`/health/services`	Parallel status of all dependencies
POST	`/chat`	Blocking completion
POST	`/chat/stream`	SSE streaming
GET/PATCH	`/settings`	Persistent settings
GET	`/models`	`.gguf` file scan
GET	`/models/props`	llama-server model info
GET	`/sessions`	Delegates to memory-service
GET	`/sessions/:sessionId/history`	Paginated episodes by external ID
PATCH	`/sessions/:sessionId`	`name` and/or `projectId`
DELETE	`/sessions/:sessionId`
GET	`/episodes`	Delegates; supports `q` for FTS
DELETE	`/episodes/:id`	Delegates
GET/POST/PATCH/DELETE	`/projects` and `/projects/:id`	Delegates
POST	`/summaries/project/:projectId/generate`	On-demand; 422 if no data
GET	`/summaries/project/:projectId/overview`
GET	`/summaries/session/:sessionId`	Resolves external ID first
GET	`/summaries/project/:projectId`

7.9 KiB Raw Blame History Unescape Escape