clean up

2026-04-27 00:14:51 -07:00
parent aac0923351
commit 5ad01c6ad8
2 changed files with 232 additions and 0 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -0,0 +1,108 @@
 # CLAUDE.md
 This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
 ## Development Commands
 ```bash
 # Start individual services
 npm run memory           # Memory Service (port 3002)
 npm run embedding        # Embedding Service (port 3003)
 npm run inference        # Inference Service (port 3001)
 npm run orchestration    # Orchestration Service (port 4000)
 npm run mini1            # Start memory + embedding concurrently
 # Per-service dev mode (with --watch)
 npm -w packages/<service-name> run dev
 # Chat client
 npm -w packages/chat-client run dev      # Vite dev server (port 5173)
 npm -w packages/chat-client run build    # Production build
 ```
 No test framework or linter is configured.
 ## Architecture Overview
 NexusAI is a **modular AI assistant** with persistent, project-scoped memory. It's a Node.js monorepo (`npm workspaces`) with 4 independent backend services, 1 React frontend, and 1 shared package.
 ### Services
 | Package | Port | Role |
 |---|---|---|
 | `orchestration-service` | 4000 | Central gateway; coordinates all others |
 | `memory-service` | 3002 | SQLite + Qdrant hybrid memory |
 | `embedding-service` | 3003 | Text embeddings via Ollama (`nomic-embed-text`, 768-dim) |
 | `inference-service` | 3001 | LLM inference (Ollama or llama.cpp) |
 | `chat-client` | 5173 | React/Vite frontend |
 | `shared` | — | Constants, env helpers, logger, formatters |
 All inter-service communication is **REST HTTP only** — no message queues or WebSockets.
 ### Chat Request Flow
 1. Client POSTs to orchestration `/chat/stream`
 2. Orchestration resolves session, fetches **recent episodes** (SQLite) + **semantic episodes** (Qdrant vector search) + **entities** (Qdrant, scoped by project)
 3. Embedding computed for user message (embedding-service)
 4. Prompt assembled: system message → entities → semantic memories → recent episodes → user message
 5. Inference streams response (inference-service)
 6. Episode stored in SQLite + Qdrant (fire-and-forget embedding)
 7. Entity extraction triggered async (qwen2.5:3b via inference-service)
 8. Auto-summarization checked (threshold: 200+ tokens, re-triggers every 5 episodes)
 9. Auto-naming on first message (temp 0.3, 20 tokens max)
 ### Memory Model
 **Dual store — neither works alone:**
 - **SQLite** (`better-sqlite3`, synchronous) — Full content: sessions, episodes, entities, relationships, projects, summaries, FTS5 index
 - **Qdrant** — Vector embeddings for semantic search; IDs used to fetch full content from SQLite afterward
 Orchestration queries Qdrant directly (bypasses memory-service) for performance, then fetches full episode content from memory-service by ID.
 **Project-scoped isolation:** Sessions grouped into projects; Qdrant queries use `should` filter on session IDs to enforce memory boundaries. Non-project sessions share a common pool.
 ### Key File Locations
 **Orchestration** (`packages/orchestration-service/src/`):
 - `chat/index.js` — Core prompt building and memory assembly
 - `routes/` — HTTP endpoints: chat, sessions, projects, episodes, models, settings, summaries
 - `services/` — Thin HTTP clients for memory, embedding, inference, and direct Qdrant access
 - `config/settings.js` — Loads/saves `data/settings.json` (user-tunable: model params, thresholds, system prompt)
 **Memory** (`packages/memory-service/src/`):
 - `db/schema.js` — SQLite table definitions (source of truth for data model)
 - `episodic/` — Episode CRUD
 - `semantic/` — Qdrant operations
 - `entities/` — Entity extraction + CRUD
 - `summarization/` — Project summary generation
 **Shared** (`packages/shared/src/`):
 - `config/constants.js` — All tunables (ports, thresholds, model names, vector size)
 - `config/env.js` — `getEnv()` helper with fallback to constants
 - `utils.js` — `parseRow()`, `formatEpisodeText()`, `logger`
 **Frontend** (`packages/chat-client/src/`):
 - `App.jsx` — View router and top-level state (views: home, chat, all-chats, all-projects, project, memory, summaries, settings)
 - `hooks/` — `useChat`, `useSession`, `useModels`, `useProjects`, `useSettings`, `useContextMenu`
 - `api/orchestration.js` — Fetch wrapper for all API calls
 - Vite proxy points to `192.168.0.205:4000` (Mini PC 2 / orchestration)
 ### Configuration
 Each service uses `.env` via `dotenv`, falling back to `packages/shared/src/config/constants.js`. The orchestration service also serves `data/settings.json` to the frontend via `/settings` — this is the single source of truth for user-facing inference parameters and system prompt.
 ### Deployment
 Home lab across 3 nodes, managed with Docker Compose:
 - **Main PC** — RTX A4000 (inference via llama.cpp)
 - **Mini PC 1** — memory + embedding services, Qdrant, Ollama
 - **Mini PC 2** — orchestration + chat client, Caddy reverse proxy + Authelia SSO
 Docker Compose files: `docker-compose.mini1.yml`, `docker-compose.mini2.yml`. All services expose `/health`. Deployment docs: `docs/deployment/homelab.md`.
 ## Key Development Principles
 - **Layer-by-layer validation** — always build and test backend → orchestration → frontend in sequence, curl-testing each layer before proceeding
 - **New orchestration routes require changes in four places**: route file, `orchestration-service/src/index.js`, Caddyfile on Mini PC 2 (`192.168.0.205`), and `vite.config.js` in the chat client
 - **All services read settings on every request** — no restart required for config changes
 - **Backend-first development** — data layer → service endpoints → orchestration proxy → frontend
--- a/packages/orchestration-service/CLAUDE.md
+++ b/packages/orchestration-service/CLAUDE.md
@@ -0,0 +1,124 @@
 # CLAUDE.md
 This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
 See the root [CLAUDE.md](../../CLAUDE.md) for overall architecture, service roles, and the end-to-end chat flow.
 ## Running This Service
 ```bash
 npm run orchestration             # From repo root (node src/index.js)
 npm -w packages/orchestration-service run dev   # With --watch
 ```
 Default port: **4000**. Depends on memory-service, embedding-service, inference-service, and Qdrant.
 ## Context Assembly (`src/chat/index.js`)
 `assembleContext(externalId, userMessage)` is the core function that builds the inference prompt. Order of operations:
 1. Resolve session by `externalId` (creates it if missing — every chat call is self-healing).
 2. If session has a `project_id`, load the project and fetch all sibling sessions (via `getProjectSessions`, hardcoded `limit=200`).
 3. Fetch `recentEpisodeLimit` recent episodes from memory-service.
 4. Embed the user message; search Qdrant EPISODES with `scoreThreshold`:
   - No project: `must: [sessionId == this session]`
   - Project: `should: [sessionId == s1, sessionId == s2, ...]` across all project sessions
   - Dedup against recent episode IDs before including.
 5. Embed and search Qdrant ENTITIES; filter by `projectId` if applicable.
 6. Build prompt in this fixed order: **system prompt → entities → semantic episodes → recent episodes → user message → "Assistant:"**
 The ordering prioritizes established facts (entities) and relevant past context (semantic) over pure recency.
 ## System Prompt Resolution
 Priority from highest to lowest:
 1. `project.system_prompt` (stored on the project row in memory-service)
 2. `settings.systemPrompt` (saved in `data/settings.json`)
 3. `ORCHESTRATION.SYSTEM_PROMPT` (shared constants fallback)
 ## Settings (`src/config/settings.js`)
 Settings are loaded from `data/settings.json` merged with defaults at every `GET /settings` call. `PATCH /settings` validates each field individually with specific constraints:
 | Field | Constraint |
 |---|---|
 | `recentEpisodeLimit` | integer, 1–20 |
 | `semanticLimit` | integer, 1–20 |
 | `scoreThreshold` | number, 0–1 |
 | `temperature` | number, 0–2 |
 | `repeatPenalty` | number, 1–2 |
 | `topP` | number, 0–1 |
 | `topK` | integer, 1–100 |
 | `modelsFolderPath` | path must exist and be readable |
 | `systemPrompt` | string (trimmed); `null` reverts to shared default |
 `data/settings.json` is created on first save. Parent directories are created if missing.
 ## Streaming SSE (`src/chat/index.js` — `chatStream`)
 The route sets SSE headers and delegates to `chatStream`, which:
 1. Calls `inference.completeStream()` → receives a raw HTTP Response with a readable body.
 2. Reads the body in chunks, buffers across chunk boundaries, splits on `\n\n`.
 3. For each event line starting with `data: `, parses the JSON and calls `onChunk(data.response)`.
 4. The `[DONE]` sentinel (used by some llama-server versions) is explicitly ignored.
 5. After stream ends, saves the assembled full response as an episode (same as non-streaming).
 If a chunk parse fails the error is logged and the stream continues. If the response body closes with no text accumulated, the episode is not saved (logged as warning).
 ## Fire-and-Forget Tasks
 After every successful chat turn:
 - **Summarization** (`services/summarization.js` → `triggerSummary`): checks token threshold → recency guard → calls Ollama → POSTs to memory-service. Only runs if `SUMMARIES.THRESHOLD_TOKENS` is exceeded AND at least `SUMMARIES.MIN_EPISODES_SINCE` new episodes have occurred since the last summary.
 - **Auto-naming** (`chat/index.js` → `autoNameSession`): only fires on the first message of a session. Uses temp 0.3, `maxTokens=20`, prompts for a ≤5-word title.
 Both tasks catch all errors and log warnings without surfacing to the client.
 ## Summarization Recency Guard
 `src/services/summarization.js` reads the `episode_range` field of the latest existing summary (format: `"<startId>-<endId>"`). It counts SQLite episodes with `id > endId`; if fewer than `SUMMARIES.MIN_EPISODES_SINCE`, it skips. This prevents rapid re-summarization on high-traffic sessions.
 When the existing summary's token count exceeds `SUMMARIES.MAX_SUMMARY_TOKENS`, it is treated as "expired" — a fresh summary is generated instead of an incremental update.
 ## Qdrant Calls (Direct, Not Via Memory-Service)
 `src/services/qdrant.js` makes REST calls to Qdrant directly at `QDRANT_URL`. This bypasses memory-service for semantic search performance. Orchestration fetches episode/entity content from memory-service by ID *after* getting vector search results from Qdrant.
 `searchEntities` checks `projectId !== null && projectId !== undefined` before applying the filter — a session with no project skips the filter entirely and searches globally.
 ## Models Endpoint
 `GET /models` scans `modelsFolderPath` for `.gguf` files and optionally reads a `models.json` manifest (keyed by filename) for labels and descriptions. File size is reported in GB. Returns 500 if the folder is inaccessible.
 `GET /models/props` proxies `/props` from llama-server and returns `{contextWindow, modelAlias}`. Returns 503 if llama-server is unreachable.
 ## Health Check
 `GET /health/services` runs parallel fetch calls to all four dependent services with a 3-second `AbortSignal.timeout` each. Results are returned as an array — the endpoint never returns a non-2xx itself regardless of downstream status.
 ## Background Model (qwen2.5:3b)
 Used for entity extraction and summarization via Ollama on Mini PC 1. Uses **ChatML 
 format** (`<|im_start|>` / `<|im_end|>`) — not Phi3 format. Use `format: 'json'` 
 only for structured extraction, never for free-text summarization.
 ## API Endpoints Quick Reference
 | Method | Path | Notes |
 |---|---|---|
 | GET | `/health` | Returns service URLs |
 | GET | `/health/services` | Parallel status of all dependencies |
 | POST | `/chat` | Blocking completion |
 | POST | `/chat/stream` | SSE streaming |
 | GET/PATCH | `/settings` | Persistent settings |
 | GET | `/models` | `.gguf` file scan |
 | GET | `/models/props` | llama-server model info |
 | GET | `/sessions` | Delegates to memory-service |
 | GET | `/sessions/:sessionId/history` | Paginated episodes by external ID |
 | PATCH | `/sessions/:sessionId` | `name` and/or `projectId` |
 | DELETE | `/sessions/:sessionId` | |
 | GET | `/episodes` | Delegates; supports `q` for FTS |
 | DELETE | `/episodes/:id` | Delegates |
 | GET/POST/PATCH/DELETE | `/projects` and `/projects/:id` | Delegates |
 | POST | `/summaries/project/:projectId/generate` | On-demand; 422 if no data |
 | GET | `/summaries/project/:projectId/overview` | |
 | GET | `/summaries/session/:sessionId` | Resolves external ID first |
 | GET | `/summaries/project/:projectId` | |