108 lines
5.5 KiB
Markdown
108 lines
5.5 KiB
Markdown
# CLAUDE.md
|
|
|
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
|
|
|
## Development Commands
|
|
|
|
```bash
|
|
# Start individual services
|
|
npm run memory # Memory Service (port 3002)
|
|
npm run embedding # Embedding Service (port 3003)
|
|
npm run inference # Inference Service (port 3001)
|
|
npm run orchestration # Orchestration Service (port 4000)
|
|
npm run mini1 # Start memory + embedding concurrently
|
|
|
|
# Per-service dev mode (with --watch)
|
|
npm -w packages/<service-name> run dev
|
|
|
|
# Chat client
|
|
npm -w packages/chat-client run dev # Vite dev server (port 5173)
|
|
npm -w packages/chat-client run build # Production build
|
|
```
|
|
|
|
No test framework or linter is configured.
|
|
|
|
## Architecture Overview
|
|
|
|
NexusAI is a **modular AI assistant** with persistent, project-scoped memory. It's a Node.js monorepo (`npm workspaces`) with 4 independent backend services, 1 React frontend, and 1 shared package.
|
|
|
|
### Services
|
|
|
|
| Package | Port | Role |
|
|
|---|---|---|
|
|
| `orchestration-service` | 4000 | Central gateway; coordinates all others |
|
|
| `memory-service` | 3002 | SQLite + Qdrant hybrid memory |
|
|
| `embedding-service` | 3003 | Text embeddings via Ollama (`nomic-embed-text`, 768-dim) |
|
|
| `inference-service` | 3001 | LLM inference (Ollama or llama.cpp) |
|
|
| `chat-client` | 5173 | React/Vite frontend |
|
|
| `shared` | — | Constants, env helpers, logger, formatters |
|
|
|
|
All inter-service communication is **REST HTTP only** — no message queues or WebSockets.
|
|
|
|
### Chat Request Flow
|
|
|
|
1. Client POSTs to orchestration `/chat/stream`
|
|
2. Orchestration resolves session, fetches **recent episodes** (SQLite) + **semantic episodes** (Qdrant vector search) + **entities** (Qdrant, scoped by project)
|
|
3. Embedding computed for user message (embedding-service)
|
|
4. Prompt assembled: system message → entities → semantic memories → recent episodes → user message
|
|
5. Inference streams response (inference-service)
|
|
6. Episode stored in SQLite + Qdrant (fire-and-forget embedding)
|
|
7. Entity extraction triggered async (qwen2.5:3b via inference-service)
|
|
8. Auto-summarization checked (threshold: 200+ tokens, re-triggers every 5 episodes)
|
|
9. Auto-naming on first message (temp 0.3, 20 tokens max)
|
|
|
|
### Memory Model
|
|
|
|
**Dual store — neither works alone:**
|
|
- **SQLite** (`better-sqlite3`, synchronous) — Full content: sessions, episodes, entities, relationships, projects, summaries, FTS5 index
|
|
- **Qdrant** — Vector embeddings for semantic search; IDs used to fetch full content from SQLite afterward
|
|
|
|
Orchestration queries Qdrant directly (bypasses memory-service) for performance, then fetches full episode content from memory-service by ID.
|
|
|
|
**Project-scoped isolation:** Sessions grouped into projects; Qdrant queries use `should` filter on session IDs to enforce memory boundaries. Non-project sessions share a common pool.
|
|
|
|
### Key File Locations
|
|
|
|
**Orchestration** (`packages/orchestration-service/src/`):
|
|
- `chat/index.js` — Core prompt building and memory assembly
|
|
- `routes/` — HTTP endpoints: chat, sessions, projects, episodes, models, settings, summaries
|
|
- `services/` — Thin HTTP clients for memory, embedding, inference, and direct Qdrant access
|
|
- `config/settings.js` — Loads/saves `data/settings.json` (user-tunable: model params, thresholds, system prompt)
|
|
|
|
**Memory** (`packages/memory-service/src/`):
|
|
- `db/schema.js` — SQLite table definitions (source of truth for data model)
|
|
- `episodic/` — Episode CRUD
|
|
- `semantic/` — Qdrant operations
|
|
- `entities/` — Entity extraction + CRUD
|
|
- `summarization/` — Project summary generation
|
|
|
|
**Shared** (`packages/shared/src/`):
|
|
- `config/constants.js` — All tunables (ports, thresholds, model names, vector size)
|
|
- `config/env.js` — `getEnv()` helper with fallback to constants
|
|
- `utils.js` — `parseRow()`, `formatEpisodeText()`, `logger`
|
|
|
|
**Frontend** (`packages/chat-client/src/`):
|
|
- `App.jsx` — View router and top-level state (views: home, chat, all-chats, all-projects, project, memory, summaries, settings)
|
|
- `hooks/` — `useChat`, `useSession`, `useModels`, `useProjects`, `useSettings`, `useContextMenu`
|
|
- `api/orchestration.js` — Fetch wrapper for all API calls
|
|
- Vite proxy points to `192.168.0.205:4000` (Mini PC 2 / orchestration)
|
|
|
|
### Configuration
|
|
|
|
Each service uses `.env` via `dotenv`, falling back to `packages/shared/src/config/constants.js`. The orchestration service also serves `data/settings.json` to the frontend via `/settings` — this is the single source of truth for user-facing inference parameters and system prompt.
|
|
|
|
### Deployment
|
|
|
|
Home lab across 3 nodes, managed with Docker Compose:
|
|
- **Main PC** — RTX A4000 (inference via llama.cpp)
|
|
- **Mini PC 1** — memory + embedding services, Qdrant, Ollama
|
|
- **Mini PC 2** — orchestration + chat client, Caddy reverse proxy + Authelia SSO
|
|
|
|
Docker Compose files: `docker-compose.mini1.yml`, `docker-compose.mini2.yml`. All services expose `/health`. Deployment docs: `docs/deployment/homelab.md`.
|
|
|
|
## Key Development Principles
|
|
|
|
- **Layer-by-layer validation** — always build and test backend → orchestration → frontend in sequence, curl-testing each layer before proceeding
|
|
- **New orchestration routes require changes in four places**: route file, `orchestration-service/src/index.js`, Caddyfile on Mini PC 2 (`192.168.0.205`), and `vite.config.js` in the chat client
|
|
- **All services read settings on every request** — no restart required for config changes
|
|
- **Backend-first development** — data layer → service endpoints → orchestration proxy → frontend |