CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Development Commands

# Start individual services
npm run memory           # Memory Service (port 3002)
npm run embedding        # Embedding Service (port 3003)
npm run inference        # Inference Service (port 3001)
npm run orchestration    # Orchestration Service (port 4000)
npm run mini1            # Start memory + embedding concurrently

# Per-service dev mode (with --watch)
npm -w packages/<service-name> run dev

# Chat client
npm -w packages/chat-client run dev      # Vite dev server (port 5173)
npm -w packages/chat-client run build    # Production build

No test framework or linter is configured.

Architecture Overview

NexusAI is a modular AI assistant with persistent, project-scoped memory. It's a Node.js monorepo (npm workspaces) with 4 independent backend services, 1 React frontend, and 1 shared package.

Services

Package	Port	Role
`orchestration-service`	4000	Central gateway; coordinates all others
`memory-service`	3002	SQLite + Qdrant hybrid memory
`embedding-service`	3003	Text embeddings via Ollama (`nomic-embed-text`, 768-dim)
`inference-service`	3001	LLM inference (Ollama or llama.cpp)
`chat-client`	5173	React/Vite frontend
`shared`	—	Constants, env helpers, logger, formatters

All inter-service communication is REST HTTP only — no message queues or WebSockets.

Chat Request Flow

Client POSTs to orchestration /chat/stream
Orchestration resolves session, fetches recent episodes (SQLite) + semantic episodes (Qdrant vector search) + entities (Qdrant, scoped by project)
Embedding computed for user message (embedding-service)
Prompt assembled: system message → entities → semantic memories → recent episodes → user message
Inference streams response (inference-service)
Episode stored in SQLite + Qdrant (fire-and-forget embedding)
Entity extraction triggered async (qwen2.5:3b via inference-service)
Auto-summarization checked (threshold: 200+ tokens, re-triggers every 5 episodes)
Auto-naming on first message (temp 0.3, 20 tokens max)

Memory Model

Dual store — neither works alone:

SQLite (better-sqlite3, synchronous) — Full content: sessions, episodes, entities, relationships, projects, summaries, FTS5 index
Qdrant — Vector embeddings for semantic search; IDs used to fetch full content from SQLite afterward

Orchestration queries Qdrant directly (bypasses memory-service) for performance, then fetches full episode content from memory-service by ID.

Project-scoped isolation: Sessions grouped into projects; Qdrant queries use should filter on session IDs to enforce memory boundaries. Non-project sessions share a common pool.

Key File Locations

Orchestration (packages/orchestration-service/src/):

chat/index.js — Core prompt building and memory assembly
routes/ — HTTP endpoints: chat, sessions, projects, episodes, models, settings, summaries
services/ — Thin HTTP clients for memory, embedding, inference, and direct Qdrant access
config/settings.js — Loads/saves data/settings.json (user-tunable: model params, thresholds, system prompt)

Memory (packages/memory-service/src/):

db/schema.js — SQLite table definitions (source of truth for data model)
episodic/ — Episode CRUD
semantic/ — Qdrant operations
entities/ — Entity extraction + CRUD
summarization/ — Project summary generation

Shared (packages/shared/src/):

config/constants.js — All tunables (ports, thresholds, model names, vector size)
config/env.js — getEnv() helper with fallback to constants
utils.js — parseRow(), formatEpisodeText(), logger

Frontend (packages/chat-client/src/):

App.jsx — View router and top-level state (views: home, chat, all-chats, all-projects, project, memory, summaries, settings)
hooks/ — useChat, useSession, useModels, useProjects, useSettings, useContextMenu
api/orchestration.js — Fetch wrapper for all API calls
Vite proxy points to 192.168.0.205:4000 (Mini PC 2 / orchestration)

Configuration

Each service uses .env via dotenv, falling back to packages/shared/src/config/constants.js. The orchestration service also serves data/settings.json to the frontend via /settings — this is the single source of truth for user-facing inference parameters and system prompt.

Deployment

Home lab across 3 nodes, managed with Docker Compose:

Main PC — RTX A4000 (inference via llama.cpp)
Mini PC 1 — memory + embedding services, Qdrant, Ollama
Mini PC 2 — orchestration + chat client, Caddy reverse proxy + Authelia SSO

Docker Compose files: docker-compose.mini1.yml, docker-compose.mini2.yml. All services expose /health. Deployment docs: docs/deployment/homelab.md.

Key Development Principles

Layer-by-layer validation — always build and test backend → orchestration → frontend in sequence, curl-testing each layer before proceeding
New orchestration routes require changes in four places: route file, orchestration-service/src/index.js, Caddyfile on Mini PC 2 (192.168.0.205), and vite.config.js in the chat client
All services read settings on every request — no restart required for config changes
Backend-first development — data layer → service endpoints → orchestration proxy → frontend

5.5 KiB Raw Blame History