# Architecture Overview NexusAI is a modular, memory-centric AI assistant designed for persistent, context-aware conversations. It separates concerns across independent services that can be evolved and deployed separately. ## Core Design Principles - **Decoupled layers** — memory, inference, and orchestration are independent of each other - **Hybrid retrieval** — semantic similarity (Qdrant) combined with structured storage (SQLite) for flexible, ranked context assembly - **Project-scoped memory** — sessions can be grouped into projects with shared or isolated memory pools - **Home lab first** — services are distributed across nodes according to available hardware ## Memory Model Memory is split between SQLite and Qdrant, which always work as a pair: - **SQLite** — episodic interactions, entities, relationships, summaries, sessions, projects - **Qdrant** — vector embeddings for semantic similarity search When recalling memory, Qdrant returns IDs and similarity scores, which are used to fetch full content from SQLite. Neither store works in isolation. Episode embeddings carry a `{ sessionId, createdAt }` payload in Qdrant, enabling per-session and per-project filtering at search time. See `memory-isolation.md` for how project-scoped retrieval works. ## Hardware Layout | Node | Address | Role | |---|---|---| | Main PC | 192.168.0.79 | Primary inference — RTX A4000 16GB | | Mini PC 1 | 192.168.0.81 | Memory service, Embedding service, Qdrant, Ollama | | Mini PC 2 | 192.168.0.205 | Orchestration service, Chat Client, Caddy, Authelia, Gitea | ## Service Communication All services expose a REST HTTP API. The orchestration service is the single entry point — clients never talk directly to memory or inference services. ``` Client (browser) └─► Caddy (HTTPS + Authelia SSO) └─► Orchestration (:4000) — Mini PC 2 ├─► Memory Service (:3002) — Mini PC 1 │ ├─► SQLite (local file) │ └─► Qdrant (:6333) — Mini PC 1 ├─► Embedding Service (:3003) — Mini PC 1 │ └─► Ollama (:11434) — Mini PC 1 ├─► Inference Service (:3001) — Main PC │ └─► llama-server (:8080) — Main PC └─► Qdrant (:6333) — Mini PC 1 (direct — semantic search) ``` Note: Orchestration queries Qdrant directly for semantic search (bypassing the memory service) but always fetches full episode content from the memory service by ID after the vector search. ## Technology Choices | Concern | Choice | Reason | |---|---|---| | Language | Node.js (CommonJS) | Familiar stack, async I/O suits service architecture | | Package management | npm workspaces | Monorepo with shared code, no publishing needed | | Vector store | Qdrant | Mature, Docker-native, excellent Node.js client | | Relational store | SQLite (better-sqlite3) | Zero-ops, fast, sufficient for single-user scale | | LLM inference | llama.cpp (`llama-server`) | Maximum GPU utilisation on RTX A4000, OpenAI-compatible API | | Embeddings | Ollama (`nomic-embed-text`) | Co-located with memory service on Mini PC 1, 768-dim Cosine | | Reverse proxy | Caddy + Authelia | Automatic HTTPS, SSO/MFA for all exposed services | | Version control | Gitea (self-hosted) | Code stays on local network | ## Current State The core four-service architecture is complete and operational. Key capabilities: - **Hybrid memory retrieval** — recent episodes + semantic search combined into every prompt - **Entity layer + Knowledge graph** — automatic extraction of named entities and relationships from conversations via qwen2.5:3b. Entities and relationships are stored in SQLite with `mention_count` tracking. A graph traversal layer expands Qdrant entity search hits into a 1-hop neighborhood subgraph, injecting structured connected knowledge into every prompt - **Projects** — sessions grouped with shared or isolated memory pools - **Auto-naming** — sessions named automatically from first exchange via inference - **Project-scoped semantic search** — Qdrant filtered by project session IDs - **Chat client** — view-based UI with sidebar navigation, project views, session management