3.8 KiB
Architecture Overview
NexusAI is a modular, memory-centric AI assistant designed for persistent, context-aware conversations. It separates concerns across independent services that can be evolved and deployed separately.
Core Design Principles
- Decoupled layers — memory, inference, and orchestration are independent of each other
- Hybrid retrieval — semantic similarity (Qdrant) combined with structured storage (SQLite) for flexible, ranked context assembly
- Project-scoped memory — sessions can be grouped into projects with shared or isolated memory pools
- Home lab first — services are distributed across nodes according to available hardware
Memory Model
Memory is split between SQLite and Qdrant, which always work as a pair:
- SQLite — episodic interactions, entities, relationships, summaries, sessions, projects
- Qdrant — vector embeddings for semantic similarity search
When recalling memory, Qdrant returns IDs and similarity scores, which are used to fetch full content from SQLite. Neither store works in isolation.
Episode embeddings carry a { sessionId, createdAt } payload in Qdrant,
enabling per-session and per-project filtering at search time. See
memory-isolation.md for how project-scoped retrieval works.
Hardware Layout
| Node | Address | Role |
|---|---|---|
| Main PC | 192.168.0.79 | Primary inference — RTX A4000 16GB |
| Mini PC 1 | 192.168.0.81 | Memory service, Embedding service, Qdrant, Ollama |
| Mini PC 2 | 192.168.0.205 | Orchestration service, Chat Client, Caddy, Authelia, Gitea |
Service Communication
All services expose a REST HTTP API. The orchestration service is the single entry point — clients never talk directly to memory or inference services.
Client (browser)
└─► Caddy (HTTPS + Authelia SSO)
└─► Orchestration (:4000) — Mini PC 2
├─► Memory Service (:3002) — Mini PC 1
│ ├─► SQLite (local file)
│ └─► Qdrant (:6333) — Mini PC 1
├─► Embedding Service (:3003) — Mini PC 1
│ └─► Ollama (:11434) — Mini PC 1
├─► Inference Service (:3001) — Main PC
│ └─► llama-server (:8080) — Main PC
└─► Qdrant (:6333) — Mini PC 1 (direct — semantic search)
Note: Orchestration queries Qdrant directly for semantic search (bypassing the memory service) but always fetches full episode content from the memory service by ID after the vector search.
Technology Choices
| Concern | Choice | Reason |
|---|---|---|
| Language | Node.js (CommonJS) | Familiar stack, async I/O suits service architecture |
| Package management | npm workspaces | Monorepo with shared code, no publishing needed |
| Vector store | Qdrant | Mature, Docker-native, excellent Node.js client |
| Relational store | SQLite (better-sqlite3) | Zero-ops, fast, sufficient for single-user scale |
| LLM inference | llama.cpp (llama-server) |
Maximum GPU utilisation on RTX A4000, OpenAI-compatible API |
| Embeddings | Ollama (nomic-embed-text) |
Co-located with memory service on Mini PC 1, 768-dim Cosine |
| Reverse proxy | Caddy + Authelia | Automatic HTTPS, SSO/MFA for all exposed services |
| Version control | Gitea (self-hosted) | Code stays on local network |
Current State
The core four-service architecture is complete and operational. Key capabilities:
- Hybrid memory retrieval — recent episodes + semantic search combined into every prompt
- Projects — sessions grouped with shared or isolated memory pools
- Auto-naming — sessions named automatically from first exchange via inference
- Project-scoped semantic search — Qdrant filtered by project session IDs
- Chat client — view-based UI with sidebar navigation, project views, session management