Architecture Overview

NexusAI is a modular, memory-centric AI assistant designed for persistent, context-aware conversations. It separates concerns across independent services that can be evolved and deployed separately.

Core Design Principles

Decoupled layers — memory, inference, and orchestration are independent of each other
Hybrid retrieval — semantic similarity (Qdrant) combined with structured storage (SQLite) for flexible, ranked context assembly
Project-scoped memory — sessions can be grouped into projects with shared or isolated memory pools
Home lab first — services are distributed across nodes according to available hardware

Memory Model

Memory is split between SQLite and Qdrant, which always work as a pair:

SQLite — episodic interactions, entities, relationships, summaries, sessions, projects
Qdrant — vector embeddings for semantic similarity search

When recalling memory, Qdrant returns IDs and similarity scores, which are used to fetch full content from SQLite. Neither store works in isolation.

Episode embeddings carry a { sessionId, createdAt } payload in Qdrant, enabling per-session and per-project filtering at search time. See memory-isolation.md for how project-scoped retrieval works.

Hardware Layout

Node	Address	Role
Main PC	192.168.0.79	Primary inference — RTX A4000 16GB
Mini PC 1	192.168.0.81	Memory service, Embedding service, Qdrant, Ollama
Mini PC 2	192.168.0.205	Orchestration service, Chat Client, Caddy, Authelia, Gitea

Service Communication

All services expose a REST HTTP API. The orchestration service is the single entry point — clients never talk directly to memory or inference services.

Client (browser)
└─► Caddy (HTTPS + Authelia SSO)
    └─► Orchestration (:4000) — Mini PC 2
        ├─► Memory Service (:3002) — Mini PC 1
        │     ├─► SQLite (local file)
        │     └─► Qdrant (:6333) — Mini PC 1
        ├─► Embedding Service (:3003) — Mini PC 1
        │     └─► Ollama (:11434) — Mini PC 1
        ├─► Inference Service (:3001) — Main PC
        │     └─► llama-server (:8080) — Main PC
        └─► Qdrant (:6333) — Mini PC 1 (direct — semantic search)

Note: Orchestration queries Qdrant directly for semantic search (bypassing the memory service) but always fetches full episode content from the memory service by ID after the vector search.

Technology Choices

Concern	Choice	Reason
Language	Node.js (CommonJS)	Familiar stack, async I/O suits service architecture
Package management	npm workspaces	Monorepo with shared code, no publishing needed
Vector store	Qdrant	Mature, Docker-native, excellent Node.js client
Relational store	SQLite (better-sqlite3)	Zero-ops, fast, sufficient for single-user scale
LLM inference	llama.cpp (`llama-server`)	Maximum GPU utilisation on RTX A4000, OpenAI-compatible API
Embeddings	Ollama (`nomic-embed-text`)	Co-located with memory service on Mini PC 1, 768-dim Cosine
Reverse proxy	Caddy + Authelia	Automatic HTTPS, SSO/MFA for all exposed services
Version control	Gitea (self-hosted)	Code stays on local network

Current State

The core four-service architecture is complete and operational. Key capabilities:

Hybrid memory retrieval — recent episodes + semantic search combined into every prompt
Entity layer + Knowledge graph — automatic extraction of named entities and relationships from conversations via qwen2.5:3b. Entities and relationships are stored in SQLite with mention_count tracking. A graph traversal layer expands Qdrant entity search hits into a 1-hop neighborhood subgraph, injecting structured connected knowledge into every prompt
Projects — sessions grouped with shared or isolated memory pools
Auto-naming — sessions named automatically from first exchange via inference
Project-scoped semantic search — Qdrant filtered by project session IDs
Chat client — view-based UI with sidebar navigation, project views, session management

4.1 KiB Raw Blame History