Files
nexusAI/docs/architecture/overview.md
2026-04-27 03:10:39 -07:00

4.1 KiB

Architecture Overview

NexusAI is a modular, memory-centric AI assistant designed for persistent, context-aware conversations. It separates concerns across independent services that can be evolved and deployed separately.

Core Design Principles

  • Decoupled layers — memory, inference, and orchestration are independent of each other
  • Hybrid retrieval — semantic similarity (Qdrant) combined with structured storage (SQLite) for flexible, ranked context assembly
  • Project-scoped memory — sessions can be grouped into projects with shared or isolated memory pools
  • Home lab first — services are distributed across nodes according to available hardware

Memory Model

Memory is split between SQLite and Qdrant, which always work as a pair:

  • SQLite — episodic interactions, entities, relationships, summaries, sessions, projects
  • Qdrant — vector embeddings for semantic similarity search

When recalling memory, Qdrant returns IDs and similarity scores, which are used to fetch full content from SQLite. Neither store works in isolation.

Episode embeddings carry a { sessionId, createdAt } payload in Qdrant, enabling per-session and per-project filtering at search time. See memory-isolation.md for how project-scoped retrieval works.

Hardware Layout

Node Address Role
Main PC 192.168.0.79 Primary inference — RTX A4000 16GB
Mini PC 1 192.168.0.81 Memory service, Embedding service, Qdrant, Ollama
Mini PC 2 192.168.0.205 Orchestration service, Chat Client, Caddy, Authelia, Gitea

Service Communication

All services expose a REST HTTP API. The orchestration service is the single entry point — clients never talk directly to memory or inference services.

Client (browser)
└─► Caddy (HTTPS + Authelia SSO)
    └─► Orchestration (:4000) — Mini PC 2
        ├─► Memory Service (:3002) — Mini PC 1
        │     ├─► SQLite (local file)
        │     └─► Qdrant (:6333) — Mini PC 1
        ├─► Embedding Service (:3003) — Mini PC 1
        │     └─► Ollama (:11434) — Mini PC 1
        ├─► Inference Service (:3001) — Main PC
        │     └─► llama-server (:8080) — Main PC
        └─► Qdrant (:6333) — Mini PC 1 (direct — semantic search)

Note: Orchestration queries Qdrant directly for semantic search (bypassing the memory service) but always fetches full episode content from the memory service by ID after the vector search.

Technology Choices

Concern Choice Reason
Language Node.js (CommonJS) Familiar stack, async I/O suits service architecture
Package management npm workspaces Monorepo with shared code, no publishing needed
Vector store Qdrant Mature, Docker-native, excellent Node.js client
Relational store SQLite (better-sqlite3) Zero-ops, fast, sufficient for single-user scale
LLM inference llama.cpp (llama-server) Maximum GPU utilisation on RTX A4000, OpenAI-compatible API
Embeddings Ollama (nomic-embed-text) Co-located with memory service on Mini PC 1, 768-dim Cosine
Reverse proxy Caddy + Authelia Automatic HTTPS, SSO/MFA for all exposed services
Version control Gitea (self-hosted) Code stays on local network

Current State

The core four-service architecture is complete and operational. Key capabilities:

  • Hybrid memory retrieval — recent episodes + semantic search combined into every prompt
  • Entity layer + Knowledge graph — automatic extraction of named entities and relationships from conversations via qwen2.5:3b. Entities and relationships are stored in SQLite with mention_count tracking. A graph traversal layer expands Qdrant entity search hits into a 1-hop neighborhood subgraph, injecting structured connected knowledge into every prompt
  • Projects — sessions grouped with shared or isolated memory pools
  • Auto-naming — sessions named automatically from first exchange via inference
  • Project-scoped semantic search — Qdrant filtered by project session IDs
  • Chat client — view-based UI with sidebar navigation, project views, session management