diff --git a/docs/architecture/overview.md b/docs/architecture/overview.md index 25e54aa..f7d3d89 100644 --- a/docs/architecture/overview.md +++ b/docs/architecture/overview.md @@ -74,6 +74,7 @@ service by ID after the vector search. The core four-service architecture is complete and operational. Key capabilities: - **Hybrid memory retrieval** — recent episodes + semantic search combined into every prompt +- **Entity layer** — automatic extraction of named entities from conversations via qwen2.5:3b, stored in SQLite and Qdrant, injected into every prompt as structured knowledge - **Projects** — sessions grouped with shared or isolated memory pools - **Auto-naming** — sessions named automatically from first exchange via inference - **Project-scoped semantic search** — Qdrant filtered by project session IDs diff --git a/docs/services/memory-service.md b/docs/services/memory-service.md index ccbc5ab..bf88ebd 100644 --- a/docs/services/memory-service.md +++ b/docs/services/memory-service.md @@ -28,6 +28,8 @@ service to generate and store a vector in Qdrant. | SQLITE_PATH | Yes | — | Path to SQLite database file | | QDRANT_URL | No | http://localhost:6333 | Qdrant instance URL | | EMBEDDING_SERVICE_URL | No | http://localhost:3003 | Embedding service URL | +| EXTRACTION_URL | No | http://localhost:11434 | Ollama URL for entity extraction | +| EXTRACTION_MODEL | No | qwen2.5:3b | Ollama model used for entity extraction | ## Internal Structure @@ -42,7 +44,8 @@ src/ ├── semantic/ │ └── index.js # Qdrant collection management, upsert, search, delete ├── entities/ -│ └── index.js # Entity + relationship CRUD +│ ├── index.js # Entity + relationship CRUD +│ └── extraction.js # Automatic entity extraction via qwen2.5:3b on Ollama └── index.js # Express app + all route definitions ``` @@ -143,6 +146,32 @@ constraints to prevent duplicates: - `UNIQUE(from_id, to_id, label)` on relationships - `ON DELETE CASCADE` on relationship foreign keys +### Automatic Entity Extraction + +After each episode is saved, `extraction.js` automatically extracts named +entities from the conversation using `qwen2.5:3b` running on Ollama (Mini PC 1). +This runs **fire-and-forget** — the episode is already saved and returned +before extraction begins. + +**Entity types extracted:** `person`, `place`, `project`, `technology`, +`concept`, `organization` + +The extraction prompt uses ChatML format (native to qwen2.5) and primes the +response by ending with `[` to steer the model directly into JSON array output. +A list of already-known entities is injected into the prompt so the model +reuses existing `(name, type)` pairs rather than creating duplicates with +different types. + +After extraction, each entity is: +1. Upserted into SQLite via `upsertEntity` — notes are only written if + the entity is new (`COALESCE(entities.notes, excluded.notes)` prevents + overwriting existing notes with speculative updates) +2. Embedded via the embedding service and upserted into the `entities` + Qdrant collection with `{ name, type, notes }` as payload + +The Qdrant payload stores enough information to reconstruct entity context +at retrieval time without a SQLite roundtrip. + ## Project Delete Behaviour Deleting a project runs as a transaction — it first nulls out `project_id` diff --git a/docs/services/orchestration-service.md b/docs/services/orchestration-service.md index 3807e7e..56a4f98 100644 --- a/docs/services/orchestration-service.md +++ b/docs/services/orchestration-service.md @@ -76,17 +76,22 @@ difference is how the inference response is delivered to the client. recent episodes. Non-critical — if it fails, pipeline continues with recency-only context. -5. **Prompt assembly** — combine system prompt, semantic episodes, recent - episodes, and user message. +5. **Entity search** — reuse the embedded user message vector to query the + `entities` Qdrant collection (score threshold 0.6, limit 5). Returns + entity payloads (`name`, `type`, `notes`) directly — no SQLite roundtrip + needed. Non-critical — if it fails, pipeline continues without entity context. -6. **Inference** — send to inference service. `/chat` awaits full response; +6. **Prompt assembly** — combine system prompt, entity context, semantic + episodes, recent episodes, and user message. + +7. **Inference** — send to inference service. `/chat` awaits full response; `/chat/stream` pipes SSE chunks to the client. -7. **Episode write** — write the exchange back to memory. Fire-and-forget +8. **Episode write** — write the exchange back to memory. Fire-and-forget for `/chat`; awaited for `/chat/stream` to ensure the full text is accumulated before saving. -8. **Auto-naming** — on `isFirstMessage && !session.name`, fire a secondary +9. **Auto-naming** — on `isFirstMessage && !session.name`, fire a secondary inference call with a naming prompt (max 20 tokens, temperature 0.3) and write the result back as `session.name`. Fully fire-and-forget. @@ -95,6 +100,10 @@ difference is how the inference response is delivered to the client. ``` [System prompt] +Here is what you know about entities relevant to this conversation: +- {name} ({type}): {notes} +... (up to 5 entity results) +--- Here are some relevant memories from earlier conversations: User: {past user message} Assistant: {past ai response} @@ -110,8 +119,9 @@ User: {current message} Assistant: ``` -Semantic episodes appear before recent episodes so the model sees -long-range context before the immediate conversation flow. +Entity context appears first — before episodic memory — because structured +facts about known entities are the most stable and reliable context. Semantic +episodes follow, then recent episodes as the immediate conversation flow. ## SSE Stream Format