documentation updates for entity extraction and summarization

This commit is contained in:
Storme-bit
2026-04-21 03:50:38 -07:00
parent 32365e67f4
commit acda21317b
6 changed files with 540 additions and 107 deletions

View File

@@ -38,7 +38,8 @@ src/
├── db/
│ ├── index.js # SQLite connection + initialization + migrations
│ ├── schema.js # Table definitions, indexes, FTS5, triggers
── projects.js # Project CRUD functions
── projects.js # Project CRUD functions
│ └── summaries.js # Summary CRUD functions
├── episodic/
│ └── index.js # Session + episode CRUD, FTS search, embedding write path
├── semantic/
@@ -51,7 +52,7 @@ src/
## SQLite Schema
Six core tables:
Seven core tables:
- **sessions** — top-level conversation containers. Fields: `external_id`, `name`, `project_id`, `metadata`
- **episodes** — individual exchanges (user message + AI response) tied to a session
@@ -100,12 +101,9 @@ that weren't touched.
const allowed = ['name', 'description', 'colour', 'icon', 'isolated', 'notes', 'system_prompt'];
```
This means saving just `{ notes: "..." }` or `{ system_prompt: "..." }` won't
touch any other field.
## Qdrant / Semantic Layer
Three Qdrant collections are initialized on service startup:
Three Qdrant collections are initialized on service startup via `semantic.initCollections()`:
| Collection | Purpose |
|---|---|
@@ -117,9 +115,13 @@ All collections use **768-dimension vectors** with **Cosine similarity**,
matching `nomic-embed-text` via Ollama. Vector size and distance metric are
defined in `@nexusai/shared` — not hardcoded here.
Each collection exposes three operations in `src/semantic/index.js`:
upsert, search (with optional Qdrant filter), and delete. The `wait: true`
flag is used on all writes.
`initCollections()` iterates `Object.values(COLLECTIONS)` and creates any
collection that doesn't already exist at startup — all three collections are
guaranteed to exist before any requests are handled, avoiding race conditions
between the first entity embed and an entity search.
Each collection exposes upsert, search (with optional Qdrant filter), and
delete operations. The `wait: true` flag is used on all writes.
## Embedding Write Path
@@ -133,8 +135,7 @@ When a new episode is created:
This step is **fire-and-forget** — if embedding fails, the episode is still
saved and searchable via FTS. The error is logged but not surfaced.
> The Qdrant payload stores `sessionId` (the internal integer ID). This is
> used for per-session and per-project filtering during semantic search. See
> The Qdrant payload stores `sessionId` (the internal integer ID). See
> `memory-isolation.md` for how project-level filtering works.
## Entity Layer
@@ -146,34 +147,20 @@ constraints to prevent duplicates:
- `UNIQUE(from_id, to_id, label)` on relationships
- `ON DELETE CASCADE` on relationship foreign keys
### Automatic Entity Extraction
After each episode is saved, `extraction.js` automatically extracts named
entities from the conversation using `qwen2.5:3b` running on Ollama (Mini PC 1).
This runs **fire-and-forget** — the episode is already saved and returned
before extraction begins.
entities from the conversation using `qwen2.5:3b` on Ollama — fire-and-forget.
**Entity types extracted:** `person`, `place`, `project`, `technology`,
`concept`, `organization`
> For full details on the extraction pipeline, prompt format, constrained
> decoding, stoplist, and Qdrant storage, see `entity-extraction.md`.
The extraction prompt uses ChatML format (native to qwen2.5) and primes the
response by ending with `[` to steer the model directly into JSON array output.
A list of already-known entities is injected into the prompt so the model
reuses existing `(name, type)` pairs rather than creating duplicates with
different types.
## Summaries Layer
After extraction, each entity is:
1. Upserted into SQLite via `upsertEntity` — notes are only written if
the entity is new (`COALESCE(entities.notes, excluded.notes)` prevents
overwriting existing notes with speculative updates)
2. Embedded via the embedding service and upserted into the `entities`
Qdrant collection with `{ name, type, notes, projectId }` as payload —
`projectId` scopes entities to their project for isolated retrieval
Session summaries are generated by `orchestration-service/src/services/summarization.js`
after each episode write and stored here via `POST /summaries`. The memory
service is responsible only for CRUD — generation logic lives in orchestration.
`extractAndStoreEntities` receives `projectId` from `createEpisode`, which
receives it from the episode route, which receives it from orchestration's
`createEpisode` call. This ensures entities are tagged with the correct
project scope at extraction time.
> For full details on trigger conditions, prompt format, cumulative updates,
> and ChatML token stripping, see `summarization.md`.
## Project Delete Behaviour