documentation updates for entity extraction and summarization

2026-04-21 03:50:38 -07:00
parent 32365e67f4
commit acda21317b
6 changed files with 540 additions and 107 deletions
--- a/docs/reference/API-routes.md
+++ b/docs/reference/API-routes.md
@@ -120,6 +120,38 @@ all projects use isolated memory. Returns `201` with the created project object.

 Only provided fields are updated — omitted fields are not touched.

+### Summaries
+
+| Method | Path | Description |
+|---|---|---|
+| GET | /summaries/session/:sessionId | Get all summaries for a session (by external UUID) |
+| GET | /summaries/project/:projectId | Get all summaries for a project |
+
+**GET /summaries/session/:sessionId** — resolves the external UUID to an
+internal session ID, then fetches summaries from the memory service.
+Returns an array of summary objects ordered by `created_at` ascending.
+
+**GET /summaries/project/:projectId** — proxies directly to the memory
+service project summaries endpoint.
+
+**Summary object shape:**
+```json
+{
+  "id": 8,
+  "session_id": 72,
+  "project_id": null,
+  "content": "The user asked about...",
+  "token_count": 579,
+  "episode_range": "246-251",
+  "created_at": 1776766518,
+  "updated_at": 1776766518
+}
+```
+
+> **Proxy requirement:** `/summaries` must be added to both the Caddyfile
+> reverse proxy and the Vite dev proxy config alongside the other route
+> prefixes. See `orchestration-service.md` for the Caddy block pattern.
+
 ### Models

 | Method | Path | Description |
@@ -269,6 +301,29 @@ Both fields are optional. Only provided fields are updated.

 Same request/response shape as orchestration `/projects` above.

+### Summaries
+
+| Method | Path | Description |
+|---|---|---|
+| POST | /summaries | Create a new summary |
+| GET | /sessions/:id/summaries | Get all summaries for a session (internal ID) |
+| GET | /projects/:id/summaries | Get all summaries for a project |
+| PATCH | /summaries/:id | Update a summary (content, tokenCount, episodeRange) |
+| DELETE | /summaries/:id | Delete a summary |
+
+**POST /summaries — body:**
+```json
+{
+  "sessionId": 72,
+  "content": "The user discussed...",
+  "tokenCount": 579,
+  "episodeRange": "246-251"
+}
+```
+`content` is required. Either `sessionId` or `projectId` is required.
+
+**PATCH /summaries/:id — body:** any subset of `content`, `tokenCount`, `episodeRange`.
+
 ### Entities

 | Method | Path | Description |
--- a/docs/services/entity-extraction.md
+++ b/docs/services/entity-extraction.md
@@ -0,0 +1,178 @@
+# Memory Service
+
+**Package:** `@nexusai/memory-service`  
+**Location:** `packages/memory-service`  
+**Deployed on:** Mini PC 1 (192.168.0.81)  
+**Port:** 3002
+
+## Purpose
+
+Responsible for all reading and writing of long-term memory. Acts as the
+sole interface to both SQLite and Qdrant — no other service accesses these
+stores directly. On episode creation, automatically calls the embedding
+service to generate and store a vector in Qdrant.
+
+## Dependencies
+
+- `express` — HTTP API
+- `better-sqlite3` — SQLite driver
+- `@qdrant/js-client-rest` — Qdrant vector store client
+- `dotenv` — environment variable loading
+- `@nexusai/shared` — shared utilities and constants
+
+## Environment Variables
+
+| Variable | Required | Default | Description |
+|---|---|---|---|
+| PORT | No | 3002 | Port to listen on |
+| SQLITE_PATH | Yes | — | Path to SQLite database file |
+| QDRANT_URL | No | http://localhost:6333 | Qdrant instance URL |
+| EMBEDDING_SERVICE_URL | No | http://localhost:3003 | Embedding service URL |
+| EXTRACTION_URL | No | http://localhost:11434 | Ollama URL for entity extraction |
+| EXTRACTION_MODEL | No | qwen2.5:3b | Ollama model used for entity extraction |
+
+## Internal Structure
+
+```
+src/
+├── db/
+│   ├── index.js       # SQLite connection + initialization + migrations
+│   ├── schema.js      # Table definitions, indexes, FTS5, triggers
+│   ├── projects.js    # Project CRUD functions
+│   └── summaries.js   # Summary CRUD functions
+├── episodic/
+│   └── index.js       # Session + episode CRUD, FTS search, embedding write path
+├── semantic/
+│   └── index.js       # Qdrant collection management, upsert, search, delete
+├── entities/
+│   ├── index.js       # Entity + relationship CRUD
+│   └── extraction.js  # Automatic entity extraction via qwen2.5:3b on Ollama
+└── index.js           # Express app + all route definitions
+```
+
+## SQLite Schema
+
+Seven core tables:
+
+- **sessions** — top-level conversation containers. Fields: `external_id`, `name`, `project_id`, `metadata`
+- **episodes** — individual exchanges (user message + AI response) tied to a session
+- **entities** — named things the system learns about (people, places, concepts)
+- **relationships** — directional labeled links between entities
+- **summaries** — condensed episode groups for efficient context retrieval
+- **projects** — named groupings of sessions with `name`, `description`, `colour`, `icon`, `isolated`, `notes`, `system_prompt`
+
+### Migrations
+
+Schema changes that cannot use `CREATE TABLE IF NOT EXISTS` are applied as
+idempotent migrations in `db/index.js` at startup:
+
+```js
+try { db.exec(`ALTER TABLE sessions ADD COLUMN name TEXT`); } catch {}
+try { db.exec(`ALTER TABLE sessions ADD COLUMN project_id INTEGER REFERENCES projects(id)`); } catch {}
+try { db.exec(`CREATE INDEX IF NOT EXISTS idx_sessions_project ON sessions(project_id)`); } catch {}
+try { db.exec(`ALTER TABLE projects ADD COLUMN isolated INTEGER NOT NULL DEFAULT 0`); } catch {}
+try { db.exec(`ALTER TABLE projects ADD COLUMN notes TEXT`); } catch {}
+try { db.exec(`ALTER TABLE projects ADD COLUMN system_prompt TEXT`); } catch {}
+```
+
+New migrations are always appended here — never modify the schema file for
+existing tables since `ALTER TABLE` cannot use `IF NOT EXISTS`.
+
+### FTS5 Full-Text Search
+
+An `episodes_fts` virtual table enables keyword search across all episodes.
+Three triggers (`episodes_fts_insert`, `episodes_fts_update`, `episodes_fts_delete`)
+keep the FTS index automatically in sync with the episodes table.
+
+### SQLite Configuration
+
+- `journal_mode = WAL` — non-blocking reads during writes
+- `foreign_keys = ON` — enforces referential integrity and cascade deletes
+- PRAGMAs set via `db.pragma()`, not `db.exec()`
+
+### Dynamic Updates
+
+Both `updateSession` and `updateProject` build their `SET` clause dynamically
+from only the fields passed — prevents partial updates from overwriting fields
+that weren't touched.
+
+`updateProject` allowlist:
+```js
+const allowed = ['name', 'description', 'colour', 'icon', 'isolated', 'notes', 'system_prompt'];
+```
+
+## Qdrant / Semantic Layer
+
+Three Qdrant collections are initialized on service startup via `semantic.initCollections()`:
+
+| Collection | Purpose |
+|---|---|
+| `episodes` | Embeddings for individual conversation exchanges |
+| `entities` | Embeddings for named entities |
+| `summaries` | Embeddings for condensed episode summaries |
+
+All collections use **768-dimension vectors** with **Cosine similarity**,
+matching `nomic-embed-text` via Ollama. Vector size and distance metric are
+defined in `@nexusai/shared` — not hardcoded here.
+
+`initCollections()` iterates `Object.values(COLLECTIONS)` and creates any
+collection that doesn't already exist at startup — all three collections are
+guaranteed to exist before any requests are handled, avoiding race conditions
+between the first entity embed and an entity search.
+
+Each collection exposes upsert, search (with optional Qdrant filter), and
+delete operations. The `wait: true` flag is used on all writes.
+
+## Embedding Write Path
+
+When a new episode is created:
+
+1. Episode saved to SQLite synchronously — response returned immediately
+2. User message + AI response combined: `User: ...\nAssistant: ...`
+3. Text sent to embedding service (`POST /embed`)
+4. Vector upserted into `episodes` Qdrant collection with payload `{ sessionId, createdAt }`
+
+This step is **fire-and-forget** — if embedding fails, the episode is still
+saved and searchable via FTS. The error is logged but not surfaced.
+
+> The Qdrant payload stores `sessionId` (the internal integer ID). See
+> `memory-isolation.md` for how project-level filtering works.
+
+## Entity Layer
+
+Entities and relationships use upsert semantics with composite unique
+constraints to prevent duplicates:
+
+- `UNIQUE(name, type)` on entities
+- `UNIQUE(from_id, to_id, label)` on relationships
+- `ON DELETE CASCADE` on relationship foreign keys
+
+After each episode is saved, `extraction.js` automatically extracts named
+entities from the conversation using `qwen2.5:3b` on Ollama — fire-and-forget.
+
+> For full details on the extraction pipeline, prompt format, constrained
+> decoding, stoplist, and Qdrant storage, see `entity-extraction.md`.
+
+## Summaries Layer
+
+Session summaries are generated by `orchestration-service/src/services/summarization.js`
+after each episode write and stored here via `POST /summaries`. The memory
+service is responsible only for CRUD — generation logic lives in orchestration.
+
+> For full details on trigger conditions, prompt format, cumulative updates,
+> and ChatML token stripping, see `summarization.md`.
+
+## Project Delete Behaviour
+
+Deleting a project runs as a transaction — it first nulls out `project_id`
+on all assigned sessions, then deletes the project. This avoids a foreign
+key constraint failure since `sessions.project_id` has no `ON DELETE` rule:
+
+```js
+const doDelete = db.transaction(() => {
+  db.prepare(`UPDATE sessions SET project_id = NULL WHERE project_id = ?`).run(id);
+  db.prepare(`DELETE FROM projects WHERE id = ?`).run(id);
+});
+```
+
+For all HTTP endpoints, see `api-routes.md`.
--- a/docs/services/memory-service.md
+++ b/docs/services/memory-service.md
@@ -38,7 +38,8 @@ src/
 ├── db/
 │   ├── index.js       # SQLite connection + initialization + migrations
 │   ├── schema.js      # Table definitions, indexes, FTS5, triggers
-│   └── projects.js    # Project CRUD functions
+│   ├── projects.js    # Project CRUD functions
+│   └── summaries.js   # Summary CRUD functions
 ├── episodic/
 │   └── index.js       # Session + episode CRUD, FTS search, embedding write path
 ├── semantic/
@@ -51,7 +52,7 @@ src/

 ## SQLite Schema

-Six core tables:
+Seven core tables:

 - **sessions** — top-level conversation containers. Fields: `external_id`, `name`, `project_id`, `metadata`
 - **episodes** — individual exchanges (user message + AI response) tied to a session
@@ -100,12 +101,9 @@ that weren't touched.
 const allowed = ['name', 'description', 'colour', 'icon', 'isolated', 'notes', 'system_prompt'];
 ```

-This means saving just `{ notes: "..." }` or `{ system_prompt: "..." }` won't
-touch any other field.
-
 ## Qdrant / Semantic Layer

-Three Qdrant collections are initialized on service startup:
+Three Qdrant collections are initialized on service startup via `semantic.initCollections()`:

 | Collection | Purpose |
 |---|---|
@@ -117,9 +115,13 @@ All collections use **768-dimension vectors** with **Cosine similarity**,
 matching `nomic-embed-text` via Ollama. Vector size and distance metric are
 defined in `@nexusai/shared` — not hardcoded here.

-Each collection exposes three operations in `src/semantic/index.js`:
-upsert, search (with optional Qdrant filter), and delete. The `wait: true`
-flag is used on all writes.
+`initCollections()` iterates `Object.values(COLLECTIONS)` and creates any
+collection that doesn't already exist at startup — all three collections are
+guaranteed to exist before any requests are handled, avoiding race conditions
+between the first entity embed and an entity search.
+
+Each collection exposes upsert, search (with optional Qdrant filter), and
+delete operations. The `wait: true` flag is used on all writes.

 ## Embedding Write Path

@@ -133,8 +135,7 @@ When a new episode is created:
 This step is **fire-and-forget** — if embedding fails, the episode is still
 saved and searchable via FTS. The error is logged but not surfaced.

-> The Qdrant payload stores `sessionId` (the internal integer ID). This is
-> used for per-session and per-project filtering during semantic search. See
+> The Qdrant payload stores `sessionId` (the internal integer ID). See
 > `memory-isolation.md` for how project-level filtering works.

 ## Entity Layer
@@ -146,34 +147,20 @@ constraints to prevent duplicates:
 - `UNIQUE(from_id, to_id, label)` on relationships
 - `ON DELETE CASCADE` on relationship foreign keys

-### Automatic Entity Extraction
-
 After each episode is saved, `extraction.js` automatically extracts named
-entities from the conversation using `qwen2.5:3b` running on Ollama (Mini PC 1).
-This runs **fire-and-forget** — the episode is already saved and returned
-before extraction begins.
+entities from the conversation using `qwen2.5:3b` on Ollama — fire-and-forget.

-**Entity types extracted:** `person`, `place`, `project`, `technology`,
-`concept`, `organization`
+> For full details on the extraction pipeline, prompt format, constrained
+> decoding, stoplist, and Qdrant storage, see `entity-extraction.md`.

-The extraction prompt uses ChatML format (native to qwen2.5) and primes the
-response by ending with `[` to steer the model directly into JSON array output.
-A list of already-known entities is injected into the prompt so the model
-reuses existing `(name, type)` pairs rather than creating duplicates with
-different types.
+## Summaries Layer

-After extraction, each entity is:
-1. Upserted into SQLite via `upsertEntity` — notes are only written if
-   the entity is new (`COALESCE(entities.notes, excluded.notes)` prevents
-   overwriting existing notes with speculative updates)
-2. Embedded via the embedding service and upserted into the `entities`
-   Qdrant collection with `{ name, type, notes, projectId }` as payload —
-   `projectId` scopes entities to their project for isolated retrieval
+Session summaries are generated by `orchestration-service/src/services/summarization.js`
+after each episode write and stored here via `POST /summaries`. The memory
+service is responsible only for CRUD — generation logic lives in orchestration.

-`extractAndStoreEntities` receives `projectId` from `createEpisode`, which
-receives it from the episode route, which receives it from orchestration's
-`createEpisode` call. This ensures entities are tagged with the correct
-project scope at extraction time.
+> For full details on trigger conditions, prompt format, cumulative updates,
+> and ChatML token stripping, see `summarization.md`.

 ## Project Delete Behaviour

--- a/docs/services/orchestration-service.md
+++ b/docs/services/orchestration-service.md
@@ -30,31 +30,33 @@ or inference services — all traffic flows through orchestration.
 | LLAMA_SERVER_URL | No | http://localhost:8080 | Direct llama-server URL for /models/props |
 | QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search |
 | CORS_ORIGIN | No | http://localhost:5173 | Allowed origin for CORS requests |
-| MODELS_MANIFEST_PATH | No | — | Legacy — superseded by `modelsFolderPath` in settings.json |
+| EXTRACTION_URL | No | http://localhost:11434 | Ollama URL for summarisation |
+| EXTRACTION_MODEL | No | qwen2.5:3b | Ollama model used for summarisation |

 ## Internal Structure

 ```
 src/
 ├── services/
-│   ├── memory.js      # HTTP client for memory service
-│   ├── inference.js   # HTTP client for inference service
-│   ├── embedding.js   # HTTP client for embedding service
-│   └── qdrant.js      # HTTP client for Qdrant (direct vector search)
+│   ├── memory.js         # HTTP client for memory service
+│   ├── inference.js      # HTTP client for inference service
+│   ├── embedding.js      # HTTP client for embedding service
+│   ├── qdrant.js         # HTTP client for Qdrant (direct vector search)
+│   └── summarization.js  # Session summarisation — triggers after each episode
 ├── chat/
-│   └── index.js       # Core pipeline — context assembly, isolation, auto-naming
+│   └── index.js          # Core pipeline — context assembly, isolation, auto-naming
 ├── config/
-│   └── settings.js    # Settings load/save — reads/writes data/settings.json
+│   └── settings.js       # Settings load/save — reads/writes data/settings.json
 ├── routes/
-│   ├── chat.js        # POST /chat and POST /chat/stream
-│   ├── sessions.js    # Session CRUD proxy
-│   ├── projects.js    # Project CRUD proxy — passes req.body straight through
-│   ├── episodes.js    # Episode list and delete proxy
-│   ├── settings.js    # GET /settings and PATCH /settings
-│   ├── health.js      # GET /health — pings all four services
-│   └── models.js      # GET /models — scans .gguf files live, merges with models.json
-                       # GET /models/props — context window + loaded model from llama-server
-└── index.js           # Express app entry point
+│   ├── chat.js           # POST /chat and POST /chat/stream
+│   ├── sessions.js       # Session CRUD proxy
+│   ├── projects.js       # Project CRUD proxy
+│   ├── episodes.js       # Episode list and delete proxy
+│   ├── summaries.js      # GET /summaries/session/:id and /summaries/project/:id
+│   ├── settings.js       # GET /settings and PATCH /settings
+│   ├── health.js         # GET /health/services — pings all four services
+│   └── models.js         # GET /models and GET /models/props
+└── index.js              # Express app entry point
 ```

 The `services/` layer wraps all downstream HTTP calls in named functions.
@@ -77,9 +79,6 @@ via `appSettings.load()` — changes apply immediately without a service restart
 | `topK` | 40 | Top-K token candidates per step |
 | `systemPrompt` | *(ORCHESTRATION.SYSTEM_PROMPT)* | Global system prompt. `null` reverts to hardcoded constant. |

-Defaults are defined in `config/settings.js` and fall back to constants in
-`@nexusai/shared`. Values saved in `settings.json` take precedence.
-
 ## Chat Pipeline

 Both `POST /chat` and `POST /chat/stream` share the same steps. The only
@@ -88,42 +87,38 @@ difference is how the inference response is delivered to the client.
 ### Steps

 1. **Session resolution** — look up session by `externalId`. Auto-create if
-   not found. Clients generate a UUID for new conversations — no pre-creation
-   step needed.
+   not found.

 2. **Project context resolution** — if the session has a `project_id`, fetch
   the project and all its session IDs. Used to scope semantic search. The
   project's `system_prompt` is also read at this step if set.

 3. **System prompt resolution** — three-tier hierarchy:
-   - `project.system_prompt` — if the session is in a project and it's set (highest priority)
+   - `project.system_prompt` — highest priority
   - `settings.systemPrompt` — global setting from `settings.json`
-   - `ORCHESTRATION.SYSTEM_PROMPT` — hardcoded constant in `@nexusai/shared` (last resort)
+   - `ORCHESTRATION.SYSTEM_PROMPT` — hardcoded constant (last resort)

-4. **Recent episode retrieval** — fetch the most recent episodes for the
-   session (`recentEpisodeLimit`, default 5).
+4. **Recent episode retrieval** — fetch most recent episodes (`recentEpisodeLimit`).

-5. **Semantic search** — embed the user message, query Qdrant for the top
-   most similar past episodes (`semanticLimit`, `scoreThreshold`). Deduplicated
-   against recent episodes. Non-critical — if it fails, pipeline continues with
-   recency-only context.
+5. **Semantic search** — embed user message, query Qdrant for similar past
+   episodes. Deduplicated against recent episodes. Non-critical.

-6. **Entity search** — query the `entities` Qdrant collection filtered by
+6. **Entity search** — query `entities` Qdrant collection filtered by
   `projectId`. Non-project sessions receive no entity context. Non-critical.

-7. **Prompt assembly** — combine resolved system prompt, entity context,
-   semantic episodes, recent episodes, and user message.
+7. **Prompt assembly** — combine system prompt, entity context, semantic
+   episodes, recent episodes, and user message.

-8. **Inference** — send to inference service with settings-derived parameters
-   (temperature, topP, topK, repeatPenalty). `/chat` awaits full response;
+8. **Inference** — send to inference service. `/chat` awaits full response;
   `/chat/stream` pipes SSE chunks to the client.

-9. **Episode write** — write the exchange back to memory with `projectId`.
-   Fire-and-forget for `/chat`; awaited for `/chat/stream`.
+9. **Episode write** — write exchange back to memory with `projectId`.

-10. **Auto-naming** — on `isFirstMessage && !session.name`, fire a secondary
-    inference call with a naming prompt (max 20 tokens, temperature 0.3) and
-    write the result back as `session.name`. Fully fire-and-forget.
+10. **Summarisation trigger** — `triggerSummary(session, allEpisodes)` called
+    fire-and-forget. See `summarization.md` for full details.
+
+11. **Auto-naming** — on first message with no session name, fires a secondary
+    inference call (max 20 tokens, temperature 0.3) to generate a session name.

 ### Prompt Structure

@@ -132,26 +127,28 @@ difference is how the inference response is delivered to the client.

 Here is what you know about entities relevant to this conversation:
 - {name} ({type}): {notes}
-... (up to 5 entity results)
 ---
 Here are some relevant memories from earlier conversations:
 User: {past user message}
 Assistant: {past ai response}
-... (up to semanticLimit semantic episodes)
 ---
 Here are some relevant memories from your past conversations:
 User: {past user message}
 Assistant: {past ai response}
-... (up to recentEpisodeLimit recent episodes)
 --- End of recent memories ---

 User: {current message}
 Assistant:
 ```

-Entity context appears first — before episodic memory — because structured
-facts about known entities are the most stable and reliable context. Semantic
-episodes follow, then recent episodes as the immediate conversation flow.
+## Summarisation
+
+After each episode write, `triggerSummary` is called fire-and-forget. It
+checks token thresholds and episode counts before generating, then stores
+the result in the memory service.
+
+> For full details on trigger conditions, prompt format, cumulative updates,
+> ChatML token stripping, and episode range tracking, see `summarization.md`.

 ## SSE Stream Format

@@ -168,46 +165,36 @@ data: {"text":"Hello"}
 data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42}
 ```

-The `[DONE]` sentinel is consumed internally and not forwarded. The stream
-is terminated by `res.end()` after the done event.
+The `[DONE]` sentinel is consumed internally and not forwarded.

 ## Models Route

-`GET /models` scans `.gguf` files live on each request from `modelsFolderPath`
-(read from settings). Merges results with a `models.json` file in the same
-folder for richer metadata (label, description). Returns file size in GB.
+`GET /models` scans `.gguf` files live from `modelsFolderPath` and merges
+with `models.json` for metadata. Returns file size in GB.

-`GET /models/props` fetches directly from llama-server via `LLAMA_SERVER_URL`.
-Returns `{ contextWindow, modelAlias }`. `n_ctx` is at
-`data.default_generation_settings.n_ctx` in the llama-server response.
-Returns `503` if llama-server is unreachable.
+`GET /models/props` fetches directly from llama-server. Returns
+`{ contextWindow, modelAlias }`. Returns `503` if unreachable.

 ## Sessions Route Behaviour

-`PATCH /sessions/:sessionId` accepts either `name`, `projectId`, or both.
-The validation guard only rejects requests where neither is provided:
-
-```js
-if (!name?.trim() && projectId === undefined) {
-  return res.status(400).json({ error: 'name or projectId is required' });
-}
-```
-
-This allows `useChat` to write project assignment separately from rename
-operations.
+`PATCH /sessions/:sessionId` accepts `name`, `projectId`, or both.
+Rejects only when neither is provided — allows `useChat` to write project
+assignment separately from rename operations.

 ## Caddy Configuration

-Each route prefix needs a handle block in the Caddyfile on Mini PC 2:
+Each route prefix needs a handle block in the Caddyfile on Mini PC 2.
+**Any new top-level route must be added here AND in `vite.config.js`.**

 ```
-handle /chat*     { reverse_proxy localhost:4000 }
-handle /sessions* { reverse_proxy localhost:4000 }
-handle /models*   { reverse_proxy localhost:4000 }
-handle /projects* { reverse_proxy localhost:4000 }
-handle /episodes* { reverse_proxy localhost:4000 }
-handle /settings* { reverse_proxy localhost:4000 }
-handle /health*   { reverse_proxy localhost:4000 }
+handle /chat*      { reverse_proxy localhost:4000 }
+handle /sessions*  { reverse_proxy localhost:4000 }
+handle /models*    { reverse_proxy localhost:4000 }
+handle /projects*  { reverse_proxy localhost:4000 }
+handle /episodes*  { reverse_proxy localhost:4000 }
+handle /settings*  { reverse_proxy localhost:4000 }
+handle /summaries* { reverse_proxy localhost:4000 }
+handle /health*    { reverse_proxy localhost:4000 }
 ```

 After updating: `caddy reload --config /path/to/Caddyfile`
--- a/docs/services/shared.md
+++ b/docs/services/shared.md
@@ -165,10 +165,16 @@ Orchestration pipeline defaults. Used as fallback values in
 | `RECENT_EPISODE_LIMIT` | `5` | Recent episodes to inject into prompt |
 | `SEMANTIC_LIMIT` | `5` | Semantic search results to inject into prompt |
 | `SCORE_THRESHOLD` | `0.75` | Minimum similarity score for semantic results |
+| `ENTITIES_LIMIT` | `5` | Max entity search results to inject into prompt |
+| `ENTITIES_THRESHOLD` | `0.55` | Minimum similarity score for entity results |
 | `TEMPERATURE` | `0.7` | Default inference temperature |
 | `CORS_ORIGIN` | `'http://localhost:5173'` | Fallback allowed CORS origin |
 | `SYSTEM_PROMPT` | *(see below)* | Default system prompt |

+> `ENTITIES_THRESHOLD` is set to `0.55` — lower than `SCORE_THRESHOLD` because
+> entity notes generated by a 3B model tend to embed with lower cosine similarity
+> than full episode text. Tune upward if irrelevant entities appear in context.
+
 > `repeatPenalty`, `topP`, and `topK` defaults are sourced from
 > `INFERENCE_DEFAULTS` in `config/settings.js` rather than `ORCHESTRATION`,
 > since those constants already define the canonical values.
@@ -178,6 +184,25 @@ Default system prompt:
 > of past conversations with the user. Use them to provide consistent,
 > personalised responses."

+#### `SUMMARIES`
+
+Controls the automatic session summarisation system in `orchestration-service/src/services/summarization.js`.
+
+| Key | Value | Description |
+|---|---|---|
+| `THRESHOLD_TOKENS` | `200` | Minimum total session tokens before summarisation is considered |
+| `MAX_SUMMARY_TOKENS` | `800` | If existing summary exceeds this length (chars), create a new row instead of updating |
+| `MIN_EPISODES_SINCE` | `5` | Minimum new episodes since last summary before re-summarising |
+
+These can be overridden per-deployment via environment variables in the
+orchestration service `.env`:
+
+```
+SUMMARY_THRESHOLD_TOKENS=200
+SUMMARY_MAX_TOKENS=800
+SUMMARY_MIN_EPISODES=5
+```
+
 #### `SQLITE`

 | Key | Value | Description |
--- a/docs/services/summarization.md
+++ b/docs/services/summarization.md
@@ -0,0 +1,201 @@
+# Summarization
+
+Session summarization generates rolling plain-text summaries of conversation
+history, giving the model a condensed view of past context without consuming
+the full context window with raw episodes.
+
+**Location:** `packages/orchestration-service/src/services/summarization.js`  
+**Triggered by:** `chat/index.js` after every episode write (fire-and-forget)  
+**Model:** `qwen2.5:3b` via Ollama on Mini PC 1 (192.168.0.81)
+
+---
+
+## Trigger Conditions
+
+`triggerSummary(session, allEpisodes)` calls `maybeSummarize` fire-and-forget.
+`maybeSummarize` proceeds only when both conditions are met:
+
+1. Total session token count exceeds `SUMMARIES.THRESHOLD_TOKENS` (default 200)
+2. At least `SUMMARIES.MIN_EPISODES_SINCE` (default 5) new episodes have
+   accumulated since the last summary
+
+The token threshold is intentionally low — it ensures summaries start
+generating early in a session's life rather than only after very long
+conversations.
+
+---
+
+## Summary Rows and Cumulative Updates
+
+Each session can have multiple summary rows in the `summaries` table.
+The update strategy depends on the size of the most recent summary:
+
+| Condition | Action |
+|---|---|
+| No existing summary | Generate fresh summary from all episodes |
+| Latest summary under `MAX_SUMMARY_TOKENS` | Update: summarise new episodes with existing summary as context |
+| Latest summary over `MAX_SUMMARY_TOKENS` | Create new row: treat as fresh summarisation |
+
+This produces a chain of summary rows over time. Each row's `episode_range`
+covers only the episodes summarised in that specific pass (e.g. `259-263`),
+not all episodes in the session.
+
+---
+
+## Ollama Request
+
+```js
+{
+    model: EXTRACTION_MODEL,   // qwen2.5:3b (set via EXTRACTION_MODEL env var)
+    prompt: buildSummaryPrompt(episodesToSummarize, existingSummary),
+    stream: false,
+    // No format: 'json' — free-text output required for summaries
+    options: {
+        temperature: 0.2,
+        num_predict: 500,
+    },
+}
+```
+
+`temperature: 0.2` is slightly higher than extraction (0.1) — summaries
+benefit from some fluency. `num_predict: 500` gives room for 5 thorough
+sentences without risk of runoff.
+
+---
+
+## Prompt Format
+
+ChatML format — native to qwen2.5:
+
+```
+<|im_start|>user
+Summarize the conversation below in 3-5 sentences.
+Write in third person. Do not quote directly — paraphrase only.
+Do not include greetings, sign-offs, or filler. Output only the summary text.
+
+Conversation:
+{context}
+<|im_end|>
+<|im_start|>assistant
+```
+
+For cumulative updates, the instruction and context change:
+
+```
+<|im_start|>user
+Update the summary below to incorporate the new exchanges.
+Write 3-5 sentences in third person. Do not quote directly — paraphrase only.
+Do not include greetings, sign-offs, or filler. Output only the updated summary text.
+
+Previous summary:
+{existingSummary}
+
+New exchanges:
+{context}
+<|im_end|>
+<|im_start|>assistant
+```
+
+### Input truncation
+
+Episode context is truncated to `MAX_CHARS = 3000` characters, keeping the
+most recent exchanges (sliced from the end). This keeps Qwen focused and
+prevents the prompt from exceeding its effective context window.
+
+---
+
+## ChatML Token Stripping
+
+Qwen occasionally echoes ChatML tokens back into its response. The raw output
+is cleaned before saving:
+
+```js
+const raw = data.response?.trim() ?? '';
+const content = raw
+    .replace(/<\|im_start\|>.*?<\|im_end\|>/gs, '')
+    .replace(/<\|im_start\|>|<\|im_end\|>|<\|im_sep\|>/g, '')
+    .trim();
+return content;
+```
+
+Without this, leaked tokens get stored in the summary and then injected
+back into the next summarisation prompt — causing the model to append a new
+summary after the old one rather than replacing it.
+
+---
+
+## Episode Range Tracking
+
+Each summary row stores `episode_range` as `"firstId-lastId"` covering only
+the episodes summarised in that pass:
+
+```js
+const summarizedIds = episodesToSummarize.map(ep => ep.id).sort((a,b) => a - b);
+const episodeRange = `${summarizedIds.at(0)}-${summarizedIds.at(-1)}`;
+```
+
+This makes SummaryView cards meaningful — "Episodes 259-263" tells you
+exactly which exchanges that summary covers, rather than always showing
+the full session range.
+
+---
+
+## Summary Storage
+
+Summaries are written directly to the memory service from orchestration:
+
+```js
+// Create new row
+await fetch(`${MEMORY_URL}/summaries`, {
+    method: 'POST',
+    body: JSON.stringify({ sessionId: session.id, content, tokenCount, episodeRange }),
+});
+
+// Update existing row
+await fetch(`${MEMORY_URL}/summaries/${latest.id}`, {
+    method: 'PATCH',
+    body: JSON.stringify({ content, tokenCount, episodeRange }),
+});
+```
+
+`session.id` here is the internal SQLite integer ID — not the external UUID.
+It is available directly on the `session` object passed from `chat/index.js`.
+
+---
+
+## Client-Side Indicator
+
+The chat client shows a "Summarising…" spinner in the `ChatWindow` header
+and on the InfoPanel's Session Memory button while summarisation may be
+in progress.
+
+Since summarisation is fire-and-forget with no completion signal back to
+the client, the indicator is timer-based: it activates when the stream
+finishes and clears after 8 seconds.
+
+```js
+// In App.jsx, watching the streaming state from useChat:
+useEffect(() => {
+    if (prevStreaming.current && !streaming) {
+        setSummarising(true);
+        const t = setTimeout(() => setSummarising(false), 8000);
+        return () => clearTimeout(t);
+    }
+    prevStreaming.current = streaming;
+}, [streaming]);
+```
+
+---
+
+## Environment Variables
+
+Set in `packages/orchestration-service/src/.env`:
+
+| Variable | Default | Description |
+|---|---|---|
+| `EXTRACTION_URL` | `http://localhost:11434` | Ollama instance URL |
+| `EXTRACTION_MODEL` | `qwen2.5:3b` | Model for summarisation |
+| `MEMORY_SERVICE_URL` | `http://localhost:3002` | Memory service URL |
+| `SUMMARY_THRESHOLD_TOKENS` | `200` | Token threshold before summarisation triggers |
+| `SUMMARY_MAX_TOKENS` | `800` | Max summary length before a new row is created |
+| `SUMMARY_MIN_EPISODES` | `5` | Min new episodes since last summary before re-summarising |s