documentation updates for entity extraction and summarization

2026-04-21 03:50:38 -07:00
parent 32365e67f4
commit acda21317b
6 changed files with 540 additions and 107 deletions
--- a/docs/reference/API-routes.md
+++ b/docs/reference/API-routes.md
@@ -120,6 +120,38 @@ all projects use isolated memory. Returns `201` with the created project object.
 Only provided fields are updated — omitted fields are not touched.
 ### Summaries
 | Method | Path | Description |
 |---|---|---|
 | GET | /summaries/session/:sessionId | Get all summaries for a session (by external UUID) |
 | GET | /summaries/project/:projectId | Get all summaries for a project |
 **GET /summaries/session/:sessionId** — resolves the external UUID to an
 internal session ID, then fetches summaries from the memory service.
 Returns an array of summary objects ordered by `created_at` ascending.
 **GET /summaries/project/:projectId** — proxies directly to the memory
 service project summaries endpoint.
 **Summary object shape:**
 ```json
 {
  "id": 8,
  "session_id": 72,
  "project_id": null,
  "content": "The user asked about...",
  "token_count": 579,
  "episode_range": "246-251",
  "created_at": 1776766518,
  "updated_at": 1776766518
 }
 ```
 > **Proxy requirement:** `/summaries` must be added to both the Caddyfile
 > reverse proxy and the Vite dev proxy config alongside the other route
 > prefixes. See `orchestration-service.md` for the Caddy block pattern.
 ### Models
 | Method | Path | Description |
@@ -269,6 +301,29 @@ Both fields are optional. Only provided fields are updated.
 Same request/response shape as orchestration `/projects` above.
 ### Summaries
 | Method | Path | Description |
 |---|---|---|
 | POST | /summaries | Create a new summary |
 | GET | /sessions/:id/summaries | Get all summaries for a session (internal ID) |
 | GET | /projects/:id/summaries | Get all summaries for a project |
 | PATCH | /summaries/:id | Update a summary (content, tokenCount, episodeRange) |
 | DELETE | /summaries/:id | Delete a summary |
 **POST /summaries — body:**
 ```json
 {
  "sessionId": 72,
  "content": "The user discussed...",
  "tokenCount": 579,
  "episodeRange": "246-251"
 }
 ```
 `content` is required. Either `sessionId` or `projectId` is required.
 **PATCH /summaries/:id — body:** any subset of `content`, `tokenCount`, `episodeRange`.
 ### Entities
 | Method | Path | Description |
--- a/docs/services/entity-extraction.md
+++ b/docs/services/entity-extraction.md
@@ -0,0 +1,178 @@
 # Memory Service
 **Package:** `@nexusai/memory-service`  
 **Location:** `packages/memory-service`  
 **Deployed on:** Mini PC 1 (192.168.0.81)  
 **Port:** 3002
 ## Purpose
 Responsible for all reading and writing of long-term memory. Acts as the
 sole interface to both SQLite and Qdrant — no other service accesses these
 stores directly. On episode creation, automatically calls the embedding
 service to generate and store a vector in Qdrant.
 ## Dependencies
 - `express` — HTTP API
 - `better-sqlite3` — SQLite driver
 - `@qdrant/js-client-rest` — Qdrant vector store client
 - `dotenv` — environment variable loading
 - `@nexusai/shared` — shared utilities and constants
 ## Environment Variables
 | Variable | Required | Default | Description |
 |---|---|---|---|
 | PORT | No | 3002 | Port to listen on |
 | SQLITE_PATH | Yes | — | Path to SQLite database file |
 | QDRANT_URL | No | http://localhost:6333 | Qdrant instance URL |
 | EMBEDDING_SERVICE_URL | No | http://localhost:3003 | Embedding service URL |
 | EXTRACTION_URL | No | http://localhost:11434 | Ollama URL for entity extraction |
 | EXTRACTION_MODEL | No | qwen2.5:3b | Ollama model used for entity extraction |
 ## Internal Structure
 ```
 src/
 ├── db/
 │   ├── index.js       # SQLite connection + initialization + migrations
 │   ├── schema.js      # Table definitions, indexes, FTS5, triggers
 │   ├── projects.js    # Project CRUD functions
 │   └── summaries.js   # Summary CRUD functions
 ├── episodic/
 │   └── index.js       # Session + episode CRUD, FTS search, embedding write path
 ├── semantic/
 │   └── index.js       # Qdrant collection management, upsert, search, delete
 ├── entities/
 │   ├── index.js       # Entity + relationship CRUD
 │   └── extraction.js  # Automatic entity extraction via qwen2.5:3b on Ollama
 └── index.js           # Express app + all route definitions
 ```
 ## SQLite Schema
 Seven core tables:
 - **sessions** — top-level conversation containers. Fields: `external_id`, `name`, `project_id`, `metadata`
 - **episodes** — individual exchanges (user message + AI response) tied to a session
 - **entities** — named things the system learns about (people, places, concepts)
 - **relationships** — directional labeled links between entities
 - **summaries** — condensed episode groups for efficient context retrieval
 - **projects** — named groupings of sessions with `name`, `description`, `colour`, `icon`, `isolated`, `notes`, `system_prompt`
 ### Migrations
 Schema changes that cannot use `CREATE TABLE IF NOT EXISTS` are applied as
 idempotent migrations in `db/index.js` at startup:
 ```js
 try { db.exec(`ALTER TABLE sessions ADD COLUMN name TEXT`); } catch {}
 try { db.exec(`ALTER TABLE sessions ADD COLUMN project_id INTEGER REFERENCES projects(id)`); } catch {}
 try { db.exec(`CREATE INDEX IF NOT EXISTS idx_sessions_project ON sessions(project_id)`); } catch {}
 try { db.exec(`ALTER TABLE projects ADD COLUMN isolated INTEGER NOT NULL DEFAULT 0`); } catch {}
 try { db.exec(`ALTER TABLE projects ADD COLUMN notes TEXT`); } catch {}
 try { db.exec(`ALTER TABLE projects ADD COLUMN system_prompt TEXT`); } catch {}
 ```
 New migrations are always appended here — never modify the schema file for
 existing tables since `ALTER TABLE` cannot use `IF NOT EXISTS`.
 ### FTS5 Full-Text Search
 An `episodes_fts` virtual table enables keyword search across all episodes.
 Three triggers (`episodes_fts_insert`, `episodes_fts_update`, `episodes_fts_delete`)
 keep the FTS index automatically in sync with the episodes table.
 ### SQLite Configuration
 - `journal_mode = WAL` — non-blocking reads during writes
 - `foreign_keys = ON` — enforces referential integrity and cascade deletes
 - PRAGMAs set via `db.pragma()`, not `db.exec()`
 ### Dynamic Updates
 Both `updateSession` and `updateProject` build their `SET` clause dynamically
 from only the fields passed — prevents partial updates from overwriting fields
 that weren't touched.
 `updateProject` allowlist:
 ```js
 const allowed = ['name', 'description', 'colour', 'icon', 'isolated', 'notes', 'system_prompt'];
 ```
 ## Qdrant / Semantic Layer
 Three Qdrant collections are initialized on service startup via `semantic.initCollections()`:
 | Collection | Purpose |
 |---|---|
 | `episodes` | Embeddings for individual conversation exchanges |
 | `entities` | Embeddings for named entities |
 | `summaries` | Embeddings for condensed episode summaries |
 All collections use **768-dimension vectors** with **Cosine similarity**,
 matching `nomic-embed-text` via Ollama. Vector size and distance metric are
 defined in `@nexusai/shared` — not hardcoded here.
 `initCollections()` iterates `Object.values(COLLECTIONS)` and creates any
 collection that doesn't already exist at startup — all three collections are
 guaranteed to exist before any requests are handled, avoiding race conditions
 between the first entity embed and an entity search.
 Each collection exposes upsert, search (with optional Qdrant filter), and
 delete operations. The `wait: true` flag is used on all writes.
 ## Embedding Write Path
 When a new episode is created:
 1. Episode saved to SQLite synchronously — response returned immediately
 2. User message + AI response combined: `User: ...\nAssistant: ...`
 3. Text sent to embedding service (`POST /embed`)
 4. Vector upserted into `episodes` Qdrant collection with payload `{ sessionId, createdAt }`
 This step is **fire-and-forget** — if embedding fails, the episode is still
 saved and searchable via FTS. The error is logged but not surfaced.
 > The Qdrant payload stores `sessionId` (the internal integer ID). See
 > `memory-isolation.md` for how project-level filtering works.
 ## Entity Layer
 Entities and relationships use upsert semantics with composite unique
 constraints to prevent duplicates:
 - `UNIQUE(name, type)` on entities
 - `UNIQUE(from_id, to_id, label)` on relationships
 - `ON DELETE CASCADE` on relationship foreign keys
 After each episode is saved, `extraction.js` automatically extracts named
 entities from the conversation using `qwen2.5:3b` on Ollama — fire-and-forget.
 > For full details on the extraction pipeline, prompt format, constrained
 > decoding, stoplist, and Qdrant storage, see `entity-extraction.md`.
 ## Summaries Layer
 Session summaries are generated by `orchestration-service/src/services/summarization.js`
 after each episode write and stored here via `POST /summaries`. The memory
 service is responsible only for CRUD — generation logic lives in orchestration.
 > For full details on trigger conditions, prompt format, cumulative updates,
 > and ChatML token stripping, see `summarization.md`.
 ## Project Delete Behaviour
 Deleting a project runs as a transaction — it first nulls out `project_id`
 on all assigned sessions, then deletes the project. This avoids a foreign
 key constraint failure since `sessions.project_id` has no `ON DELETE` rule:
 ```js
 const doDelete = db.transaction(() => {
  db.prepare(`UPDATE sessions SET project_id = NULL WHERE project_id = ?`).run(id);
  db.prepare(`DELETE FROM projects WHERE id = ?`).run(id);
 });
 ```
 For all HTTP endpoints, see `api-routes.md`.
--- a/docs/services/memory-service.md
+++ b/docs/services/memory-service.md
@@ -38,7 +38,8 @@ src/
 ├── db/
 │   ├── index.js       # SQLite connection + initialization + migrations
 │   ├── schema.js      # Table definitions, indexes, FTS5, triggers
-│   └── projects.js    # Project CRUD functions
+│   ├── projects.js    # Project CRUD functions
 │   └── summaries.js   # Summary CRUD functions
 ├── episodic/
 │   └── index.js       # Session + episode CRUD, FTS search, embedding write path
 ├── semantic/
@@ -51,7 +52,7 @@ src/
 ## SQLite Schema
-Six core tables:
+Seven core tables:
 - **sessions** — top-level conversation containers. Fields: `external_id`, `name`, `project_id`, `metadata`
 - **episodes** — individual exchanges (user message + AI response) tied to a session
@@ -100,12 +101,9 @@ that weren't touched.
 const allowed = ['name', 'description', 'colour', 'icon', 'isolated', 'notes', 'system_prompt'];
 ```
 This means saving just `{ notes: "..." }` or `{ system_prompt: "..." }` won't
 touch any other field.
 ## Qdrant / Semantic Layer
-Three Qdrant collections are initialized on service startup:
+Three Qdrant collections are initialized on service startup via `semantic.initCollections()`:
 | Collection | Purpose |
 |---|---|
@@ -117,9 +115,13 @@ All collections use **768-dimension vectors** with **Cosine similarity**,
 matching `nomic-embed-text` via Ollama. Vector size and distance metric are
 defined in `@nexusai/shared` — not hardcoded here.
-Each collection exposes three operations in `src/semantic/index.js`:
+`initCollections()` iterates `Object.values(COLLECTIONS)` and creates any
-upsert, search (with optional Qdrant filter), and delete. The `wait: true`
+collection that doesn't already exist at startup — all three collections are
-flag is used on all writes.
+guaranteed to exist before any requests are handled, avoiding race conditions
 between the first entity embed and an entity search.
 Each collection exposes upsert, search (with optional Qdrant filter), and
 delete operations. The `wait: true` flag is used on all writes.
 ## Embedding Write Path
@@ -133,8 +135,7 @@ When a new episode is created:
 This step is **fire-and-forget** — if embedding fails, the episode is still
 saved and searchable via FTS. The error is logged but not surfaced.
-> The Qdrant payload stores `sessionId` (the internal integer ID). This is
+> The Qdrant payload stores `sessionId` (the internal integer ID). See
 > used for per-session and per-project filtering during semantic search. See
 > `memory-isolation.md` for how project-level filtering works.
 ## Entity Layer
@@ -146,34 +147,20 @@ constraints to prevent duplicates:
 - `UNIQUE(from_id, to_id, label)` on relationships
 - `ON DELETE CASCADE` on relationship foreign keys
 ### Automatic Entity Extraction
 After each episode is saved, `extraction.js` automatically extracts named
-entities from the conversation using `qwen2.5:3b` running on Ollama (Mini PC 1).
+entities from the conversation using `qwen2.5:3b` on Ollama — fire-and-forget.
 This runs **fire-and-forget** — the episode is already saved and returned
 before extraction begins.
-**Entity types extracted:** `person`, `place`, `project`, `technology`,
+> For full details on the extraction pipeline, prompt format, constrained
-`concept`, `organization`
+> decoding, stoplist, and Qdrant storage, see `entity-extraction.md`.
-The extraction prompt uses ChatML format (native to qwen2.5) and primes the
+## Summaries Layer
 response by ending with `[` to steer the model directly into JSON array output.
 A list of already-known entities is injected into the prompt so the model
 reuses existing `(name, type)` pairs rather than creating duplicates with
 different types.
-After extraction, each entity is:
+Session summaries are generated by `orchestration-service/src/services/summarization.js`
-1. Upserted into SQLite via `upsertEntity` — notes are only written if
+after each episode write and stored here via `POST /summaries`. The memory
-   the entity is new (`COALESCE(entities.notes, excluded.notes)` prevents
+service is responsible only for CRUD — generation logic lives in orchestration.
   overwriting existing notes with speculative updates)
 2. Embedded via the embedding service and upserted into the `entities`
   Qdrant collection with `{ name, type, notes, projectId }` as payload —
   `projectId` scopes entities to their project for isolated retrieval
-`extractAndStoreEntities` receives `projectId` from `createEpisode`, which
+> For full details on trigger conditions, prompt format, cumulative updates,
-receives it from the episode route, which receives it from orchestration's
+> and ChatML token stripping, see `summarization.md`.
 `createEpisode` call. This ensures entities are tagged with the correct
 project scope at extraction time.
 ## Project Delete Behaviour
--- a/docs/services/orchestration-service.md
+++ b/docs/services/orchestration-service.md
@@ -30,7 +30,8 @@ or inference services — all traffic flows through orchestration.
 | LLAMA_SERVER_URL | No | http://localhost:8080 | Direct llama-server URL for /models/props |
 | QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search |
 | CORS_ORIGIN | No | http://localhost:5173 | Allowed origin for CORS requests |
-| MODELS_MANIFEST_PATH | No | — | Legacy — superseded by `modelsFolderPath` in settings.json |
+| EXTRACTION_URL | No | http://localhost:11434 | Ollama URL for summarisation |
 | EXTRACTION_MODEL | No | qwen2.5:3b | Ollama model used for summarisation |
 ## Internal Structure
@@ -40,7 +41,8 @@ src/
 │   ├── memory.js         # HTTP client for memory service
 │   ├── inference.js      # HTTP client for inference service
 │   ├── embedding.js      # HTTP client for embedding service
-│   └── qdrant.js      # HTTP client for Qdrant (direct vector search)
+│   ├── qdrant.js         # HTTP client for Qdrant (direct vector search)
 │   └── summarization.js  # Session summarisation — triggers after each episode
 ├── chat/
 │   └── index.js          # Core pipeline — context assembly, isolation, auto-naming
 ├── config/
@@ -48,12 +50,12 @@ src/
 ├── routes/
 │   ├── chat.js           # POST /chat and POST /chat/stream
 │   ├── sessions.js       # Session CRUD proxy
-│   ├── projects.js    # Project CRUD proxy — passes req.body straight through
+│   ├── projects.js       # Project CRUD proxy
 │   ├── episodes.js       # Episode list and delete proxy
 │   ├── summaries.js      # GET /summaries/session/:id and /summaries/project/:id
 │   ├── settings.js       # GET /settings and PATCH /settings
-│   ├── health.js      # GET /health — pings all four services
+│   ├── health.js         # GET /health/services — pings all four services
-│   └── models.js      # GET /models — scans .gguf files live, merges with models.json
+│   └── models.js         # GET /models and GET /models/props
                       # GET /models/props — context window + loaded model from llama-server
 └── index.js              # Express app entry point
 ```
@@ -77,9 +79,6 @@ via `appSettings.load()` — changes apply immediately without a service restart
 | `topK` | 40 | Top-K token candidates per step |
 | `systemPrompt` | *(ORCHESTRATION.SYSTEM_PROMPT)* | Global system prompt. `null` reverts to hardcoded constant. |
 Defaults are defined in `config/settings.js` and fall back to constants in
 `@nexusai/shared`. Values saved in `settings.json` take precedence.
 ## Chat Pipeline
 Both `POST /chat` and `POST /chat/stream` share the same steps. The only
@@ -88,42 +87,38 @@ difference is how the inference response is delivered to the client.
 ### Steps
 1. **Session resolution** — look up session by `externalId`. Auto-create if
-   not found. Clients generate a UUID for new conversations — no pre-creation
+   not found.
   step needed.
 2. **Project context resolution** — if the session has a `project_id`, fetch
   the project and all its session IDs. Used to scope semantic search. The
   project's `system_prompt` is also read at this step if set.
 3. **System prompt resolution** — three-tier hierarchy:
-   - `project.system_prompt` — if the session is in a project and it's set (highest priority)
+   - `project.system_prompt` — highest priority
   - `settings.systemPrompt` — global setting from `settings.json`
-   - `ORCHESTRATION.SYSTEM_PROMPT` — hardcoded constant in `@nexusai/shared` (last resort)
+   - `ORCHESTRATION.SYSTEM_PROMPT` — hardcoded constant (last resort)
-4. **Recent episode retrieval** — fetch the most recent episodes for the
+4. **Recent episode retrieval** — fetch most recent episodes (`recentEpisodeLimit`).
   session (`recentEpisodeLimit`, default 5).
-5. **Semantic search** — embed the user message, query Qdrant for the top
+5. **Semantic search** — embed user message, query Qdrant for similar past
-   most similar past episodes (`semanticLimit`, `scoreThreshold`). Deduplicated
+   episodes. Deduplicated against recent episodes. Non-critical.
   against recent episodes. Non-critical — if it fails, pipeline continues with
   recency-only context.
-6. **Entity search** — query the `entities` Qdrant collection filtered by
+6. **Entity search** — query `entities` Qdrant collection filtered by
   `projectId`. Non-project sessions receive no entity context. Non-critical.
-7. **Prompt assembly** — combine resolved system prompt, entity context,
+7. **Prompt assembly** — combine system prompt, entity context, semantic
-   semantic episodes, recent episodes, and user message.
+   episodes, recent episodes, and user message.
-8. **Inference** — send to inference service with settings-derived parameters
+8. **Inference** — send to inference service. `/chat` awaits full response;
   (temperature, topP, topK, repeatPenalty). `/chat` awaits full response;
   `/chat/stream` pipes SSE chunks to the client.
-9. **Episode write** — write the exchange back to memory with `projectId`.
+9. **Episode write** — write exchange back to memory with `projectId`.
   Fire-and-forget for `/chat`; awaited for `/chat/stream`.
-10. **Auto-naming** — on `isFirstMessage && !session.name`, fire a secondary
+10. **Summarisation trigger** — `triggerSummary(session, allEpisodes)` called
-    inference call with a naming prompt (max 20 tokens, temperature 0.3) and
+    fire-and-forget. See `summarization.md` for full details.
-    write the result back as `session.name`. Fully fire-and-forget.
+
 11. **Auto-naming** — on first message with no session name, fires a secondary
    inference call (max 20 tokens, temperature 0.3) to generate a session name.
 ### Prompt Structure
@@ -132,26 +127,28 @@ difference is how the inference response is delivered to the client.
 Here is what you know about entities relevant to this conversation:
 - {name} ({type}): {notes}
 ... (up to 5 entity results)
 ---
 Here are some relevant memories from earlier conversations:
 User: {past user message}
 Assistant: {past ai response}
 ... (up to semanticLimit semantic episodes)
 ---
 Here are some relevant memories from your past conversations:
 User: {past user message}
 Assistant: {past ai response}
 ... (up to recentEpisodeLimit recent episodes)
 --- End of recent memories ---
 User: {current message}
 Assistant:
 ```
-Entity context appears first — before episodic memory — because structured
+## Summarisation
-facts about known entities are the most stable and reliable context. Semantic
+
-episodes follow, then recent episodes as the immediate conversation flow.
+After each episode write, `triggerSummary` is called fire-and-forget. It
 checks token thresholds and episode counts before generating, then stores
 the result in the memory service.
 > For full details on trigger conditions, prompt format, cumulative updates,
 > ChatML token stripping, and episode range tracking, see `summarization.md`.
 ## SSE Stream Format
@@ -168,37 +165,26 @@ data: {"text":"Hello"}
 data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42}
 ```
-The `[DONE]` sentinel is consumed internally and not forwarded. The stream
+The `[DONE]` sentinel is consumed internally and not forwarded.
 is terminated by `res.end()` after the done event.
 ## Models Route
-`GET /models` scans `.gguf` files live on each request from `modelsFolderPath`
+`GET /models` scans `.gguf` files live from `modelsFolderPath` and merges
-(read from settings). Merges results with a `models.json` file in the same
+with `models.json` for metadata. Returns file size in GB.
 folder for richer metadata (label, description). Returns file size in GB.
-`GET /models/props` fetches directly from llama-server via `LLAMA_SERVER_URL`.
+`GET /models/props` fetches directly from llama-server. Returns
-Returns `{ contextWindow, modelAlias }`. `n_ctx` is at
+`{ contextWindow, modelAlias }`. Returns `503` if unreachable.
 `data.default_generation_settings.n_ctx` in the llama-server response.
 Returns `503` if llama-server is unreachable.
 ## Sessions Route Behaviour
-`PATCH /sessions/:sessionId` accepts either `name`, `projectId`, or both.
+`PATCH /sessions/:sessionId` accepts `name`, `projectId`, or both.
-The validation guard only rejects requests where neither is provided:
+Rejects only when neither is provided — allows `useChat` to write project
-
+assignment separately from rename operations.
 ```js
 if (!name?.trim() && projectId === undefined) {
  return res.status(400).json({ error: 'name or projectId is required' });
 }
 ```
 This allows `useChat` to write project assignment separately from rename
 operations.
 ## Caddy Configuration
-Each route prefix needs a handle block in the Caddyfile on Mini PC 2:
+Each route prefix needs a handle block in the Caddyfile on Mini PC 2.
 **Any new top-level route must be added here AND in `vite.config.js`.**
 ```
 handle /chat*      { reverse_proxy localhost:4000 }
@@ -207,6 +193,7 @@ handle /models*   { reverse_proxy localhost:4000 }
 handle /projects*  { reverse_proxy localhost:4000 }
 handle /episodes*  { reverse_proxy localhost:4000 }
 handle /settings*  { reverse_proxy localhost:4000 }
 handle /summaries* { reverse_proxy localhost:4000 }
 handle /health*    { reverse_proxy localhost:4000 }
 ```
--- a/docs/services/shared.md
+++ b/docs/services/shared.md
@@ -165,10 +165,16 @@ Orchestration pipeline defaults. Used as fallback values in
 | `RECENT_EPISODE_LIMIT` | `5` | Recent episodes to inject into prompt |
 | `SEMANTIC_LIMIT` | `5` | Semantic search results to inject into prompt |
 | `SCORE_THRESHOLD` | `0.75` | Minimum similarity score for semantic results |
 | `ENTITIES_LIMIT` | `5` | Max entity search results to inject into prompt |
 | `ENTITIES_THRESHOLD` | `0.55` | Minimum similarity score for entity results |
 | `TEMPERATURE` | `0.7` | Default inference temperature |
 | `CORS_ORIGIN` | `'http://localhost:5173'` | Fallback allowed CORS origin |
 | `SYSTEM_PROMPT` | *(see below)* | Default system prompt |
 > `ENTITIES_THRESHOLD` is set to `0.55` — lower than `SCORE_THRESHOLD` because
 > entity notes generated by a 3B model tend to embed with lower cosine similarity
 > than full episode text. Tune upward if irrelevant entities appear in context.
 > `repeatPenalty`, `topP`, and `topK` defaults are sourced from
 > `INFERENCE_DEFAULTS` in `config/settings.js` rather than `ORCHESTRATION`,
 > since those constants already define the canonical values.
@@ -178,6 +184,25 @@ Default system prompt:
 > of past conversations with the user. Use them to provide consistent,
 > personalised responses."
 #### `SUMMARIES`
 Controls the automatic session summarisation system in `orchestration-service/src/services/summarization.js`.
 | Key | Value | Description |
 |---|---|---|
 | `THRESHOLD_TOKENS` | `200` | Minimum total session tokens before summarisation is considered |
 | `MAX_SUMMARY_TOKENS` | `800` | If existing summary exceeds this length (chars), create a new row instead of updating |
 | `MIN_EPISODES_SINCE` | `5` | Minimum new episodes since last summary before re-summarising |
 These can be overridden per-deployment via environment variables in the
 orchestration service `.env`:
 ```
 SUMMARY_THRESHOLD_TOKENS=200
 SUMMARY_MAX_TOKENS=800
 SUMMARY_MIN_EPISODES=5
 ```
 #### `SQLITE`
 | Key | Value | Description |
--- a/docs/services/summarization.md
+++ b/docs/services/summarization.md
@@ -0,0 +1,201 @@
 # Summarization
 Session summarization generates rolling plain-text summaries of conversation
 history, giving the model a condensed view of past context without consuming
 the full context window with raw episodes.
 **Location:** `packages/orchestration-service/src/services/summarization.js`  
 **Triggered by:** `chat/index.js` after every episode write (fire-and-forget)  
 **Model:** `qwen2.5:3b` via Ollama on Mini PC 1 (192.168.0.81)
 ---
 ## Trigger Conditions
 `triggerSummary(session, allEpisodes)` calls `maybeSummarize` fire-and-forget.
 `maybeSummarize` proceeds only when both conditions are met:
 1. Total session token count exceeds `SUMMARIES.THRESHOLD_TOKENS` (default 200)
 2. At least `SUMMARIES.MIN_EPISODES_SINCE` (default 5) new episodes have
   accumulated since the last summary
 The token threshold is intentionally low — it ensures summaries start
 generating early in a session's life rather than only after very long
 conversations.
 ---
 ## Summary Rows and Cumulative Updates
 Each session can have multiple summary rows in the `summaries` table.
 The update strategy depends on the size of the most recent summary:
 | Condition | Action |
 |---|---|
 | No existing summary | Generate fresh summary from all episodes |
 | Latest summary under `MAX_SUMMARY_TOKENS` | Update: summarise new episodes with existing summary as context |
 | Latest summary over `MAX_SUMMARY_TOKENS` | Create new row: treat as fresh summarisation |
 This produces a chain of summary rows over time. Each row's `episode_range`
 covers only the episodes summarised in that specific pass (e.g. `259-263`),
 not all episodes in the session.
 ---
 ## Ollama Request
 ```js
 {
    model: EXTRACTION_MODEL,   // qwen2.5:3b (set via EXTRACTION_MODEL env var)
    prompt: buildSummaryPrompt(episodesToSummarize, existingSummary),
    stream: false,
    // No format: 'json' — free-text output required for summaries
    options: {
        temperature: 0.2,
        num_predict: 500,
    },
 }
 ```
 `temperature: 0.2` is slightly higher than extraction (0.1) — summaries
 benefit from some fluency. `num_predict: 500` gives room for 5 thorough
 sentences without risk of runoff.
 ---
 ## Prompt Format
 ChatML format — native to qwen2.5:
 ```
 <|im_start|>user
 Summarize the conversation below in 3-5 sentences.
 Write in third person. Do not quote directly — paraphrase only.
 Do not include greetings, sign-offs, or filler. Output only the summary text.
 Conversation:
 {context}
 <|im_end|>
 <|im_start|>assistant
 ```
 For cumulative updates, the instruction and context change:
 ```
 <|im_start|>user
 Update the summary below to incorporate the new exchanges.
 Write 3-5 sentences in third person. Do not quote directly — paraphrase only.
 Do not include greetings, sign-offs, or filler. Output only the updated summary text.
 Previous summary:
 {existingSummary}
 New exchanges:
 {context}
 <|im_end|>
 <|im_start|>assistant
 ```
 ### Input truncation
 Episode context is truncated to `MAX_CHARS = 3000` characters, keeping the
 most recent exchanges (sliced from the end). This keeps Qwen focused and
 prevents the prompt from exceeding its effective context window.
 ---
 ## ChatML Token Stripping
 Qwen occasionally echoes ChatML tokens back into its response. The raw output
 is cleaned before saving:
 ```js
 const raw = data.response?.trim() ?? '';
 const content = raw
    .replace(/<\|im_start\|>.*?<\|im_end\|>/gs, '')
    .replace(/<\|im_start\|>|<\|im_end\|>|<\|im_sep\|>/g, '')
    .trim();
 return content;
 ```
 Without this, leaked tokens get stored in the summary and then injected
 back into the next summarisation prompt — causing the model to append a new
 summary after the old one rather than replacing it.
 ---
 ## Episode Range Tracking
 Each summary row stores `episode_range` as `"firstId-lastId"` covering only
 the episodes summarised in that pass:
 ```js
 const summarizedIds = episodesToSummarize.map(ep => ep.id).sort((a,b) => a - b);
 const episodeRange = `${summarizedIds.at(0)}-${summarizedIds.at(-1)}`;
 ```
 This makes SummaryView cards meaningful — "Episodes 259-263" tells you
 exactly which exchanges that summary covers, rather than always showing
 the full session range.
 ---
 ## Summary Storage
 Summaries are written directly to the memory service from orchestration:
 ```js
 // Create new row
 await fetch(`${MEMORY_URL}/summaries`, {
    method: 'POST',
    body: JSON.stringify({ sessionId: session.id, content, tokenCount, episodeRange }),
 });
 // Update existing row
 await fetch(`${MEMORY_URL}/summaries/${latest.id}`, {
    method: 'PATCH',
    body: JSON.stringify({ content, tokenCount, episodeRange }),
 });
 ```
 `session.id` here is the internal SQLite integer ID — not the external UUID.
 It is available directly on the `session` object passed from `chat/index.js`.
 ---
 ## Client-Side Indicator
 The chat client shows a "Summarising…" spinner in the `ChatWindow` header
 and on the InfoPanel's Session Memory button while summarisation may be
 in progress.
 Since summarisation is fire-and-forget with no completion signal back to
 the client, the indicator is timer-based: it activates when the stream
 finishes and clears after 8 seconds.
 ```js
 // In App.jsx, watching the streaming state from useChat:
 useEffect(() => {
    if (prevStreaming.current && !streaming) {
        setSummarising(true);
        const t = setTimeout(() => setSummarising(false), 8000);
        return () => clearTimeout(t);
    }
    prevStreaming.current = streaming;
 }, [streaming]);
 ```
 ---
 ## Environment Variables
 Set in `packages/orchestration-service/src/.env`:
 | Variable | Default | Description |
 |---|---|---|
 | `EXTRACTION_URL` | `http://localhost:11434` | Ollama instance URL |
 | `EXTRACTION_MODEL` | `qwen2.5:3b` | Model for summarisation |
 | `MEMORY_SERVICE_URL` | `http://localhost:3002` | Memory service URL |
 | `SUMMARY_THRESHOLD_TOKENS` | `200` | Token threshold before summarisation triggers |
 | `SUMMARY_MAX_TOKENS` | `800` | Max summary length before a new row is created |
 | `SUMMARY_MIN_EPISODES` | `5` | Min new episodes since last summary before re-summarising |s