documentation updates for entity extraction and summarization

2026-04-21 03:50:38 -07:00
parent 32365e67f4
commit acda21317b
6 changed files with 540 additions and 107 deletions
--- a/docs/services/summarization.md
+++ b/docs/services/summarization.md
@@ -0,0 +1,201 @@
+# Summarization
+
+Session summarization generates rolling plain-text summaries of conversation
+history, giving the model a condensed view of past context without consuming
+the full context window with raw episodes.
+
+**Location:** `packages/orchestration-service/src/services/summarization.js`  
+**Triggered by:** `chat/index.js` after every episode write (fire-and-forget)  
+**Model:** `qwen2.5:3b` via Ollama on Mini PC 1 (192.168.0.81)
+
+---
+
+## Trigger Conditions
+
+`triggerSummary(session, allEpisodes)` calls `maybeSummarize` fire-and-forget.
+`maybeSummarize` proceeds only when both conditions are met:
+
+1. Total session token count exceeds `SUMMARIES.THRESHOLD_TOKENS` (default 200)
+2. At least `SUMMARIES.MIN_EPISODES_SINCE` (default 5) new episodes have
+   accumulated since the last summary
+
+The token threshold is intentionally low — it ensures summaries start
+generating early in a session's life rather than only after very long
+conversations.
+
+---
+
+## Summary Rows and Cumulative Updates
+
+Each session can have multiple summary rows in the `summaries` table.
+The update strategy depends on the size of the most recent summary:
+
+| Condition | Action |
+|---|---|
+| No existing summary | Generate fresh summary from all episodes |
+| Latest summary under `MAX_SUMMARY_TOKENS` | Update: summarise new episodes with existing summary as context |
+| Latest summary over `MAX_SUMMARY_TOKENS` | Create new row: treat as fresh summarisation |
+
+This produces a chain of summary rows over time. Each row's `episode_range`
+covers only the episodes summarised in that specific pass (e.g. `259-263`),
+not all episodes in the session.
+
+---
+
+## Ollama Request
+
+```js
+{
+    model: EXTRACTION_MODEL,   // qwen2.5:3b (set via EXTRACTION_MODEL env var)
+    prompt: buildSummaryPrompt(episodesToSummarize, existingSummary),
+    stream: false,
+    // No format: 'json' — free-text output required for summaries
+    options: {
+        temperature: 0.2,
+        num_predict: 500,
+    },
+}
+```
+
+`temperature: 0.2` is slightly higher than extraction (0.1) — summaries
+benefit from some fluency. `num_predict: 500` gives room for 5 thorough
+sentences without risk of runoff.
+
+---
+
+## Prompt Format
+
+ChatML format — native to qwen2.5:
+
+```
+<|im_start|>user
+Summarize the conversation below in 3-5 sentences.
+Write in third person. Do not quote directly — paraphrase only.
+Do not include greetings, sign-offs, or filler. Output only the summary text.
+
+Conversation:
+{context}
+<|im_end|>
+<|im_start|>assistant
+```
+
+For cumulative updates, the instruction and context change:
+
+```
+<|im_start|>user
+Update the summary below to incorporate the new exchanges.
+Write 3-5 sentences in third person. Do not quote directly — paraphrase only.
+Do not include greetings, sign-offs, or filler. Output only the updated summary text.
+
+Previous summary:
+{existingSummary}
+
+New exchanges:
+{context}
+<|im_end|>
+<|im_start|>assistant
+```
+
+### Input truncation
+
+Episode context is truncated to `MAX_CHARS = 3000` characters, keeping the
+most recent exchanges (sliced from the end). This keeps Qwen focused and
+prevents the prompt from exceeding its effective context window.
+
+---
+
+## ChatML Token Stripping
+
+Qwen occasionally echoes ChatML tokens back into its response. The raw output
+is cleaned before saving:
+
+```js
+const raw = data.response?.trim() ?? '';
+const content = raw
+    .replace(/<\|im_start\|>.*?<\|im_end\|>/gs, '')
+    .replace(/<\|im_start\|>|<\|im_end\|>|<\|im_sep\|>/g, '')
+    .trim();
+return content;
+```
+
+Without this, leaked tokens get stored in the summary and then injected
+back into the next summarisation prompt — causing the model to append a new
+summary after the old one rather than replacing it.
+
+---
+
+## Episode Range Tracking
+
+Each summary row stores `episode_range` as `"firstId-lastId"` covering only
+the episodes summarised in that pass:
+
+```js
+const summarizedIds = episodesToSummarize.map(ep => ep.id).sort((a,b) => a - b);
+const episodeRange = `${summarizedIds.at(0)}-${summarizedIds.at(-1)}`;
+```
+
+This makes SummaryView cards meaningful — "Episodes 259-263" tells you
+exactly which exchanges that summary covers, rather than always showing
+the full session range.
+
+---
+
+## Summary Storage
+
+Summaries are written directly to the memory service from orchestration:
+
+```js
+// Create new row
+await fetch(`${MEMORY_URL}/summaries`, {
+    method: 'POST',
+    body: JSON.stringify({ sessionId: session.id, content, tokenCount, episodeRange }),
+});
+
+// Update existing row
+await fetch(`${MEMORY_URL}/summaries/${latest.id}`, {
+    method: 'PATCH',
+    body: JSON.stringify({ content, tokenCount, episodeRange }),
+});
+```
+
+`session.id` here is the internal SQLite integer ID — not the external UUID.
+It is available directly on the `session` object passed from `chat/index.js`.
+
+---
+
+## Client-Side Indicator
+
+The chat client shows a "Summarising…" spinner in the `ChatWindow` header
+and on the InfoPanel's Session Memory button while summarisation may be
+in progress.
+
+Since summarisation is fire-and-forget with no completion signal back to
+the client, the indicator is timer-based: it activates when the stream
+finishes and clears after 8 seconds.
+
+```js
+// In App.jsx, watching the streaming state from useChat:
+useEffect(() => {
+    if (prevStreaming.current && !streaming) {
+        setSummarising(true);
+        const t = setTimeout(() => setSummarising(false), 8000);
+        return () => clearTimeout(t);
+    }
+    prevStreaming.current = streaming;
+}, [streaming]);
+```
+
+---
+
+## Environment Variables
+
+Set in `packages/orchestration-service/src/.env`:
+
+| Variable | Default | Description |
+|---|---|---|
+| `EXTRACTION_URL` | `http://localhost:11434` | Ollama instance URL |
+| `EXTRACTION_MODEL` | `qwen2.5:3b` | Model for summarisation |
+| `MEMORY_SERVICE_URL` | `http://localhost:3002` | Memory service URL |
+| `SUMMARY_THRESHOLD_TOKENS` | `200` | Token threshold before summarisation triggers |
+| `SUMMARY_MAX_TOKENS` | `800` | Max summary length before a new row is created |
+| `SUMMARY_MIN_EPISODES` | `5` | Min new episodes since last summary before re-summarising |s