# Summarization Session summarization generates rolling plain-text summaries of conversation history, giving the model a condensed view of past context without consuming the full context window with raw episodes. **Location:** `packages/orchestration-service/src/services/summarization.js` **Triggered by:** `chat/index.js` after every episode write (fire-and-forget) **Model:** `qwen2.5:3b` via Ollama on Mini PC 1 (192.168.0.81) --- ## Trigger Conditions `triggerSummary(session, allEpisodes)` calls `maybeSummarize` fire-and-forget. `maybeSummarize` proceeds only when both conditions are met: 1. Total session token count exceeds `SUMMARIES.THRESHOLD_TOKENS` (default 200) 2. At least `SUMMARIES.MIN_EPISODES_SINCE` (default 5) new episodes have accumulated since the last summary The token threshold is intentionally low — it ensures summaries start generating early in a session's life rather than only after very long conversations. --- ## Summary Rows and Cumulative Updates Each session can have multiple summary rows in the `summaries` table. The update strategy depends on the size of the most recent summary: | Condition | Action | |---|---| | No existing summary | Generate fresh summary from all episodes | | Latest summary under `MAX_SUMMARY_TOKENS` | Update: summarise new episodes with existing summary as context | | Latest summary over `MAX_SUMMARY_TOKENS` | Create new row: treat as fresh summarisation | This produces a chain of summary rows over time. Each row's `episode_range` covers only the episodes summarised in that specific pass (e.g. `259-263`), not all episodes in the session. --- ## Ollama Request ```js { model: EXTRACTION_MODEL, // qwen2.5:3b (set via EXTRACTION_MODEL env var) prompt: buildSummaryPrompt(episodesToSummarize, existingSummary), stream: false, // No format: 'json' — free-text output required for summaries options: { temperature: 0.2, num_predict: 500, }, } ``` `temperature: 0.2` is slightly higher than extraction (0.1) — summaries benefit from some fluency. `num_predict: 500` gives room for 5 thorough sentences without risk of runoff. --- ## Prompt Format ChatML format — native to qwen2.5: ``` <|im_start|>user Summarize the conversation below in 3-5 sentences. Write in third person. Do not quote directly — paraphrase only. Do not include greetings, sign-offs, or filler. Output only the summary text. Conversation: {context} <|im_end|> <|im_start|>assistant ``` For cumulative updates, the instruction and context change: ``` <|im_start|>user Update the summary below to incorporate the new exchanges. Write 3-5 sentences in third person. Do not quote directly — paraphrase only. Do not include greetings, sign-offs, or filler. Output only the updated summary text. Previous summary: {existingSummary} New exchanges: {context} <|im_end|> <|im_start|>assistant ``` ### Input truncation Episode context is truncated to `MAX_CHARS = 3000` characters, keeping the most recent exchanges (sliced from the end). This keeps Qwen focused and prevents the prompt from exceeding its effective context window. --- ## ChatML Token Stripping Qwen occasionally echoes ChatML tokens back into its response. The raw output is cleaned before saving: ```js const raw = data.response?.trim() ?? ''; const content = raw .replace(/<\|im_start\|>.*?<\|im_end\|>/gs, '') .replace(/<\|im_start\|>|<\|im_end\|>|<\|im_sep\|>/g, '') .trim(); return content; ``` Without this, leaked tokens get stored in the summary and then injected back into the next summarisation prompt — causing the model to append a new summary after the old one rather than replacing it. --- ## Episode Range Tracking Each summary row stores `episode_range` as `"firstId-lastId"` covering only the episodes summarised in that pass: ```js const summarizedIds = episodesToSummarize.map(ep => ep.id).sort((a,b) => a - b); const episodeRange = `${summarizedIds.at(0)}-${summarizedIds.at(-1)}`; ``` This makes SummaryView cards meaningful — "Episodes 259-263" tells you exactly which exchanges that summary covers, rather than always showing the full session range. --- ## Summary Storage Summaries are written directly to the memory service from orchestration: ```js // Create new row await fetch(`${MEMORY_URL}/summaries`, { method: 'POST', body: JSON.stringify({ sessionId: session.id, content, tokenCount, episodeRange }), }); // Update existing row await fetch(`${MEMORY_URL}/summaries/${latest.id}`, { method: 'PATCH', body: JSON.stringify({ content, tokenCount, episodeRange }), }); ``` `session.id` here is the internal SQLite integer ID — not the external UUID. It is available directly on the `session` object passed from `chat/index.js`. --- ## Client-Side Indicator The chat client shows a "Summarising…" spinner in the `ChatWindow` header and on the InfoPanel's Session Memory button while summarisation may be in progress. Since summarisation is fire-and-forget with no completion signal back to the client, the indicator is timer-based: it activates when the stream finishes and clears after 8 seconds. ```js // In App.jsx, watching the streaming state from useChat: useEffect(() => { if (prevStreaming.current && !streaming) { setSummarising(true); const t = setTimeout(() => setSummarising(false), 8000); return () => clearTimeout(t); } prevStreaming.current = streaming; }, [streaming]); ``` --- ## Environment Variables Set in `packages/orchestration-service/src/.env`: | Variable | Default | Description | |---|---|---| | `EXTRACTION_URL` | `http://localhost:11434` | Ollama instance URL | | `EXTRACTION_MODEL` | `qwen2.5:3b` | Model for summarisation | | `MEMORY_SERVICE_URL` | `http://localhost:3002` | Memory service URL | | `SUMMARY_THRESHOLD_TOKENS` | `200` | Token threshold before summarisation triggers | | `SUMMARY_MAX_TOKENS` | `800` | Max summary length before a new row is created | | `SUMMARY_MIN_EPISODES` | `5` | Min new episodes since last summary before re-summarising |s