6.0 KiB
Summarization
Session summarization generates rolling plain-text summaries of conversation history, giving the model a condensed view of past context without consuming the full context window with raw episodes.
Location: packages/orchestration-service/src/services/summarization.js
Triggered by: chat/index.js after every episode write (fire-and-forget)
Model: qwen2.5:3b via Ollama on Mini PC 1 (192.168.0.81)
Trigger Conditions
triggerSummary(session, allEpisodes) calls maybeSummarize fire-and-forget.
maybeSummarize proceeds only when both conditions are met:
- Total session token count exceeds
SUMMARIES.THRESHOLD_TOKENS(default 200) - At least
SUMMARIES.MIN_EPISODES_SINCE(default 5) new episodes have accumulated since the last summary
The token threshold is intentionally low — it ensures summaries start generating early in a session's life rather than only after very long conversations.
Summary Rows and Cumulative Updates
Each session can have multiple summary rows in the summaries table.
The update strategy depends on the size of the most recent summary:
| Condition | Action |
|---|---|
| No existing summary | Generate fresh summary from all episodes |
Latest summary under MAX_SUMMARY_TOKENS |
Update: summarise new episodes with existing summary as context |
Latest summary over MAX_SUMMARY_TOKENS |
Create new row: treat as fresh summarisation |
This produces a chain of summary rows over time. Each row's episode_range
covers only the episodes summarised in that specific pass (e.g. 259-263),
not all episodes in the session.
Ollama Request
{
model: EXTRACTION_MODEL, // qwen2.5:3b (set via EXTRACTION_MODEL env var)
prompt: buildSummaryPrompt(episodesToSummarize, existingSummary),
stream: false,
// No format: 'json' — free-text output required for summaries
options: {
temperature: 0.2,
num_predict: 500,
},
}
temperature: 0.2 is slightly higher than extraction (0.1) — summaries
benefit from some fluency. num_predict: 500 gives room for 5 thorough
sentences without risk of runoff.
Prompt Format
ChatML format — native to qwen2.5:
<|im_start|>user
Summarize the conversation below in 3-5 sentences.
Write in third person. Do not quote directly — paraphrase only.
Do not include greetings, sign-offs, or filler. Output only the summary text.
Conversation:
{context}
<|im_end|>
<|im_start|>assistant
For cumulative updates, the instruction and context change:
<|im_start|>user
Update the summary below to incorporate the new exchanges.
Write 3-5 sentences in third person. Do not quote directly — paraphrase only.
Do not include greetings, sign-offs, or filler. Output only the updated summary text.
Previous summary:
{existingSummary}
New exchanges:
{context}
<|im_end|>
<|im_start|>assistant
Input truncation
Episode context is truncated to MAX_CHARS = 3000 characters, keeping the
most recent exchanges (sliced from the end). This keeps Qwen focused and
prevents the prompt from exceeding its effective context window.
ChatML Token Stripping
Qwen occasionally echoes ChatML tokens back into its response. The raw output is cleaned before saving:
const raw = data.response?.trim() ?? '';
const content = raw
.replace(/<\|im_start\|>.*?<\|im_end\|>/gs, '')
.replace(/<\|im_start\|>|<\|im_end\|>|<\|im_sep\|>/g, '')
.trim();
return content;
Without this, leaked tokens get stored in the summary and then injected back into the next summarisation prompt — causing the model to append a new summary after the old one rather than replacing it.
Episode Range Tracking
Each summary row stores episode_range as "firstId-lastId" covering only
the episodes summarised in that pass:
const summarizedIds = episodesToSummarize.map(ep => ep.id).sort((a,b) => a - b);
const episodeRange = `${summarizedIds.at(0)}-${summarizedIds.at(-1)}`;
This makes SummaryView cards meaningful — "Episodes 259-263" tells you exactly which exchanges that summary covers, rather than always showing the full session range.
Summary Storage
Summaries are written directly to the memory service from orchestration:
// Create new row
await fetch(`${MEMORY_URL}/summaries`, {
method: 'POST',
body: JSON.stringify({ sessionId: session.id, content, tokenCount, episodeRange }),
});
// Update existing row
await fetch(`${MEMORY_URL}/summaries/${latest.id}`, {
method: 'PATCH',
body: JSON.stringify({ content, tokenCount, episodeRange }),
});
session.id here is the internal SQLite integer ID — not the external UUID.
It is available directly on the session object passed from chat/index.js.
Client-Side Indicator
The chat client shows a "Summarising…" spinner in the ChatWindow header
and on the InfoPanel's Session Memory button while summarisation may be
in progress.
Since summarisation is fire-and-forget with no completion signal back to the client, the indicator is timer-based: it activates when the stream finishes and clears after 8 seconds.
// In App.jsx, watching the streaming state from useChat:
useEffect(() => {
if (prevStreaming.current && !streaming) {
setSummarising(true);
const t = setTimeout(() => setSummarising(false), 8000);
return () => clearTimeout(t);
}
prevStreaming.current = streaming;
}, [streaming]);
Environment Variables
Set in packages/orchestration-service/src/.env:
| Variable | Default | Description |
|---|---|---|
EXTRACTION_URL |
http://localhost:11434 |
Ollama instance URL |
EXTRACTION_MODEL |
qwen2.5:3b |
Model for summarisation |
MEMORY_SERVICE_URL |
http://localhost:3002 |
Memory service URL |
SUMMARY_THRESHOLD_TOKENS |
200 |
Token threshold before summarisation triggers |
SUMMARY_MAX_TOKENS |
800 |
Max summary length before a new row is created |
SUMMARY_MIN_EPISODES |
5 |
Min new episodes since last summary before re-summarising |