Summarization

Session summarization generates rolling plain-text summaries of conversation history, giving the model a condensed view of past context without consuming the full context window with raw episodes.

Location: packages/orchestration-service/src/services/summarization.js
Triggered by: chat/index.js after every episode write (fire-and-forget)
Model: qwen2.5:3b via Ollama on Mini PC 1 (192.168.0.81)

Trigger Conditions

triggerSummary(session, allEpisodes) calls maybeSummarize fire-and-forget. maybeSummarize proceeds only when both conditions are met:

Total session token count exceeds SUMMARIES.THRESHOLD_TOKENS (default 200)
At least SUMMARIES.MIN_EPISODES_SINCE (default 5) new episodes have accumulated since the last summary

The token threshold is intentionally low — it ensures summaries start generating early in a session's life rather than only after very long conversations.

Summary Rows and Cumulative Updates

Each session can have multiple summary rows in the summaries table. The update strategy depends on the size of the most recent summary:

Condition	Action
No existing summary	Generate fresh summary from all episodes
Latest summary under `MAX_SUMMARY_TOKENS`	Update: summarise new episodes with existing summary as context
Latest summary over `MAX_SUMMARY_TOKENS`	Create new row: treat as fresh summarisation

This produces a chain of summary rows over time. Each row's episode_range covers only the episodes summarised in that specific pass (e.g. 259-263), not all episodes in the session.

Ollama Request

{
    model: EXTRACTION_MODEL,   // qwen2.5:3b (set via EXTRACTION_MODEL env var)
    prompt: buildSummaryPrompt(episodesToSummarize, existingSummary),
    stream: false,
    // No format: 'json' — free-text output required for summaries
    options: {
        temperature: 0.2,
        num_predict: 500,
    },
}

temperature: 0.2 is slightly higher than extraction (0.1) — summaries benefit from some fluency. num_predict: 500 gives room for 5 thorough sentences without risk of runoff.

Prompt Format

ChatML format — native to qwen2.5:

<|im_start|>user
Summarize the conversation below in 3-5 sentences.
Write in third person. Do not quote directly — paraphrase only.
Do not include greetings, sign-offs, or filler. Output only the summary text.

Conversation:
{context}
<|im_end|>
<|im_start|>assistant

For cumulative updates, the instruction and context change:

<|im_start|>user
Update the summary below to incorporate the new exchanges.
Write 3-5 sentences in third person. Do not quote directly — paraphrase only.
Do not include greetings, sign-offs, or filler. Output only the updated summary text.

Previous summary:
{existingSummary}

New exchanges:
{context}
<|im_end|>
<|im_start|>assistant

Input truncation

Episode context is truncated to MAX_CHARS = 3000 characters, keeping the most recent exchanges (sliced from the end). This keeps Qwen focused and prevents the prompt from exceeding its effective context window.

ChatML Token Stripping

Qwen occasionally echoes ChatML tokens back into its response. The raw output is cleaned before saving:

const raw = data.response?.trim() ?? '';
const content = raw
    .replace(/<\|im_start\|>.*?<\|im_end\|>/gs, '')
    .replace(/<\|im_start\|>|<\|im_end\|>|<\|im_sep\|>/g, '')
    .trim();
return content;

Without this, leaked tokens get stored in the summary and then injected back into the next summarisation prompt — causing the model to append a new summary after the old one rather than replacing it.

Episode Range Tracking

Each summary row stores episode_range as "firstId-lastId" covering only the episodes summarised in that pass:

const summarizedIds = episodesToSummarize.map(ep => ep.id).sort((a,b) => a - b);
const episodeRange = `${summarizedIds.at(0)}-${summarizedIds.at(-1)}`;

This makes SummaryView cards meaningful — "Episodes 259-263" tells you exactly which exchanges that summary covers, rather than always showing the full session range.

Summary Storage

Summaries are written directly to the memory service from orchestration:

// Create new row
await fetch(`${MEMORY_URL}/summaries`, {
    method: 'POST',
    body: JSON.stringify({ sessionId: session.id, content, tokenCount, episodeRange }),
});

// Update existing row
await fetch(`${MEMORY_URL}/summaries/${latest.id}`, {
    method: 'PATCH',
    body: JSON.stringify({ content, tokenCount, episodeRange }),
});

session.id here is the internal SQLite integer ID — not the external UUID. It is available directly on the session object passed from chat/index.js.

Client-Side Indicator

The chat client shows a "Summarising…" spinner in the ChatWindow header and on the InfoPanel's Session Memory button while summarisation may be in progress.

Since summarisation is fire-and-forget with no completion signal back to the client, the indicator is timer-based: it activates when the stream finishes and clears after 8 seconds.

// In App.jsx, watching the streaming state from useChat:
useEffect(() => {
    if (prevStreaming.current && !streaming) {
        setSummarising(true);
        const t = setTimeout(() => setSummarising(false), 8000);
        return () => clearTimeout(t);
    }
    prevStreaming.current = streaming;
}, [streaming]);

Environment Variables

Set in packages/orchestration-service/src/.env:

Variable	Default	Description
`EXTRACTION_URL`	`http://localhost:11434`	Ollama instance URL
`EXTRACTION_MODEL`	`qwen2.5:3b`	Model for summarisation
`MEMORY_SERVICE_URL`	`http://localhost:3002`	Memory service URL
`SUMMARY_THRESHOLD_TOKENS`	`200`	Token threshold before summarisation triggers
`SUMMARY_MAX_TOKENS`	`800`	Max summary length before a new row is created
`SUMMARY_MIN_EPISODES`	`5`	Min new episodes since last summary before re-summarising

6.0 KiB Raw Blame History