documentation updates for entity extraction and summarization
This commit is contained in:
201
docs/services/summarization.md
Normal file
201
docs/services/summarization.md
Normal file
@@ -0,0 +1,201 @@
|
||||
# Summarization
|
||||
|
||||
Session summarization generates rolling plain-text summaries of conversation
|
||||
history, giving the model a condensed view of past context without consuming
|
||||
the full context window with raw episodes.
|
||||
|
||||
**Location:** `packages/orchestration-service/src/services/summarization.js`
|
||||
**Triggered by:** `chat/index.js` after every episode write (fire-and-forget)
|
||||
**Model:** `qwen2.5:3b` via Ollama on Mini PC 1 (192.168.0.81)
|
||||
|
||||
---
|
||||
|
||||
## Trigger Conditions
|
||||
|
||||
`triggerSummary(session, allEpisodes)` calls `maybeSummarize` fire-and-forget.
|
||||
`maybeSummarize` proceeds only when both conditions are met:
|
||||
|
||||
1. Total session token count exceeds `SUMMARIES.THRESHOLD_TOKENS` (default 200)
|
||||
2. At least `SUMMARIES.MIN_EPISODES_SINCE` (default 5) new episodes have
|
||||
accumulated since the last summary
|
||||
|
||||
The token threshold is intentionally low — it ensures summaries start
|
||||
generating early in a session's life rather than only after very long
|
||||
conversations.
|
||||
|
||||
---
|
||||
|
||||
## Summary Rows and Cumulative Updates
|
||||
|
||||
Each session can have multiple summary rows in the `summaries` table.
|
||||
The update strategy depends on the size of the most recent summary:
|
||||
|
||||
| Condition | Action |
|
||||
|---|---|
|
||||
| No existing summary | Generate fresh summary from all episodes |
|
||||
| Latest summary under `MAX_SUMMARY_TOKENS` | Update: summarise new episodes with existing summary as context |
|
||||
| Latest summary over `MAX_SUMMARY_TOKENS` | Create new row: treat as fresh summarisation |
|
||||
|
||||
This produces a chain of summary rows over time. Each row's `episode_range`
|
||||
covers only the episodes summarised in that specific pass (e.g. `259-263`),
|
||||
not all episodes in the session.
|
||||
|
||||
---
|
||||
|
||||
## Ollama Request
|
||||
|
||||
```js
|
||||
{
|
||||
model: EXTRACTION_MODEL, // qwen2.5:3b (set via EXTRACTION_MODEL env var)
|
||||
prompt: buildSummaryPrompt(episodesToSummarize, existingSummary),
|
||||
stream: false,
|
||||
// No format: 'json' — free-text output required for summaries
|
||||
options: {
|
||||
temperature: 0.2,
|
||||
num_predict: 500,
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
`temperature: 0.2` is slightly higher than extraction (0.1) — summaries
|
||||
benefit from some fluency. `num_predict: 500` gives room for 5 thorough
|
||||
sentences without risk of runoff.
|
||||
|
||||
---
|
||||
|
||||
## Prompt Format
|
||||
|
||||
ChatML format — native to qwen2.5:
|
||||
|
||||
```
|
||||
<|im_start|>user
|
||||
Summarize the conversation below in 3-5 sentences.
|
||||
Write in third person. Do not quote directly — paraphrase only.
|
||||
Do not include greetings, sign-offs, or filler. Output only the summary text.
|
||||
|
||||
Conversation:
|
||||
{context}
|
||||
<|im_end|>
|
||||
<|im_start|>assistant
|
||||
```
|
||||
|
||||
For cumulative updates, the instruction and context change:
|
||||
|
||||
```
|
||||
<|im_start|>user
|
||||
Update the summary below to incorporate the new exchanges.
|
||||
Write 3-5 sentences in third person. Do not quote directly — paraphrase only.
|
||||
Do not include greetings, sign-offs, or filler. Output only the updated summary text.
|
||||
|
||||
Previous summary:
|
||||
{existingSummary}
|
||||
|
||||
New exchanges:
|
||||
{context}
|
||||
<|im_end|>
|
||||
<|im_start|>assistant
|
||||
```
|
||||
|
||||
### Input truncation
|
||||
|
||||
Episode context is truncated to `MAX_CHARS = 3000` characters, keeping the
|
||||
most recent exchanges (sliced from the end). This keeps Qwen focused and
|
||||
prevents the prompt from exceeding its effective context window.
|
||||
|
||||
---
|
||||
|
||||
## ChatML Token Stripping
|
||||
|
||||
Qwen occasionally echoes ChatML tokens back into its response. The raw output
|
||||
is cleaned before saving:
|
||||
|
||||
```js
|
||||
const raw = data.response?.trim() ?? '';
|
||||
const content = raw
|
||||
.replace(/<\|im_start\|>.*?<\|im_end\|>/gs, '')
|
||||
.replace(/<\|im_start\|>|<\|im_end\|>|<\|im_sep\|>/g, '')
|
||||
.trim();
|
||||
return content;
|
||||
```
|
||||
|
||||
Without this, leaked tokens get stored in the summary and then injected
|
||||
back into the next summarisation prompt — causing the model to append a new
|
||||
summary after the old one rather than replacing it.
|
||||
|
||||
---
|
||||
|
||||
## Episode Range Tracking
|
||||
|
||||
Each summary row stores `episode_range` as `"firstId-lastId"` covering only
|
||||
the episodes summarised in that pass:
|
||||
|
||||
```js
|
||||
const summarizedIds = episodesToSummarize.map(ep => ep.id).sort((a,b) => a - b);
|
||||
const episodeRange = `${summarizedIds.at(0)}-${summarizedIds.at(-1)}`;
|
||||
```
|
||||
|
||||
This makes SummaryView cards meaningful — "Episodes 259-263" tells you
|
||||
exactly which exchanges that summary covers, rather than always showing
|
||||
the full session range.
|
||||
|
||||
---
|
||||
|
||||
## Summary Storage
|
||||
|
||||
Summaries are written directly to the memory service from orchestration:
|
||||
|
||||
```js
|
||||
// Create new row
|
||||
await fetch(`${MEMORY_URL}/summaries`, {
|
||||
method: 'POST',
|
||||
body: JSON.stringify({ sessionId: session.id, content, tokenCount, episodeRange }),
|
||||
});
|
||||
|
||||
// Update existing row
|
||||
await fetch(`${MEMORY_URL}/summaries/${latest.id}`, {
|
||||
method: 'PATCH',
|
||||
body: JSON.stringify({ content, tokenCount, episodeRange }),
|
||||
});
|
||||
```
|
||||
|
||||
`session.id` here is the internal SQLite integer ID — not the external UUID.
|
||||
It is available directly on the `session` object passed from `chat/index.js`.
|
||||
|
||||
---
|
||||
|
||||
## Client-Side Indicator
|
||||
|
||||
The chat client shows a "Summarising…" spinner in the `ChatWindow` header
|
||||
and on the InfoPanel's Session Memory button while summarisation may be
|
||||
in progress.
|
||||
|
||||
Since summarisation is fire-and-forget with no completion signal back to
|
||||
the client, the indicator is timer-based: it activates when the stream
|
||||
finishes and clears after 8 seconds.
|
||||
|
||||
```js
|
||||
// In App.jsx, watching the streaming state from useChat:
|
||||
useEffect(() => {
|
||||
if (prevStreaming.current && !streaming) {
|
||||
setSummarising(true);
|
||||
const t = setTimeout(() => setSummarising(false), 8000);
|
||||
return () => clearTimeout(t);
|
||||
}
|
||||
prevStreaming.current = streaming;
|
||||
}, [streaming]);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Environment Variables
|
||||
|
||||
Set in `packages/orchestration-service/src/.env`:
|
||||
|
||||
| Variable | Default | Description |
|
||||
|---|---|---|
|
||||
| `EXTRACTION_URL` | `http://localhost:11434` | Ollama instance URL |
|
||||
| `EXTRACTION_MODEL` | `qwen2.5:3b` | Model for summarisation |
|
||||
| `MEMORY_SERVICE_URL` | `http://localhost:3002` | Memory service URL |
|
||||
| `SUMMARY_THRESHOLD_TOKENS` | `200` | Token threshold before summarisation triggers |
|
||||
| `SUMMARY_MAX_TOKENS` | `800` | Max summary length before a new row is created |
|
||||
| `SUMMARY_MIN_EPISODES` | `5` | Min new episodes since last summary before re-summarising |s
|
||||
Reference in New Issue
Block a user