documentation updates for entity extraction and summarization

This commit is contained in:
Storme-bit
2026-04-21 03:50:38 -07:00
parent 32365e67f4
commit acda21317b
6 changed files with 540 additions and 107 deletions

View File

@@ -0,0 +1,178 @@
# Memory Service
**Package:** `@nexusai/memory-service`
**Location:** `packages/memory-service`
**Deployed on:** Mini PC 1 (192.168.0.81)
**Port:** 3002
## Purpose
Responsible for all reading and writing of long-term memory. Acts as the
sole interface to both SQLite and Qdrant — no other service accesses these
stores directly. On episode creation, automatically calls the embedding
service to generate and store a vector in Qdrant.
## Dependencies
- `express` — HTTP API
- `better-sqlite3` — SQLite driver
- `@qdrant/js-client-rest` — Qdrant vector store client
- `dotenv` — environment variable loading
- `@nexusai/shared` — shared utilities and constants
## Environment Variables
| Variable | Required | Default | Description |
|---|---|---|---|
| PORT | No | 3002 | Port to listen on |
| SQLITE_PATH | Yes | — | Path to SQLite database file |
| QDRANT_URL | No | http://localhost:6333 | Qdrant instance URL |
| EMBEDDING_SERVICE_URL | No | http://localhost:3003 | Embedding service URL |
| EXTRACTION_URL | No | http://localhost:11434 | Ollama URL for entity extraction |
| EXTRACTION_MODEL | No | qwen2.5:3b | Ollama model used for entity extraction |
## Internal Structure
```
src/
├── db/
│ ├── index.js # SQLite connection + initialization + migrations
│ ├── schema.js # Table definitions, indexes, FTS5, triggers
│ ├── projects.js # Project CRUD functions
│ └── summaries.js # Summary CRUD functions
├── episodic/
│ └── index.js # Session + episode CRUD, FTS search, embedding write path
├── semantic/
│ └── index.js # Qdrant collection management, upsert, search, delete
├── entities/
│ ├── index.js # Entity + relationship CRUD
│ └── extraction.js # Automatic entity extraction via qwen2.5:3b on Ollama
└── index.js # Express app + all route definitions
```
## SQLite Schema
Seven core tables:
- **sessions** — top-level conversation containers. Fields: `external_id`, `name`, `project_id`, `metadata`
- **episodes** — individual exchanges (user message + AI response) tied to a session
- **entities** — named things the system learns about (people, places, concepts)
- **relationships** — directional labeled links between entities
- **summaries** — condensed episode groups for efficient context retrieval
- **projects** — named groupings of sessions with `name`, `description`, `colour`, `icon`, `isolated`, `notes`, `system_prompt`
### Migrations
Schema changes that cannot use `CREATE TABLE IF NOT EXISTS` are applied as
idempotent migrations in `db/index.js` at startup:
```js
try { db.exec(`ALTER TABLE sessions ADD COLUMN name TEXT`); } catch {}
try { db.exec(`ALTER TABLE sessions ADD COLUMN project_id INTEGER REFERENCES projects(id)`); } catch {}
try { db.exec(`CREATE INDEX IF NOT EXISTS idx_sessions_project ON sessions(project_id)`); } catch {}
try { db.exec(`ALTER TABLE projects ADD COLUMN isolated INTEGER NOT NULL DEFAULT 0`); } catch {}
try { db.exec(`ALTER TABLE projects ADD COLUMN notes TEXT`); } catch {}
try { db.exec(`ALTER TABLE projects ADD COLUMN system_prompt TEXT`); } catch {}
```
New migrations are always appended here — never modify the schema file for
existing tables since `ALTER TABLE` cannot use `IF NOT EXISTS`.
### FTS5 Full-Text Search
An `episodes_fts` virtual table enables keyword search across all episodes.
Three triggers (`episodes_fts_insert`, `episodes_fts_update`, `episodes_fts_delete`)
keep the FTS index automatically in sync with the episodes table.
### SQLite Configuration
- `journal_mode = WAL` — non-blocking reads during writes
- `foreign_keys = ON` — enforces referential integrity and cascade deletes
- PRAGMAs set via `db.pragma()`, not `db.exec()`
### Dynamic Updates
Both `updateSession` and `updateProject` build their `SET` clause dynamically
from only the fields passed — prevents partial updates from overwriting fields
that weren't touched.
`updateProject` allowlist:
```js
const allowed = ['name', 'description', 'colour', 'icon', 'isolated', 'notes', 'system_prompt'];
```
## Qdrant / Semantic Layer
Three Qdrant collections are initialized on service startup via `semantic.initCollections()`:
| Collection | Purpose |
|---|---|
| `episodes` | Embeddings for individual conversation exchanges |
| `entities` | Embeddings for named entities |
| `summaries` | Embeddings for condensed episode summaries |
All collections use **768-dimension vectors** with **Cosine similarity**,
matching `nomic-embed-text` via Ollama. Vector size and distance metric are
defined in `@nexusai/shared` — not hardcoded here.
`initCollections()` iterates `Object.values(COLLECTIONS)` and creates any
collection that doesn't already exist at startup — all three collections are
guaranteed to exist before any requests are handled, avoiding race conditions
between the first entity embed and an entity search.
Each collection exposes upsert, search (with optional Qdrant filter), and
delete operations. The `wait: true` flag is used on all writes.
## Embedding Write Path
When a new episode is created:
1. Episode saved to SQLite synchronously — response returned immediately
2. User message + AI response combined: `User: ...\nAssistant: ...`
3. Text sent to embedding service (`POST /embed`)
4. Vector upserted into `episodes` Qdrant collection with payload `{ sessionId, createdAt }`
This step is **fire-and-forget** — if embedding fails, the episode is still
saved and searchable via FTS. The error is logged but not surfaced.
> The Qdrant payload stores `sessionId` (the internal integer ID). See
> `memory-isolation.md` for how project-level filtering works.
## Entity Layer
Entities and relationships use upsert semantics with composite unique
constraints to prevent duplicates:
- `UNIQUE(name, type)` on entities
- `UNIQUE(from_id, to_id, label)` on relationships
- `ON DELETE CASCADE` on relationship foreign keys
After each episode is saved, `extraction.js` automatically extracts named
entities from the conversation using `qwen2.5:3b` on Ollama — fire-and-forget.
> For full details on the extraction pipeline, prompt format, constrained
> decoding, stoplist, and Qdrant storage, see `entity-extraction.md`.
## Summaries Layer
Session summaries are generated by `orchestration-service/src/services/summarization.js`
after each episode write and stored here via `POST /summaries`. The memory
service is responsible only for CRUD — generation logic lives in orchestration.
> For full details on trigger conditions, prompt format, cumulative updates,
> and ChatML token stripping, see `summarization.md`.
## Project Delete Behaviour
Deleting a project runs as a transaction — it first nulls out `project_id`
on all assigned sessions, then deletes the project. This avoids a foreign
key constraint failure since `sessions.project_id` has no `ON DELETE` rule:
```js
const doDelete = db.transaction(() => {
db.prepare(`UPDATE sessions SET project_id = NULL WHERE project_id = ?`).run(id);
db.prepare(`DELETE FROM projects WHERE id = ?`).run(id);
});
```
For all HTTP endpoints, see `api-routes.md`.

View File

@@ -38,7 +38,8 @@ src/
├── db/
│ ├── index.js # SQLite connection + initialization + migrations
│ ├── schema.js # Table definitions, indexes, FTS5, triggers
── projects.js # Project CRUD functions
── projects.js # Project CRUD functions
│ └── summaries.js # Summary CRUD functions
├── episodic/
│ └── index.js # Session + episode CRUD, FTS search, embedding write path
├── semantic/
@@ -51,7 +52,7 @@ src/
## SQLite Schema
Six core tables:
Seven core tables:
- **sessions** — top-level conversation containers. Fields: `external_id`, `name`, `project_id`, `metadata`
- **episodes** — individual exchanges (user message + AI response) tied to a session
@@ -100,12 +101,9 @@ that weren't touched.
const allowed = ['name', 'description', 'colour', 'icon', 'isolated', 'notes', 'system_prompt'];
```
This means saving just `{ notes: "..." }` or `{ system_prompt: "..." }` won't
touch any other field.
## Qdrant / Semantic Layer
Three Qdrant collections are initialized on service startup:
Three Qdrant collections are initialized on service startup via `semantic.initCollections()`:
| Collection | Purpose |
|---|---|
@@ -117,9 +115,13 @@ All collections use **768-dimension vectors** with **Cosine similarity**,
matching `nomic-embed-text` via Ollama. Vector size and distance metric are
defined in `@nexusai/shared` — not hardcoded here.
Each collection exposes three operations in `src/semantic/index.js`:
upsert, search (with optional Qdrant filter), and delete. The `wait: true`
flag is used on all writes.
`initCollections()` iterates `Object.values(COLLECTIONS)` and creates any
collection that doesn't already exist at startup — all three collections are
guaranteed to exist before any requests are handled, avoiding race conditions
between the first entity embed and an entity search.
Each collection exposes upsert, search (with optional Qdrant filter), and
delete operations. The `wait: true` flag is used on all writes.
## Embedding Write Path
@@ -133,8 +135,7 @@ When a new episode is created:
This step is **fire-and-forget** — if embedding fails, the episode is still
saved and searchable via FTS. The error is logged but not surfaced.
> The Qdrant payload stores `sessionId` (the internal integer ID). This is
> used for per-session and per-project filtering during semantic search. See
> The Qdrant payload stores `sessionId` (the internal integer ID). See
> `memory-isolation.md` for how project-level filtering works.
## Entity Layer
@@ -146,34 +147,20 @@ constraints to prevent duplicates:
- `UNIQUE(from_id, to_id, label)` on relationships
- `ON DELETE CASCADE` on relationship foreign keys
### Automatic Entity Extraction
After each episode is saved, `extraction.js` automatically extracts named
entities from the conversation using `qwen2.5:3b` running on Ollama (Mini PC 1).
This runs **fire-and-forget** — the episode is already saved and returned
before extraction begins.
entities from the conversation using `qwen2.5:3b` on Ollama — fire-and-forget.
**Entity types extracted:** `person`, `place`, `project`, `technology`,
`concept`, `organization`
> For full details on the extraction pipeline, prompt format, constrained
> decoding, stoplist, and Qdrant storage, see `entity-extraction.md`.
The extraction prompt uses ChatML format (native to qwen2.5) and primes the
response by ending with `[` to steer the model directly into JSON array output.
A list of already-known entities is injected into the prompt so the model
reuses existing `(name, type)` pairs rather than creating duplicates with
different types.
## Summaries Layer
After extraction, each entity is:
1. Upserted into SQLite via `upsertEntity` — notes are only written if
the entity is new (`COALESCE(entities.notes, excluded.notes)` prevents
overwriting existing notes with speculative updates)
2. Embedded via the embedding service and upserted into the `entities`
Qdrant collection with `{ name, type, notes, projectId }` as payload —
`projectId` scopes entities to their project for isolated retrieval
Session summaries are generated by `orchestration-service/src/services/summarization.js`
after each episode write and stored here via `POST /summaries`. The memory
service is responsible only for CRUD — generation logic lives in orchestration.
`extractAndStoreEntities` receives `projectId` from `createEpisode`, which
receives it from the episode route, which receives it from orchestration's
`createEpisode` call. This ensures entities are tagged with the correct
project scope at extraction time.
> For full details on trigger conditions, prompt format, cumulative updates,
> and ChatML token stripping, see `summarization.md`.
## Project Delete Behaviour

View File

@@ -30,31 +30,33 @@ or inference services — all traffic flows through orchestration.
| LLAMA_SERVER_URL | No | http://localhost:8080 | Direct llama-server URL for /models/props |
| QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search |
| CORS_ORIGIN | No | http://localhost:5173 | Allowed origin for CORS requests |
| MODELS_MANIFEST_PATH | No | — | Legacy — superseded by `modelsFolderPath` in settings.json |
| EXTRACTION_URL | No | http://localhost:11434 | Ollama URL for summarisation |
| EXTRACTION_MODEL | No | qwen2.5:3b | Ollama model used for summarisation |
## Internal Structure
```
src/
├── services/
│ ├── memory.js # HTTP client for memory service
│ ├── inference.js # HTTP client for inference service
│ ├── embedding.js # HTTP client for embedding service
── qdrant.js # HTTP client for Qdrant (direct vector search)
│ ├── memory.js # HTTP client for memory service
│ ├── inference.js # HTTP client for inference service
│ ├── embedding.js # HTTP client for embedding service
── qdrant.js # HTTP client for Qdrant (direct vector search)
│ └── summarization.js # Session summarisation — triggers after each episode
├── chat/
│ └── index.js # Core pipeline — context assembly, isolation, auto-naming
│ └── index.js # Core pipeline — context assembly, isolation, auto-naming
├── config/
│ └── settings.js # Settings load/save — reads/writes data/settings.json
│ └── settings.js # Settings load/save — reads/writes data/settings.json
├── routes/
│ ├── chat.js # POST /chat and POST /chat/stream
│ ├── sessions.js # Session CRUD proxy
│ ├── projects.js # Project CRUD proxy — passes req.body straight through
│ ├── episodes.js # Episode list and delete proxy
│ ├── settings.js # GET /settings and PATCH /settings
│ ├── health.js # GET /health — pings all four services
── models.js # GET /models — scans .gguf files live, merges with models.json
# GET /models/props — context window + loaded model from llama-server
└── index.js # Express app entry point
│ ├── chat.js # POST /chat and POST /chat/stream
│ ├── sessions.js # Session CRUD proxy
│ ├── projects.js # Project CRUD proxy
│ ├── episodes.js # Episode list and delete proxy
│ ├── summaries.js # GET /summaries/session/:id and /summaries/project/:id
│ ├── settings.js # GET /settings and PATCH /settings
── health.js # GET /health/services — pings all four services
└── models.js # GET /models and GET /models/props
└── index.js # Express app entry point
```
The `services/` layer wraps all downstream HTTP calls in named functions.
@@ -77,9 +79,6 @@ via `appSettings.load()` — changes apply immediately without a service restart
| `topK` | 40 | Top-K token candidates per step |
| `systemPrompt` | *(ORCHESTRATION.SYSTEM_PROMPT)* | Global system prompt. `null` reverts to hardcoded constant. |
Defaults are defined in `config/settings.js` and fall back to constants in
`@nexusai/shared`. Values saved in `settings.json` take precedence.
## Chat Pipeline
Both `POST /chat` and `POST /chat/stream` share the same steps. The only
@@ -88,42 +87,38 @@ difference is how the inference response is delivered to the client.
### Steps
1. **Session resolution** — look up session by `externalId`. Auto-create if
not found. Clients generate a UUID for new conversations — no pre-creation
step needed.
not found.
2. **Project context resolution** — if the session has a `project_id`, fetch
the project and all its session IDs. Used to scope semantic search. The
project's `system_prompt` is also read at this step if set.
3. **System prompt resolution** — three-tier hierarchy:
- `project.system_prompt`if the session is in a project and it's set (highest priority)
- `project.system_prompt` — highest priority
- `settings.systemPrompt` — global setting from `settings.json`
- `ORCHESTRATION.SYSTEM_PROMPT` — hardcoded constant in `@nexusai/shared` (last resort)
- `ORCHESTRATION.SYSTEM_PROMPT` — hardcoded constant (last resort)
4. **Recent episode retrieval** — fetch the most recent episodes for the
session (`recentEpisodeLimit`, default 5).
4. **Recent episode retrieval** — fetch most recent episodes (`recentEpisodeLimit`).
5. **Semantic search** — embed the user message, query Qdrant for the top
most similar past episodes (`semanticLimit`, `scoreThreshold`). Deduplicated
against recent episodes. Non-critical — if it fails, pipeline continues with
recency-only context.
5. **Semantic search** — embed user message, query Qdrant for similar past
episodes. Deduplicated against recent episodes. Non-critical.
6. **Entity search** — query the `entities` Qdrant collection filtered by
6. **Entity search** — query `entities` Qdrant collection filtered by
`projectId`. Non-project sessions receive no entity context. Non-critical.
7. **Prompt assembly** — combine resolved system prompt, entity context,
semantic episodes, recent episodes, and user message.
7. **Prompt assembly** — combine system prompt, entity context, semantic
episodes, recent episodes, and user message.
8. **Inference** — send to inference service with settings-derived parameters
(temperature, topP, topK, repeatPenalty). `/chat` awaits full response;
8. **Inference** — send to inference service. `/chat` awaits full response;
`/chat/stream` pipes SSE chunks to the client.
9. **Episode write** — write the exchange back to memory with `projectId`.
Fire-and-forget for `/chat`; awaited for `/chat/stream`.
9. **Episode write** — write exchange back to memory with `projectId`.
10. **Auto-naming** — on `isFirstMessage && !session.name`, fire a secondary
inference call with a naming prompt (max 20 tokens, temperature 0.3) and
write the result back as `session.name`. Fully fire-and-forget.
10. **Summarisation trigger**`triggerSummary(session, allEpisodes)` called
fire-and-forget. See `summarization.md` for full details.
11. **Auto-naming** — on first message with no session name, fires a secondary
inference call (max 20 tokens, temperature 0.3) to generate a session name.
### Prompt Structure
@@ -132,26 +127,28 @@ difference is how the inference response is delivered to the client.
Here is what you know about entities relevant to this conversation:
- {name} ({type}): {notes}
... (up to 5 entity results)
---
Here are some relevant memories from earlier conversations:
User: {past user message}
Assistant: {past ai response}
... (up to semanticLimit semantic episodes)
---
Here are some relevant memories from your past conversations:
User: {past user message}
Assistant: {past ai response}
... (up to recentEpisodeLimit recent episodes)
--- End of recent memories ---
User: {current message}
Assistant:
```
Entity context appears first — before episodic memory — because structured
facts about known entities are the most stable and reliable context. Semantic
episodes follow, then recent episodes as the immediate conversation flow.
## Summarisation
After each episode write, `triggerSummary` is called fire-and-forget. It
checks token thresholds and episode counts before generating, then stores
the result in the memory service.
> For full details on trigger conditions, prompt format, cumulative updates,
> ChatML token stripping, and episode range tracking, see `summarization.md`.
## SSE Stream Format
@@ -168,46 +165,36 @@ data: {"text":"Hello"}
data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42}
```
The `[DONE]` sentinel is consumed internally and not forwarded. The stream
is terminated by `res.end()` after the done event.
The `[DONE]` sentinel is consumed internally and not forwarded.
## Models Route
`GET /models` scans `.gguf` files live on each request from `modelsFolderPath`
(read from settings). Merges results with a `models.json` file in the same
folder for richer metadata (label, description). Returns file size in GB.
`GET /models` scans `.gguf` files live from `modelsFolderPath` and merges
with `models.json` for metadata. Returns file size in GB.
`GET /models/props` fetches directly from llama-server via `LLAMA_SERVER_URL`.
Returns `{ contextWindow, modelAlias }`. `n_ctx` is at
`data.default_generation_settings.n_ctx` in the llama-server response.
Returns `503` if llama-server is unreachable.
`GET /models/props` fetches directly from llama-server. Returns
`{ contextWindow, modelAlias }`. Returns `503` if unreachable.
## Sessions Route Behaviour
`PATCH /sessions/:sessionId` accepts either `name`, `projectId`, or both.
The validation guard only rejects requests where neither is provided:
```js
if (!name?.trim() && projectId === undefined) {
return res.status(400).json({ error: 'name or projectId is required' });
}
```
This allows `useChat` to write project assignment separately from rename
operations.
`PATCH /sessions/:sessionId` accepts `name`, `projectId`, or both.
Rejects only when neither is provided — allows `useChat` to write project
assignment separately from rename operations.
## Caddy Configuration
Each route prefix needs a handle block in the Caddyfile on Mini PC 2:
Each route prefix needs a handle block in the Caddyfile on Mini PC 2.
**Any new top-level route must be added here AND in `vite.config.js`.**
```
handle /chat* { reverse_proxy localhost:4000 }
handle /sessions* { reverse_proxy localhost:4000 }
handle /models* { reverse_proxy localhost:4000 }
handle /projects* { reverse_proxy localhost:4000 }
handle /episodes* { reverse_proxy localhost:4000 }
handle /settings* { reverse_proxy localhost:4000 }
handle /health* { reverse_proxy localhost:4000 }
handle /chat* { reverse_proxy localhost:4000 }
handle /sessions* { reverse_proxy localhost:4000 }
handle /models* { reverse_proxy localhost:4000 }
handle /projects* { reverse_proxy localhost:4000 }
handle /episodes* { reverse_proxy localhost:4000 }
handle /settings* { reverse_proxy localhost:4000 }
handle /summaries* { reverse_proxy localhost:4000 }
handle /health* { reverse_proxy localhost:4000 }
```
After updating: `caddy reload --config /path/to/Caddyfile`

View File

@@ -165,10 +165,16 @@ Orchestration pipeline defaults. Used as fallback values in
| `RECENT_EPISODE_LIMIT` | `5` | Recent episodes to inject into prompt |
| `SEMANTIC_LIMIT` | `5` | Semantic search results to inject into prompt |
| `SCORE_THRESHOLD` | `0.75` | Minimum similarity score for semantic results |
| `ENTITIES_LIMIT` | `5` | Max entity search results to inject into prompt |
| `ENTITIES_THRESHOLD` | `0.55` | Minimum similarity score for entity results |
| `TEMPERATURE` | `0.7` | Default inference temperature |
| `CORS_ORIGIN` | `'http://localhost:5173'` | Fallback allowed CORS origin |
| `SYSTEM_PROMPT` | *(see below)* | Default system prompt |
> `ENTITIES_THRESHOLD` is set to `0.55` — lower than `SCORE_THRESHOLD` because
> entity notes generated by a 3B model tend to embed with lower cosine similarity
> than full episode text. Tune upward if irrelevant entities appear in context.
> `repeatPenalty`, `topP`, and `topK` defaults are sourced from
> `INFERENCE_DEFAULTS` in `config/settings.js` rather than `ORCHESTRATION`,
> since those constants already define the canonical values.
@@ -178,6 +184,25 @@ Default system prompt:
> of past conversations with the user. Use them to provide consistent,
> personalised responses."
#### `SUMMARIES`
Controls the automatic session summarisation system in `orchestration-service/src/services/summarization.js`.
| Key | Value | Description |
|---|---|---|
| `THRESHOLD_TOKENS` | `200` | Minimum total session tokens before summarisation is considered |
| `MAX_SUMMARY_TOKENS` | `800` | If existing summary exceeds this length (chars), create a new row instead of updating |
| `MIN_EPISODES_SINCE` | `5` | Minimum new episodes since last summary before re-summarising |
These can be overridden per-deployment via environment variables in the
orchestration service `.env`:
```
SUMMARY_THRESHOLD_TOKENS=200
SUMMARY_MAX_TOKENS=800
SUMMARY_MIN_EPISODES=5
```
#### `SQLITE`
| Key | Value | Description |

View File

@@ -0,0 +1,201 @@
# Summarization
Session summarization generates rolling plain-text summaries of conversation
history, giving the model a condensed view of past context without consuming
the full context window with raw episodes.
**Location:** `packages/orchestration-service/src/services/summarization.js`
**Triggered by:** `chat/index.js` after every episode write (fire-and-forget)
**Model:** `qwen2.5:3b` via Ollama on Mini PC 1 (192.168.0.81)
---
## Trigger Conditions
`triggerSummary(session, allEpisodes)` calls `maybeSummarize` fire-and-forget.
`maybeSummarize` proceeds only when both conditions are met:
1. Total session token count exceeds `SUMMARIES.THRESHOLD_TOKENS` (default 200)
2. At least `SUMMARIES.MIN_EPISODES_SINCE` (default 5) new episodes have
accumulated since the last summary
The token threshold is intentionally low — it ensures summaries start
generating early in a session's life rather than only after very long
conversations.
---
## Summary Rows and Cumulative Updates
Each session can have multiple summary rows in the `summaries` table.
The update strategy depends on the size of the most recent summary:
| Condition | Action |
|---|---|
| No existing summary | Generate fresh summary from all episodes |
| Latest summary under `MAX_SUMMARY_TOKENS` | Update: summarise new episodes with existing summary as context |
| Latest summary over `MAX_SUMMARY_TOKENS` | Create new row: treat as fresh summarisation |
This produces a chain of summary rows over time. Each row's `episode_range`
covers only the episodes summarised in that specific pass (e.g. `259-263`),
not all episodes in the session.
---
## Ollama Request
```js
{
model: EXTRACTION_MODEL, // qwen2.5:3b (set via EXTRACTION_MODEL env var)
prompt: buildSummaryPrompt(episodesToSummarize, existingSummary),
stream: false,
// No format: 'json' — free-text output required for summaries
options: {
temperature: 0.2,
num_predict: 500,
},
}
```
`temperature: 0.2` is slightly higher than extraction (0.1) — summaries
benefit from some fluency. `num_predict: 500` gives room for 5 thorough
sentences without risk of runoff.
---
## Prompt Format
ChatML format — native to qwen2.5:
```
<|im_start|>user
Summarize the conversation below in 3-5 sentences.
Write in third person. Do not quote directly — paraphrase only.
Do not include greetings, sign-offs, or filler. Output only the summary text.
Conversation:
{context}
<|im_end|>
<|im_start|>assistant
```
For cumulative updates, the instruction and context change:
```
<|im_start|>user
Update the summary below to incorporate the new exchanges.
Write 3-5 sentences in third person. Do not quote directly — paraphrase only.
Do not include greetings, sign-offs, or filler. Output only the updated summary text.
Previous summary:
{existingSummary}
New exchanges:
{context}
<|im_end|>
<|im_start|>assistant
```
### Input truncation
Episode context is truncated to `MAX_CHARS = 3000` characters, keeping the
most recent exchanges (sliced from the end). This keeps Qwen focused and
prevents the prompt from exceeding its effective context window.
---
## ChatML Token Stripping
Qwen occasionally echoes ChatML tokens back into its response. The raw output
is cleaned before saving:
```js
const raw = data.response?.trim() ?? '';
const content = raw
.replace(/<\|im_start\|>.*?<\|im_end\|>/gs, '')
.replace(/<\|im_start\|>|<\|im_end\|>|<\|im_sep\|>/g, '')
.trim();
return content;
```
Without this, leaked tokens get stored in the summary and then injected
back into the next summarisation prompt — causing the model to append a new
summary after the old one rather than replacing it.
---
## Episode Range Tracking
Each summary row stores `episode_range` as `"firstId-lastId"` covering only
the episodes summarised in that pass:
```js
const summarizedIds = episodesToSummarize.map(ep => ep.id).sort((a,b) => a - b);
const episodeRange = `${summarizedIds.at(0)}-${summarizedIds.at(-1)}`;
```
This makes SummaryView cards meaningful — "Episodes 259-263" tells you
exactly which exchanges that summary covers, rather than always showing
the full session range.
---
## Summary Storage
Summaries are written directly to the memory service from orchestration:
```js
// Create new row
await fetch(`${MEMORY_URL}/summaries`, {
method: 'POST',
body: JSON.stringify({ sessionId: session.id, content, tokenCount, episodeRange }),
});
// Update existing row
await fetch(`${MEMORY_URL}/summaries/${latest.id}`, {
method: 'PATCH',
body: JSON.stringify({ content, tokenCount, episodeRange }),
});
```
`session.id` here is the internal SQLite integer ID — not the external UUID.
It is available directly on the `session` object passed from `chat/index.js`.
---
## Client-Side Indicator
The chat client shows a "Summarising…" spinner in the `ChatWindow` header
and on the InfoPanel's Session Memory button while summarisation may be
in progress.
Since summarisation is fire-and-forget with no completion signal back to
the client, the indicator is timer-based: it activates when the stream
finishes and clears after 8 seconds.
```js
// In App.jsx, watching the streaming state from useChat:
useEffect(() => {
if (prevStreaming.current && !streaming) {
setSummarising(true);
const t = setTimeout(() => setSummarising(false), 8000);
return () => clearTimeout(t);
}
prevStreaming.current = streaming;
}, [streaming]);
```
---
## Environment Variables
Set in `packages/orchestration-service/src/.env`:
| Variable | Default | Description |
|---|---|---|
| `EXTRACTION_URL` | `http://localhost:11434` | Ollama instance URL |
| `EXTRACTION_MODEL` | `qwen2.5:3b` | Model for summarisation |
| `MEMORY_SERVICE_URL` | `http://localhost:3002` | Memory service URL |
| `SUMMARY_THRESHOLD_TOKENS` | `200` | Token threshold before summarisation triggers |
| `SUMMARY_MAX_TOKENS` | `800` | Max summary length before a new row is created |
| `SUMMARY_MIN_EPISODES` | `5` | Min new episodes since last summary before re-summarising |s