documentation updates for entity extraction and summarization

This commit is contained in:
Storme-bit
2026-04-21 03:50:38 -07:00
parent 32365e67f4
commit acda21317b
6 changed files with 540 additions and 107 deletions

View File

@@ -120,6 +120,38 @@ all projects use isolated memory. Returns `201` with the created project object.
Only provided fields are updated — omitted fields are not touched. Only provided fields are updated — omitted fields are not touched.
### Summaries
| Method | Path | Description |
|---|---|---|
| GET | /summaries/session/:sessionId | Get all summaries for a session (by external UUID) |
| GET | /summaries/project/:projectId | Get all summaries for a project |
**GET /summaries/session/:sessionId** — resolves the external UUID to an
internal session ID, then fetches summaries from the memory service.
Returns an array of summary objects ordered by `created_at` ascending.
**GET /summaries/project/:projectId** — proxies directly to the memory
service project summaries endpoint.
**Summary object shape:**
```json
{
"id": 8,
"session_id": 72,
"project_id": null,
"content": "The user asked about...",
"token_count": 579,
"episode_range": "246-251",
"created_at": 1776766518,
"updated_at": 1776766518
}
```
> **Proxy requirement:** `/summaries` must be added to both the Caddyfile
> reverse proxy and the Vite dev proxy config alongside the other route
> prefixes. See `orchestration-service.md` for the Caddy block pattern.
### Models ### Models
| Method | Path | Description | | Method | Path | Description |
@@ -269,6 +301,29 @@ Both fields are optional. Only provided fields are updated.
Same request/response shape as orchestration `/projects` above. Same request/response shape as orchestration `/projects` above.
### Summaries
| Method | Path | Description |
|---|---|---|
| POST | /summaries | Create a new summary |
| GET | /sessions/:id/summaries | Get all summaries for a session (internal ID) |
| GET | /projects/:id/summaries | Get all summaries for a project |
| PATCH | /summaries/:id | Update a summary (content, tokenCount, episodeRange) |
| DELETE | /summaries/:id | Delete a summary |
**POST /summaries — body:**
```json
{
"sessionId": 72,
"content": "The user discussed...",
"tokenCount": 579,
"episodeRange": "246-251"
}
```
`content` is required. Either `sessionId` or `projectId` is required.
**PATCH /summaries/:id — body:** any subset of `content`, `tokenCount`, `episodeRange`.
### Entities ### Entities
| Method | Path | Description | | Method | Path | Description |

View File

@@ -0,0 +1,178 @@
# Memory Service
**Package:** `@nexusai/memory-service`
**Location:** `packages/memory-service`
**Deployed on:** Mini PC 1 (192.168.0.81)
**Port:** 3002
## Purpose
Responsible for all reading and writing of long-term memory. Acts as the
sole interface to both SQLite and Qdrant — no other service accesses these
stores directly. On episode creation, automatically calls the embedding
service to generate and store a vector in Qdrant.
## Dependencies
- `express` — HTTP API
- `better-sqlite3` — SQLite driver
- `@qdrant/js-client-rest` — Qdrant vector store client
- `dotenv` — environment variable loading
- `@nexusai/shared` — shared utilities and constants
## Environment Variables
| Variable | Required | Default | Description |
|---|---|---|---|
| PORT | No | 3002 | Port to listen on |
| SQLITE_PATH | Yes | — | Path to SQLite database file |
| QDRANT_URL | No | http://localhost:6333 | Qdrant instance URL |
| EMBEDDING_SERVICE_URL | No | http://localhost:3003 | Embedding service URL |
| EXTRACTION_URL | No | http://localhost:11434 | Ollama URL for entity extraction |
| EXTRACTION_MODEL | No | qwen2.5:3b | Ollama model used for entity extraction |
## Internal Structure
```
src/
├── db/
│ ├── index.js # SQLite connection + initialization + migrations
│ ├── schema.js # Table definitions, indexes, FTS5, triggers
│ ├── projects.js # Project CRUD functions
│ └── summaries.js # Summary CRUD functions
├── episodic/
│ └── index.js # Session + episode CRUD, FTS search, embedding write path
├── semantic/
│ └── index.js # Qdrant collection management, upsert, search, delete
├── entities/
│ ├── index.js # Entity + relationship CRUD
│ └── extraction.js # Automatic entity extraction via qwen2.5:3b on Ollama
└── index.js # Express app + all route definitions
```
## SQLite Schema
Seven core tables:
- **sessions** — top-level conversation containers. Fields: `external_id`, `name`, `project_id`, `metadata`
- **episodes** — individual exchanges (user message + AI response) tied to a session
- **entities** — named things the system learns about (people, places, concepts)
- **relationships** — directional labeled links between entities
- **summaries** — condensed episode groups for efficient context retrieval
- **projects** — named groupings of sessions with `name`, `description`, `colour`, `icon`, `isolated`, `notes`, `system_prompt`
### Migrations
Schema changes that cannot use `CREATE TABLE IF NOT EXISTS` are applied as
idempotent migrations in `db/index.js` at startup:
```js
try { db.exec(`ALTER TABLE sessions ADD COLUMN name TEXT`); } catch {}
try { db.exec(`ALTER TABLE sessions ADD COLUMN project_id INTEGER REFERENCES projects(id)`); } catch {}
try { db.exec(`CREATE INDEX IF NOT EXISTS idx_sessions_project ON sessions(project_id)`); } catch {}
try { db.exec(`ALTER TABLE projects ADD COLUMN isolated INTEGER NOT NULL DEFAULT 0`); } catch {}
try { db.exec(`ALTER TABLE projects ADD COLUMN notes TEXT`); } catch {}
try { db.exec(`ALTER TABLE projects ADD COLUMN system_prompt TEXT`); } catch {}
```
New migrations are always appended here — never modify the schema file for
existing tables since `ALTER TABLE` cannot use `IF NOT EXISTS`.
### FTS5 Full-Text Search
An `episodes_fts` virtual table enables keyword search across all episodes.
Three triggers (`episodes_fts_insert`, `episodes_fts_update`, `episodes_fts_delete`)
keep the FTS index automatically in sync with the episodes table.
### SQLite Configuration
- `journal_mode = WAL` — non-blocking reads during writes
- `foreign_keys = ON` — enforces referential integrity and cascade deletes
- PRAGMAs set via `db.pragma()`, not `db.exec()`
### Dynamic Updates
Both `updateSession` and `updateProject` build their `SET` clause dynamically
from only the fields passed — prevents partial updates from overwriting fields
that weren't touched.
`updateProject` allowlist:
```js
const allowed = ['name', 'description', 'colour', 'icon', 'isolated', 'notes', 'system_prompt'];
```
## Qdrant / Semantic Layer
Three Qdrant collections are initialized on service startup via `semantic.initCollections()`:
| Collection | Purpose |
|---|---|
| `episodes` | Embeddings for individual conversation exchanges |
| `entities` | Embeddings for named entities |
| `summaries` | Embeddings for condensed episode summaries |
All collections use **768-dimension vectors** with **Cosine similarity**,
matching `nomic-embed-text` via Ollama. Vector size and distance metric are
defined in `@nexusai/shared` — not hardcoded here.
`initCollections()` iterates `Object.values(COLLECTIONS)` and creates any
collection that doesn't already exist at startup — all three collections are
guaranteed to exist before any requests are handled, avoiding race conditions
between the first entity embed and an entity search.
Each collection exposes upsert, search (with optional Qdrant filter), and
delete operations. The `wait: true` flag is used on all writes.
## Embedding Write Path
When a new episode is created:
1. Episode saved to SQLite synchronously — response returned immediately
2. User message + AI response combined: `User: ...\nAssistant: ...`
3. Text sent to embedding service (`POST /embed`)
4. Vector upserted into `episodes` Qdrant collection with payload `{ sessionId, createdAt }`
This step is **fire-and-forget** — if embedding fails, the episode is still
saved and searchable via FTS. The error is logged but not surfaced.
> The Qdrant payload stores `sessionId` (the internal integer ID). See
> `memory-isolation.md` for how project-level filtering works.
## Entity Layer
Entities and relationships use upsert semantics with composite unique
constraints to prevent duplicates:
- `UNIQUE(name, type)` on entities
- `UNIQUE(from_id, to_id, label)` on relationships
- `ON DELETE CASCADE` on relationship foreign keys
After each episode is saved, `extraction.js` automatically extracts named
entities from the conversation using `qwen2.5:3b` on Ollama — fire-and-forget.
> For full details on the extraction pipeline, prompt format, constrained
> decoding, stoplist, and Qdrant storage, see `entity-extraction.md`.
## Summaries Layer
Session summaries are generated by `orchestration-service/src/services/summarization.js`
after each episode write and stored here via `POST /summaries`. The memory
service is responsible only for CRUD — generation logic lives in orchestration.
> For full details on trigger conditions, prompt format, cumulative updates,
> and ChatML token stripping, see `summarization.md`.
## Project Delete Behaviour
Deleting a project runs as a transaction — it first nulls out `project_id`
on all assigned sessions, then deletes the project. This avoids a foreign
key constraint failure since `sessions.project_id` has no `ON DELETE` rule:
```js
const doDelete = db.transaction(() => {
db.prepare(`UPDATE sessions SET project_id = NULL WHERE project_id = ?`).run(id);
db.prepare(`DELETE FROM projects WHERE id = ?`).run(id);
});
```
For all HTTP endpoints, see `api-routes.md`.

View File

@@ -38,7 +38,8 @@ src/
├── db/ ├── db/
│ ├── index.js # SQLite connection + initialization + migrations │ ├── index.js # SQLite connection + initialization + migrations
│ ├── schema.js # Table definitions, indexes, FTS5, triggers │ ├── schema.js # Table definitions, indexes, FTS5, triggers
── projects.js # Project CRUD functions ── projects.js # Project CRUD functions
│ └── summaries.js # Summary CRUD functions
├── episodic/ ├── episodic/
│ └── index.js # Session + episode CRUD, FTS search, embedding write path │ └── index.js # Session + episode CRUD, FTS search, embedding write path
├── semantic/ ├── semantic/
@@ -51,7 +52,7 @@ src/
## SQLite Schema ## SQLite Schema
Six core tables: Seven core tables:
- **sessions** — top-level conversation containers. Fields: `external_id`, `name`, `project_id`, `metadata` - **sessions** — top-level conversation containers. Fields: `external_id`, `name`, `project_id`, `metadata`
- **episodes** — individual exchanges (user message + AI response) tied to a session - **episodes** — individual exchanges (user message + AI response) tied to a session
@@ -100,12 +101,9 @@ that weren't touched.
const allowed = ['name', 'description', 'colour', 'icon', 'isolated', 'notes', 'system_prompt']; const allowed = ['name', 'description', 'colour', 'icon', 'isolated', 'notes', 'system_prompt'];
``` ```
This means saving just `{ notes: "..." }` or `{ system_prompt: "..." }` won't
touch any other field.
## Qdrant / Semantic Layer ## Qdrant / Semantic Layer
Three Qdrant collections are initialized on service startup: Three Qdrant collections are initialized on service startup via `semantic.initCollections()`:
| Collection | Purpose | | Collection | Purpose |
|---|---| |---|---|
@@ -117,9 +115,13 @@ All collections use **768-dimension vectors** with **Cosine similarity**,
matching `nomic-embed-text` via Ollama. Vector size and distance metric are matching `nomic-embed-text` via Ollama. Vector size and distance metric are
defined in `@nexusai/shared` — not hardcoded here. defined in `@nexusai/shared` — not hardcoded here.
Each collection exposes three operations in `src/semantic/index.js`: `initCollections()` iterates `Object.values(COLLECTIONS)` and creates any
upsert, search (with optional Qdrant filter), and delete. The `wait: true` collection that doesn't already exist at startup — all three collections are
flag is used on all writes. guaranteed to exist before any requests are handled, avoiding race conditions
between the first entity embed and an entity search.
Each collection exposes upsert, search (with optional Qdrant filter), and
delete operations. The `wait: true` flag is used on all writes.
## Embedding Write Path ## Embedding Write Path
@@ -133,8 +135,7 @@ When a new episode is created:
This step is **fire-and-forget** — if embedding fails, the episode is still This step is **fire-and-forget** — if embedding fails, the episode is still
saved and searchable via FTS. The error is logged but not surfaced. saved and searchable via FTS. The error is logged but not surfaced.
> The Qdrant payload stores `sessionId` (the internal integer ID). This is > The Qdrant payload stores `sessionId` (the internal integer ID). See
> used for per-session and per-project filtering during semantic search. See
> `memory-isolation.md` for how project-level filtering works. > `memory-isolation.md` for how project-level filtering works.
## Entity Layer ## Entity Layer
@@ -146,34 +147,20 @@ constraints to prevent duplicates:
- `UNIQUE(from_id, to_id, label)` on relationships - `UNIQUE(from_id, to_id, label)` on relationships
- `ON DELETE CASCADE` on relationship foreign keys - `ON DELETE CASCADE` on relationship foreign keys
### Automatic Entity Extraction
After each episode is saved, `extraction.js` automatically extracts named After each episode is saved, `extraction.js` automatically extracts named
entities from the conversation using `qwen2.5:3b` running on Ollama (Mini PC 1). entities from the conversation using `qwen2.5:3b` on Ollama — fire-and-forget.
This runs **fire-and-forget** — the episode is already saved and returned
before extraction begins.
**Entity types extracted:** `person`, `place`, `project`, `technology`, > For full details on the extraction pipeline, prompt format, constrained
`concept`, `organization` > decoding, stoplist, and Qdrant storage, see `entity-extraction.md`.
The extraction prompt uses ChatML format (native to qwen2.5) and primes the ## Summaries Layer
response by ending with `[` to steer the model directly into JSON array output.
A list of already-known entities is injected into the prompt so the model
reuses existing `(name, type)` pairs rather than creating duplicates with
different types.
After extraction, each entity is: Session summaries are generated by `orchestration-service/src/services/summarization.js`
1. Upserted into SQLite via `upsertEntity` — notes are only written if after each episode write and stored here via `POST /summaries`. The memory
the entity is new (`COALESCE(entities.notes, excluded.notes)` prevents service is responsible only for CRUD — generation logic lives in orchestration.
overwriting existing notes with speculative updates)
2. Embedded via the embedding service and upserted into the `entities`
Qdrant collection with `{ name, type, notes, projectId }` as payload —
`projectId` scopes entities to their project for isolated retrieval
`extractAndStoreEntities` receives `projectId` from `createEpisode`, which > For full details on trigger conditions, prompt format, cumulative updates,
receives it from the episode route, which receives it from orchestration's > and ChatML token stripping, see `summarization.md`.
`createEpisode` call. This ensures entities are tagged with the correct
project scope at extraction time.
## Project Delete Behaviour ## Project Delete Behaviour

View File

@@ -30,7 +30,8 @@ or inference services — all traffic flows through orchestration.
| LLAMA_SERVER_URL | No | http://localhost:8080 | Direct llama-server URL for /models/props | | LLAMA_SERVER_URL | No | http://localhost:8080 | Direct llama-server URL for /models/props |
| QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search | | QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search |
| CORS_ORIGIN | No | http://localhost:5173 | Allowed origin for CORS requests | | CORS_ORIGIN | No | http://localhost:5173 | Allowed origin for CORS requests |
| MODELS_MANIFEST_PATH | No | — | Legacy — superseded by `modelsFolderPath` in settings.json | | EXTRACTION_URL | No | http://localhost:11434 | Ollama URL for summarisation |
| EXTRACTION_MODEL | No | qwen2.5:3b | Ollama model used for summarisation |
## Internal Structure ## Internal Structure
@@ -40,7 +41,8 @@ src/
│ ├── memory.js # HTTP client for memory service │ ├── memory.js # HTTP client for memory service
│ ├── inference.js # HTTP client for inference service │ ├── inference.js # HTTP client for inference service
│ ├── embedding.js # HTTP client for embedding service │ ├── embedding.js # HTTP client for embedding service
── qdrant.js # HTTP client for Qdrant (direct vector search) ── qdrant.js # HTTP client for Qdrant (direct vector search)
│ └── summarization.js # Session summarisation — triggers after each episode
├── chat/ ├── chat/
│ └── index.js # Core pipeline — context assembly, isolation, auto-naming │ └── index.js # Core pipeline — context assembly, isolation, auto-naming
├── config/ ├── config/
@@ -48,12 +50,12 @@ src/
├── routes/ ├── routes/
│ ├── chat.js # POST /chat and POST /chat/stream │ ├── chat.js # POST /chat and POST /chat/stream
│ ├── sessions.js # Session CRUD proxy │ ├── sessions.js # Session CRUD proxy
│ ├── projects.js # Project CRUD proxy — passes req.body straight through │ ├── projects.js # Project CRUD proxy
│ ├── episodes.js # Episode list and delete proxy │ ├── episodes.js # Episode list and delete proxy
│ ├── summaries.js # GET /summaries/session/:id and /summaries/project/:id
│ ├── settings.js # GET /settings and PATCH /settings │ ├── settings.js # GET /settings and PATCH /settings
│ ├── health.js # GET /health — pings all four services │ ├── health.js # GET /health/services — pings all four services
│ └── models.js # GET /models — scans .gguf files live, merges with models.json │ └── models.js # GET /models and GET /models/props
# GET /models/props — context window + loaded model from llama-server
└── index.js # Express app entry point └── index.js # Express app entry point
``` ```
@@ -77,9 +79,6 @@ via `appSettings.load()` — changes apply immediately without a service restart
| `topK` | 40 | Top-K token candidates per step | | `topK` | 40 | Top-K token candidates per step |
| `systemPrompt` | *(ORCHESTRATION.SYSTEM_PROMPT)* | Global system prompt. `null` reverts to hardcoded constant. | | `systemPrompt` | *(ORCHESTRATION.SYSTEM_PROMPT)* | Global system prompt. `null` reverts to hardcoded constant. |
Defaults are defined in `config/settings.js` and fall back to constants in
`@nexusai/shared`. Values saved in `settings.json` take precedence.
## Chat Pipeline ## Chat Pipeline
Both `POST /chat` and `POST /chat/stream` share the same steps. The only Both `POST /chat` and `POST /chat/stream` share the same steps. The only
@@ -88,42 +87,38 @@ difference is how the inference response is delivered to the client.
### Steps ### Steps
1. **Session resolution** — look up session by `externalId`. Auto-create if 1. **Session resolution** — look up session by `externalId`. Auto-create if
not found. Clients generate a UUID for new conversations — no pre-creation not found.
step needed.
2. **Project context resolution** — if the session has a `project_id`, fetch 2. **Project context resolution** — if the session has a `project_id`, fetch
the project and all its session IDs. Used to scope semantic search. The the project and all its session IDs. Used to scope semantic search. The
project's `system_prompt` is also read at this step if set. project's `system_prompt` is also read at this step if set.
3. **System prompt resolution** — three-tier hierarchy: 3. **System prompt resolution** — three-tier hierarchy:
- `project.system_prompt`if the session is in a project and it's set (highest priority) - `project.system_prompt` — highest priority
- `settings.systemPrompt` — global setting from `settings.json` - `settings.systemPrompt` — global setting from `settings.json`
- `ORCHESTRATION.SYSTEM_PROMPT` — hardcoded constant in `@nexusai/shared` (last resort) - `ORCHESTRATION.SYSTEM_PROMPT` — hardcoded constant (last resort)
4. **Recent episode retrieval** — fetch the most recent episodes for the 4. **Recent episode retrieval** — fetch most recent episodes (`recentEpisodeLimit`).
session (`recentEpisodeLimit`, default 5).
5. **Semantic search** — embed the user message, query Qdrant for the top 5. **Semantic search** — embed user message, query Qdrant for similar past
most similar past episodes (`semanticLimit`, `scoreThreshold`). Deduplicated episodes. Deduplicated against recent episodes. Non-critical.
against recent episodes. Non-critical — if it fails, pipeline continues with
recency-only context.
6. **Entity search** — query the `entities` Qdrant collection filtered by 6. **Entity search** — query `entities` Qdrant collection filtered by
`projectId`. Non-project sessions receive no entity context. Non-critical. `projectId`. Non-project sessions receive no entity context. Non-critical.
7. **Prompt assembly** — combine resolved system prompt, entity context, 7. **Prompt assembly** — combine system prompt, entity context, semantic
semantic episodes, recent episodes, and user message. episodes, recent episodes, and user message.
8. **Inference** — send to inference service with settings-derived parameters 8. **Inference** — send to inference service. `/chat` awaits full response;
(temperature, topP, topK, repeatPenalty). `/chat` awaits full response;
`/chat/stream` pipes SSE chunks to the client. `/chat/stream` pipes SSE chunks to the client.
9. **Episode write** — write the exchange back to memory with `projectId`. 9. **Episode write** — write exchange back to memory with `projectId`.
Fire-and-forget for `/chat`; awaited for `/chat/stream`.
10. **Auto-naming** — on `isFirstMessage && !session.name`, fire a secondary 10. **Summarisation trigger**`triggerSummary(session, allEpisodes)` called
inference call with a naming prompt (max 20 tokens, temperature 0.3) and fire-and-forget. See `summarization.md` for full details.
write the result back as `session.name`. Fully fire-and-forget.
11. **Auto-naming** — on first message with no session name, fires a secondary
inference call (max 20 tokens, temperature 0.3) to generate a session name.
### Prompt Structure ### Prompt Structure
@@ -132,26 +127,28 @@ difference is how the inference response is delivered to the client.
Here is what you know about entities relevant to this conversation: Here is what you know about entities relevant to this conversation:
- {name} ({type}): {notes} - {name} ({type}): {notes}
... (up to 5 entity results)
--- ---
Here are some relevant memories from earlier conversations: Here are some relevant memories from earlier conversations:
User: {past user message} User: {past user message}
Assistant: {past ai response} Assistant: {past ai response}
... (up to semanticLimit semantic episodes)
--- ---
Here are some relevant memories from your past conversations: Here are some relevant memories from your past conversations:
User: {past user message} User: {past user message}
Assistant: {past ai response} Assistant: {past ai response}
... (up to recentEpisodeLimit recent episodes)
--- End of recent memories --- --- End of recent memories ---
User: {current message} User: {current message}
Assistant: Assistant:
``` ```
Entity context appears first — before episodic memory — because structured ## Summarisation
facts about known entities are the most stable and reliable context. Semantic
episodes follow, then recent episodes as the immediate conversation flow. After each episode write, `triggerSummary` is called fire-and-forget. It
checks token thresholds and episode counts before generating, then stores
the result in the memory service.
> For full details on trigger conditions, prompt format, cumulative updates,
> ChatML token stripping, and episode range tracking, see `summarization.md`.
## SSE Stream Format ## SSE Stream Format
@@ -168,37 +165,26 @@ data: {"text":"Hello"}
data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42} data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42}
``` ```
The `[DONE]` sentinel is consumed internally and not forwarded. The stream The `[DONE]` sentinel is consumed internally and not forwarded.
is terminated by `res.end()` after the done event.
## Models Route ## Models Route
`GET /models` scans `.gguf` files live on each request from `modelsFolderPath` `GET /models` scans `.gguf` files live from `modelsFolderPath` and merges
(read from settings). Merges results with a `models.json` file in the same with `models.json` for metadata. Returns file size in GB.
folder for richer metadata (label, description). Returns file size in GB.
`GET /models/props` fetches directly from llama-server via `LLAMA_SERVER_URL`. `GET /models/props` fetches directly from llama-server. Returns
Returns `{ contextWindow, modelAlias }`. `n_ctx` is at `{ contextWindow, modelAlias }`. Returns `503` if unreachable.
`data.default_generation_settings.n_ctx` in the llama-server response.
Returns `503` if llama-server is unreachable.
## Sessions Route Behaviour ## Sessions Route Behaviour
`PATCH /sessions/:sessionId` accepts either `name`, `projectId`, or both. `PATCH /sessions/:sessionId` accepts `name`, `projectId`, or both.
The validation guard only rejects requests where neither is provided: Rejects only when neither is provided — allows `useChat` to write project
assignment separately from rename operations.
```js
if (!name?.trim() && projectId === undefined) {
return res.status(400).json({ error: 'name or projectId is required' });
}
```
This allows `useChat` to write project assignment separately from rename
operations.
## Caddy Configuration ## Caddy Configuration
Each route prefix needs a handle block in the Caddyfile on Mini PC 2: Each route prefix needs a handle block in the Caddyfile on Mini PC 2.
**Any new top-level route must be added here AND in `vite.config.js`.**
``` ```
handle /chat* { reverse_proxy localhost:4000 } handle /chat* { reverse_proxy localhost:4000 }
@@ -207,6 +193,7 @@ handle /models* { reverse_proxy localhost:4000 }
handle /projects* { reverse_proxy localhost:4000 } handle /projects* { reverse_proxy localhost:4000 }
handle /episodes* { reverse_proxy localhost:4000 } handle /episodes* { reverse_proxy localhost:4000 }
handle /settings* { reverse_proxy localhost:4000 } handle /settings* { reverse_proxy localhost:4000 }
handle /summaries* { reverse_proxy localhost:4000 }
handle /health* { reverse_proxy localhost:4000 } handle /health* { reverse_proxy localhost:4000 }
``` ```

View File

@@ -165,10 +165,16 @@ Orchestration pipeline defaults. Used as fallback values in
| `RECENT_EPISODE_LIMIT` | `5` | Recent episodes to inject into prompt | | `RECENT_EPISODE_LIMIT` | `5` | Recent episodes to inject into prompt |
| `SEMANTIC_LIMIT` | `5` | Semantic search results to inject into prompt | | `SEMANTIC_LIMIT` | `5` | Semantic search results to inject into prompt |
| `SCORE_THRESHOLD` | `0.75` | Minimum similarity score for semantic results | | `SCORE_THRESHOLD` | `0.75` | Minimum similarity score for semantic results |
| `ENTITIES_LIMIT` | `5` | Max entity search results to inject into prompt |
| `ENTITIES_THRESHOLD` | `0.55` | Minimum similarity score for entity results |
| `TEMPERATURE` | `0.7` | Default inference temperature | | `TEMPERATURE` | `0.7` | Default inference temperature |
| `CORS_ORIGIN` | `'http://localhost:5173'` | Fallback allowed CORS origin | | `CORS_ORIGIN` | `'http://localhost:5173'` | Fallback allowed CORS origin |
| `SYSTEM_PROMPT` | *(see below)* | Default system prompt | | `SYSTEM_PROMPT` | *(see below)* | Default system prompt |
> `ENTITIES_THRESHOLD` is set to `0.55` — lower than `SCORE_THRESHOLD` because
> entity notes generated by a 3B model tend to embed with lower cosine similarity
> than full episode text. Tune upward if irrelevant entities appear in context.
> `repeatPenalty`, `topP`, and `topK` defaults are sourced from > `repeatPenalty`, `topP`, and `topK` defaults are sourced from
> `INFERENCE_DEFAULTS` in `config/settings.js` rather than `ORCHESTRATION`, > `INFERENCE_DEFAULTS` in `config/settings.js` rather than `ORCHESTRATION`,
> since those constants already define the canonical values. > since those constants already define the canonical values.
@@ -178,6 +184,25 @@ Default system prompt:
> of past conversations with the user. Use them to provide consistent, > of past conversations with the user. Use them to provide consistent,
> personalised responses." > personalised responses."
#### `SUMMARIES`
Controls the automatic session summarisation system in `orchestration-service/src/services/summarization.js`.
| Key | Value | Description |
|---|---|---|
| `THRESHOLD_TOKENS` | `200` | Minimum total session tokens before summarisation is considered |
| `MAX_SUMMARY_TOKENS` | `800` | If existing summary exceeds this length (chars), create a new row instead of updating |
| `MIN_EPISODES_SINCE` | `5` | Minimum new episodes since last summary before re-summarising |
These can be overridden per-deployment via environment variables in the
orchestration service `.env`:
```
SUMMARY_THRESHOLD_TOKENS=200
SUMMARY_MAX_TOKENS=800
SUMMARY_MIN_EPISODES=5
```
#### `SQLITE` #### `SQLITE`
| Key | Value | Description | | Key | Value | Description |

View File

@@ -0,0 +1,201 @@
# Summarization
Session summarization generates rolling plain-text summaries of conversation
history, giving the model a condensed view of past context without consuming
the full context window with raw episodes.
**Location:** `packages/orchestration-service/src/services/summarization.js`
**Triggered by:** `chat/index.js` after every episode write (fire-and-forget)
**Model:** `qwen2.5:3b` via Ollama on Mini PC 1 (192.168.0.81)
---
## Trigger Conditions
`triggerSummary(session, allEpisodes)` calls `maybeSummarize` fire-and-forget.
`maybeSummarize` proceeds only when both conditions are met:
1. Total session token count exceeds `SUMMARIES.THRESHOLD_TOKENS` (default 200)
2. At least `SUMMARIES.MIN_EPISODES_SINCE` (default 5) new episodes have
accumulated since the last summary
The token threshold is intentionally low — it ensures summaries start
generating early in a session's life rather than only after very long
conversations.
---
## Summary Rows and Cumulative Updates
Each session can have multiple summary rows in the `summaries` table.
The update strategy depends on the size of the most recent summary:
| Condition | Action |
|---|---|
| No existing summary | Generate fresh summary from all episodes |
| Latest summary under `MAX_SUMMARY_TOKENS` | Update: summarise new episodes with existing summary as context |
| Latest summary over `MAX_SUMMARY_TOKENS` | Create new row: treat as fresh summarisation |
This produces a chain of summary rows over time. Each row's `episode_range`
covers only the episodes summarised in that specific pass (e.g. `259-263`),
not all episodes in the session.
---
## Ollama Request
```js
{
model: EXTRACTION_MODEL, // qwen2.5:3b (set via EXTRACTION_MODEL env var)
prompt: buildSummaryPrompt(episodesToSummarize, existingSummary),
stream: false,
// No format: 'json' — free-text output required for summaries
options: {
temperature: 0.2,
num_predict: 500,
},
}
```
`temperature: 0.2` is slightly higher than extraction (0.1) — summaries
benefit from some fluency. `num_predict: 500` gives room for 5 thorough
sentences without risk of runoff.
---
## Prompt Format
ChatML format — native to qwen2.5:
```
<|im_start|>user
Summarize the conversation below in 3-5 sentences.
Write in third person. Do not quote directly — paraphrase only.
Do not include greetings, sign-offs, or filler. Output only the summary text.
Conversation:
{context}
<|im_end|>
<|im_start|>assistant
```
For cumulative updates, the instruction and context change:
```
<|im_start|>user
Update the summary below to incorporate the new exchanges.
Write 3-5 sentences in third person. Do not quote directly — paraphrase only.
Do not include greetings, sign-offs, or filler. Output only the updated summary text.
Previous summary:
{existingSummary}
New exchanges:
{context}
<|im_end|>
<|im_start|>assistant
```
### Input truncation
Episode context is truncated to `MAX_CHARS = 3000` characters, keeping the
most recent exchanges (sliced from the end). This keeps Qwen focused and
prevents the prompt from exceeding its effective context window.
---
## ChatML Token Stripping
Qwen occasionally echoes ChatML tokens back into its response. The raw output
is cleaned before saving:
```js
const raw = data.response?.trim() ?? '';
const content = raw
.replace(/<\|im_start\|>.*?<\|im_end\|>/gs, '')
.replace(/<\|im_start\|>|<\|im_end\|>|<\|im_sep\|>/g, '')
.trim();
return content;
```
Without this, leaked tokens get stored in the summary and then injected
back into the next summarisation prompt — causing the model to append a new
summary after the old one rather than replacing it.
---
## Episode Range Tracking
Each summary row stores `episode_range` as `"firstId-lastId"` covering only
the episodes summarised in that pass:
```js
const summarizedIds = episodesToSummarize.map(ep => ep.id).sort((a,b) => a - b);
const episodeRange = `${summarizedIds.at(0)}-${summarizedIds.at(-1)}`;
```
This makes SummaryView cards meaningful — "Episodes 259-263" tells you
exactly which exchanges that summary covers, rather than always showing
the full session range.
---
## Summary Storage
Summaries are written directly to the memory service from orchestration:
```js
// Create new row
await fetch(`${MEMORY_URL}/summaries`, {
method: 'POST',
body: JSON.stringify({ sessionId: session.id, content, tokenCount, episodeRange }),
});
// Update existing row
await fetch(`${MEMORY_URL}/summaries/${latest.id}`, {
method: 'PATCH',
body: JSON.stringify({ content, tokenCount, episodeRange }),
});
```
`session.id` here is the internal SQLite integer ID — not the external UUID.
It is available directly on the `session` object passed from `chat/index.js`.
---
## Client-Side Indicator
The chat client shows a "Summarising…" spinner in the `ChatWindow` header
and on the InfoPanel's Session Memory button while summarisation may be
in progress.
Since summarisation is fire-and-forget with no completion signal back to
the client, the indicator is timer-based: it activates when the stream
finishes and clears after 8 seconds.
```js
// In App.jsx, watching the streaming state from useChat:
useEffect(() => {
if (prevStreaming.current && !streaming) {
setSummarising(true);
const t = setTimeout(() => setSummarising(false), 8000);
return () => clearTimeout(t);
}
prevStreaming.current = streaming;
}, [streaming]);
```
---
## Environment Variables
Set in `packages/orchestration-service/src/.env`:
| Variable | Default | Description |
|---|---|---|
| `EXTRACTION_URL` | `http://localhost:11434` | Ollama instance URL |
| `EXTRACTION_MODEL` | `qwen2.5:3b` | Model for summarisation |
| `MEMORY_SERVICE_URL` | `http://localhost:3002` | Memory service URL |
| `SUMMARY_THRESHOLD_TOKENS` | `200` | Token threshold before summarisation triggers |
| `SUMMARY_MAX_TOKENS` | `800` | Max summary length before a new row is created |
| `SUMMARY_MIN_EPISODES` | `5` | Min new episodes since last summary before re-summarising |s