diff --git a/docs/reference/API-routes.md b/docs/reference/API-routes.md index 7c59e50..a406a36 100644 --- a/docs/reference/API-routes.md +++ b/docs/reference/API-routes.md @@ -120,6 +120,38 @@ all projects use isolated memory. Returns `201` with the created project object. Only provided fields are updated — omitted fields are not touched. +### Summaries + +| Method | Path | Description | +|---|---|---| +| GET | /summaries/session/:sessionId | Get all summaries for a session (by external UUID) | +| GET | /summaries/project/:projectId | Get all summaries for a project | + +**GET /summaries/session/:sessionId** — resolves the external UUID to an +internal session ID, then fetches summaries from the memory service. +Returns an array of summary objects ordered by `created_at` ascending. + +**GET /summaries/project/:projectId** — proxies directly to the memory +service project summaries endpoint. + +**Summary object shape:** +```json +{ + "id": 8, + "session_id": 72, + "project_id": null, + "content": "The user asked about...", + "token_count": 579, + "episode_range": "246-251", + "created_at": 1776766518, + "updated_at": 1776766518 +} +``` + +> **Proxy requirement:** `/summaries` must be added to both the Caddyfile +> reverse proxy and the Vite dev proxy config alongside the other route +> prefixes. See `orchestration-service.md` for the Caddy block pattern. + ### Models | Method | Path | Description | @@ -269,6 +301,29 @@ Both fields are optional. Only provided fields are updated. Same request/response shape as orchestration `/projects` above. +### Summaries + +| Method | Path | Description | +|---|---|---| +| POST | /summaries | Create a new summary | +| GET | /sessions/:id/summaries | Get all summaries for a session (internal ID) | +| GET | /projects/:id/summaries | Get all summaries for a project | +| PATCH | /summaries/:id | Update a summary (content, tokenCount, episodeRange) | +| DELETE | /summaries/:id | Delete a summary | + +**POST /summaries — body:** +```json +{ + "sessionId": 72, + "content": "The user discussed...", + "tokenCount": 579, + "episodeRange": "246-251" +} +``` +`content` is required. Either `sessionId` or `projectId` is required. + +**PATCH /summaries/:id — body:** any subset of `content`, `tokenCount`, `episodeRange`. + ### Entities | Method | Path | Description | diff --git a/docs/services/entity-extraction.md b/docs/services/entity-extraction.md new file mode 100644 index 0000000..2f6ea12 --- /dev/null +++ b/docs/services/entity-extraction.md @@ -0,0 +1,178 @@ +# Memory Service + +**Package:** `@nexusai/memory-service` +**Location:** `packages/memory-service` +**Deployed on:** Mini PC 1 (192.168.0.81) +**Port:** 3002 + +## Purpose + +Responsible for all reading and writing of long-term memory. Acts as the +sole interface to both SQLite and Qdrant — no other service accesses these +stores directly. On episode creation, automatically calls the embedding +service to generate and store a vector in Qdrant. + +## Dependencies + +- `express` — HTTP API +- `better-sqlite3` — SQLite driver +- `@qdrant/js-client-rest` — Qdrant vector store client +- `dotenv` — environment variable loading +- `@nexusai/shared` — shared utilities and constants + +## Environment Variables + +| Variable | Required | Default | Description | +|---|---|---|---| +| PORT | No | 3002 | Port to listen on | +| SQLITE_PATH | Yes | — | Path to SQLite database file | +| QDRANT_URL | No | http://localhost:6333 | Qdrant instance URL | +| EMBEDDING_SERVICE_URL | No | http://localhost:3003 | Embedding service URL | +| EXTRACTION_URL | No | http://localhost:11434 | Ollama URL for entity extraction | +| EXTRACTION_MODEL | No | qwen2.5:3b | Ollama model used for entity extraction | + +## Internal Structure + +``` +src/ +├── db/ +│ ├── index.js # SQLite connection + initialization + migrations +│ ├── schema.js # Table definitions, indexes, FTS5, triggers +│ ├── projects.js # Project CRUD functions +│ └── summaries.js # Summary CRUD functions +├── episodic/ +│ └── index.js # Session + episode CRUD, FTS search, embedding write path +├── semantic/ +│ └── index.js # Qdrant collection management, upsert, search, delete +├── entities/ +│ ├── index.js # Entity + relationship CRUD +│ └── extraction.js # Automatic entity extraction via qwen2.5:3b on Ollama +└── index.js # Express app + all route definitions +``` + +## SQLite Schema + +Seven core tables: + +- **sessions** — top-level conversation containers. Fields: `external_id`, `name`, `project_id`, `metadata` +- **episodes** — individual exchanges (user message + AI response) tied to a session +- **entities** — named things the system learns about (people, places, concepts) +- **relationships** — directional labeled links between entities +- **summaries** — condensed episode groups for efficient context retrieval +- **projects** — named groupings of sessions with `name`, `description`, `colour`, `icon`, `isolated`, `notes`, `system_prompt` + +### Migrations + +Schema changes that cannot use `CREATE TABLE IF NOT EXISTS` are applied as +idempotent migrations in `db/index.js` at startup: + +```js +try { db.exec(`ALTER TABLE sessions ADD COLUMN name TEXT`); } catch {} +try { db.exec(`ALTER TABLE sessions ADD COLUMN project_id INTEGER REFERENCES projects(id)`); } catch {} +try { db.exec(`CREATE INDEX IF NOT EXISTS idx_sessions_project ON sessions(project_id)`); } catch {} +try { db.exec(`ALTER TABLE projects ADD COLUMN isolated INTEGER NOT NULL DEFAULT 0`); } catch {} +try { db.exec(`ALTER TABLE projects ADD COLUMN notes TEXT`); } catch {} +try { db.exec(`ALTER TABLE projects ADD COLUMN system_prompt TEXT`); } catch {} +``` + +New migrations are always appended here — never modify the schema file for +existing tables since `ALTER TABLE` cannot use `IF NOT EXISTS`. + +### FTS5 Full-Text Search + +An `episodes_fts` virtual table enables keyword search across all episodes. +Three triggers (`episodes_fts_insert`, `episodes_fts_update`, `episodes_fts_delete`) +keep the FTS index automatically in sync with the episodes table. + +### SQLite Configuration + +- `journal_mode = WAL` — non-blocking reads during writes +- `foreign_keys = ON` — enforces referential integrity and cascade deletes +- PRAGMAs set via `db.pragma()`, not `db.exec()` + +### Dynamic Updates + +Both `updateSession` and `updateProject` build their `SET` clause dynamically +from only the fields passed — prevents partial updates from overwriting fields +that weren't touched. + +`updateProject` allowlist: +```js +const allowed = ['name', 'description', 'colour', 'icon', 'isolated', 'notes', 'system_prompt']; +``` + +## Qdrant / Semantic Layer + +Three Qdrant collections are initialized on service startup via `semantic.initCollections()`: + +| Collection | Purpose | +|---|---| +| `episodes` | Embeddings for individual conversation exchanges | +| `entities` | Embeddings for named entities | +| `summaries` | Embeddings for condensed episode summaries | + +All collections use **768-dimension vectors** with **Cosine similarity**, +matching `nomic-embed-text` via Ollama. Vector size and distance metric are +defined in `@nexusai/shared` — not hardcoded here. + +`initCollections()` iterates `Object.values(COLLECTIONS)` and creates any +collection that doesn't already exist at startup — all three collections are +guaranteed to exist before any requests are handled, avoiding race conditions +between the first entity embed and an entity search. + +Each collection exposes upsert, search (with optional Qdrant filter), and +delete operations. The `wait: true` flag is used on all writes. + +## Embedding Write Path + +When a new episode is created: + +1. Episode saved to SQLite synchronously — response returned immediately +2. User message + AI response combined: `User: ...\nAssistant: ...` +3. Text sent to embedding service (`POST /embed`) +4. Vector upserted into `episodes` Qdrant collection with payload `{ sessionId, createdAt }` + +This step is **fire-and-forget** — if embedding fails, the episode is still +saved and searchable via FTS. The error is logged but not surfaced. + +> The Qdrant payload stores `sessionId` (the internal integer ID). See +> `memory-isolation.md` for how project-level filtering works. + +## Entity Layer + +Entities and relationships use upsert semantics with composite unique +constraints to prevent duplicates: + +- `UNIQUE(name, type)` on entities +- `UNIQUE(from_id, to_id, label)` on relationships +- `ON DELETE CASCADE` on relationship foreign keys + +After each episode is saved, `extraction.js` automatically extracts named +entities from the conversation using `qwen2.5:3b` on Ollama — fire-and-forget. + +> For full details on the extraction pipeline, prompt format, constrained +> decoding, stoplist, and Qdrant storage, see `entity-extraction.md`. + +## Summaries Layer + +Session summaries are generated by `orchestration-service/src/services/summarization.js` +after each episode write and stored here via `POST /summaries`. The memory +service is responsible only for CRUD — generation logic lives in orchestration. + +> For full details on trigger conditions, prompt format, cumulative updates, +> and ChatML token stripping, see `summarization.md`. + +## Project Delete Behaviour + +Deleting a project runs as a transaction — it first nulls out `project_id` +on all assigned sessions, then deletes the project. This avoids a foreign +key constraint failure since `sessions.project_id` has no `ON DELETE` rule: + +```js +const doDelete = db.transaction(() => { + db.prepare(`UPDATE sessions SET project_id = NULL WHERE project_id = ?`).run(id); + db.prepare(`DELETE FROM projects WHERE id = ?`).run(id); +}); +``` + +For all HTTP endpoints, see `api-routes.md`. \ No newline at end of file diff --git a/docs/services/memory-service.md b/docs/services/memory-service.md index 7134f7c..2f6ea12 100644 --- a/docs/services/memory-service.md +++ b/docs/services/memory-service.md @@ -38,7 +38,8 @@ src/ ├── db/ │ ├── index.js # SQLite connection + initialization + migrations │ ├── schema.js # Table definitions, indexes, FTS5, triggers -│ └── projects.js # Project CRUD functions +│ ├── projects.js # Project CRUD functions +│ └── summaries.js # Summary CRUD functions ├── episodic/ │ └── index.js # Session + episode CRUD, FTS search, embedding write path ├── semantic/ @@ -51,7 +52,7 @@ src/ ## SQLite Schema -Six core tables: +Seven core tables: - **sessions** — top-level conversation containers. Fields: `external_id`, `name`, `project_id`, `metadata` - **episodes** — individual exchanges (user message + AI response) tied to a session @@ -100,12 +101,9 @@ that weren't touched. const allowed = ['name', 'description', 'colour', 'icon', 'isolated', 'notes', 'system_prompt']; ``` -This means saving just `{ notes: "..." }` or `{ system_prompt: "..." }` won't -touch any other field. - ## Qdrant / Semantic Layer -Three Qdrant collections are initialized on service startup: +Three Qdrant collections are initialized on service startup via `semantic.initCollections()`: | Collection | Purpose | |---|---| @@ -117,9 +115,13 @@ All collections use **768-dimension vectors** with **Cosine similarity**, matching `nomic-embed-text` via Ollama. Vector size and distance metric are defined in `@nexusai/shared` — not hardcoded here. -Each collection exposes three operations in `src/semantic/index.js`: -upsert, search (with optional Qdrant filter), and delete. The `wait: true` -flag is used on all writes. +`initCollections()` iterates `Object.values(COLLECTIONS)` and creates any +collection that doesn't already exist at startup — all three collections are +guaranteed to exist before any requests are handled, avoiding race conditions +between the first entity embed and an entity search. + +Each collection exposes upsert, search (with optional Qdrant filter), and +delete operations. The `wait: true` flag is used on all writes. ## Embedding Write Path @@ -133,8 +135,7 @@ When a new episode is created: This step is **fire-and-forget** — if embedding fails, the episode is still saved and searchable via FTS. The error is logged but not surfaced. -> The Qdrant payload stores `sessionId` (the internal integer ID). This is -> used for per-session and per-project filtering during semantic search. See +> The Qdrant payload stores `sessionId` (the internal integer ID). See > `memory-isolation.md` for how project-level filtering works. ## Entity Layer @@ -146,34 +147,20 @@ constraints to prevent duplicates: - `UNIQUE(from_id, to_id, label)` on relationships - `ON DELETE CASCADE` on relationship foreign keys -### Automatic Entity Extraction - After each episode is saved, `extraction.js` automatically extracts named -entities from the conversation using `qwen2.5:3b` running on Ollama (Mini PC 1). -This runs **fire-and-forget** — the episode is already saved and returned -before extraction begins. +entities from the conversation using `qwen2.5:3b` on Ollama — fire-and-forget. -**Entity types extracted:** `person`, `place`, `project`, `technology`, -`concept`, `organization` +> For full details on the extraction pipeline, prompt format, constrained +> decoding, stoplist, and Qdrant storage, see `entity-extraction.md`. -The extraction prompt uses ChatML format (native to qwen2.5) and primes the -response by ending with `[` to steer the model directly into JSON array output. -A list of already-known entities is injected into the prompt so the model -reuses existing `(name, type)` pairs rather than creating duplicates with -different types. +## Summaries Layer -After extraction, each entity is: -1. Upserted into SQLite via `upsertEntity` — notes are only written if - the entity is new (`COALESCE(entities.notes, excluded.notes)` prevents - overwriting existing notes with speculative updates) -2. Embedded via the embedding service and upserted into the `entities` - Qdrant collection with `{ name, type, notes, projectId }` as payload — - `projectId` scopes entities to their project for isolated retrieval +Session summaries are generated by `orchestration-service/src/services/summarization.js` +after each episode write and stored here via `POST /summaries`. The memory +service is responsible only for CRUD — generation logic lives in orchestration. -`extractAndStoreEntities` receives `projectId` from `createEpisode`, which -receives it from the episode route, which receives it from orchestration's -`createEpisode` call. This ensures entities are tagged with the correct -project scope at extraction time. +> For full details on trigger conditions, prompt format, cumulative updates, +> and ChatML token stripping, see `summarization.md`. ## Project Delete Behaviour diff --git a/docs/services/orchestration-service.md b/docs/services/orchestration-service.md index a04d0d6..89aebfd 100644 --- a/docs/services/orchestration-service.md +++ b/docs/services/orchestration-service.md @@ -30,31 +30,33 @@ or inference services — all traffic flows through orchestration. | LLAMA_SERVER_URL | No | http://localhost:8080 | Direct llama-server URL for /models/props | | QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search | | CORS_ORIGIN | No | http://localhost:5173 | Allowed origin for CORS requests | -| MODELS_MANIFEST_PATH | No | — | Legacy — superseded by `modelsFolderPath` in settings.json | +| EXTRACTION_URL | No | http://localhost:11434 | Ollama URL for summarisation | +| EXTRACTION_MODEL | No | qwen2.5:3b | Ollama model used for summarisation | ## Internal Structure ``` src/ ├── services/ -│ ├── memory.js # HTTP client for memory service -│ ├── inference.js # HTTP client for inference service -│ ├── embedding.js # HTTP client for embedding service -│ └── qdrant.js # HTTP client for Qdrant (direct vector search) +│ ├── memory.js # HTTP client for memory service +│ ├── inference.js # HTTP client for inference service +│ ├── embedding.js # HTTP client for embedding service +│ ├── qdrant.js # HTTP client for Qdrant (direct vector search) +│ └── summarization.js # Session summarisation — triggers after each episode ├── chat/ -│ └── index.js # Core pipeline — context assembly, isolation, auto-naming +│ └── index.js # Core pipeline — context assembly, isolation, auto-naming ├── config/ -│ └── settings.js # Settings load/save — reads/writes data/settings.json +│ └── settings.js # Settings load/save — reads/writes data/settings.json ├── routes/ -│ ├── chat.js # POST /chat and POST /chat/stream -│ ├── sessions.js # Session CRUD proxy -│ ├── projects.js # Project CRUD proxy — passes req.body straight through -│ ├── episodes.js # Episode list and delete proxy -│ ├── settings.js # GET /settings and PATCH /settings -│ ├── health.js # GET /health — pings all four services -│ └── models.js # GET /models — scans .gguf files live, merges with models.json - # GET /models/props — context window + loaded model from llama-server -└── index.js # Express app entry point +│ ├── chat.js # POST /chat and POST /chat/stream +│ ├── sessions.js # Session CRUD proxy +│ ├── projects.js # Project CRUD proxy +│ ├── episodes.js # Episode list and delete proxy +│ ├── summaries.js # GET /summaries/session/:id and /summaries/project/:id +│ ├── settings.js # GET /settings and PATCH /settings +│ ├── health.js # GET /health/services — pings all four services +│ └── models.js # GET /models and GET /models/props +└── index.js # Express app entry point ``` The `services/` layer wraps all downstream HTTP calls in named functions. @@ -77,9 +79,6 @@ via `appSettings.load()` — changes apply immediately without a service restart | `topK` | 40 | Top-K token candidates per step | | `systemPrompt` | *(ORCHESTRATION.SYSTEM_PROMPT)* | Global system prompt. `null` reverts to hardcoded constant. | -Defaults are defined in `config/settings.js` and fall back to constants in -`@nexusai/shared`. Values saved in `settings.json` take precedence. - ## Chat Pipeline Both `POST /chat` and `POST /chat/stream` share the same steps. The only @@ -88,42 +87,38 @@ difference is how the inference response is delivered to the client. ### Steps 1. **Session resolution** — look up session by `externalId`. Auto-create if - not found. Clients generate a UUID for new conversations — no pre-creation - step needed. + not found. 2. **Project context resolution** — if the session has a `project_id`, fetch the project and all its session IDs. Used to scope semantic search. The project's `system_prompt` is also read at this step if set. 3. **System prompt resolution** — three-tier hierarchy: - - `project.system_prompt` — if the session is in a project and it's set (highest priority) + - `project.system_prompt` — highest priority - `settings.systemPrompt` — global setting from `settings.json` - - `ORCHESTRATION.SYSTEM_PROMPT` — hardcoded constant in `@nexusai/shared` (last resort) + - `ORCHESTRATION.SYSTEM_PROMPT` — hardcoded constant (last resort) -4. **Recent episode retrieval** — fetch the most recent episodes for the - session (`recentEpisodeLimit`, default 5). +4. **Recent episode retrieval** — fetch most recent episodes (`recentEpisodeLimit`). -5. **Semantic search** — embed the user message, query Qdrant for the top - most similar past episodes (`semanticLimit`, `scoreThreshold`). Deduplicated - against recent episodes. Non-critical — if it fails, pipeline continues with - recency-only context. +5. **Semantic search** — embed user message, query Qdrant for similar past + episodes. Deduplicated against recent episodes. Non-critical. -6. **Entity search** — query the `entities` Qdrant collection filtered by +6. **Entity search** — query `entities` Qdrant collection filtered by `projectId`. Non-project sessions receive no entity context. Non-critical. -7. **Prompt assembly** — combine resolved system prompt, entity context, - semantic episodes, recent episodes, and user message. +7. **Prompt assembly** — combine system prompt, entity context, semantic + episodes, recent episodes, and user message. -8. **Inference** — send to inference service with settings-derived parameters - (temperature, topP, topK, repeatPenalty). `/chat` awaits full response; +8. **Inference** — send to inference service. `/chat` awaits full response; `/chat/stream` pipes SSE chunks to the client. -9. **Episode write** — write the exchange back to memory with `projectId`. - Fire-and-forget for `/chat`; awaited for `/chat/stream`. +9. **Episode write** — write exchange back to memory with `projectId`. -10. **Auto-naming** — on `isFirstMessage && !session.name`, fire a secondary - inference call with a naming prompt (max 20 tokens, temperature 0.3) and - write the result back as `session.name`. Fully fire-and-forget. +10. **Summarisation trigger** — `triggerSummary(session, allEpisodes)` called + fire-and-forget. See `summarization.md` for full details. + +11. **Auto-naming** — on first message with no session name, fires a secondary + inference call (max 20 tokens, temperature 0.3) to generate a session name. ### Prompt Structure @@ -132,26 +127,28 @@ difference is how the inference response is delivered to the client. Here is what you know about entities relevant to this conversation: - {name} ({type}): {notes} -... (up to 5 entity results) --- Here are some relevant memories from earlier conversations: User: {past user message} Assistant: {past ai response} -... (up to semanticLimit semantic episodes) --- Here are some relevant memories from your past conversations: User: {past user message} Assistant: {past ai response} -... (up to recentEpisodeLimit recent episodes) --- End of recent memories --- User: {current message} Assistant: ``` -Entity context appears first — before episodic memory — because structured -facts about known entities are the most stable and reliable context. Semantic -episodes follow, then recent episodes as the immediate conversation flow. +## Summarisation + +After each episode write, `triggerSummary` is called fire-and-forget. It +checks token thresholds and episode counts before generating, then stores +the result in the memory service. + +> For full details on trigger conditions, prompt format, cumulative updates, +> ChatML token stripping, and episode range tracking, see `summarization.md`. ## SSE Stream Format @@ -168,46 +165,36 @@ data: {"text":"Hello"} data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42} ``` -The `[DONE]` sentinel is consumed internally and not forwarded. The stream -is terminated by `res.end()` after the done event. +The `[DONE]` sentinel is consumed internally and not forwarded. ## Models Route -`GET /models` scans `.gguf` files live on each request from `modelsFolderPath` -(read from settings). Merges results with a `models.json` file in the same -folder for richer metadata (label, description). Returns file size in GB. +`GET /models` scans `.gguf` files live from `modelsFolderPath` and merges +with `models.json` for metadata. Returns file size in GB. -`GET /models/props` fetches directly from llama-server via `LLAMA_SERVER_URL`. -Returns `{ contextWindow, modelAlias }`. `n_ctx` is at -`data.default_generation_settings.n_ctx` in the llama-server response. -Returns `503` if llama-server is unreachable. +`GET /models/props` fetches directly from llama-server. Returns +`{ contextWindow, modelAlias }`. Returns `503` if unreachable. ## Sessions Route Behaviour -`PATCH /sessions/:sessionId` accepts either `name`, `projectId`, or both. -The validation guard only rejects requests where neither is provided: - -```js -if (!name?.trim() && projectId === undefined) { - return res.status(400).json({ error: 'name or projectId is required' }); -} -``` - -This allows `useChat` to write project assignment separately from rename -operations. +`PATCH /sessions/:sessionId` accepts `name`, `projectId`, or both. +Rejects only when neither is provided — allows `useChat` to write project +assignment separately from rename operations. ## Caddy Configuration -Each route prefix needs a handle block in the Caddyfile on Mini PC 2: +Each route prefix needs a handle block in the Caddyfile on Mini PC 2. +**Any new top-level route must be added here AND in `vite.config.js`.** ``` -handle /chat* { reverse_proxy localhost:4000 } -handle /sessions* { reverse_proxy localhost:4000 } -handle /models* { reverse_proxy localhost:4000 } -handle /projects* { reverse_proxy localhost:4000 } -handle /episodes* { reverse_proxy localhost:4000 } -handle /settings* { reverse_proxy localhost:4000 } -handle /health* { reverse_proxy localhost:4000 } +handle /chat* { reverse_proxy localhost:4000 } +handle /sessions* { reverse_proxy localhost:4000 } +handle /models* { reverse_proxy localhost:4000 } +handle /projects* { reverse_proxy localhost:4000 } +handle /episodes* { reverse_proxy localhost:4000 } +handle /settings* { reverse_proxy localhost:4000 } +handle /summaries* { reverse_proxy localhost:4000 } +handle /health* { reverse_proxy localhost:4000 } ``` After updating: `caddy reload --config /path/to/Caddyfile` diff --git a/docs/services/shared.md b/docs/services/shared.md index 69ffffa..3b7e4de 100644 --- a/docs/services/shared.md +++ b/docs/services/shared.md @@ -165,10 +165,16 @@ Orchestration pipeline defaults. Used as fallback values in | `RECENT_EPISODE_LIMIT` | `5` | Recent episodes to inject into prompt | | `SEMANTIC_LIMIT` | `5` | Semantic search results to inject into prompt | | `SCORE_THRESHOLD` | `0.75` | Minimum similarity score for semantic results | +| `ENTITIES_LIMIT` | `5` | Max entity search results to inject into prompt | +| `ENTITIES_THRESHOLD` | `0.55` | Minimum similarity score for entity results | | `TEMPERATURE` | `0.7` | Default inference temperature | | `CORS_ORIGIN` | `'http://localhost:5173'` | Fallback allowed CORS origin | | `SYSTEM_PROMPT` | *(see below)* | Default system prompt | +> `ENTITIES_THRESHOLD` is set to `0.55` — lower than `SCORE_THRESHOLD` because +> entity notes generated by a 3B model tend to embed with lower cosine similarity +> than full episode text. Tune upward if irrelevant entities appear in context. + > `repeatPenalty`, `topP`, and `topK` defaults are sourced from > `INFERENCE_DEFAULTS` in `config/settings.js` rather than `ORCHESTRATION`, > since those constants already define the canonical values. @@ -178,6 +184,25 @@ Default system prompt: > of past conversations with the user. Use them to provide consistent, > personalised responses." +#### `SUMMARIES` + +Controls the automatic session summarisation system in `orchestration-service/src/services/summarization.js`. + +| Key | Value | Description | +|---|---|---| +| `THRESHOLD_TOKENS` | `200` | Minimum total session tokens before summarisation is considered | +| `MAX_SUMMARY_TOKENS` | `800` | If existing summary exceeds this length (chars), create a new row instead of updating | +| `MIN_EPISODES_SINCE` | `5` | Minimum new episodes since last summary before re-summarising | + +These can be overridden per-deployment via environment variables in the +orchestration service `.env`: + +``` +SUMMARY_THRESHOLD_TOKENS=200 +SUMMARY_MAX_TOKENS=800 +SUMMARY_MIN_EPISODES=5 +``` + #### `SQLITE` | Key | Value | Description | diff --git a/docs/services/summarization.md b/docs/services/summarization.md new file mode 100644 index 0000000..f8af7b4 --- /dev/null +++ b/docs/services/summarization.md @@ -0,0 +1,201 @@ +# Summarization + +Session summarization generates rolling plain-text summaries of conversation +history, giving the model a condensed view of past context without consuming +the full context window with raw episodes. + +**Location:** `packages/orchestration-service/src/services/summarization.js` +**Triggered by:** `chat/index.js` after every episode write (fire-and-forget) +**Model:** `qwen2.5:3b` via Ollama on Mini PC 1 (192.168.0.81) + +--- + +## Trigger Conditions + +`triggerSummary(session, allEpisodes)` calls `maybeSummarize` fire-and-forget. +`maybeSummarize` proceeds only when both conditions are met: + +1. Total session token count exceeds `SUMMARIES.THRESHOLD_TOKENS` (default 200) +2. At least `SUMMARIES.MIN_EPISODES_SINCE` (default 5) new episodes have + accumulated since the last summary + +The token threshold is intentionally low — it ensures summaries start +generating early in a session's life rather than only after very long +conversations. + +--- + +## Summary Rows and Cumulative Updates + +Each session can have multiple summary rows in the `summaries` table. +The update strategy depends on the size of the most recent summary: + +| Condition | Action | +|---|---| +| No existing summary | Generate fresh summary from all episodes | +| Latest summary under `MAX_SUMMARY_TOKENS` | Update: summarise new episodes with existing summary as context | +| Latest summary over `MAX_SUMMARY_TOKENS` | Create new row: treat as fresh summarisation | + +This produces a chain of summary rows over time. Each row's `episode_range` +covers only the episodes summarised in that specific pass (e.g. `259-263`), +not all episodes in the session. + +--- + +## Ollama Request + +```js +{ + model: EXTRACTION_MODEL, // qwen2.5:3b (set via EXTRACTION_MODEL env var) + prompt: buildSummaryPrompt(episodesToSummarize, existingSummary), + stream: false, + // No format: 'json' — free-text output required for summaries + options: { + temperature: 0.2, + num_predict: 500, + }, +} +``` + +`temperature: 0.2` is slightly higher than extraction (0.1) — summaries +benefit from some fluency. `num_predict: 500` gives room for 5 thorough +sentences without risk of runoff. + +--- + +## Prompt Format + +ChatML format — native to qwen2.5: + +``` +<|im_start|>user +Summarize the conversation below in 3-5 sentences. +Write in third person. Do not quote directly — paraphrase only. +Do not include greetings, sign-offs, or filler. Output only the summary text. + +Conversation: +{context} +<|im_end|> +<|im_start|>assistant +``` + +For cumulative updates, the instruction and context change: + +``` +<|im_start|>user +Update the summary below to incorporate the new exchanges. +Write 3-5 sentences in third person. Do not quote directly — paraphrase only. +Do not include greetings, sign-offs, or filler. Output only the updated summary text. + +Previous summary: +{existingSummary} + +New exchanges: +{context} +<|im_end|> +<|im_start|>assistant +``` + +### Input truncation + +Episode context is truncated to `MAX_CHARS = 3000` characters, keeping the +most recent exchanges (sliced from the end). This keeps Qwen focused and +prevents the prompt from exceeding its effective context window. + +--- + +## ChatML Token Stripping + +Qwen occasionally echoes ChatML tokens back into its response. The raw output +is cleaned before saving: + +```js +const raw = data.response?.trim() ?? ''; +const content = raw + .replace(/<\|im_start\|>.*?<\|im_end\|>/gs, '') + .replace(/<\|im_start\|>|<\|im_end\|>|<\|im_sep\|>/g, '') + .trim(); +return content; +``` + +Without this, leaked tokens get stored in the summary and then injected +back into the next summarisation prompt — causing the model to append a new +summary after the old one rather than replacing it. + +--- + +## Episode Range Tracking + +Each summary row stores `episode_range` as `"firstId-lastId"` covering only +the episodes summarised in that pass: + +```js +const summarizedIds = episodesToSummarize.map(ep => ep.id).sort((a,b) => a - b); +const episodeRange = `${summarizedIds.at(0)}-${summarizedIds.at(-1)}`; +``` + +This makes SummaryView cards meaningful — "Episodes 259-263" tells you +exactly which exchanges that summary covers, rather than always showing +the full session range. + +--- + +## Summary Storage + +Summaries are written directly to the memory service from orchestration: + +```js +// Create new row +await fetch(`${MEMORY_URL}/summaries`, { + method: 'POST', + body: JSON.stringify({ sessionId: session.id, content, tokenCount, episodeRange }), +}); + +// Update existing row +await fetch(`${MEMORY_URL}/summaries/${latest.id}`, { + method: 'PATCH', + body: JSON.stringify({ content, tokenCount, episodeRange }), +}); +``` + +`session.id` here is the internal SQLite integer ID — not the external UUID. +It is available directly on the `session` object passed from `chat/index.js`. + +--- + +## Client-Side Indicator + +The chat client shows a "Summarising…" spinner in the `ChatWindow` header +and on the InfoPanel's Session Memory button while summarisation may be +in progress. + +Since summarisation is fire-and-forget with no completion signal back to +the client, the indicator is timer-based: it activates when the stream +finishes and clears after 8 seconds. + +```js +// In App.jsx, watching the streaming state from useChat: +useEffect(() => { + if (prevStreaming.current && !streaming) { + setSummarising(true); + const t = setTimeout(() => setSummarising(false), 8000); + return () => clearTimeout(t); + } + prevStreaming.current = streaming; +}, [streaming]); +``` + +--- + +## Environment Variables + +Set in `packages/orchestration-service/src/.env`: + +| Variable | Default | Description | +|---|---|---| +| `EXTRACTION_URL` | `http://localhost:11434` | Ollama instance URL | +| `EXTRACTION_MODEL` | `qwen2.5:3b` | Model for summarisation | +| `MEMORY_SERVICE_URL` | `http://localhost:3002` | Memory service URL | +| `SUMMARY_THRESHOLD_TOKENS` | `200` | Token threshold before summarisation triggers | +| `SUMMARY_MAX_TOKENS` | `800` | Max summary length before a new row is created | +| `SUMMARY_MIN_EPISODES` | `5` | Min new episodes since last summary before re-summarising |s \ No newline at end of file