documentation updates for entity extraction and summarization
This commit is contained in:
@@ -120,6 +120,38 @@ all projects use isolated memory. Returns `201` with the created project object.
|
||||
|
||||
Only provided fields are updated — omitted fields are not touched.
|
||||
|
||||
### Summaries
|
||||
|
||||
| Method | Path | Description |
|
||||
|---|---|---|
|
||||
| GET | /summaries/session/:sessionId | Get all summaries for a session (by external UUID) |
|
||||
| GET | /summaries/project/:projectId | Get all summaries for a project |
|
||||
|
||||
**GET /summaries/session/:sessionId** — resolves the external UUID to an
|
||||
internal session ID, then fetches summaries from the memory service.
|
||||
Returns an array of summary objects ordered by `created_at` ascending.
|
||||
|
||||
**GET /summaries/project/:projectId** — proxies directly to the memory
|
||||
service project summaries endpoint.
|
||||
|
||||
**Summary object shape:**
|
||||
```json
|
||||
{
|
||||
"id": 8,
|
||||
"session_id": 72,
|
||||
"project_id": null,
|
||||
"content": "The user asked about...",
|
||||
"token_count": 579,
|
||||
"episode_range": "246-251",
|
||||
"created_at": 1776766518,
|
||||
"updated_at": 1776766518
|
||||
}
|
||||
```
|
||||
|
||||
> **Proxy requirement:** `/summaries` must be added to both the Caddyfile
|
||||
> reverse proxy and the Vite dev proxy config alongside the other route
|
||||
> prefixes. See `orchestration-service.md` for the Caddy block pattern.
|
||||
|
||||
### Models
|
||||
|
||||
| Method | Path | Description |
|
||||
@@ -269,6 +301,29 @@ Both fields are optional. Only provided fields are updated.
|
||||
|
||||
Same request/response shape as orchestration `/projects` above.
|
||||
|
||||
### Summaries
|
||||
|
||||
| Method | Path | Description |
|
||||
|---|---|---|
|
||||
| POST | /summaries | Create a new summary |
|
||||
| GET | /sessions/:id/summaries | Get all summaries for a session (internal ID) |
|
||||
| GET | /projects/:id/summaries | Get all summaries for a project |
|
||||
| PATCH | /summaries/:id | Update a summary (content, tokenCount, episodeRange) |
|
||||
| DELETE | /summaries/:id | Delete a summary |
|
||||
|
||||
**POST /summaries — body:**
|
||||
```json
|
||||
{
|
||||
"sessionId": 72,
|
||||
"content": "The user discussed...",
|
||||
"tokenCount": 579,
|
||||
"episodeRange": "246-251"
|
||||
}
|
||||
```
|
||||
`content` is required. Either `sessionId` or `projectId` is required.
|
||||
|
||||
**PATCH /summaries/:id — body:** any subset of `content`, `tokenCount`, `episodeRange`.
|
||||
|
||||
### Entities
|
||||
|
||||
| Method | Path | Description |
|
||||
|
||||
178
docs/services/entity-extraction.md
Normal file
178
docs/services/entity-extraction.md
Normal file
@@ -0,0 +1,178 @@
|
||||
# Memory Service
|
||||
|
||||
**Package:** `@nexusai/memory-service`
|
||||
**Location:** `packages/memory-service`
|
||||
**Deployed on:** Mini PC 1 (192.168.0.81)
|
||||
**Port:** 3002
|
||||
|
||||
## Purpose
|
||||
|
||||
Responsible for all reading and writing of long-term memory. Acts as the
|
||||
sole interface to both SQLite and Qdrant — no other service accesses these
|
||||
stores directly. On episode creation, automatically calls the embedding
|
||||
service to generate and store a vector in Qdrant.
|
||||
|
||||
## Dependencies
|
||||
|
||||
- `express` — HTTP API
|
||||
- `better-sqlite3` — SQLite driver
|
||||
- `@qdrant/js-client-rest` — Qdrant vector store client
|
||||
- `dotenv` — environment variable loading
|
||||
- `@nexusai/shared` — shared utilities and constants
|
||||
|
||||
## Environment Variables
|
||||
|
||||
| Variable | Required | Default | Description |
|
||||
|---|---|---|---|
|
||||
| PORT | No | 3002 | Port to listen on |
|
||||
| SQLITE_PATH | Yes | — | Path to SQLite database file |
|
||||
| QDRANT_URL | No | http://localhost:6333 | Qdrant instance URL |
|
||||
| EMBEDDING_SERVICE_URL | No | http://localhost:3003 | Embedding service URL |
|
||||
| EXTRACTION_URL | No | http://localhost:11434 | Ollama URL for entity extraction |
|
||||
| EXTRACTION_MODEL | No | qwen2.5:3b | Ollama model used for entity extraction |
|
||||
|
||||
## Internal Structure
|
||||
|
||||
```
|
||||
src/
|
||||
├── db/
|
||||
│ ├── index.js # SQLite connection + initialization + migrations
|
||||
│ ├── schema.js # Table definitions, indexes, FTS5, triggers
|
||||
│ ├── projects.js # Project CRUD functions
|
||||
│ └── summaries.js # Summary CRUD functions
|
||||
├── episodic/
|
||||
│ └── index.js # Session + episode CRUD, FTS search, embedding write path
|
||||
├── semantic/
|
||||
│ └── index.js # Qdrant collection management, upsert, search, delete
|
||||
├── entities/
|
||||
│ ├── index.js # Entity + relationship CRUD
|
||||
│ └── extraction.js # Automatic entity extraction via qwen2.5:3b on Ollama
|
||||
└── index.js # Express app + all route definitions
|
||||
```
|
||||
|
||||
## SQLite Schema
|
||||
|
||||
Seven core tables:
|
||||
|
||||
- **sessions** — top-level conversation containers. Fields: `external_id`, `name`, `project_id`, `metadata`
|
||||
- **episodes** — individual exchanges (user message + AI response) tied to a session
|
||||
- **entities** — named things the system learns about (people, places, concepts)
|
||||
- **relationships** — directional labeled links between entities
|
||||
- **summaries** — condensed episode groups for efficient context retrieval
|
||||
- **projects** — named groupings of sessions with `name`, `description`, `colour`, `icon`, `isolated`, `notes`, `system_prompt`
|
||||
|
||||
### Migrations
|
||||
|
||||
Schema changes that cannot use `CREATE TABLE IF NOT EXISTS` are applied as
|
||||
idempotent migrations in `db/index.js` at startup:
|
||||
|
||||
```js
|
||||
try { db.exec(`ALTER TABLE sessions ADD COLUMN name TEXT`); } catch {}
|
||||
try { db.exec(`ALTER TABLE sessions ADD COLUMN project_id INTEGER REFERENCES projects(id)`); } catch {}
|
||||
try { db.exec(`CREATE INDEX IF NOT EXISTS idx_sessions_project ON sessions(project_id)`); } catch {}
|
||||
try { db.exec(`ALTER TABLE projects ADD COLUMN isolated INTEGER NOT NULL DEFAULT 0`); } catch {}
|
||||
try { db.exec(`ALTER TABLE projects ADD COLUMN notes TEXT`); } catch {}
|
||||
try { db.exec(`ALTER TABLE projects ADD COLUMN system_prompt TEXT`); } catch {}
|
||||
```
|
||||
|
||||
New migrations are always appended here — never modify the schema file for
|
||||
existing tables since `ALTER TABLE` cannot use `IF NOT EXISTS`.
|
||||
|
||||
### FTS5 Full-Text Search
|
||||
|
||||
An `episodes_fts` virtual table enables keyword search across all episodes.
|
||||
Three triggers (`episodes_fts_insert`, `episodes_fts_update`, `episodes_fts_delete`)
|
||||
keep the FTS index automatically in sync with the episodes table.
|
||||
|
||||
### SQLite Configuration
|
||||
|
||||
- `journal_mode = WAL` — non-blocking reads during writes
|
||||
- `foreign_keys = ON` — enforces referential integrity and cascade deletes
|
||||
- PRAGMAs set via `db.pragma()`, not `db.exec()`
|
||||
|
||||
### Dynamic Updates
|
||||
|
||||
Both `updateSession` and `updateProject` build their `SET` clause dynamically
|
||||
from only the fields passed — prevents partial updates from overwriting fields
|
||||
that weren't touched.
|
||||
|
||||
`updateProject` allowlist:
|
||||
```js
|
||||
const allowed = ['name', 'description', 'colour', 'icon', 'isolated', 'notes', 'system_prompt'];
|
||||
```
|
||||
|
||||
## Qdrant / Semantic Layer
|
||||
|
||||
Three Qdrant collections are initialized on service startup via `semantic.initCollections()`:
|
||||
|
||||
| Collection | Purpose |
|
||||
|---|---|
|
||||
| `episodes` | Embeddings for individual conversation exchanges |
|
||||
| `entities` | Embeddings for named entities |
|
||||
| `summaries` | Embeddings for condensed episode summaries |
|
||||
|
||||
All collections use **768-dimension vectors** with **Cosine similarity**,
|
||||
matching `nomic-embed-text` via Ollama. Vector size and distance metric are
|
||||
defined in `@nexusai/shared` — not hardcoded here.
|
||||
|
||||
`initCollections()` iterates `Object.values(COLLECTIONS)` and creates any
|
||||
collection that doesn't already exist at startup — all three collections are
|
||||
guaranteed to exist before any requests are handled, avoiding race conditions
|
||||
between the first entity embed and an entity search.
|
||||
|
||||
Each collection exposes upsert, search (with optional Qdrant filter), and
|
||||
delete operations. The `wait: true` flag is used on all writes.
|
||||
|
||||
## Embedding Write Path
|
||||
|
||||
When a new episode is created:
|
||||
|
||||
1. Episode saved to SQLite synchronously — response returned immediately
|
||||
2. User message + AI response combined: `User: ...\nAssistant: ...`
|
||||
3. Text sent to embedding service (`POST /embed`)
|
||||
4. Vector upserted into `episodes` Qdrant collection with payload `{ sessionId, createdAt }`
|
||||
|
||||
This step is **fire-and-forget** — if embedding fails, the episode is still
|
||||
saved and searchable via FTS. The error is logged but not surfaced.
|
||||
|
||||
> The Qdrant payload stores `sessionId` (the internal integer ID). See
|
||||
> `memory-isolation.md` for how project-level filtering works.
|
||||
|
||||
## Entity Layer
|
||||
|
||||
Entities and relationships use upsert semantics with composite unique
|
||||
constraints to prevent duplicates:
|
||||
|
||||
- `UNIQUE(name, type)` on entities
|
||||
- `UNIQUE(from_id, to_id, label)` on relationships
|
||||
- `ON DELETE CASCADE` on relationship foreign keys
|
||||
|
||||
After each episode is saved, `extraction.js` automatically extracts named
|
||||
entities from the conversation using `qwen2.5:3b` on Ollama — fire-and-forget.
|
||||
|
||||
> For full details on the extraction pipeline, prompt format, constrained
|
||||
> decoding, stoplist, and Qdrant storage, see `entity-extraction.md`.
|
||||
|
||||
## Summaries Layer
|
||||
|
||||
Session summaries are generated by `orchestration-service/src/services/summarization.js`
|
||||
after each episode write and stored here via `POST /summaries`. The memory
|
||||
service is responsible only for CRUD — generation logic lives in orchestration.
|
||||
|
||||
> For full details on trigger conditions, prompt format, cumulative updates,
|
||||
> and ChatML token stripping, see `summarization.md`.
|
||||
|
||||
## Project Delete Behaviour
|
||||
|
||||
Deleting a project runs as a transaction — it first nulls out `project_id`
|
||||
on all assigned sessions, then deletes the project. This avoids a foreign
|
||||
key constraint failure since `sessions.project_id` has no `ON DELETE` rule:
|
||||
|
||||
```js
|
||||
const doDelete = db.transaction(() => {
|
||||
db.prepare(`UPDATE sessions SET project_id = NULL WHERE project_id = ?`).run(id);
|
||||
db.prepare(`DELETE FROM projects WHERE id = ?`).run(id);
|
||||
});
|
||||
```
|
||||
|
||||
For all HTTP endpoints, see `api-routes.md`.
|
||||
@@ -38,7 +38,8 @@ src/
|
||||
├── db/
|
||||
│ ├── index.js # SQLite connection + initialization + migrations
|
||||
│ ├── schema.js # Table definitions, indexes, FTS5, triggers
|
||||
│ └── projects.js # Project CRUD functions
|
||||
│ ├── projects.js # Project CRUD functions
|
||||
│ └── summaries.js # Summary CRUD functions
|
||||
├── episodic/
|
||||
│ └── index.js # Session + episode CRUD, FTS search, embedding write path
|
||||
├── semantic/
|
||||
@@ -51,7 +52,7 @@ src/
|
||||
|
||||
## SQLite Schema
|
||||
|
||||
Six core tables:
|
||||
Seven core tables:
|
||||
|
||||
- **sessions** — top-level conversation containers. Fields: `external_id`, `name`, `project_id`, `metadata`
|
||||
- **episodes** — individual exchanges (user message + AI response) tied to a session
|
||||
@@ -100,12 +101,9 @@ that weren't touched.
|
||||
const allowed = ['name', 'description', 'colour', 'icon', 'isolated', 'notes', 'system_prompt'];
|
||||
```
|
||||
|
||||
This means saving just `{ notes: "..." }` or `{ system_prompt: "..." }` won't
|
||||
touch any other field.
|
||||
|
||||
## Qdrant / Semantic Layer
|
||||
|
||||
Three Qdrant collections are initialized on service startup:
|
||||
Three Qdrant collections are initialized on service startup via `semantic.initCollections()`:
|
||||
|
||||
| Collection | Purpose |
|
||||
|---|---|
|
||||
@@ -117,9 +115,13 @@ All collections use **768-dimension vectors** with **Cosine similarity**,
|
||||
matching `nomic-embed-text` via Ollama. Vector size and distance metric are
|
||||
defined in `@nexusai/shared` — not hardcoded here.
|
||||
|
||||
Each collection exposes three operations in `src/semantic/index.js`:
|
||||
upsert, search (with optional Qdrant filter), and delete. The `wait: true`
|
||||
flag is used on all writes.
|
||||
`initCollections()` iterates `Object.values(COLLECTIONS)` and creates any
|
||||
collection that doesn't already exist at startup — all three collections are
|
||||
guaranteed to exist before any requests are handled, avoiding race conditions
|
||||
between the first entity embed and an entity search.
|
||||
|
||||
Each collection exposes upsert, search (with optional Qdrant filter), and
|
||||
delete operations. The `wait: true` flag is used on all writes.
|
||||
|
||||
## Embedding Write Path
|
||||
|
||||
@@ -133,8 +135,7 @@ When a new episode is created:
|
||||
This step is **fire-and-forget** — if embedding fails, the episode is still
|
||||
saved and searchable via FTS. The error is logged but not surfaced.
|
||||
|
||||
> The Qdrant payload stores `sessionId` (the internal integer ID). This is
|
||||
> used for per-session and per-project filtering during semantic search. See
|
||||
> The Qdrant payload stores `sessionId` (the internal integer ID). See
|
||||
> `memory-isolation.md` for how project-level filtering works.
|
||||
|
||||
## Entity Layer
|
||||
@@ -146,34 +147,20 @@ constraints to prevent duplicates:
|
||||
- `UNIQUE(from_id, to_id, label)` on relationships
|
||||
- `ON DELETE CASCADE` on relationship foreign keys
|
||||
|
||||
### Automatic Entity Extraction
|
||||
|
||||
After each episode is saved, `extraction.js` automatically extracts named
|
||||
entities from the conversation using `qwen2.5:3b` running on Ollama (Mini PC 1).
|
||||
This runs **fire-and-forget** — the episode is already saved and returned
|
||||
before extraction begins.
|
||||
entities from the conversation using `qwen2.5:3b` on Ollama — fire-and-forget.
|
||||
|
||||
**Entity types extracted:** `person`, `place`, `project`, `technology`,
|
||||
`concept`, `organization`
|
||||
> For full details on the extraction pipeline, prompt format, constrained
|
||||
> decoding, stoplist, and Qdrant storage, see `entity-extraction.md`.
|
||||
|
||||
The extraction prompt uses ChatML format (native to qwen2.5) and primes the
|
||||
response by ending with `[` to steer the model directly into JSON array output.
|
||||
A list of already-known entities is injected into the prompt so the model
|
||||
reuses existing `(name, type)` pairs rather than creating duplicates with
|
||||
different types.
|
||||
## Summaries Layer
|
||||
|
||||
After extraction, each entity is:
|
||||
1. Upserted into SQLite via `upsertEntity` — notes are only written if
|
||||
the entity is new (`COALESCE(entities.notes, excluded.notes)` prevents
|
||||
overwriting existing notes with speculative updates)
|
||||
2. Embedded via the embedding service and upserted into the `entities`
|
||||
Qdrant collection with `{ name, type, notes, projectId }` as payload —
|
||||
`projectId` scopes entities to their project for isolated retrieval
|
||||
Session summaries are generated by `orchestration-service/src/services/summarization.js`
|
||||
after each episode write and stored here via `POST /summaries`. The memory
|
||||
service is responsible only for CRUD — generation logic lives in orchestration.
|
||||
|
||||
`extractAndStoreEntities` receives `projectId` from `createEpisode`, which
|
||||
receives it from the episode route, which receives it from orchestration's
|
||||
`createEpisode` call. This ensures entities are tagged with the correct
|
||||
project scope at extraction time.
|
||||
> For full details on trigger conditions, prompt format, cumulative updates,
|
||||
> and ChatML token stripping, see `summarization.md`.
|
||||
|
||||
## Project Delete Behaviour
|
||||
|
||||
|
||||
@@ -30,31 +30,33 @@ or inference services — all traffic flows through orchestration.
|
||||
| LLAMA_SERVER_URL | No | http://localhost:8080 | Direct llama-server URL for /models/props |
|
||||
| QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search |
|
||||
| CORS_ORIGIN | No | http://localhost:5173 | Allowed origin for CORS requests |
|
||||
| MODELS_MANIFEST_PATH | No | — | Legacy — superseded by `modelsFolderPath` in settings.json |
|
||||
| EXTRACTION_URL | No | http://localhost:11434 | Ollama URL for summarisation |
|
||||
| EXTRACTION_MODEL | No | qwen2.5:3b | Ollama model used for summarisation |
|
||||
|
||||
## Internal Structure
|
||||
|
||||
```
|
||||
src/
|
||||
├── services/
|
||||
│ ├── memory.js # HTTP client for memory service
|
||||
│ ├── inference.js # HTTP client for inference service
|
||||
│ ├── embedding.js # HTTP client for embedding service
|
||||
│ └── qdrant.js # HTTP client for Qdrant (direct vector search)
|
||||
│ ├── memory.js # HTTP client for memory service
|
||||
│ ├── inference.js # HTTP client for inference service
|
||||
│ ├── embedding.js # HTTP client for embedding service
|
||||
│ ├── qdrant.js # HTTP client for Qdrant (direct vector search)
|
||||
│ └── summarization.js # Session summarisation — triggers after each episode
|
||||
├── chat/
|
||||
│ └── index.js # Core pipeline — context assembly, isolation, auto-naming
|
||||
│ └── index.js # Core pipeline — context assembly, isolation, auto-naming
|
||||
├── config/
|
||||
│ └── settings.js # Settings load/save — reads/writes data/settings.json
|
||||
│ └── settings.js # Settings load/save — reads/writes data/settings.json
|
||||
├── routes/
|
||||
│ ├── chat.js # POST /chat and POST /chat/stream
|
||||
│ ├── sessions.js # Session CRUD proxy
|
||||
│ ├── projects.js # Project CRUD proxy — passes req.body straight through
|
||||
│ ├── episodes.js # Episode list and delete proxy
|
||||
│ ├── settings.js # GET /settings and PATCH /settings
|
||||
│ ├── health.js # GET /health — pings all four services
|
||||
│ └── models.js # GET /models — scans .gguf files live, merges with models.json
|
||||
# GET /models/props — context window + loaded model from llama-server
|
||||
└── index.js # Express app entry point
|
||||
│ ├── chat.js # POST /chat and POST /chat/stream
|
||||
│ ├── sessions.js # Session CRUD proxy
|
||||
│ ├── projects.js # Project CRUD proxy
|
||||
│ ├── episodes.js # Episode list and delete proxy
|
||||
│ ├── summaries.js # GET /summaries/session/:id and /summaries/project/:id
|
||||
│ ├── settings.js # GET /settings and PATCH /settings
|
||||
│ ├── health.js # GET /health/services — pings all four services
|
||||
│ └── models.js # GET /models and GET /models/props
|
||||
└── index.js # Express app entry point
|
||||
```
|
||||
|
||||
The `services/` layer wraps all downstream HTTP calls in named functions.
|
||||
@@ -77,9 +79,6 @@ via `appSettings.load()` — changes apply immediately without a service restart
|
||||
| `topK` | 40 | Top-K token candidates per step |
|
||||
| `systemPrompt` | *(ORCHESTRATION.SYSTEM_PROMPT)* | Global system prompt. `null` reverts to hardcoded constant. |
|
||||
|
||||
Defaults are defined in `config/settings.js` and fall back to constants in
|
||||
`@nexusai/shared`. Values saved in `settings.json` take precedence.
|
||||
|
||||
## Chat Pipeline
|
||||
|
||||
Both `POST /chat` and `POST /chat/stream` share the same steps. The only
|
||||
@@ -88,42 +87,38 @@ difference is how the inference response is delivered to the client.
|
||||
### Steps
|
||||
|
||||
1. **Session resolution** — look up session by `externalId`. Auto-create if
|
||||
not found. Clients generate a UUID for new conversations — no pre-creation
|
||||
step needed.
|
||||
not found.
|
||||
|
||||
2. **Project context resolution** — if the session has a `project_id`, fetch
|
||||
the project and all its session IDs. Used to scope semantic search. The
|
||||
project's `system_prompt` is also read at this step if set.
|
||||
|
||||
3. **System prompt resolution** — three-tier hierarchy:
|
||||
- `project.system_prompt` — if the session is in a project and it's set (highest priority)
|
||||
- `project.system_prompt` — highest priority
|
||||
- `settings.systemPrompt` — global setting from `settings.json`
|
||||
- `ORCHESTRATION.SYSTEM_PROMPT` — hardcoded constant in `@nexusai/shared` (last resort)
|
||||
- `ORCHESTRATION.SYSTEM_PROMPT` — hardcoded constant (last resort)
|
||||
|
||||
4. **Recent episode retrieval** — fetch the most recent episodes for the
|
||||
session (`recentEpisodeLimit`, default 5).
|
||||
4. **Recent episode retrieval** — fetch most recent episodes (`recentEpisodeLimit`).
|
||||
|
||||
5. **Semantic search** — embed the user message, query Qdrant for the top
|
||||
most similar past episodes (`semanticLimit`, `scoreThreshold`). Deduplicated
|
||||
against recent episodes. Non-critical — if it fails, pipeline continues with
|
||||
recency-only context.
|
||||
5. **Semantic search** — embed user message, query Qdrant for similar past
|
||||
episodes. Deduplicated against recent episodes. Non-critical.
|
||||
|
||||
6. **Entity search** — query the `entities` Qdrant collection filtered by
|
||||
6. **Entity search** — query `entities` Qdrant collection filtered by
|
||||
`projectId`. Non-project sessions receive no entity context. Non-critical.
|
||||
|
||||
7. **Prompt assembly** — combine resolved system prompt, entity context,
|
||||
semantic episodes, recent episodes, and user message.
|
||||
7. **Prompt assembly** — combine system prompt, entity context, semantic
|
||||
episodes, recent episodes, and user message.
|
||||
|
||||
8. **Inference** — send to inference service with settings-derived parameters
|
||||
(temperature, topP, topK, repeatPenalty). `/chat` awaits full response;
|
||||
8. **Inference** — send to inference service. `/chat` awaits full response;
|
||||
`/chat/stream` pipes SSE chunks to the client.
|
||||
|
||||
9. **Episode write** — write the exchange back to memory with `projectId`.
|
||||
Fire-and-forget for `/chat`; awaited for `/chat/stream`.
|
||||
9. **Episode write** — write exchange back to memory with `projectId`.
|
||||
|
||||
10. **Auto-naming** — on `isFirstMessage && !session.name`, fire a secondary
|
||||
inference call with a naming prompt (max 20 tokens, temperature 0.3) and
|
||||
write the result back as `session.name`. Fully fire-and-forget.
|
||||
10. **Summarisation trigger** — `triggerSummary(session, allEpisodes)` called
|
||||
fire-and-forget. See `summarization.md` for full details.
|
||||
|
||||
11. **Auto-naming** — on first message with no session name, fires a secondary
|
||||
inference call (max 20 tokens, temperature 0.3) to generate a session name.
|
||||
|
||||
### Prompt Structure
|
||||
|
||||
@@ -132,26 +127,28 @@ difference is how the inference response is delivered to the client.
|
||||
|
||||
Here is what you know about entities relevant to this conversation:
|
||||
- {name} ({type}): {notes}
|
||||
... (up to 5 entity results)
|
||||
---
|
||||
Here are some relevant memories from earlier conversations:
|
||||
User: {past user message}
|
||||
Assistant: {past ai response}
|
||||
... (up to semanticLimit semantic episodes)
|
||||
---
|
||||
Here are some relevant memories from your past conversations:
|
||||
User: {past user message}
|
||||
Assistant: {past ai response}
|
||||
... (up to recentEpisodeLimit recent episodes)
|
||||
--- End of recent memories ---
|
||||
|
||||
User: {current message}
|
||||
Assistant:
|
||||
```
|
||||
|
||||
Entity context appears first — before episodic memory — because structured
|
||||
facts about known entities are the most stable and reliable context. Semantic
|
||||
episodes follow, then recent episodes as the immediate conversation flow.
|
||||
## Summarisation
|
||||
|
||||
After each episode write, `triggerSummary` is called fire-and-forget. It
|
||||
checks token thresholds and episode counts before generating, then stores
|
||||
the result in the memory service.
|
||||
|
||||
> For full details on trigger conditions, prompt format, cumulative updates,
|
||||
> ChatML token stripping, and episode range tracking, see `summarization.md`.
|
||||
|
||||
## SSE Stream Format
|
||||
|
||||
@@ -168,46 +165,36 @@ data: {"text":"Hello"}
|
||||
data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42}
|
||||
```
|
||||
|
||||
The `[DONE]` sentinel is consumed internally and not forwarded. The stream
|
||||
is terminated by `res.end()` after the done event.
|
||||
The `[DONE]` sentinel is consumed internally and not forwarded.
|
||||
|
||||
## Models Route
|
||||
|
||||
`GET /models` scans `.gguf` files live on each request from `modelsFolderPath`
|
||||
(read from settings). Merges results with a `models.json` file in the same
|
||||
folder for richer metadata (label, description). Returns file size in GB.
|
||||
`GET /models` scans `.gguf` files live from `modelsFolderPath` and merges
|
||||
with `models.json` for metadata. Returns file size in GB.
|
||||
|
||||
`GET /models/props` fetches directly from llama-server via `LLAMA_SERVER_URL`.
|
||||
Returns `{ contextWindow, modelAlias }`. `n_ctx` is at
|
||||
`data.default_generation_settings.n_ctx` in the llama-server response.
|
||||
Returns `503` if llama-server is unreachable.
|
||||
`GET /models/props` fetches directly from llama-server. Returns
|
||||
`{ contextWindow, modelAlias }`. Returns `503` if unreachable.
|
||||
|
||||
## Sessions Route Behaviour
|
||||
|
||||
`PATCH /sessions/:sessionId` accepts either `name`, `projectId`, or both.
|
||||
The validation guard only rejects requests where neither is provided:
|
||||
|
||||
```js
|
||||
if (!name?.trim() && projectId === undefined) {
|
||||
return res.status(400).json({ error: 'name or projectId is required' });
|
||||
}
|
||||
```
|
||||
|
||||
This allows `useChat` to write project assignment separately from rename
|
||||
operations.
|
||||
`PATCH /sessions/:sessionId` accepts `name`, `projectId`, or both.
|
||||
Rejects only when neither is provided — allows `useChat` to write project
|
||||
assignment separately from rename operations.
|
||||
|
||||
## Caddy Configuration
|
||||
|
||||
Each route prefix needs a handle block in the Caddyfile on Mini PC 2:
|
||||
Each route prefix needs a handle block in the Caddyfile on Mini PC 2.
|
||||
**Any new top-level route must be added here AND in `vite.config.js`.**
|
||||
|
||||
```
|
||||
handle /chat* { reverse_proxy localhost:4000 }
|
||||
handle /sessions* { reverse_proxy localhost:4000 }
|
||||
handle /models* { reverse_proxy localhost:4000 }
|
||||
handle /projects* { reverse_proxy localhost:4000 }
|
||||
handle /episodes* { reverse_proxy localhost:4000 }
|
||||
handle /settings* { reverse_proxy localhost:4000 }
|
||||
handle /health* { reverse_proxy localhost:4000 }
|
||||
handle /chat* { reverse_proxy localhost:4000 }
|
||||
handle /sessions* { reverse_proxy localhost:4000 }
|
||||
handle /models* { reverse_proxy localhost:4000 }
|
||||
handle /projects* { reverse_proxy localhost:4000 }
|
||||
handle /episodes* { reverse_proxy localhost:4000 }
|
||||
handle /settings* { reverse_proxy localhost:4000 }
|
||||
handle /summaries* { reverse_proxy localhost:4000 }
|
||||
handle /health* { reverse_proxy localhost:4000 }
|
||||
```
|
||||
|
||||
After updating: `caddy reload --config /path/to/Caddyfile`
|
||||
|
||||
@@ -165,10 +165,16 @@ Orchestration pipeline defaults. Used as fallback values in
|
||||
| `RECENT_EPISODE_LIMIT` | `5` | Recent episodes to inject into prompt |
|
||||
| `SEMANTIC_LIMIT` | `5` | Semantic search results to inject into prompt |
|
||||
| `SCORE_THRESHOLD` | `0.75` | Minimum similarity score for semantic results |
|
||||
| `ENTITIES_LIMIT` | `5` | Max entity search results to inject into prompt |
|
||||
| `ENTITIES_THRESHOLD` | `0.55` | Minimum similarity score for entity results |
|
||||
| `TEMPERATURE` | `0.7` | Default inference temperature |
|
||||
| `CORS_ORIGIN` | `'http://localhost:5173'` | Fallback allowed CORS origin |
|
||||
| `SYSTEM_PROMPT` | *(see below)* | Default system prompt |
|
||||
|
||||
> `ENTITIES_THRESHOLD` is set to `0.55` — lower than `SCORE_THRESHOLD` because
|
||||
> entity notes generated by a 3B model tend to embed with lower cosine similarity
|
||||
> than full episode text. Tune upward if irrelevant entities appear in context.
|
||||
|
||||
> `repeatPenalty`, `topP`, and `topK` defaults are sourced from
|
||||
> `INFERENCE_DEFAULTS` in `config/settings.js` rather than `ORCHESTRATION`,
|
||||
> since those constants already define the canonical values.
|
||||
@@ -178,6 +184,25 @@ Default system prompt:
|
||||
> of past conversations with the user. Use them to provide consistent,
|
||||
> personalised responses."
|
||||
|
||||
#### `SUMMARIES`
|
||||
|
||||
Controls the automatic session summarisation system in `orchestration-service/src/services/summarization.js`.
|
||||
|
||||
| Key | Value | Description |
|
||||
|---|---|---|
|
||||
| `THRESHOLD_TOKENS` | `200` | Minimum total session tokens before summarisation is considered |
|
||||
| `MAX_SUMMARY_TOKENS` | `800` | If existing summary exceeds this length (chars), create a new row instead of updating |
|
||||
| `MIN_EPISODES_SINCE` | `5` | Minimum new episodes since last summary before re-summarising |
|
||||
|
||||
These can be overridden per-deployment via environment variables in the
|
||||
orchestration service `.env`:
|
||||
|
||||
```
|
||||
SUMMARY_THRESHOLD_TOKENS=200
|
||||
SUMMARY_MAX_TOKENS=800
|
||||
SUMMARY_MIN_EPISODES=5
|
||||
```
|
||||
|
||||
#### `SQLITE`
|
||||
|
||||
| Key | Value | Description |
|
||||
|
||||
201
docs/services/summarization.md
Normal file
201
docs/services/summarization.md
Normal file
@@ -0,0 +1,201 @@
|
||||
# Summarization
|
||||
|
||||
Session summarization generates rolling plain-text summaries of conversation
|
||||
history, giving the model a condensed view of past context without consuming
|
||||
the full context window with raw episodes.
|
||||
|
||||
**Location:** `packages/orchestration-service/src/services/summarization.js`
|
||||
**Triggered by:** `chat/index.js` after every episode write (fire-and-forget)
|
||||
**Model:** `qwen2.5:3b` via Ollama on Mini PC 1 (192.168.0.81)
|
||||
|
||||
---
|
||||
|
||||
## Trigger Conditions
|
||||
|
||||
`triggerSummary(session, allEpisodes)` calls `maybeSummarize` fire-and-forget.
|
||||
`maybeSummarize` proceeds only when both conditions are met:
|
||||
|
||||
1. Total session token count exceeds `SUMMARIES.THRESHOLD_TOKENS` (default 200)
|
||||
2. At least `SUMMARIES.MIN_EPISODES_SINCE` (default 5) new episodes have
|
||||
accumulated since the last summary
|
||||
|
||||
The token threshold is intentionally low — it ensures summaries start
|
||||
generating early in a session's life rather than only after very long
|
||||
conversations.
|
||||
|
||||
---
|
||||
|
||||
## Summary Rows and Cumulative Updates
|
||||
|
||||
Each session can have multiple summary rows in the `summaries` table.
|
||||
The update strategy depends on the size of the most recent summary:
|
||||
|
||||
| Condition | Action |
|
||||
|---|---|
|
||||
| No existing summary | Generate fresh summary from all episodes |
|
||||
| Latest summary under `MAX_SUMMARY_TOKENS` | Update: summarise new episodes with existing summary as context |
|
||||
| Latest summary over `MAX_SUMMARY_TOKENS` | Create new row: treat as fresh summarisation |
|
||||
|
||||
This produces a chain of summary rows over time. Each row's `episode_range`
|
||||
covers only the episodes summarised in that specific pass (e.g. `259-263`),
|
||||
not all episodes in the session.
|
||||
|
||||
---
|
||||
|
||||
## Ollama Request
|
||||
|
||||
```js
|
||||
{
|
||||
model: EXTRACTION_MODEL, // qwen2.5:3b (set via EXTRACTION_MODEL env var)
|
||||
prompt: buildSummaryPrompt(episodesToSummarize, existingSummary),
|
||||
stream: false,
|
||||
// No format: 'json' — free-text output required for summaries
|
||||
options: {
|
||||
temperature: 0.2,
|
||||
num_predict: 500,
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
`temperature: 0.2` is slightly higher than extraction (0.1) — summaries
|
||||
benefit from some fluency. `num_predict: 500` gives room for 5 thorough
|
||||
sentences without risk of runoff.
|
||||
|
||||
---
|
||||
|
||||
## Prompt Format
|
||||
|
||||
ChatML format — native to qwen2.5:
|
||||
|
||||
```
|
||||
<|im_start|>user
|
||||
Summarize the conversation below in 3-5 sentences.
|
||||
Write in third person. Do not quote directly — paraphrase only.
|
||||
Do not include greetings, sign-offs, or filler. Output only the summary text.
|
||||
|
||||
Conversation:
|
||||
{context}
|
||||
<|im_end|>
|
||||
<|im_start|>assistant
|
||||
```
|
||||
|
||||
For cumulative updates, the instruction and context change:
|
||||
|
||||
```
|
||||
<|im_start|>user
|
||||
Update the summary below to incorporate the new exchanges.
|
||||
Write 3-5 sentences in third person. Do not quote directly — paraphrase only.
|
||||
Do not include greetings, sign-offs, or filler. Output only the updated summary text.
|
||||
|
||||
Previous summary:
|
||||
{existingSummary}
|
||||
|
||||
New exchanges:
|
||||
{context}
|
||||
<|im_end|>
|
||||
<|im_start|>assistant
|
||||
```
|
||||
|
||||
### Input truncation
|
||||
|
||||
Episode context is truncated to `MAX_CHARS = 3000` characters, keeping the
|
||||
most recent exchanges (sliced from the end). This keeps Qwen focused and
|
||||
prevents the prompt from exceeding its effective context window.
|
||||
|
||||
---
|
||||
|
||||
## ChatML Token Stripping
|
||||
|
||||
Qwen occasionally echoes ChatML tokens back into its response. The raw output
|
||||
is cleaned before saving:
|
||||
|
||||
```js
|
||||
const raw = data.response?.trim() ?? '';
|
||||
const content = raw
|
||||
.replace(/<\|im_start\|>.*?<\|im_end\|>/gs, '')
|
||||
.replace(/<\|im_start\|>|<\|im_end\|>|<\|im_sep\|>/g, '')
|
||||
.trim();
|
||||
return content;
|
||||
```
|
||||
|
||||
Without this, leaked tokens get stored in the summary and then injected
|
||||
back into the next summarisation prompt — causing the model to append a new
|
||||
summary after the old one rather than replacing it.
|
||||
|
||||
---
|
||||
|
||||
## Episode Range Tracking
|
||||
|
||||
Each summary row stores `episode_range` as `"firstId-lastId"` covering only
|
||||
the episodes summarised in that pass:
|
||||
|
||||
```js
|
||||
const summarizedIds = episodesToSummarize.map(ep => ep.id).sort((a,b) => a - b);
|
||||
const episodeRange = `${summarizedIds.at(0)}-${summarizedIds.at(-1)}`;
|
||||
```
|
||||
|
||||
This makes SummaryView cards meaningful — "Episodes 259-263" tells you
|
||||
exactly which exchanges that summary covers, rather than always showing
|
||||
the full session range.
|
||||
|
||||
---
|
||||
|
||||
## Summary Storage
|
||||
|
||||
Summaries are written directly to the memory service from orchestration:
|
||||
|
||||
```js
|
||||
// Create new row
|
||||
await fetch(`${MEMORY_URL}/summaries`, {
|
||||
method: 'POST',
|
||||
body: JSON.stringify({ sessionId: session.id, content, tokenCount, episodeRange }),
|
||||
});
|
||||
|
||||
// Update existing row
|
||||
await fetch(`${MEMORY_URL}/summaries/${latest.id}`, {
|
||||
method: 'PATCH',
|
||||
body: JSON.stringify({ content, tokenCount, episodeRange }),
|
||||
});
|
||||
```
|
||||
|
||||
`session.id` here is the internal SQLite integer ID — not the external UUID.
|
||||
It is available directly on the `session` object passed from `chat/index.js`.
|
||||
|
||||
---
|
||||
|
||||
## Client-Side Indicator
|
||||
|
||||
The chat client shows a "Summarising…" spinner in the `ChatWindow` header
|
||||
and on the InfoPanel's Session Memory button while summarisation may be
|
||||
in progress.
|
||||
|
||||
Since summarisation is fire-and-forget with no completion signal back to
|
||||
the client, the indicator is timer-based: it activates when the stream
|
||||
finishes and clears after 8 seconds.
|
||||
|
||||
```js
|
||||
// In App.jsx, watching the streaming state from useChat:
|
||||
useEffect(() => {
|
||||
if (prevStreaming.current && !streaming) {
|
||||
setSummarising(true);
|
||||
const t = setTimeout(() => setSummarising(false), 8000);
|
||||
return () => clearTimeout(t);
|
||||
}
|
||||
prevStreaming.current = streaming;
|
||||
}, [streaming]);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Environment Variables
|
||||
|
||||
Set in `packages/orchestration-service/src/.env`:
|
||||
|
||||
| Variable | Default | Description |
|
||||
|---|---|---|
|
||||
| `EXTRACTION_URL` | `http://localhost:11434` | Ollama instance URL |
|
||||
| `EXTRACTION_MODEL` | `qwen2.5:3b` | Model for summarisation |
|
||||
| `MEMORY_SERVICE_URL` | `http://localhost:3002` | Memory service URL |
|
||||
| `SUMMARY_THRESHOLD_TOKENS` | `200` | Token threshold before summarisation triggers |
|
||||
| `SUMMARY_MAX_TOKENS` | `800` | Max summary length before a new row is created |
|
||||
| `SUMMARY_MIN_EPISODES` | `5` | Min new episodes since last summary before re-summarising |s
|
||||
Reference in New Issue
Block a user