documentation updates for entity extraction and summarization
This commit is contained in:
@@ -120,6 +120,38 @@ all projects use isolated memory. Returns `201` with the created project object.
|
|||||||
|
|
||||||
Only provided fields are updated — omitted fields are not touched.
|
Only provided fields are updated — omitted fields are not touched.
|
||||||
|
|
||||||
|
### Summaries
|
||||||
|
|
||||||
|
| Method | Path | Description |
|
||||||
|
|---|---|---|
|
||||||
|
| GET | /summaries/session/:sessionId | Get all summaries for a session (by external UUID) |
|
||||||
|
| GET | /summaries/project/:projectId | Get all summaries for a project |
|
||||||
|
|
||||||
|
**GET /summaries/session/:sessionId** — resolves the external UUID to an
|
||||||
|
internal session ID, then fetches summaries from the memory service.
|
||||||
|
Returns an array of summary objects ordered by `created_at` ascending.
|
||||||
|
|
||||||
|
**GET /summaries/project/:projectId** — proxies directly to the memory
|
||||||
|
service project summaries endpoint.
|
||||||
|
|
||||||
|
**Summary object shape:**
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"id": 8,
|
||||||
|
"session_id": 72,
|
||||||
|
"project_id": null,
|
||||||
|
"content": "The user asked about...",
|
||||||
|
"token_count": 579,
|
||||||
|
"episode_range": "246-251",
|
||||||
|
"created_at": 1776766518,
|
||||||
|
"updated_at": 1776766518
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
> **Proxy requirement:** `/summaries` must be added to both the Caddyfile
|
||||||
|
> reverse proxy and the Vite dev proxy config alongside the other route
|
||||||
|
> prefixes. See `orchestration-service.md` for the Caddy block pattern.
|
||||||
|
|
||||||
### Models
|
### Models
|
||||||
|
|
||||||
| Method | Path | Description |
|
| Method | Path | Description |
|
||||||
@@ -269,6 +301,29 @@ Both fields are optional. Only provided fields are updated.
|
|||||||
|
|
||||||
Same request/response shape as orchestration `/projects` above.
|
Same request/response shape as orchestration `/projects` above.
|
||||||
|
|
||||||
|
### Summaries
|
||||||
|
|
||||||
|
| Method | Path | Description |
|
||||||
|
|---|---|---|
|
||||||
|
| POST | /summaries | Create a new summary |
|
||||||
|
| GET | /sessions/:id/summaries | Get all summaries for a session (internal ID) |
|
||||||
|
| GET | /projects/:id/summaries | Get all summaries for a project |
|
||||||
|
| PATCH | /summaries/:id | Update a summary (content, tokenCount, episodeRange) |
|
||||||
|
| DELETE | /summaries/:id | Delete a summary |
|
||||||
|
|
||||||
|
**POST /summaries — body:**
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"sessionId": 72,
|
||||||
|
"content": "The user discussed...",
|
||||||
|
"tokenCount": 579,
|
||||||
|
"episodeRange": "246-251"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
`content` is required. Either `sessionId` or `projectId` is required.
|
||||||
|
|
||||||
|
**PATCH /summaries/:id — body:** any subset of `content`, `tokenCount`, `episodeRange`.
|
||||||
|
|
||||||
### Entities
|
### Entities
|
||||||
|
|
||||||
| Method | Path | Description |
|
| Method | Path | Description |
|
||||||
|
|||||||
178
docs/services/entity-extraction.md
Normal file
178
docs/services/entity-extraction.md
Normal file
@@ -0,0 +1,178 @@
|
|||||||
|
# Memory Service
|
||||||
|
|
||||||
|
**Package:** `@nexusai/memory-service`
|
||||||
|
**Location:** `packages/memory-service`
|
||||||
|
**Deployed on:** Mini PC 1 (192.168.0.81)
|
||||||
|
**Port:** 3002
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
|
||||||
|
Responsible for all reading and writing of long-term memory. Acts as the
|
||||||
|
sole interface to both SQLite and Qdrant — no other service accesses these
|
||||||
|
stores directly. On episode creation, automatically calls the embedding
|
||||||
|
service to generate and store a vector in Qdrant.
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
- `express` — HTTP API
|
||||||
|
- `better-sqlite3` — SQLite driver
|
||||||
|
- `@qdrant/js-client-rest` — Qdrant vector store client
|
||||||
|
- `dotenv` — environment variable loading
|
||||||
|
- `@nexusai/shared` — shared utilities and constants
|
||||||
|
|
||||||
|
## Environment Variables
|
||||||
|
|
||||||
|
| Variable | Required | Default | Description |
|
||||||
|
|---|---|---|---|
|
||||||
|
| PORT | No | 3002 | Port to listen on |
|
||||||
|
| SQLITE_PATH | Yes | — | Path to SQLite database file |
|
||||||
|
| QDRANT_URL | No | http://localhost:6333 | Qdrant instance URL |
|
||||||
|
| EMBEDDING_SERVICE_URL | No | http://localhost:3003 | Embedding service URL |
|
||||||
|
| EXTRACTION_URL | No | http://localhost:11434 | Ollama URL for entity extraction |
|
||||||
|
| EXTRACTION_MODEL | No | qwen2.5:3b | Ollama model used for entity extraction |
|
||||||
|
|
||||||
|
## Internal Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
src/
|
||||||
|
├── db/
|
||||||
|
│ ├── index.js # SQLite connection + initialization + migrations
|
||||||
|
│ ├── schema.js # Table definitions, indexes, FTS5, triggers
|
||||||
|
│ ├── projects.js # Project CRUD functions
|
||||||
|
│ └── summaries.js # Summary CRUD functions
|
||||||
|
├── episodic/
|
||||||
|
│ └── index.js # Session + episode CRUD, FTS search, embedding write path
|
||||||
|
├── semantic/
|
||||||
|
│ └── index.js # Qdrant collection management, upsert, search, delete
|
||||||
|
├── entities/
|
||||||
|
│ ├── index.js # Entity + relationship CRUD
|
||||||
|
│ └── extraction.js # Automatic entity extraction via qwen2.5:3b on Ollama
|
||||||
|
└── index.js # Express app + all route definitions
|
||||||
|
```
|
||||||
|
|
||||||
|
## SQLite Schema
|
||||||
|
|
||||||
|
Seven core tables:
|
||||||
|
|
||||||
|
- **sessions** — top-level conversation containers. Fields: `external_id`, `name`, `project_id`, `metadata`
|
||||||
|
- **episodes** — individual exchanges (user message + AI response) tied to a session
|
||||||
|
- **entities** — named things the system learns about (people, places, concepts)
|
||||||
|
- **relationships** — directional labeled links between entities
|
||||||
|
- **summaries** — condensed episode groups for efficient context retrieval
|
||||||
|
- **projects** — named groupings of sessions with `name`, `description`, `colour`, `icon`, `isolated`, `notes`, `system_prompt`
|
||||||
|
|
||||||
|
### Migrations
|
||||||
|
|
||||||
|
Schema changes that cannot use `CREATE TABLE IF NOT EXISTS` are applied as
|
||||||
|
idempotent migrations in `db/index.js` at startup:
|
||||||
|
|
||||||
|
```js
|
||||||
|
try { db.exec(`ALTER TABLE sessions ADD COLUMN name TEXT`); } catch {}
|
||||||
|
try { db.exec(`ALTER TABLE sessions ADD COLUMN project_id INTEGER REFERENCES projects(id)`); } catch {}
|
||||||
|
try { db.exec(`CREATE INDEX IF NOT EXISTS idx_sessions_project ON sessions(project_id)`); } catch {}
|
||||||
|
try { db.exec(`ALTER TABLE projects ADD COLUMN isolated INTEGER NOT NULL DEFAULT 0`); } catch {}
|
||||||
|
try { db.exec(`ALTER TABLE projects ADD COLUMN notes TEXT`); } catch {}
|
||||||
|
try { db.exec(`ALTER TABLE projects ADD COLUMN system_prompt TEXT`); } catch {}
|
||||||
|
```
|
||||||
|
|
||||||
|
New migrations are always appended here — never modify the schema file for
|
||||||
|
existing tables since `ALTER TABLE` cannot use `IF NOT EXISTS`.
|
||||||
|
|
||||||
|
### FTS5 Full-Text Search
|
||||||
|
|
||||||
|
An `episodes_fts` virtual table enables keyword search across all episodes.
|
||||||
|
Three triggers (`episodes_fts_insert`, `episodes_fts_update`, `episodes_fts_delete`)
|
||||||
|
keep the FTS index automatically in sync with the episodes table.
|
||||||
|
|
||||||
|
### SQLite Configuration
|
||||||
|
|
||||||
|
- `journal_mode = WAL` — non-blocking reads during writes
|
||||||
|
- `foreign_keys = ON` — enforces referential integrity and cascade deletes
|
||||||
|
- PRAGMAs set via `db.pragma()`, not `db.exec()`
|
||||||
|
|
||||||
|
### Dynamic Updates
|
||||||
|
|
||||||
|
Both `updateSession` and `updateProject` build their `SET` clause dynamically
|
||||||
|
from only the fields passed — prevents partial updates from overwriting fields
|
||||||
|
that weren't touched.
|
||||||
|
|
||||||
|
`updateProject` allowlist:
|
||||||
|
```js
|
||||||
|
const allowed = ['name', 'description', 'colour', 'icon', 'isolated', 'notes', 'system_prompt'];
|
||||||
|
```
|
||||||
|
|
||||||
|
## Qdrant / Semantic Layer
|
||||||
|
|
||||||
|
Three Qdrant collections are initialized on service startup via `semantic.initCollections()`:
|
||||||
|
|
||||||
|
| Collection | Purpose |
|
||||||
|
|---|---|
|
||||||
|
| `episodes` | Embeddings for individual conversation exchanges |
|
||||||
|
| `entities` | Embeddings for named entities |
|
||||||
|
| `summaries` | Embeddings for condensed episode summaries |
|
||||||
|
|
||||||
|
All collections use **768-dimension vectors** with **Cosine similarity**,
|
||||||
|
matching `nomic-embed-text` via Ollama. Vector size and distance metric are
|
||||||
|
defined in `@nexusai/shared` — not hardcoded here.
|
||||||
|
|
||||||
|
`initCollections()` iterates `Object.values(COLLECTIONS)` and creates any
|
||||||
|
collection that doesn't already exist at startup — all three collections are
|
||||||
|
guaranteed to exist before any requests are handled, avoiding race conditions
|
||||||
|
between the first entity embed and an entity search.
|
||||||
|
|
||||||
|
Each collection exposes upsert, search (with optional Qdrant filter), and
|
||||||
|
delete operations. The `wait: true` flag is used on all writes.
|
||||||
|
|
||||||
|
## Embedding Write Path
|
||||||
|
|
||||||
|
When a new episode is created:
|
||||||
|
|
||||||
|
1. Episode saved to SQLite synchronously — response returned immediately
|
||||||
|
2. User message + AI response combined: `User: ...\nAssistant: ...`
|
||||||
|
3. Text sent to embedding service (`POST /embed`)
|
||||||
|
4. Vector upserted into `episodes` Qdrant collection with payload `{ sessionId, createdAt }`
|
||||||
|
|
||||||
|
This step is **fire-and-forget** — if embedding fails, the episode is still
|
||||||
|
saved and searchable via FTS. The error is logged but not surfaced.
|
||||||
|
|
||||||
|
> The Qdrant payload stores `sessionId` (the internal integer ID). See
|
||||||
|
> `memory-isolation.md` for how project-level filtering works.
|
||||||
|
|
||||||
|
## Entity Layer
|
||||||
|
|
||||||
|
Entities and relationships use upsert semantics with composite unique
|
||||||
|
constraints to prevent duplicates:
|
||||||
|
|
||||||
|
- `UNIQUE(name, type)` on entities
|
||||||
|
- `UNIQUE(from_id, to_id, label)` on relationships
|
||||||
|
- `ON DELETE CASCADE` on relationship foreign keys
|
||||||
|
|
||||||
|
After each episode is saved, `extraction.js` automatically extracts named
|
||||||
|
entities from the conversation using `qwen2.5:3b` on Ollama — fire-and-forget.
|
||||||
|
|
||||||
|
> For full details on the extraction pipeline, prompt format, constrained
|
||||||
|
> decoding, stoplist, and Qdrant storage, see `entity-extraction.md`.
|
||||||
|
|
||||||
|
## Summaries Layer
|
||||||
|
|
||||||
|
Session summaries are generated by `orchestration-service/src/services/summarization.js`
|
||||||
|
after each episode write and stored here via `POST /summaries`. The memory
|
||||||
|
service is responsible only for CRUD — generation logic lives in orchestration.
|
||||||
|
|
||||||
|
> For full details on trigger conditions, prompt format, cumulative updates,
|
||||||
|
> and ChatML token stripping, see `summarization.md`.
|
||||||
|
|
||||||
|
## Project Delete Behaviour
|
||||||
|
|
||||||
|
Deleting a project runs as a transaction — it first nulls out `project_id`
|
||||||
|
on all assigned sessions, then deletes the project. This avoids a foreign
|
||||||
|
key constraint failure since `sessions.project_id` has no `ON DELETE` rule:
|
||||||
|
|
||||||
|
```js
|
||||||
|
const doDelete = db.transaction(() => {
|
||||||
|
db.prepare(`UPDATE sessions SET project_id = NULL WHERE project_id = ?`).run(id);
|
||||||
|
db.prepare(`DELETE FROM projects WHERE id = ?`).run(id);
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
For all HTTP endpoints, see `api-routes.md`.
|
||||||
@@ -38,7 +38,8 @@ src/
|
|||||||
├── db/
|
├── db/
|
||||||
│ ├── index.js # SQLite connection + initialization + migrations
|
│ ├── index.js # SQLite connection + initialization + migrations
|
||||||
│ ├── schema.js # Table definitions, indexes, FTS5, triggers
|
│ ├── schema.js # Table definitions, indexes, FTS5, triggers
|
||||||
│ └── projects.js # Project CRUD functions
|
│ ├── projects.js # Project CRUD functions
|
||||||
|
│ └── summaries.js # Summary CRUD functions
|
||||||
├── episodic/
|
├── episodic/
|
||||||
│ └── index.js # Session + episode CRUD, FTS search, embedding write path
|
│ └── index.js # Session + episode CRUD, FTS search, embedding write path
|
||||||
├── semantic/
|
├── semantic/
|
||||||
@@ -51,7 +52,7 @@ src/
|
|||||||
|
|
||||||
## SQLite Schema
|
## SQLite Schema
|
||||||
|
|
||||||
Six core tables:
|
Seven core tables:
|
||||||
|
|
||||||
- **sessions** — top-level conversation containers. Fields: `external_id`, `name`, `project_id`, `metadata`
|
- **sessions** — top-level conversation containers. Fields: `external_id`, `name`, `project_id`, `metadata`
|
||||||
- **episodes** — individual exchanges (user message + AI response) tied to a session
|
- **episodes** — individual exchanges (user message + AI response) tied to a session
|
||||||
@@ -100,12 +101,9 @@ that weren't touched.
|
|||||||
const allowed = ['name', 'description', 'colour', 'icon', 'isolated', 'notes', 'system_prompt'];
|
const allowed = ['name', 'description', 'colour', 'icon', 'isolated', 'notes', 'system_prompt'];
|
||||||
```
|
```
|
||||||
|
|
||||||
This means saving just `{ notes: "..." }` or `{ system_prompt: "..." }` won't
|
|
||||||
touch any other field.
|
|
||||||
|
|
||||||
## Qdrant / Semantic Layer
|
## Qdrant / Semantic Layer
|
||||||
|
|
||||||
Three Qdrant collections are initialized on service startup:
|
Three Qdrant collections are initialized on service startup via `semantic.initCollections()`:
|
||||||
|
|
||||||
| Collection | Purpose |
|
| Collection | Purpose |
|
||||||
|---|---|
|
|---|---|
|
||||||
@@ -117,9 +115,13 @@ All collections use **768-dimension vectors** with **Cosine similarity**,
|
|||||||
matching `nomic-embed-text` via Ollama. Vector size and distance metric are
|
matching `nomic-embed-text` via Ollama. Vector size and distance metric are
|
||||||
defined in `@nexusai/shared` — not hardcoded here.
|
defined in `@nexusai/shared` — not hardcoded here.
|
||||||
|
|
||||||
Each collection exposes three operations in `src/semantic/index.js`:
|
`initCollections()` iterates `Object.values(COLLECTIONS)` and creates any
|
||||||
upsert, search (with optional Qdrant filter), and delete. The `wait: true`
|
collection that doesn't already exist at startup — all three collections are
|
||||||
flag is used on all writes.
|
guaranteed to exist before any requests are handled, avoiding race conditions
|
||||||
|
between the first entity embed and an entity search.
|
||||||
|
|
||||||
|
Each collection exposes upsert, search (with optional Qdrant filter), and
|
||||||
|
delete operations. The `wait: true` flag is used on all writes.
|
||||||
|
|
||||||
## Embedding Write Path
|
## Embedding Write Path
|
||||||
|
|
||||||
@@ -133,8 +135,7 @@ When a new episode is created:
|
|||||||
This step is **fire-and-forget** — if embedding fails, the episode is still
|
This step is **fire-and-forget** — if embedding fails, the episode is still
|
||||||
saved and searchable via FTS. The error is logged but not surfaced.
|
saved and searchable via FTS. The error is logged but not surfaced.
|
||||||
|
|
||||||
> The Qdrant payload stores `sessionId` (the internal integer ID). This is
|
> The Qdrant payload stores `sessionId` (the internal integer ID). See
|
||||||
> used for per-session and per-project filtering during semantic search. See
|
|
||||||
> `memory-isolation.md` for how project-level filtering works.
|
> `memory-isolation.md` for how project-level filtering works.
|
||||||
|
|
||||||
## Entity Layer
|
## Entity Layer
|
||||||
@@ -146,34 +147,20 @@ constraints to prevent duplicates:
|
|||||||
- `UNIQUE(from_id, to_id, label)` on relationships
|
- `UNIQUE(from_id, to_id, label)` on relationships
|
||||||
- `ON DELETE CASCADE` on relationship foreign keys
|
- `ON DELETE CASCADE` on relationship foreign keys
|
||||||
|
|
||||||
### Automatic Entity Extraction
|
|
||||||
|
|
||||||
After each episode is saved, `extraction.js` automatically extracts named
|
After each episode is saved, `extraction.js` automatically extracts named
|
||||||
entities from the conversation using `qwen2.5:3b` running on Ollama (Mini PC 1).
|
entities from the conversation using `qwen2.5:3b` on Ollama — fire-and-forget.
|
||||||
This runs **fire-and-forget** — the episode is already saved and returned
|
|
||||||
before extraction begins.
|
|
||||||
|
|
||||||
**Entity types extracted:** `person`, `place`, `project`, `technology`,
|
> For full details on the extraction pipeline, prompt format, constrained
|
||||||
`concept`, `organization`
|
> decoding, stoplist, and Qdrant storage, see `entity-extraction.md`.
|
||||||
|
|
||||||
The extraction prompt uses ChatML format (native to qwen2.5) and primes the
|
## Summaries Layer
|
||||||
response by ending with `[` to steer the model directly into JSON array output.
|
|
||||||
A list of already-known entities is injected into the prompt so the model
|
|
||||||
reuses existing `(name, type)` pairs rather than creating duplicates with
|
|
||||||
different types.
|
|
||||||
|
|
||||||
After extraction, each entity is:
|
Session summaries are generated by `orchestration-service/src/services/summarization.js`
|
||||||
1. Upserted into SQLite via `upsertEntity` — notes are only written if
|
after each episode write and stored here via `POST /summaries`. The memory
|
||||||
the entity is new (`COALESCE(entities.notes, excluded.notes)` prevents
|
service is responsible only for CRUD — generation logic lives in orchestration.
|
||||||
overwriting existing notes with speculative updates)
|
|
||||||
2. Embedded via the embedding service and upserted into the `entities`
|
|
||||||
Qdrant collection with `{ name, type, notes, projectId }` as payload —
|
|
||||||
`projectId` scopes entities to their project for isolated retrieval
|
|
||||||
|
|
||||||
`extractAndStoreEntities` receives `projectId` from `createEpisode`, which
|
> For full details on trigger conditions, prompt format, cumulative updates,
|
||||||
receives it from the episode route, which receives it from orchestration's
|
> and ChatML token stripping, see `summarization.md`.
|
||||||
`createEpisode` call. This ensures entities are tagged with the correct
|
|
||||||
project scope at extraction time.
|
|
||||||
|
|
||||||
## Project Delete Behaviour
|
## Project Delete Behaviour
|
||||||
|
|
||||||
|
|||||||
@@ -30,7 +30,8 @@ or inference services — all traffic flows through orchestration.
|
|||||||
| LLAMA_SERVER_URL | No | http://localhost:8080 | Direct llama-server URL for /models/props |
|
| LLAMA_SERVER_URL | No | http://localhost:8080 | Direct llama-server URL for /models/props |
|
||||||
| QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search |
|
| QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search |
|
||||||
| CORS_ORIGIN | No | http://localhost:5173 | Allowed origin for CORS requests |
|
| CORS_ORIGIN | No | http://localhost:5173 | Allowed origin for CORS requests |
|
||||||
| MODELS_MANIFEST_PATH | No | — | Legacy — superseded by `modelsFolderPath` in settings.json |
|
| EXTRACTION_URL | No | http://localhost:11434 | Ollama URL for summarisation |
|
||||||
|
| EXTRACTION_MODEL | No | qwen2.5:3b | Ollama model used for summarisation |
|
||||||
|
|
||||||
## Internal Structure
|
## Internal Structure
|
||||||
|
|
||||||
@@ -40,7 +41,8 @@ src/
|
|||||||
│ ├── memory.js # HTTP client for memory service
|
│ ├── memory.js # HTTP client for memory service
|
||||||
│ ├── inference.js # HTTP client for inference service
|
│ ├── inference.js # HTTP client for inference service
|
||||||
│ ├── embedding.js # HTTP client for embedding service
|
│ ├── embedding.js # HTTP client for embedding service
|
||||||
│ └── qdrant.js # HTTP client for Qdrant (direct vector search)
|
│ ├── qdrant.js # HTTP client for Qdrant (direct vector search)
|
||||||
|
│ └── summarization.js # Session summarisation — triggers after each episode
|
||||||
├── chat/
|
├── chat/
|
||||||
│ └── index.js # Core pipeline — context assembly, isolation, auto-naming
|
│ └── index.js # Core pipeline — context assembly, isolation, auto-naming
|
||||||
├── config/
|
├── config/
|
||||||
@@ -48,12 +50,12 @@ src/
|
|||||||
├── routes/
|
├── routes/
|
||||||
│ ├── chat.js # POST /chat and POST /chat/stream
|
│ ├── chat.js # POST /chat and POST /chat/stream
|
||||||
│ ├── sessions.js # Session CRUD proxy
|
│ ├── sessions.js # Session CRUD proxy
|
||||||
│ ├── projects.js # Project CRUD proxy — passes req.body straight through
|
│ ├── projects.js # Project CRUD proxy
|
||||||
│ ├── episodes.js # Episode list and delete proxy
|
│ ├── episodes.js # Episode list and delete proxy
|
||||||
|
│ ├── summaries.js # GET /summaries/session/:id and /summaries/project/:id
|
||||||
│ ├── settings.js # GET /settings and PATCH /settings
|
│ ├── settings.js # GET /settings and PATCH /settings
|
||||||
│ ├── health.js # GET /health — pings all four services
|
│ ├── health.js # GET /health/services — pings all four services
|
||||||
│ └── models.js # GET /models — scans .gguf files live, merges with models.json
|
│ └── models.js # GET /models and GET /models/props
|
||||||
# GET /models/props — context window + loaded model from llama-server
|
|
||||||
└── index.js # Express app entry point
|
└── index.js # Express app entry point
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -77,9 +79,6 @@ via `appSettings.load()` — changes apply immediately without a service restart
|
|||||||
| `topK` | 40 | Top-K token candidates per step |
|
| `topK` | 40 | Top-K token candidates per step |
|
||||||
| `systemPrompt` | *(ORCHESTRATION.SYSTEM_PROMPT)* | Global system prompt. `null` reverts to hardcoded constant. |
|
| `systemPrompt` | *(ORCHESTRATION.SYSTEM_PROMPT)* | Global system prompt. `null` reverts to hardcoded constant. |
|
||||||
|
|
||||||
Defaults are defined in `config/settings.js` and fall back to constants in
|
|
||||||
`@nexusai/shared`. Values saved in `settings.json` take precedence.
|
|
||||||
|
|
||||||
## Chat Pipeline
|
## Chat Pipeline
|
||||||
|
|
||||||
Both `POST /chat` and `POST /chat/stream` share the same steps. The only
|
Both `POST /chat` and `POST /chat/stream` share the same steps. The only
|
||||||
@@ -88,42 +87,38 @@ difference is how the inference response is delivered to the client.
|
|||||||
### Steps
|
### Steps
|
||||||
|
|
||||||
1. **Session resolution** — look up session by `externalId`. Auto-create if
|
1. **Session resolution** — look up session by `externalId`. Auto-create if
|
||||||
not found. Clients generate a UUID for new conversations — no pre-creation
|
not found.
|
||||||
step needed.
|
|
||||||
|
|
||||||
2. **Project context resolution** — if the session has a `project_id`, fetch
|
2. **Project context resolution** — if the session has a `project_id`, fetch
|
||||||
the project and all its session IDs. Used to scope semantic search. The
|
the project and all its session IDs. Used to scope semantic search. The
|
||||||
project's `system_prompt` is also read at this step if set.
|
project's `system_prompt` is also read at this step if set.
|
||||||
|
|
||||||
3. **System prompt resolution** — three-tier hierarchy:
|
3. **System prompt resolution** — three-tier hierarchy:
|
||||||
- `project.system_prompt` — if the session is in a project and it's set (highest priority)
|
- `project.system_prompt` — highest priority
|
||||||
- `settings.systemPrompt` — global setting from `settings.json`
|
- `settings.systemPrompt` — global setting from `settings.json`
|
||||||
- `ORCHESTRATION.SYSTEM_PROMPT` — hardcoded constant in `@nexusai/shared` (last resort)
|
- `ORCHESTRATION.SYSTEM_PROMPT` — hardcoded constant (last resort)
|
||||||
|
|
||||||
4. **Recent episode retrieval** — fetch the most recent episodes for the
|
4. **Recent episode retrieval** — fetch most recent episodes (`recentEpisodeLimit`).
|
||||||
session (`recentEpisodeLimit`, default 5).
|
|
||||||
|
|
||||||
5. **Semantic search** — embed the user message, query Qdrant for the top
|
5. **Semantic search** — embed user message, query Qdrant for similar past
|
||||||
most similar past episodes (`semanticLimit`, `scoreThreshold`). Deduplicated
|
episodes. Deduplicated against recent episodes. Non-critical.
|
||||||
against recent episodes. Non-critical — if it fails, pipeline continues with
|
|
||||||
recency-only context.
|
|
||||||
|
|
||||||
6. **Entity search** — query the `entities` Qdrant collection filtered by
|
6. **Entity search** — query `entities` Qdrant collection filtered by
|
||||||
`projectId`. Non-project sessions receive no entity context. Non-critical.
|
`projectId`. Non-project sessions receive no entity context. Non-critical.
|
||||||
|
|
||||||
7. **Prompt assembly** — combine resolved system prompt, entity context,
|
7. **Prompt assembly** — combine system prompt, entity context, semantic
|
||||||
semantic episodes, recent episodes, and user message.
|
episodes, recent episodes, and user message.
|
||||||
|
|
||||||
8. **Inference** — send to inference service with settings-derived parameters
|
8. **Inference** — send to inference service. `/chat` awaits full response;
|
||||||
(temperature, topP, topK, repeatPenalty). `/chat` awaits full response;
|
|
||||||
`/chat/stream` pipes SSE chunks to the client.
|
`/chat/stream` pipes SSE chunks to the client.
|
||||||
|
|
||||||
9. **Episode write** — write the exchange back to memory with `projectId`.
|
9. **Episode write** — write exchange back to memory with `projectId`.
|
||||||
Fire-and-forget for `/chat`; awaited for `/chat/stream`.
|
|
||||||
|
|
||||||
10. **Auto-naming** — on `isFirstMessage && !session.name`, fire a secondary
|
10. **Summarisation trigger** — `triggerSummary(session, allEpisodes)` called
|
||||||
inference call with a naming prompt (max 20 tokens, temperature 0.3) and
|
fire-and-forget. See `summarization.md` for full details.
|
||||||
write the result back as `session.name`. Fully fire-and-forget.
|
|
||||||
|
11. **Auto-naming** — on first message with no session name, fires a secondary
|
||||||
|
inference call (max 20 tokens, temperature 0.3) to generate a session name.
|
||||||
|
|
||||||
### Prompt Structure
|
### Prompt Structure
|
||||||
|
|
||||||
@@ -132,26 +127,28 @@ difference is how the inference response is delivered to the client.
|
|||||||
|
|
||||||
Here is what you know about entities relevant to this conversation:
|
Here is what you know about entities relevant to this conversation:
|
||||||
- {name} ({type}): {notes}
|
- {name} ({type}): {notes}
|
||||||
... (up to 5 entity results)
|
|
||||||
---
|
---
|
||||||
Here are some relevant memories from earlier conversations:
|
Here are some relevant memories from earlier conversations:
|
||||||
User: {past user message}
|
User: {past user message}
|
||||||
Assistant: {past ai response}
|
Assistant: {past ai response}
|
||||||
... (up to semanticLimit semantic episodes)
|
|
||||||
---
|
---
|
||||||
Here are some relevant memories from your past conversations:
|
Here are some relevant memories from your past conversations:
|
||||||
User: {past user message}
|
User: {past user message}
|
||||||
Assistant: {past ai response}
|
Assistant: {past ai response}
|
||||||
... (up to recentEpisodeLimit recent episodes)
|
|
||||||
--- End of recent memories ---
|
--- End of recent memories ---
|
||||||
|
|
||||||
User: {current message}
|
User: {current message}
|
||||||
Assistant:
|
Assistant:
|
||||||
```
|
```
|
||||||
|
|
||||||
Entity context appears first — before episodic memory — because structured
|
## Summarisation
|
||||||
facts about known entities are the most stable and reliable context. Semantic
|
|
||||||
episodes follow, then recent episodes as the immediate conversation flow.
|
After each episode write, `triggerSummary` is called fire-and-forget. It
|
||||||
|
checks token thresholds and episode counts before generating, then stores
|
||||||
|
the result in the memory service.
|
||||||
|
|
||||||
|
> For full details on trigger conditions, prompt format, cumulative updates,
|
||||||
|
> ChatML token stripping, and episode range tracking, see `summarization.md`.
|
||||||
|
|
||||||
## SSE Stream Format
|
## SSE Stream Format
|
||||||
|
|
||||||
@@ -168,37 +165,26 @@ data: {"text":"Hello"}
|
|||||||
data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42}
|
data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42}
|
||||||
```
|
```
|
||||||
|
|
||||||
The `[DONE]` sentinel is consumed internally and not forwarded. The stream
|
The `[DONE]` sentinel is consumed internally and not forwarded.
|
||||||
is terminated by `res.end()` after the done event.
|
|
||||||
|
|
||||||
## Models Route
|
## Models Route
|
||||||
|
|
||||||
`GET /models` scans `.gguf` files live on each request from `modelsFolderPath`
|
`GET /models` scans `.gguf` files live from `modelsFolderPath` and merges
|
||||||
(read from settings). Merges results with a `models.json` file in the same
|
with `models.json` for metadata. Returns file size in GB.
|
||||||
folder for richer metadata (label, description). Returns file size in GB.
|
|
||||||
|
|
||||||
`GET /models/props` fetches directly from llama-server via `LLAMA_SERVER_URL`.
|
`GET /models/props` fetches directly from llama-server. Returns
|
||||||
Returns `{ contextWindow, modelAlias }`. `n_ctx` is at
|
`{ contextWindow, modelAlias }`. Returns `503` if unreachable.
|
||||||
`data.default_generation_settings.n_ctx` in the llama-server response.
|
|
||||||
Returns `503` if llama-server is unreachable.
|
|
||||||
|
|
||||||
## Sessions Route Behaviour
|
## Sessions Route Behaviour
|
||||||
|
|
||||||
`PATCH /sessions/:sessionId` accepts either `name`, `projectId`, or both.
|
`PATCH /sessions/:sessionId` accepts `name`, `projectId`, or both.
|
||||||
The validation guard only rejects requests where neither is provided:
|
Rejects only when neither is provided — allows `useChat` to write project
|
||||||
|
assignment separately from rename operations.
|
||||||
```js
|
|
||||||
if (!name?.trim() && projectId === undefined) {
|
|
||||||
return res.status(400).json({ error: 'name or projectId is required' });
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
This allows `useChat` to write project assignment separately from rename
|
|
||||||
operations.
|
|
||||||
|
|
||||||
## Caddy Configuration
|
## Caddy Configuration
|
||||||
|
|
||||||
Each route prefix needs a handle block in the Caddyfile on Mini PC 2:
|
Each route prefix needs a handle block in the Caddyfile on Mini PC 2.
|
||||||
|
**Any new top-level route must be added here AND in `vite.config.js`.**
|
||||||
|
|
||||||
```
|
```
|
||||||
handle /chat* { reverse_proxy localhost:4000 }
|
handle /chat* { reverse_proxy localhost:4000 }
|
||||||
@@ -207,6 +193,7 @@ handle /models* { reverse_proxy localhost:4000 }
|
|||||||
handle /projects* { reverse_proxy localhost:4000 }
|
handle /projects* { reverse_proxy localhost:4000 }
|
||||||
handle /episodes* { reverse_proxy localhost:4000 }
|
handle /episodes* { reverse_proxy localhost:4000 }
|
||||||
handle /settings* { reverse_proxy localhost:4000 }
|
handle /settings* { reverse_proxy localhost:4000 }
|
||||||
|
handle /summaries* { reverse_proxy localhost:4000 }
|
||||||
handle /health* { reverse_proxy localhost:4000 }
|
handle /health* { reverse_proxy localhost:4000 }
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|||||||
@@ -165,10 +165,16 @@ Orchestration pipeline defaults. Used as fallback values in
|
|||||||
| `RECENT_EPISODE_LIMIT` | `5` | Recent episodes to inject into prompt |
|
| `RECENT_EPISODE_LIMIT` | `5` | Recent episodes to inject into prompt |
|
||||||
| `SEMANTIC_LIMIT` | `5` | Semantic search results to inject into prompt |
|
| `SEMANTIC_LIMIT` | `5` | Semantic search results to inject into prompt |
|
||||||
| `SCORE_THRESHOLD` | `0.75` | Minimum similarity score for semantic results |
|
| `SCORE_THRESHOLD` | `0.75` | Minimum similarity score for semantic results |
|
||||||
|
| `ENTITIES_LIMIT` | `5` | Max entity search results to inject into prompt |
|
||||||
|
| `ENTITIES_THRESHOLD` | `0.55` | Minimum similarity score for entity results |
|
||||||
| `TEMPERATURE` | `0.7` | Default inference temperature |
|
| `TEMPERATURE` | `0.7` | Default inference temperature |
|
||||||
| `CORS_ORIGIN` | `'http://localhost:5173'` | Fallback allowed CORS origin |
|
| `CORS_ORIGIN` | `'http://localhost:5173'` | Fallback allowed CORS origin |
|
||||||
| `SYSTEM_PROMPT` | *(see below)* | Default system prompt |
|
| `SYSTEM_PROMPT` | *(see below)* | Default system prompt |
|
||||||
|
|
||||||
|
> `ENTITIES_THRESHOLD` is set to `0.55` — lower than `SCORE_THRESHOLD` because
|
||||||
|
> entity notes generated by a 3B model tend to embed with lower cosine similarity
|
||||||
|
> than full episode text. Tune upward if irrelevant entities appear in context.
|
||||||
|
|
||||||
> `repeatPenalty`, `topP`, and `topK` defaults are sourced from
|
> `repeatPenalty`, `topP`, and `topK` defaults are sourced from
|
||||||
> `INFERENCE_DEFAULTS` in `config/settings.js` rather than `ORCHESTRATION`,
|
> `INFERENCE_DEFAULTS` in `config/settings.js` rather than `ORCHESTRATION`,
|
||||||
> since those constants already define the canonical values.
|
> since those constants already define the canonical values.
|
||||||
@@ -178,6 +184,25 @@ Default system prompt:
|
|||||||
> of past conversations with the user. Use them to provide consistent,
|
> of past conversations with the user. Use them to provide consistent,
|
||||||
> personalised responses."
|
> personalised responses."
|
||||||
|
|
||||||
|
#### `SUMMARIES`
|
||||||
|
|
||||||
|
Controls the automatic session summarisation system in `orchestration-service/src/services/summarization.js`.
|
||||||
|
|
||||||
|
| Key | Value | Description |
|
||||||
|
|---|---|---|
|
||||||
|
| `THRESHOLD_TOKENS` | `200` | Minimum total session tokens before summarisation is considered |
|
||||||
|
| `MAX_SUMMARY_TOKENS` | `800` | If existing summary exceeds this length (chars), create a new row instead of updating |
|
||||||
|
| `MIN_EPISODES_SINCE` | `5` | Minimum new episodes since last summary before re-summarising |
|
||||||
|
|
||||||
|
These can be overridden per-deployment via environment variables in the
|
||||||
|
orchestration service `.env`:
|
||||||
|
|
||||||
|
```
|
||||||
|
SUMMARY_THRESHOLD_TOKENS=200
|
||||||
|
SUMMARY_MAX_TOKENS=800
|
||||||
|
SUMMARY_MIN_EPISODES=5
|
||||||
|
```
|
||||||
|
|
||||||
#### `SQLITE`
|
#### `SQLITE`
|
||||||
|
|
||||||
| Key | Value | Description |
|
| Key | Value | Description |
|
||||||
|
|||||||
201
docs/services/summarization.md
Normal file
201
docs/services/summarization.md
Normal file
@@ -0,0 +1,201 @@
|
|||||||
|
# Summarization
|
||||||
|
|
||||||
|
Session summarization generates rolling plain-text summaries of conversation
|
||||||
|
history, giving the model a condensed view of past context without consuming
|
||||||
|
the full context window with raw episodes.
|
||||||
|
|
||||||
|
**Location:** `packages/orchestration-service/src/services/summarization.js`
|
||||||
|
**Triggered by:** `chat/index.js` after every episode write (fire-and-forget)
|
||||||
|
**Model:** `qwen2.5:3b` via Ollama on Mini PC 1 (192.168.0.81)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Trigger Conditions
|
||||||
|
|
||||||
|
`triggerSummary(session, allEpisodes)` calls `maybeSummarize` fire-and-forget.
|
||||||
|
`maybeSummarize` proceeds only when both conditions are met:
|
||||||
|
|
||||||
|
1. Total session token count exceeds `SUMMARIES.THRESHOLD_TOKENS` (default 200)
|
||||||
|
2. At least `SUMMARIES.MIN_EPISODES_SINCE` (default 5) new episodes have
|
||||||
|
accumulated since the last summary
|
||||||
|
|
||||||
|
The token threshold is intentionally low — it ensures summaries start
|
||||||
|
generating early in a session's life rather than only after very long
|
||||||
|
conversations.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Summary Rows and Cumulative Updates
|
||||||
|
|
||||||
|
Each session can have multiple summary rows in the `summaries` table.
|
||||||
|
The update strategy depends on the size of the most recent summary:
|
||||||
|
|
||||||
|
| Condition | Action |
|
||||||
|
|---|---|
|
||||||
|
| No existing summary | Generate fresh summary from all episodes |
|
||||||
|
| Latest summary under `MAX_SUMMARY_TOKENS` | Update: summarise new episodes with existing summary as context |
|
||||||
|
| Latest summary over `MAX_SUMMARY_TOKENS` | Create new row: treat as fresh summarisation |
|
||||||
|
|
||||||
|
This produces a chain of summary rows over time. Each row's `episode_range`
|
||||||
|
covers only the episodes summarised in that specific pass (e.g. `259-263`),
|
||||||
|
not all episodes in the session.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Ollama Request
|
||||||
|
|
||||||
|
```js
|
||||||
|
{
|
||||||
|
model: EXTRACTION_MODEL, // qwen2.5:3b (set via EXTRACTION_MODEL env var)
|
||||||
|
prompt: buildSummaryPrompt(episodesToSummarize, existingSummary),
|
||||||
|
stream: false,
|
||||||
|
// No format: 'json' — free-text output required for summaries
|
||||||
|
options: {
|
||||||
|
temperature: 0.2,
|
||||||
|
num_predict: 500,
|
||||||
|
},
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
`temperature: 0.2` is slightly higher than extraction (0.1) — summaries
|
||||||
|
benefit from some fluency. `num_predict: 500` gives room for 5 thorough
|
||||||
|
sentences without risk of runoff.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Prompt Format
|
||||||
|
|
||||||
|
ChatML format — native to qwen2.5:
|
||||||
|
|
||||||
|
```
|
||||||
|
<|im_start|>user
|
||||||
|
Summarize the conversation below in 3-5 sentences.
|
||||||
|
Write in third person. Do not quote directly — paraphrase only.
|
||||||
|
Do not include greetings, sign-offs, or filler. Output only the summary text.
|
||||||
|
|
||||||
|
Conversation:
|
||||||
|
{context}
|
||||||
|
<|im_end|>
|
||||||
|
<|im_start|>assistant
|
||||||
|
```
|
||||||
|
|
||||||
|
For cumulative updates, the instruction and context change:
|
||||||
|
|
||||||
|
```
|
||||||
|
<|im_start|>user
|
||||||
|
Update the summary below to incorporate the new exchanges.
|
||||||
|
Write 3-5 sentences in third person. Do not quote directly — paraphrase only.
|
||||||
|
Do not include greetings, sign-offs, or filler. Output only the updated summary text.
|
||||||
|
|
||||||
|
Previous summary:
|
||||||
|
{existingSummary}
|
||||||
|
|
||||||
|
New exchanges:
|
||||||
|
{context}
|
||||||
|
<|im_end|>
|
||||||
|
<|im_start|>assistant
|
||||||
|
```
|
||||||
|
|
||||||
|
### Input truncation
|
||||||
|
|
||||||
|
Episode context is truncated to `MAX_CHARS = 3000` characters, keeping the
|
||||||
|
most recent exchanges (sliced from the end). This keeps Qwen focused and
|
||||||
|
prevents the prompt from exceeding its effective context window.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ChatML Token Stripping
|
||||||
|
|
||||||
|
Qwen occasionally echoes ChatML tokens back into its response. The raw output
|
||||||
|
is cleaned before saving:
|
||||||
|
|
||||||
|
```js
|
||||||
|
const raw = data.response?.trim() ?? '';
|
||||||
|
const content = raw
|
||||||
|
.replace(/<\|im_start\|>.*?<\|im_end\|>/gs, '')
|
||||||
|
.replace(/<\|im_start\|>|<\|im_end\|>|<\|im_sep\|>/g, '')
|
||||||
|
.trim();
|
||||||
|
return content;
|
||||||
|
```
|
||||||
|
|
||||||
|
Without this, leaked tokens get stored in the summary and then injected
|
||||||
|
back into the next summarisation prompt — causing the model to append a new
|
||||||
|
summary after the old one rather than replacing it.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Episode Range Tracking
|
||||||
|
|
||||||
|
Each summary row stores `episode_range` as `"firstId-lastId"` covering only
|
||||||
|
the episodes summarised in that pass:
|
||||||
|
|
||||||
|
```js
|
||||||
|
const summarizedIds = episodesToSummarize.map(ep => ep.id).sort((a,b) => a - b);
|
||||||
|
const episodeRange = `${summarizedIds.at(0)}-${summarizedIds.at(-1)}`;
|
||||||
|
```
|
||||||
|
|
||||||
|
This makes SummaryView cards meaningful — "Episodes 259-263" tells you
|
||||||
|
exactly which exchanges that summary covers, rather than always showing
|
||||||
|
the full session range.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Summary Storage
|
||||||
|
|
||||||
|
Summaries are written directly to the memory service from orchestration:
|
||||||
|
|
||||||
|
```js
|
||||||
|
// Create new row
|
||||||
|
await fetch(`${MEMORY_URL}/summaries`, {
|
||||||
|
method: 'POST',
|
||||||
|
body: JSON.stringify({ sessionId: session.id, content, tokenCount, episodeRange }),
|
||||||
|
});
|
||||||
|
|
||||||
|
// Update existing row
|
||||||
|
await fetch(`${MEMORY_URL}/summaries/${latest.id}`, {
|
||||||
|
method: 'PATCH',
|
||||||
|
body: JSON.stringify({ content, tokenCount, episodeRange }),
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
`session.id` here is the internal SQLite integer ID — not the external UUID.
|
||||||
|
It is available directly on the `session` object passed from `chat/index.js`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Client-Side Indicator
|
||||||
|
|
||||||
|
The chat client shows a "Summarising…" spinner in the `ChatWindow` header
|
||||||
|
and on the InfoPanel's Session Memory button while summarisation may be
|
||||||
|
in progress.
|
||||||
|
|
||||||
|
Since summarisation is fire-and-forget with no completion signal back to
|
||||||
|
the client, the indicator is timer-based: it activates when the stream
|
||||||
|
finishes and clears after 8 seconds.
|
||||||
|
|
||||||
|
```js
|
||||||
|
// In App.jsx, watching the streaming state from useChat:
|
||||||
|
useEffect(() => {
|
||||||
|
if (prevStreaming.current && !streaming) {
|
||||||
|
setSummarising(true);
|
||||||
|
const t = setTimeout(() => setSummarising(false), 8000);
|
||||||
|
return () => clearTimeout(t);
|
||||||
|
}
|
||||||
|
prevStreaming.current = streaming;
|
||||||
|
}, [streaming]);
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Environment Variables
|
||||||
|
|
||||||
|
Set in `packages/orchestration-service/src/.env`:
|
||||||
|
|
||||||
|
| Variable | Default | Description |
|
||||||
|
|---|---|---|
|
||||||
|
| `EXTRACTION_URL` | `http://localhost:11434` | Ollama instance URL |
|
||||||
|
| `EXTRACTION_MODEL` | `qwen2.5:3b` | Model for summarisation |
|
||||||
|
| `MEMORY_SERVICE_URL` | `http://localhost:3002` | Memory service URL |
|
||||||
|
| `SUMMARY_THRESHOLD_TOKENS` | `200` | Token threshold before summarisation triggers |
|
||||||
|
| `SUMMARY_MAX_TOKENS` | `800` | Max summary length before a new row is created |
|
||||||
|
| `SUMMARY_MIN_EPISODES` | `5` | Min new episodes since last summary before re-summarising |s
|
||||||
Reference in New Issue
Block a user