documentation updates for entity extraction and summarization

2026-04-21 03:50:38 -07:00
parent 32365e67f4
commit acda21317b
6 changed files with 540 additions and 107 deletions
--- a/docs/services/orchestration-service.md
+++ b/docs/services/orchestration-service.md
@@ -30,31 +30,33 @@ or inference services — all traffic flows through orchestration.
 | LLAMA_SERVER_URL | No | http://localhost:8080 | Direct llama-server URL for /models/props |
 | QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search |
 | CORS_ORIGIN | No | http://localhost:5173 | Allowed origin for CORS requests |
-| MODELS_MANIFEST_PATH | No | — | Legacy — superseded by `modelsFolderPath` in settings.json |
+| EXTRACTION_URL | No | http://localhost:11434 | Ollama URL for summarisation |
+| EXTRACTION_MODEL | No | qwen2.5:3b | Ollama model used for summarisation |

 ## Internal Structure

 ```
 src/
 ├── services/
-│   ├── memory.js      # HTTP client for memory service
-│   ├── inference.js   # HTTP client for inference service
-│   ├── embedding.js   # HTTP client for embedding service
-│   └── qdrant.js      # HTTP client for Qdrant (direct vector search)
+│   ├── memory.js         # HTTP client for memory service
+│   ├── inference.js      # HTTP client for inference service
+│   ├── embedding.js      # HTTP client for embedding service
+│   ├── qdrant.js         # HTTP client for Qdrant (direct vector search)
+│   └── summarization.js  # Session summarisation — triggers after each episode
 ├── chat/
-│   └── index.js       # Core pipeline — context assembly, isolation, auto-naming
+│   └── index.js          # Core pipeline — context assembly, isolation, auto-naming
 ├── config/
-│   └── settings.js    # Settings load/save — reads/writes data/settings.json
+│   └── settings.js       # Settings load/save — reads/writes data/settings.json
 ├── routes/
-│   ├── chat.js        # POST /chat and POST /chat/stream
-│   ├── sessions.js    # Session CRUD proxy
-│   ├── projects.js    # Project CRUD proxy — passes req.body straight through
-│   ├── episodes.js    # Episode list and delete proxy
-│   ├── settings.js    # GET /settings and PATCH /settings
-│   ├── health.js      # GET /health — pings all four services
-│   └── models.js      # GET /models — scans .gguf files live, merges with models.json
-                       # GET /models/props — context window + loaded model from llama-server
-└── index.js           # Express app entry point
+│   ├── chat.js           # POST /chat and POST /chat/stream
+│   ├── sessions.js       # Session CRUD proxy
+│   ├── projects.js       # Project CRUD proxy
+│   ├── episodes.js       # Episode list and delete proxy
+│   ├── summaries.js      # GET /summaries/session/:id and /summaries/project/:id
+│   ├── settings.js       # GET /settings and PATCH /settings
+│   ├── health.js         # GET /health/services — pings all four services
+│   └── models.js         # GET /models and GET /models/props
+└── index.js              # Express app entry point
 ```

 The `services/` layer wraps all downstream HTTP calls in named functions.
@@ -77,9 +79,6 @@ via `appSettings.load()` — changes apply immediately without a service restart
 | `topK` | 40 | Top-K token candidates per step |
 | `systemPrompt` | *(ORCHESTRATION.SYSTEM_PROMPT)* | Global system prompt. `null` reverts to hardcoded constant. |

-Defaults are defined in `config/settings.js` and fall back to constants in
-`@nexusai/shared`. Values saved in `settings.json` take precedence.
-
 ## Chat Pipeline

 Both `POST /chat` and `POST /chat/stream` share the same steps. The only
@@ -88,42 +87,38 @@ difference is how the inference response is delivered to the client.
 ### Steps

 1. **Session resolution** — look up session by `externalId`. Auto-create if
-   not found. Clients generate a UUID for new conversations — no pre-creation
-   step needed.
+   not found.

 2. **Project context resolution** — if the session has a `project_id`, fetch
   the project and all its session IDs. Used to scope semantic search. The
   project's `system_prompt` is also read at this step if set.

 3. **System prompt resolution** — three-tier hierarchy:
-   - `project.system_prompt` — if the session is in a project and it's set (highest priority)
+   - `project.system_prompt` — highest priority
   - `settings.systemPrompt` — global setting from `settings.json`
-   - `ORCHESTRATION.SYSTEM_PROMPT` — hardcoded constant in `@nexusai/shared` (last resort)
+   - `ORCHESTRATION.SYSTEM_PROMPT` — hardcoded constant (last resort)

-4. **Recent episode retrieval** — fetch the most recent episodes for the
-   session (`recentEpisodeLimit`, default 5).
+4. **Recent episode retrieval** — fetch most recent episodes (`recentEpisodeLimit`).

-5. **Semantic search** — embed the user message, query Qdrant for the top
-   most similar past episodes (`semanticLimit`, `scoreThreshold`). Deduplicated
-   against recent episodes. Non-critical — if it fails, pipeline continues with
-   recency-only context.
+5. **Semantic search** — embed user message, query Qdrant for similar past
+   episodes. Deduplicated against recent episodes. Non-critical.

-6. **Entity search** — query the `entities` Qdrant collection filtered by
+6. **Entity search** — query `entities` Qdrant collection filtered by
   `projectId`. Non-project sessions receive no entity context. Non-critical.

-7. **Prompt assembly** — combine resolved system prompt, entity context,
-   semantic episodes, recent episodes, and user message.
+7. **Prompt assembly** — combine system prompt, entity context, semantic
+   episodes, recent episodes, and user message.

-8. **Inference** — send to inference service with settings-derived parameters
-   (temperature, topP, topK, repeatPenalty). `/chat` awaits full response;
+8. **Inference** — send to inference service. `/chat` awaits full response;
   `/chat/stream` pipes SSE chunks to the client.

-9. **Episode write** — write the exchange back to memory with `projectId`.
-   Fire-and-forget for `/chat`; awaited for `/chat/stream`.
+9. **Episode write** — write exchange back to memory with `projectId`.

-10. **Auto-naming** — on `isFirstMessage && !session.name`, fire a secondary
-    inference call with a naming prompt (max 20 tokens, temperature 0.3) and
-    write the result back as `session.name`. Fully fire-and-forget.
+10. **Summarisation trigger** — `triggerSummary(session, allEpisodes)` called
+    fire-and-forget. See `summarization.md` for full details.
+
+11. **Auto-naming** — on first message with no session name, fires a secondary
+    inference call (max 20 tokens, temperature 0.3) to generate a session name.

 ### Prompt Structure

@@ -132,26 +127,28 @@ difference is how the inference response is delivered to the client.

 Here is what you know about entities relevant to this conversation:
 - {name} ({type}): {notes}
-... (up to 5 entity results)
 ---
 Here are some relevant memories from earlier conversations:
 User: {past user message}
 Assistant: {past ai response}
-... (up to semanticLimit semantic episodes)
 ---
 Here are some relevant memories from your past conversations:
 User: {past user message}
 Assistant: {past ai response}
-... (up to recentEpisodeLimit recent episodes)
 --- End of recent memories ---

 User: {current message}
 Assistant:
 ```

-Entity context appears first — before episodic memory — because structured
-facts about known entities are the most stable and reliable context. Semantic
-episodes follow, then recent episodes as the immediate conversation flow.
+## Summarisation
+
+After each episode write, `triggerSummary` is called fire-and-forget. It
+checks token thresholds and episode counts before generating, then stores
+the result in the memory service.
+
+> For full details on trigger conditions, prompt format, cumulative updates,
+> ChatML token stripping, and episode range tracking, see `summarization.md`.

 ## SSE Stream Format

@@ -168,46 +165,36 @@ data: {"text":"Hello"}
 data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42}
 ```

-The `[DONE]` sentinel is consumed internally and not forwarded. The stream
-is terminated by `res.end()` after the done event.
+The `[DONE]` sentinel is consumed internally and not forwarded.

 ## Models Route

-`GET /models` scans `.gguf` files live on each request from `modelsFolderPath`
-(read from settings). Merges results with a `models.json` file in the same
-folder for richer metadata (label, description). Returns file size in GB.
+`GET /models` scans `.gguf` files live from `modelsFolderPath` and merges
+with `models.json` for metadata. Returns file size in GB.

-`GET /models/props` fetches directly from llama-server via `LLAMA_SERVER_URL`.
-Returns `{ contextWindow, modelAlias }`. `n_ctx` is at
-`data.default_generation_settings.n_ctx` in the llama-server response.
-Returns `503` if llama-server is unreachable.
+`GET /models/props` fetches directly from llama-server. Returns
+`{ contextWindow, modelAlias }`. Returns `503` if unreachable.

 ## Sessions Route Behaviour

-`PATCH /sessions/:sessionId` accepts either `name`, `projectId`, or both.
-The validation guard only rejects requests where neither is provided:
-
-```js
-if (!name?.trim() && projectId === undefined) {
-  return res.status(400).json({ error: 'name or projectId is required' });
-}
-```
-
-This allows `useChat` to write project assignment separately from rename
-operations.
+`PATCH /sessions/:sessionId` accepts `name`, `projectId`, or both.
+Rejects only when neither is provided — allows `useChat` to write project
+assignment separately from rename operations.

 ## Caddy Configuration

-Each route prefix needs a handle block in the Caddyfile on Mini PC 2:
+Each route prefix needs a handle block in the Caddyfile on Mini PC 2.
+**Any new top-level route must be added here AND in `vite.config.js`.**

 ```
-handle /chat*     { reverse_proxy localhost:4000 }
-handle /sessions* { reverse_proxy localhost:4000 }
-handle /models*   { reverse_proxy localhost:4000 }
-handle /projects* { reverse_proxy localhost:4000 }
-handle /episodes* { reverse_proxy localhost:4000 }
-handle /settings* { reverse_proxy localhost:4000 }
-handle /health*   { reverse_proxy localhost:4000 }
+handle /chat*      { reverse_proxy localhost:4000 }
+handle /sessions*  { reverse_proxy localhost:4000 }
+handle /models*    { reverse_proxy localhost:4000 }
+handle /projects*  { reverse_proxy localhost:4000 }
+handle /episodes*  { reverse_proxy localhost:4000 }
+handle /settings*  { reverse_proxy localhost:4000 }
+handle /summaries* { reverse_proxy localhost:4000 }
+handle /health*    { reverse_proxy localhost:4000 }
 ```

 After updating: `caddy reload --config /path/to/Caddyfile`