documentation updates for entity extraction and summarization

This commit is contained in:
Storme-bit
2026-04-21 03:50:38 -07:00
parent 32365e67f4
commit acda21317b
6 changed files with 540 additions and 107 deletions

View File

@@ -30,31 +30,33 @@ or inference services — all traffic flows through orchestration.
| LLAMA_SERVER_URL | No | http://localhost:8080 | Direct llama-server URL for /models/props |
| QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search |
| CORS_ORIGIN | No | http://localhost:5173 | Allowed origin for CORS requests |
| MODELS_MANIFEST_PATH | No | — | Legacy — superseded by `modelsFolderPath` in settings.json |
| EXTRACTION_URL | No | http://localhost:11434 | Ollama URL for summarisation |
| EXTRACTION_MODEL | No | qwen2.5:3b | Ollama model used for summarisation |
## Internal Structure
```
src/
├── services/
│ ├── memory.js # HTTP client for memory service
│ ├── inference.js # HTTP client for inference service
│ ├── embedding.js # HTTP client for embedding service
── qdrant.js # HTTP client for Qdrant (direct vector search)
│ ├── memory.js # HTTP client for memory service
│ ├── inference.js # HTTP client for inference service
│ ├── embedding.js # HTTP client for embedding service
── qdrant.js # HTTP client for Qdrant (direct vector search)
│ └── summarization.js # Session summarisation — triggers after each episode
├── chat/
│ └── index.js # Core pipeline — context assembly, isolation, auto-naming
│ └── index.js # Core pipeline — context assembly, isolation, auto-naming
├── config/
│ └── settings.js # Settings load/save — reads/writes data/settings.json
│ └── settings.js # Settings load/save — reads/writes data/settings.json
├── routes/
│ ├── chat.js # POST /chat and POST /chat/stream
│ ├── sessions.js # Session CRUD proxy
│ ├── projects.js # Project CRUD proxy — passes req.body straight through
│ ├── episodes.js # Episode list and delete proxy
│ ├── settings.js # GET /settings and PATCH /settings
│ ├── health.js # GET /health — pings all four services
── models.js # GET /models — scans .gguf files live, merges with models.json
# GET /models/props — context window + loaded model from llama-server
└── index.js # Express app entry point
│ ├── chat.js # POST /chat and POST /chat/stream
│ ├── sessions.js # Session CRUD proxy
│ ├── projects.js # Project CRUD proxy
│ ├── episodes.js # Episode list and delete proxy
│ ├── summaries.js # GET /summaries/session/:id and /summaries/project/:id
│ ├── settings.js # GET /settings and PATCH /settings
── health.js # GET /health/services — pings all four services
└── models.js # GET /models and GET /models/props
└── index.js # Express app entry point
```
The `services/` layer wraps all downstream HTTP calls in named functions.
@@ -77,9 +79,6 @@ via `appSettings.load()` — changes apply immediately without a service restart
| `topK` | 40 | Top-K token candidates per step |
| `systemPrompt` | *(ORCHESTRATION.SYSTEM_PROMPT)* | Global system prompt. `null` reverts to hardcoded constant. |
Defaults are defined in `config/settings.js` and fall back to constants in
`@nexusai/shared`. Values saved in `settings.json` take precedence.
## Chat Pipeline
Both `POST /chat` and `POST /chat/stream` share the same steps. The only
@@ -88,42 +87,38 @@ difference is how the inference response is delivered to the client.
### Steps
1. **Session resolution** — look up session by `externalId`. Auto-create if
not found. Clients generate a UUID for new conversations — no pre-creation
step needed.
not found.
2. **Project context resolution** — if the session has a `project_id`, fetch
the project and all its session IDs. Used to scope semantic search. The
project's `system_prompt` is also read at this step if set.
3. **System prompt resolution** — three-tier hierarchy:
- `project.system_prompt`if the session is in a project and it's set (highest priority)
- `project.system_prompt` — highest priority
- `settings.systemPrompt` — global setting from `settings.json`
- `ORCHESTRATION.SYSTEM_PROMPT` — hardcoded constant in `@nexusai/shared` (last resort)
- `ORCHESTRATION.SYSTEM_PROMPT` — hardcoded constant (last resort)
4. **Recent episode retrieval** — fetch the most recent episodes for the
session (`recentEpisodeLimit`, default 5).
4. **Recent episode retrieval** — fetch most recent episodes (`recentEpisodeLimit`).
5. **Semantic search** — embed the user message, query Qdrant for the top
most similar past episodes (`semanticLimit`, `scoreThreshold`). Deduplicated
against recent episodes. Non-critical — if it fails, pipeline continues with
recency-only context.
5. **Semantic search** — embed user message, query Qdrant for similar past
episodes. Deduplicated against recent episodes. Non-critical.
6. **Entity search** — query the `entities` Qdrant collection filtered by
6. **Entity search** — query `entities` Qdrant collection filtered by
`projectId`. Non-project sessions receive no entity context. Non-critical.
7. **Prompt assembly** — combine resolved system prompt, entity context,
semantic episodes, recent episodes, and user message.
7. **Prompt assembly** — combine system prompt, entity context, semantic
episodes, recent episodes, and user message.
8. **Inference** — send to inference service with settings-derived parameters
(temperature, topP, topK, repeatPenalty). `/chat` awaits full response;
8. **Inference** — send to inference service. `/chat` awaits full response;
`/chat/stream` pipes SSE chunks to the client.
9. **Episode write** — write the exchange back to memory with `projectId`.
Fire-and-forget for `/chat`; awaited for `/chat/stream`.
9. **Episode write** — write exchange back to memory with `projectId`.
10. **Auto-naming** — on `isFirstMessage && !session.name`, fire a secondary
inference call with a naming prompt (max 20 tokens, temperature 0.3) and
write the result back as `session.name`. Fully fire-and-forget.
10. **Summarisation trigger**`triggerSummary(session, allEpisodes)` called
fire-and-forget. See `summarization.md` for full details.
11. **Auto-naming** — on first message with no session name, fires a secondary
inference call (max 20 tokens, temperature 0.3) to generate a session name.
### Prompt Structure
@@ -132,26 +127,28 @@ difference is how the inference response is delivered to the client.
Here is what you know about entities relevant to this conversation:
- {name} ({type}): {notes}
... (up to 5 entity results)
---
Here are some relevant memories from earlier conversations:
User: {past user message}
Assistant: {past ai response}
... (up to semanticLimit semantic episodes)
---
Here are some relevant memories from your past conversations:
User: {past user message}
Assistant: {past ai response}
... (up to recentEpisodeLimit recent episodes)
--- End of recent memories ---
User: {current message}
Assistant:
```
Entity context appears first — before episodic memory — because structured
facts about known entities are the most stable and reliable context. Semantic
episodes follow, then recent episodes as the immediate conversation flow.
## Summarisation
After each episode write, `triggerSummary` is called fire-and-forget. It
checks token thresholds and episode counts before generating, then stores
the result in the memory service.
> For full details on trigger conditions, prompt format, cumulative updates,
> ChatML token stripping, and episode range tracking, see `summarization.md`.
## SSE Stream Format
@@ -168,46 +165,36 @@ data: {"text":"Hello"}
data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42}
```
The `[DONE]` sentinel is consumed internally and not forwarded. The stream
is terminated by `res.end()` after the done event.
The `[DONE]` sentinel is consumed internally and not forwarded.
## Models Route
`GET /models` scans `.gguf` files live on each request from `modelsFolderPath`
(read from settings). Merges results with a `models.json` file in the same
folder for richer metadata (label, description). Returns file size in GB.
`GET /models` scans `.gguf` files live from `modelsFolderPath` and merges
with `models.json` for metadata. Returns file size in GB.
`GET /models/props` fetches directly from llama-server via `LLAMA_SERVER_URL`.
Returns `{ contextWindow, modelAlias }`. `n_ctx` is at
`data.default_generation_settings.n_ctx` in the llama-server response.
Returns `503` if llama-server is unreachable.
`GET /models/props` fetches directly from llama-server. Returns
`{ contextWindow, modelAlias }`. Returns `503` if unreachable.
## Sessions Route Behaviour
`PATCH /sessions/:sessionId` accepts either `name`, `projectId`, or both.
The validation guard only rejects requests where neither is provided:
```js
if (!name?.trim() && projectId === undefined) {
return res.status(400).json({ error: 'name or projectId is required' });
}
```
This allows `useChat` to write project assignment separately from rename
operations.
`PATCH /sessions/:sessionId` accepts `name`, `projectId`, or both.
Rejects only when neither is provided — allows `useChat` to write project
assignment separately from rename operations.
## Caddy Configuration
Each route prefix needs a handle block in the Caddyfile on Mini PC 2:
Each route prefix needs a handle block in the Caddyfile on Mini PC 2.
**Any new top-level route must be added here AND in `vite.config.js`.**
```
handle /chat* { reverse_proxy localhost:4000 }
handle /sessions* { reverse_proxy localhost:4000 }
handle /models* { reverse_proxy localhost:4000 }
handle /projects* { reverse_proxy localhost:4000 }
handle /episodes* { reverse_proxy localhost:4000 }
handle /settings* { reverse_proxy localhost:4000 }
handle /health* { reverse_proxy localhost:4000 }
handle /chat* { reverse_proxy localhost:4000 }
handle /sessions* { reverse_proxy localhost:4000 }
handle /models* { reverse_proxy localhost:4000 }
handle /projects* { reverse_proxy localhost:4000 }
handle /episodes* { reverse_proxy localhost:4000 }
handle /settings* { reverse_proxy localhost:4000 }
handle /summaries* { reverse_proxy localhost:4000 }
handle /health* { reverse_proxy localhost:4000 }
```
After updating: `caddy reload --config /path/to/Caddyfile`