update documentation

2026-04-17 03:46:17 -07:00
parent 27e3c98304
commit 5145b9a7db
13 changed files with 822 additions and 794 deletions
--- a/docs/services/orchestration-service.md
+++ b/docs/services/orchestration-service.md
@@ -39,56 +39,58 @@ src/
 │   ├── memory.js      # HTTP client for memory service
 │   ├── inference.js   # HTTP client for inference service
 │   ├── embedding.js   # HTTP client for embedding service
-│   └── qdrant.js      # HTTP client for Qdrant vector search
+│   └── qdrant.js      # HTTP client for Qdrant (direct vector search)
 ├── chat/
-│   └── index.js       # Core pipeline logic — context assembly and coordination
+│   └── index.js       # Core pipeline — context assembly, isolation, auto-naming
 ├── routes/
-│   ├── chat.js        # POST /chat and POST /chat/stream route handlers
-│   ├── sessions.js    # Session list, history, rename, and delete routes
-│   ├── projects.js    # Project CRUD routes — proxies to memory service
-│   └── models.js      # GET /models — reads models.json manifest from disk
+│   ├── chat.js        # POST /chat and POST /chat/stream
+│   ├── sessions.js    # Session CRUD proxy
+│   ├── projects.js    # Project CRUD proxy
+│   └── models.js      # GET /models — reads models.json from disk
 └── index.js           # Express app entry point
 ```

-The `services/` layer wraps all downstream HTTP calls in named functions,
-keeping the pipeline logic in `chat/index.js` readable and ensuring that
+The `services/` layer wraps all downstream HTTP calls in named functions.
 URL or endpoint changes have a single place to be updated.

 ## Chat Pipeline

-Both `POST /chat` and `POST /chat/stream` share the same context assembly
-steps. The only difference is how the inference response is delivered to
-the client.
+Both `POST /chat` and `POST /chat/stream` share the same steps. The only
+difference is how the inference response is delivered to the client.

-1. **Session resolution** — looks up the session by `externalId` in the memory
-   service. If not found, auto-creates a new session. Clients can generate a
-   UUID for new conversations and pass it directly — no pre-creation step needed.
+### Steps

-2. **Recent episode retrieval** — fetches the most recent episodes for the session
-   (default: 5) from the memory service.
+1. **Session resolution** — look up session by `externalId`. Auto-create if
+   not found. Clients generate a UUID for new conversations — no pre-creation
+   step needed.

-3. **Semantic search** — embeds the user message via the embedding service, then
-   queries Qdrant for the top-5 most similar past episodes (score threshold: 0.75).
-   Results are deduplicated against the recent episode set using a `Set` of IDs.
-   Full episode content is fetched from the memory service by ID. This step is
-   non-critical — if it fails, a warning is logged and the pipeline continues with
+2. **Project context resolution** — if the session has a `project_id`, fetch
+   the project and all its session IDs. Used to scope semantic search. See
+   `memory-isolation.md` for full behaviour.
+
+3. **Recent episode retrieval** — fetch the most recent episodes for the
+   session (`RECENT_EPISODE_LIMIT`, default 5).
+
+4. **Semantic search** — embed the user message, query Qdrant for the top-5
+   most similar past episodes (`SCORE_THRESHOLD` 0.75). Deduplicated against
+   recent episodes. Non-critical — if it fails, pipeline continues with
   recency-only context.

-4. **Prompt assembly** — combines the system prompt, semantic episodes (if any),
-   recent episodes, and the current user message into a single prompt string.
+5. **Prompt assembly** — combine system prompt, semantic episodes, recent
+   episodes, and user message.

-5. **Inference** — sends the assembled prompt to the inference service. `/chat`
-   awaits the full response; `/chat/stream` opens an SSE connection and pipes
-   chunks to the client as they arrive.
+6. **Inference** — send to inference service. `/chat` awaits full response;
+   `/chat/stream` pipes SSE chunks to the client.

-6. **Episode write** — writes the new exchange (user message + AI response)
-   back to the memory service as a fire-and-forget operation. For streaming,
-   the full response text is accumulated across chunks before writing.
+7. **Episode write** — write the exchange back to memory. Fire-and-forget
+   for `/chat`; awaited for `/chat/stream` to ensure the full text is
+   accumulated before saving.

-7. **Response** — returns the AI response, model name, session ID, and token
-   count to the client.
+8. **Auto-naming** — on `isFirstMessage && !session.name`, fire a secondary
+   inference call with a naming prompt (max 20 tokens, temperature 0.3) and
+   write the result back as `session.name`. Fully fire-and-forget.

-## Prompt Structure
+### Prompt Structure

 ```
 [System prompt]
@@ -108,212 +110,67 @@ User: {current message}
 Assistant:
 ```

-Semantic episodes appear before recent episodes so the model encounters
-long-range relevant context before the immediate conversation flow.
+Semantic episodes appear before recent episodes so the model sees
+long-range context before the immediate conversation flow.

 ## SSE Stream Format

-The inference service emits chunks from the llama.cpp provider in this format:
+Inference service → orchestration:
 ```
 data: {"response":"Hello","done":false}
-data: {"response":"!","done":false}
-data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":42}
+data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42}
 data: [DONE]
 ```

-The orchestration service re-emits to the client as:
+Orchestration → client:
 ```
 data: {"text":"Hello"}
-data: {"text":"!"}
-data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":42}
+data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42}
 ```

-The `[DONE]` sentinel from the inference service is consumed internally
-and not forwarded. The client stream is terminated by `res.end()` after
-the done event. Model name and token count are included on the done event
-so the client can display them in the UI.
+The `[DONE]` sentinel is consumed internally and not forwarded. The stream
+is terminated by `res.end()` after the done event.

 ## Models Manifest

-The `/models` endpoint reads a `models.json` file from disk at the path
-specified by `MODELS_MANIFEST_PATH`. The file lives on the main PC alongside
-the model files, and is accessible to orchestration via a network share
-mounted at `/mnt/nexus-models`.
+`GET /models` reads `models.json` fresh on each request from
+`MODELS_MANIFEST_PATH`. The file lives on the main PC alongside model files,
+accessible via an SMB mount at `/mnt/nexus-models`.

-The manifest is read fresh on each request — no restart needed when models
-are added or removed.
-
-**models.json format:**
 ```json
 [
  { "value": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf", "label": "Gemma 4 26B Claude Distill" }
 ]
 ```

- `value` — must match the model name as reported by `llama-server` (including `.gguf` extension)
- `label` — display name shown in the UI
+`value` must match the model name as reported by `llama-server` (including
+`.gguf` extension). No service restart needed when models are added or removed.

-## Endpoints
+## Sessions Route Behaviour

-### Health
+`PATCH /sessions/:sessionId` accepts either `name`, `projectId`, or both.
+The validation guard only rejects requests where neither is provided:

-| Method | Path | Description |
-|---|---|---|
-| GET | /health | Service health check — reports downstream service URLs |
-
-### Chat
-
-| Method | Path | Description |
-|---|---|---|
-| POST | /chat | Send a message and receive a complete response |
-| POST | /chat/stream | Send a message and receive a streaming SSE response |
-
-### Sessions
-
-| Method | Path | Description |
-|---|---|---|
-| GET | /sessions | Get paginated list of all sessions |
-| GET | /sessions/:sessionId/history | Get paginated episode history for a session |
-| PATCH | /sessions/:sessionId | Rename a session |
-| DELETE | /sessions/:sessionId | Delete a session and all its episodes |
-
-### Projects
-
-Projects are proxied directly from the memory service with no transformation.
-
-| Method | Path | Description |
-|---|---|---|
-| GET | /projects | Get all projects |
-| POST | /projects | Create a new project |
-| PATCH | /projects/:id | Update a project |
-| DELETE | /projects/:id | Delete a project |
-
-### Models
-
-| Method | Path | Description |
-|---|---|---|
-| GET | /models | Get list of available models from manifest file |
-
---
-
-**POST /chat**
-
-Request body:
-```json
-{
-  "sessionId": "your-session-uuid",
-  "message": "Hello, my name is Tim.",
-  "model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
-  "temperature": 0.7
+```js
+if (!name?.trim() && projectId === undefined) {
+  return res.status(400).json({ error: 'name or projectId is required' });
 }
 ```

-`model` and `temperature` are optional — fall back to inference service defaults
-if omitted.
-
-Response:
-```json
-{
-  "sessionId": "your-session-uuid",
-  "response": "Hello Tim! How can I help you today?",
-  "model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
-  "tokenCount": 87
-}
-```
-
---
-
-**POST /chat/stream**
-
-Same request body as `POST /chat`.
-
-Response is a stream of Server-Sent Events:
-```
-data: {"text":"Hello"}
-data: {"text":" Tim"}
-data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":87}
-```
-
---
-
-**PATCH /sessions/:sessionId**
-
-Request body:
-```json
-{ "name": "My Renamed Session" }
-```
-
-Returns the updated session object. `name` is required and trimmed of whitespace.
-
---
-
-**DELETE /sessions/:sessionId**
-
-Returns `204 No Content`. Cascades to delete all episodes for the session.
-
---
-
-**GET /sessions/:sessionId/history**
-
-Query parameters:
-
-| Parameter | Default | Description |
-|---|---|---|
-| limit | 20 | Maximum number of episodes to return |
-| offset | 0 | Number of episodes to skip (for pagination) |
-
-Response:
-```json
-{
-  "sessionId": "your-session-uuid",
-  "episodes": [
-    {
-      "id": 42,
-      "session_id": 1,
-      "user_message": "Hello, my name is Tim.",
-      "ai_response": "Hello Tim! How can I help you today?",
-      "token_count": 87,
-      "created_at": 1712345678,
-      "metadata": null
-    }
-  ]
-}
-```
-
-Episodes are ordered newest first.
-
---
-
-**GET /models**
-
-Returns the parsed contents of `models.json`:
-```json
-[
-  { "value": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf", "label": "Gemma 4 26B Claude Distill" }
-]
-```
-
-Returns `500` if the manifest file cannot be read or parsed.
+This allows `useChat` to write project assignment separately from rename
+operations.

 ## Caddy Configuration

-The Caddy reverse proxy on Mini PC 2 must have a handle block for each route
-prefix the client needs to reach. Current required blocks:
+Each route prefix needs a handle block in the Caddyfile on Mini PC 2:

 ```
-handle /chat* {
-    reverse_proxy localhost:4000
-}
-handle /sessions* {
-    reverse_proxy localhost:4000
-}
-handle /models* {
-    reverse_proxy localhost:4000
-}
-handle /projects* {
-    reverse_proxy localhost:4000
-}
+handle /chat*     { reverse_proxy localhost:4000 }
+handle /sessions* { reverse_proxy localhost:4000 }
+handle /models*   { reverse_proxy localhost:4000 }
+handle /projects* { reverse_proxy localhost:4000 }
 ```

-When adding new top-level routes to the orchestration service, add a matching
-block here and reload Caddy: `caddy reload --config /path/to/Caddyfile`
+After updating: `caddy reload --config /path/to/Caddyfile`
+
+For all HTTP endpoints, see `api-routes.md`.