Updated documentation, streaming chat and chat history the update highlights

2026-04-05 23:56:18 -07:00
parent 4bd84ded04
commit 1f824c097d
2 changed files with 123 additions and 30 deletions
--- a/docs/services/orchestration-service.md
+++ b/docs/services/orchestration-service.md
@@ -15,10 +15,13 @@ or inference services — all traffic flows through orchestration.
 ## Dependencies

 - `express` — HTTP API
- `node-fetch` — inter-service HTTP communication
+- `node-fetch` — inter-service HTTP communication (memory service client only)
 - `dotenv` — environment variable loading
 - `@nexusai/shared` — shared utilities

+> `memory.js` uses `node-fetch` v2 (pinned) because it is CommonJS. All other
+> service clients use Node.js built-in `fetch`.
+
 ## Environment Variables

 | Variable | Required | Default | Description |
@@ -27,16 +30,20 @@ or inference services — all traffic flows through orchestration.
 | MEMORY_SERVICE_URL | No | http://localhost:3002 | Memory service URL |
 | EMBEDDING_SERVICE_URL | No | http://localhost:3003 | Embedding service URL |
 | INFERENCE_SERVICE_URL | No | http://localhost:3001 | Inference service URL |
+| QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search |

 ## Internal Structure
 src/
 ├── services/
-│   ├── memory.js      # HTTP wrapper functions for memory service calls
-│   └── inference.js   # HTTP wrapper functions for inference service calls
+│   ├── memory.js      # HTTP client for memory service
+│   ├── inference.js   # HTTP client for inference service
+│   ├── embedding.js   # HTTP client for embedding service
+│   └── qdrant.js      # HTTP client for Qdrant vector search
 ├── chat/
 │   └── index.js       # Core pipeline logic — context assembly and coordination
 ├── routes/
-│   └── chat.js        # Express route handlers
+│   ├── chat.js        # POST /chat and POST /chat/stream route handlers
+│   └── sessions.js    # GET /sessions/:sessionId/history route handler
 └── index.js           # Express app entry point

 The `services/` layer wraps all downstream HTTP calls in named functions,
@@ -45,40 +52,71 @@ URL or endpoint changes have a single place to be updated.

 ## Chat Pipeline

-When a request hits `POST /chat`, the following steps run in order:
+Both `POST /chat` and `POST /chat/stream` share the same context assembly
+steps. The only difference is how the inference response is delivered to
+the client.

 1. **Session resolution** — looks up the session by `externalId` in the memory
   service. If not found, auto-creates a new session. Clients can generate a
   UUID for new conversations and pass it directly — no pre-creation step needed.

-2. **Memory retrieval** — fetches the most recent episodes for the session
-   (default: 10) from the memory service to use as conversational context.
+2. **Recent episode retrieval** — fetches the most recent episodes for the session
+   (default: 10) from the memory service.

-3. **Prompt assembly** — combines a system prompt, the retrieved episodes, and
-   the current user message into a single prompt string.
+3. **Semantic search** — embeds the user message via the embedding service, then
+   queries Qdrant for the top-5 most similar past episodes (score threshold: 0.75).
+   Results are deduplicated against the recent episode set using a `Set` of IDs.
+   Full episode content is fetched from the memory service by ID. This step is
+   non-critical — if it fails, a warning is logged and the pipeline continues with
+   recency-only context.

-4. **Inference** — sends the assembled prompt to the inference service and
-   waits for the response.
+4. **Prompt assembly** — combines the system prompt, semantic episodes (if any),
+   recent episodes, and the current user message into a single prompt string.

-5. **Episode write** — writes the new exchange (user message + AI response)
-   back to the memory service as a fire-and-forget operation. The client
-   receives the response immediately without waiting for the write to complete.
+5. **Inference** — sends the assembled prompt to the inference service. `/chat`
+   awaits the full response; `/chat/stream` opens an SSE connection and pipes
+   chunks to the client as they arrive.

-6. **Response** — returns the AI response, model name, session ID, and token
+6. **Episode write** — writes the new exchange (user message + AI response)
+   back to the memory service as a fire-and-forget operation. For streaming,
+   the full response text is accumulated across chunks before writing.
+
+7. **Response** — returns the AI response, model name, session ID, and token
   count to the client.

 ## Prompt Structure
-
-The prompt sent to the inference service follows this structure:
 [System prompt]
-Here are some relevant memories from your past conversations:
+Here are some relevant memories from earlier conversations:
+User: {past user message}
+Assistant: {past ai response}
+... (up to 5 semantic episodes)
+Here is the recent conversation history:
 User: {past user message}
 Assistant: {past ai response}
 ... (up to 10 recent episodes)
--- End of recent memories ---
+--- End of memories ---
 User: {current message}
 Assistant:

+Semantic episodes appear before recent episodes so the model encounters
+long-range relevant context before the immediate conversation flow.
+
+## SSE Stream Format
+
+The inference service emits chunks in this format:
+data: {"model":"companion:latest","response":"Hello","done":false}
+data: {"model":"companion:latest","response":"!","done":true,"eval_count":3,...}
+data: [DONE]
+
+The orchestration service re-emits to the client as:
+data: {"text":"Hello"}
+data: {"text":"!"}
+data: {"done":true}
+
+The `[DONE]` sentinel from the inference service is consumed internally
+and not forwarded. The client stream is terminated by `res.end()` after
+the `{"done":true}` event.
+
 ## Endpoints

 ### Health
@@ -91,7 +129,14 @@ Assistant:

 | Method | Path | Description |
 |---|---|---|
-| POST | /chat | Send a message and receive a response |
+| POST | /chat | Send a message and receive a complete response |
+| POST | /chat/stream | Send a message and receive a streaming SSE response |
+
+### Sessions
+
+| Method | Path | Description |
+|---|---|---|
+| GET | /sessions/:sessionId/history | Get paginated episode history for a session |

 ---

@@ -120,13 +165,52 @@ Response:
 }
 ```

-| Field | Description |
-|---|---|
-| `sessionId` | Echo of the provided session ID |
-| `response` | The AI's response text |
-| `model` | Model name as reported by the inference service |
-| `tokenCount` | Combined prompt + completion token count |
+---

-> Note: If `sessionId` does not exist in the memory service, a new session
-> is automatically created. Clients can safely generate a UUID for new
-> conversations and pass it on the first message.
+**POST /chat/stream**
+
+Same request body as `POST /chat`.
+
+Response is a stream of Server-Sent Events. Each event contains a text
+delta. The stream ends with a `done` event.
+data: {"text":"Hello"}
+data: {"text":" Tim"}
+data: {"text":"!"}
+data: {"done":true}
+
+Clients should read the `text` field from each chunk and accumulate them
+to build the full response string. The connection is closed by the server
+after the `{"done":true}` event.
+
+---
+
+**GET /sessions/:sessionId/history**
+
+Returns paginated episode history for a session identified by its external ID.
+
+Query parameters:
+
+| Parameter | Default | Description |
+|---|---|---|
+| limit | 20 | Maximum number of episodes to return |
+| offset | 0 | Number of episodes to skip (for pagination) |
+
+Response:
+```json
+{
+  "sessionId": "your-session-uuid",
+  "episodes": [
+    {
+      "id": 42,
+      "session_id": 1,
+      "user_message": "Hello, my name is Tim.",
+      "ai_response": "Hello Tim! How can I help you today?",
+      "token_count": 87,
+      "created_at": 1712345678,
+      "metadata": null
+    }
+  ]
+}
+```
+
+Episodes are ordered newest first. Returns `404` if the session does not exist.