Updated documentation, streaming chat and chat history the update highlights

2026-04-05 23:56:18 -07:00
parent 4bd84ded04
commit 1f824c097d
2 changed files with 123 additions and 30 deletions
--- a/docs/services/orchestration-service.md
+++ b/docs/services/orchestration-service.md
@@ -15,10 +15,13 @@ or inference services — all traffic flows through orchestration.
 ## Dependencies
 - `express` — HTTP API
- `node-fetch` — inter-service HTTP communication
+- `node-fetch` — inter-service HTTP communication (memory service client only)
 - `dotenv` — environment variable loading
 - `@nexusai/shared` — shared utilities
 > `memory.js` uses `node-fetch` v2 (pinned) because it is CommonJS. All other
 > service clients use Node.js built-in `fetch`.
 ## Environment Variables
 | Variable | Required | Default | Description |
@@ -27,16 +30,20 @@ or inference services — all traffic flows through orchestration.
 | MEMORY_SERVICE_URL | No | http://localhost:3002 | Memory service URL |
 | EMBEDDING_SERVICE_URL | No | http://localhost:3003 | Embedding service URL |
 | INFERENCE_SERVICE_URL | No | http://localhost:3001 | Inference service URL |
 | QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search |
 ## Internal Structure
 src/
 ├── services/
-│   ├── memory.js      # HTTP wrapper functions for memory service calls
+│   ├── memory.js      # HTTP client for memory service
-│   └── inference.js   # HTTP wrapper functions for inference service calls
+│   ├── inference.js   # HTTP client for inference service
 │   ├── embedding.js   # HTTP client for embedding service
 │   └── qdrant.js      # HTTP client for Qdrant vector search
 ├── chat/
 │   └── index.js       # Core pipeline logic — context assembly and coordination
 ├── routes/
-│   └── chat.js        # Express route handlers
+│   ├── chat.js        # POST /chat and POST /chat/stream route handlers
 │   └── sessions.js    # GET /sessions/:sessionId/history route handler
 └── index.js           # Express app entry point
 The `services/` layer wraps all downstream HTTP calls in named functions,
@@ -45,40 +52,71 @@ URL or endpoint changes have a single place to be updated.
 ## Chat Pipeline
-When a request hits `POST /chat`, the following steps run in order:
+Both `POST /chat` and `POST /chat/stream` share the same context assembly
 steps. The only difference is how the inference response is delivered to
 the client.
 1. **Session resolution** — looks up the session by `externalId` in the memory
   service. If not found, auto-creates a new session. Clients can generate a
   UUID for new conversations and pass it directly — no pre-creation step needed.
-2. **Memory retrieval** — fetches the most recent episodes for the session
+2. **Recent episode retrieval** — fetches the most recent episodes for the session
-   (default: 10) from the memory service to use as conversational context.
+   (default: 10) from the memory service.
-3. **Prompt assembly** — combines a system prompt, the retrieved episodes, and
+3. **Semantic search** — embeds the user message via the embedding service, then
-   the current user message into a single prompt string.
+   queries Qdrant for the top-5 most similar past episodes (score threshold: 0.75).
   Results are deduplicated against the recent episode set using a `Set` of IDs.
   Full episode content is fetched from the memory service by ID. This step is
   non-critical — if it fails, a warning is logged and the pipeline continues with
   recency-only context.
-4. **Inference** — sends the assembled prompt to the inference service and
+4. **Prompt assembly** — combines the system prompt, semantic episodes (if any),
-   waits for the response.
+   recent episodes, and the current user message into a single prompt string.
-5. **Episode write** — writes the new exchange (user message + AI response)
+5. **Inference** — sends the assembled prompt to the inference service. `/chat`
-   back to the memory service as a fire-and-forget operation. The client
+   awaits the full response; `/chat/stream` opens an SSE connection and pipes
-   receives the response immediately without waiting for the write to complete.
+   chunks to the client as they arrive.
-6. **Response** — returns the AI response, model name, session ID, and token
+6. **Episode write** — writes the new exchange (user message + AI response)
   back to the memory service as a fire-and-forget operation. For streaming,
   the full response text is accumulated across chunks before writing.
 7. **Response** — returns the AI response, model name, session ID, and token
   count to the client.
 ## Prompt Structure
 The prompt sent to the inference service follows this structure:
 [System prompt]
-Here are some relevant memories from your past conversations:
+Here are some relevant memories from earlier conversations:
 User: {past user message}
 Assistant: {past ai response}
 ... (up to 5 semantic episodes)
 Here is the recent conversation history:
 User: {past user message}
 Assistant: {past ai response}
 ... (up to 10 recent episodes)
--- End of recent memories ---
+--- End of memories ---
 User: {current message}
 Assistant:
 Semantic episodes appear before recent episodes so the model encounters
 long-range relevant context before the immediate conversation flow.
 ## SSE Stream Format
 The inference service emits chunks in this format:
 data: {"model":"companion:latest","response":"Hello","done":false}
 data: {"model":"companion:latest","response":"!","done":true,"eval_count":3,...}
 data: [DONE]
 The orchestration service re-emits to the client as:
 data: {"text":"Hello"}
 data: {"text":"!"}
 data: {"done":true}
 The `[DONE]` sentinel from the inference service is consumed internally
 and not forwarded. The client stream is terminated by `res.end()` after
 the `{"done":true}` event.
 ## Endpoints
 ### Health
@@ -91,7 +129,14 @@ Assistant:
 | Method | Path | Description |
 |---|---|---|
-| POST | /chat | Send a message and receive a response |
+| POST | /chat | Send a message and receive a complete response |
 | POST | /chat/stream | Send a message and receive a streaming SSE response |
 ### Sessions
 | Method | Path | Description |
 |---|---|---|
 | GET | /sessions/:sessionId/history | Get paginated episode history for a session |
 ---
@@ -120,13 +165,52 @@ Response:
 }
 ```
-| Field | Description |
+---
 |---|---|
 | `sessionId` | Echo of the provided session ID |
 | `response` | The AI's response text |
 | `model` | Model name as reported by the inference service |
 | `tokenCount` | Combined prompt + completion token count |
-> Note: If `sessionId` does not exist in the memory service, a new session
+**POST /chat/stream**
-> is automatically created. Clients can safely generate a UUID for new
+
-> conversations and pass it on the first message.
+Same request body as `POST /chat`.
 Response is a stream of Server-Sent Events. Each event contains a text
 delta. The stream ends with a `done` event.
 data: {"text":"Hello"}
 data: {"text":" Tim"}
 data: {"text":"!"}
 data: {"done":true}
 Clients should read the `text` field from each chunk and accumulate them
 to build the full response string. The connection is closed by the server
 after the `{"done":true}` event.
 ---
 **GET /sessions/:sessionId/history**
 Returns paginated episode history for a session identified by its external ID.
 Query parameters:
 | Parameter | Default | Description |
 |---|---|---|
 | limit | 20 | Maximum number of episodes to return |
 | offset | 0 | Number of episodes to skip (for pagination) |
 Response:
 ```json
 {
  "sessionId": "your-session-uuid",
  "episodes": [
    {
      "id": 42,
      "session_id": 1,
      "user_message": "Hello, my name is Tim.",
      "ai_response": "Hello Tim! How can I help you today?",
      "token_count": 87,
      "created_at": 1712345678,
      "metadata": null
    }
  ]
 }
 ```
 Episodes are ordered newest first. Returns `404` if the session does not exist.
--- a/docs/services/shared.md
+++ b/docs/services/shared.md
@@ -65,3 +65,12 @@ Default pagination and result limits for SQLite episode queries.
 | `DEFAULT_RECENT_LIMIT` | `10` | Default number of recent episodes to retrieve |
 | `DEFAULT_PAGE_SIZE` | `20` | Default episodes per page for paginated queries |
 | `DEFAULT_SEARCH_LIMIT` | `10` | Default number of FTS search results to return |
 #### `SERVICES`
 Default URLs for inter-service communication. Used as fallback values
 when the corresponding environment variable is not set.
 | Key | Value | Description |
 |---|---|---|
 | `EMBEDDING_URL` | `http://localhost:3003` | Fallback embedding service URL |