diff --git a/docs/services/orchestration-service.md b/docs/services/orchestration-service.md index e7e61f2..8044cd8 100644 --- a/docs/services/orchestration-service.md +++ b/docs/services/orchestration-service.md @@ -15,10 +15,13 @@ or inference services — all traffic flows through orchestration. ## Dependencies - `express` — HTTP API -- `node-fetch` — inter-service HTTP communication +- `node-fetch` — inter-service HTTP communication (memory service client only) - `dotenv` — environment variable loading - `@nexusai/shared` — shared utilities +> `memory.js` uses `node-fetch` v2 (pinned) because it is CommonJS. All other +> service clients use Node.js built-in `fetch`. + ## Environment Variables | Variable | Required | Default | Description | @@ -27,16 +30,20 @@ or inference services — all traffic flows through orchestration. | MEMORY_SERVICE_URL | No | http://localhost:3002 | Memory service URL | | EMBEDDING_SERVICE_URL | No | http://localhost:3003 | Embedding service URL | | INFERENCE_SERVICE_URL | No | http://localhost:3001 | Inference service URL | +| QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search | ## Internal Structure src/ ├── services/ -│ ├── memory.js # HTTP wrapper functions for memory service calls -│ └── inference.js # HTTP wrapper functions for inference service calls +│ ├── memory.js # HTTP client for memory service +│ ├── inference.js # HTTP client for inference service +│ ├── embedding.js # HTTP client for embedding service +│ └── qdrant.js # HTTP client for Qdrant vector search ├── chat/ │ └── index.js # Core pipeline logic — context assembly and coordination ├── routes/ -│ └── chat.js # Express route handlers +│ ├── chat.js # POST /chat and POST /chat/stream route handlers +│ └── sessions.js # GET /sessions/:sessionId/history route handler └── index.js # Express app entry point The `services/` layer wraps all downstream HTTP calls in named functions, @@ -45,40 +52,71 @@ URL or endpoint changes have a single place to be updated. ## Chat Pipeline -When a request hits `POST /chat`, the following steps run in order: +Both `POST /chat` and `POST /chat/stream` share the same context assembly +steps. The only difference is how the inference response is delivered to +the client. 1. **Session resolution** — looks up the session by `externalId` in the memory service. If not found, auto-creates a new session. Clients can generate a UUID for new conversations and pass it directly — no pre-creation step needed. -2. **Memory retrieval** — fetches the most recent episodes for the session - (default: 10) from the memory service to use as conversational context. +2. **Recent episode retrieval** — fetches the most recent episodes for the session + (default: 10) from the memory service. -3. **Prompt assembly** — combines a system prompt, the retrieved episodes, and - the current user message into a single prompt string. +3. **Semantic search** — embeds the user message via the embedding service, then + queries Qdrant for the top-5 most similar past episodes (score threshold: 0.75). + Results are deduplicated against the recent episode set using a `Set` of IDs. + Full episode content is fetched from the memory service by ID. This step is + non-critical — if it fails, a warning is logged and the pipeline continues with + recency-only context. -4. **Inference** — sends the assembled prompt to the inference service and - waits for the response. +4. **Prompt assembly** — combines the system prompt, semantic episodes (if any), + recent episodes, and the current user message into a single prompt string. -5. **Episode write** — writes the new exchange (user message + AI response) - back to the memory service as a fire-and-forget operation. The client - receives the response immediately without waiting for the write to complete. +5. **Inference** — sends the assembled prompt to the inference service. `/chat` + awaits the full response; `/chat/stream` opens an SSE connection and pipes + chunks to the client as they arrive. -6. **Response** — returns the AI response, model name, session ID, and token +6. **Episode write** — writes the new exchange (user message + AI response) + back to the memory service as a fire-and-forget operation. For streaming, + the full response text is accumulated across chunks before writing. + +7. **Response** — returns the AI response, model name, session ID, and token count to the client. ## Prompt Structure - -The prompt sent to the inference service follows this structure: [System prompt] -Here are some relevant memories from your past conversations: +Here are some relevant memories from earlier conversations: +User: {past user message} +Assistant: {past ai response} +... (up to 5 semantic episodes) +Here is the recent conversation history: User: {past user message} Assistant: {past ai response} ... (up to 10 recent episodes) ---- End of recent memories --- +--- End of memories --- User: {current message} Assistant: +Semantic episodes appear before recent episodes so the model encounters +long-range relevant context before the immediate conversation flow. + +## SSE Stream Format + +The inference service emits chunks in this format: +data: {"model":"companion:latest","response":"Hello","done":false} +data: {"model":"companion:latest","response":"!","done":true,"eval_count":3,...} +data: [DONE] + +The orchestration service re-emits to the client as: +data: {"text":"Hello"} +data: {"text":"!"} +data: {"done":true} + +The `[DONE]` sentinel from the inference service is consumed internally +and not forwarded. The client stream is terminated by `res.end()` after +the `{"done":true}` event. + ## Endpoints ### Health @@ -91,7 +129,14 @@ Assistant: | Method | Path | Description | |---|---|---| -| POST | /chat | Send a message and receive a response | +| POST | /chat | Send a message and receive a complete response | +| POST | /chat/stream | Send a message and receive a streaming SSE response | + +### Sessions + +| Method | Path | Description | +|---|---|---| +| GET | /sessions/:sessionId/history | Get paginated episode history for a session | --- @@ -120,13 +165,52 @@ Response: } ``` -| Field | Description | -|---|---| -| `sessionId` | Echo of the provided session ID | -| `response` | The AI's response text | -| `model` | Model name as reported by the inference service | -| `tokenCount` | Combined prompt + completion token count | +--- -> Note: If `sessionId` does not exist in the memory service, a new session -> is automatically created. Clients can safely generate a UUID for new -> conversations and pass it on the first message. \ No newline at end of file +**POST /chat/stream** + +Same request body as `POST /chat`. + +Response is a stream of Server-Sent Events. Each event contains a text +delta. The stream ends with a `done` event. +data: {"text":"Hello"} +data: {"text":" Tim"} +data: {"text":"!"} +data: {"done":true} + +Clients should read the `text` field from each chunk and accumulate them +to build the full response string. The connection is closed by the server +after the `{"done":true}` event. + +--- + +**GET /sessions/:sessionId/history** + +Returns paginated episode history for a session identified by its external ID. + +Query parameters: + +| Parameter | Default | Description | +|---|---|---| +| limit | 20 | Maximum number of episodes to return | +| offset | 0 | Number of episodes to skip (for pagination) | + +Response: +```json +{ + "sessionId": "your-session-uuid", + "episodes": [ + { + "id": 42, + "session_id": 1, + "user_message": "Hello, my name is Tim.", + "ai_response": "Hello Tim! How can I help you today?", + "token_count": 87, + "created_at": 1712345678, + "metadata": null + } + ] +} +``` + +Episodes are ordered newest first. Returns `404` if the session does not exist. \ No newline at end of file diff --git a/docs/services/shared.md b/docs/services/shared.md index 7fb248a..098f066 100644 --- a/docs/services/shared.md +++ b/docs/services/shared.md @@ -64,4 +64,13 @@ Default pagination and result limits for SQLite episode queries. |---|---|---| | `DEFAULT_RECENT_LIMIT` | `10` | Default number of recent episodes to retrieve | | `DEFAULT_PAGE_SIZE` | `20` | Default episodes per page for paginated queries | -| `DEFAULT_SEARCH_LIMIT` | `10` | Default number of FTS search results to return | \ No newline at end of file +| `DEFAULT_SEARCH_LIMIT` | `10` | Default number of FTS search results to return | + +#### `SERVICES` + +Default URLs for inter-service communication. Used as fallback values +when the corresponding environment variable is not set. + +| Key | Value | Description | +|---|---|---| +| `EMBEDDING_URL` | `http://localhost:3003` | Fallback embedding service URL | \ No newline at end of file