Updated documentation, streaming chat and chat history the update highlights

This commit is contained in:
Storme-bit
2026-04-05 23:56:18 -07:00
parent 4bd84ded04
commit 1f824c097d
2 changed files with 123 additions and 30 deletions

View File

@@ -15,10 +15,13 @@ or inference services — all traffic flows through orchestration.
## Dependencies
- `express` — HTTP API
- `node-fetch` — inter-service HTTP communication
- `node-fetch` — inter-service HTTP communication (memory service client only)
- `dotenv` — environment variable loading
- `@nexusai/shared` — shared utilities
> `memory.js` uses `node-fetch` v2 (pinned) because it is CommonJS. All other
> service clients use Node.js built-in `fetch`.
## Environment Variables
| Variable | Required | Default | Description |
@@ -27,16 +30,20 @@ or inference services — all traffic flows through orchestration.
| MEMORY_SERVICE_URL | No | http://localhost:3002 | Memory service URL |
| EMBEDDING_SERVICE_URL | No | http://localhost:3003 | Embedding service URL |
| INFERENCE_SERVICE_URL | No | http://localhost:3001 | Inference service URL |
| QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search |
## Internal Structure
src/
├── services/
│ ├── memory.js # HTTP wrapper functions for memory service calls
── inference.js # HTTP wrapper functions for inference service calls
│ ├── memory.js # HTTP client for memory service
── inference.js # HTTP client for inference service
│ ├── embedding.js # HTTP client for embedding service
│ └── qdrant.js # HTTP client for Qdrant vector search
├── chat/
│ └── index.js # Core pipeline logic — context assembly and coordination
├── routes/
── chat.js # Express route handlers
── chat.js # POST /chat and POST /chat/stream route handlers
│ └── sessions.js # GET /sessions/:sessionId/history route handler
└── index.js # Express app entry point
The `services/` layer wraps all downstream HTTP calls in named functions,
@@ -45,40 +52,71 @@ URL or endpoint changes have a single place to be updated.
## Chat Pipeline
When a request hits `POST /chat`, the following steps run in order:
Both `POST /chat` and `POST /chat/stream` share the same context assembly
steps. The only difference is how the inference response is delivered to
the client.
1. **Session resolution** — looks up the session by `externalId` in the memory
service. If not found, auto-creates a new session. Clients can generate a
UUID for new conversations and pass it directly — no pre-creation step needed.
2. **Memory retrieval** — fetches the most recent episodes for the session
(default: 10) from the memory service to use as conversational context.
2. **Recent episode retrieval** — fetches the most recent episodes for the session
(default: 10) from the memory service.
3. **Prompt assembly** — combines a system prompt, the retrieved episodes, and
the current user message into a single prompt string.
3. **Semantic search** — embeds the user message via the embedding service, then
queries Qdrant for the top-5 most similar past episodes (score threshold: 0.75).
Results are deduplicated against the recent episode set using a `Set` of IDs.
Full episode content is fetched from the memory service by ID. This step is
non-critical — if it fails, a warning is logged and the pipeline continues with
recency-only context.
4. **Inference** — sends the assembled prompt to the inference service and
waits for the response.
4. **Prompt assembly** — combines the system prompt, semantic episodes (if any),
recent episodes, and the current user message into a single prompt string.
5. **Episode write**writes the new exchange (user message + AI response)
back to the memory service as a fire-and-forget operation. The client
receives the response immediately without waiting for the write to complete.
5. **Inference**sends the assembled prompt to the inference service. `/chat`
awaits the full response; `/chat/stream` opens an SSE connection and pipes
chunks to the client as they arrive.
6. **Response** — returns the AI response, model name, session ID, and token
6. **Episode write** — writes the new exchange (user message + AI response)
back to the memory service as a fire-and-forget operation. For streaming,
the full response text is accumulated across chunks before writing.
7. **Response** — returns the AI response, model name, session ID, and token
count to the client.
## Prompt Structure
The prompt sent to the inference service follows this structure:
[System prompt]
Here are some relevant memories from your past conversations:
Here are some relevant memories from earlier conversations:
User: {past user message}
Assistant: {past ai response}
... (up to 5 semantic episodes)
Here is the recent conversation history:
User: {past user message}
Assistant: {past ai response}
... (up to 10 recent episodes)
--- End of recent memories ---
--- End of memories ---
User: {current message}
Assistant:
Semantic episodes appear before recent episodes so the model encounters
long-range relevant context before the immediate conversation flow.
## SSE Stream Format
The inference service emits chunks in this format:
data: {"model":"companion:latest","response":"Hello","done":false}
data: {"model":"companion:latest","response":"!","done":true,"eval_count":3,...}
data: [DONE]
The orchestration service re-emits to the client as:
data: {"text":"Hello"}
data: {"text":"!"}
data: {"done":true}
The `[DONE]` sentinel from the inference service is consumed internally
and not forwarded. The client stream is terminated by `res.end()` after
the `{"done":true}` event.
## Endpoints
### Health
@@ -91,7 +129,14 @@ Assistant:
| Method | Path | Description |
|---|---|---|
| POST | /chat | Send a message and receive a response |
| POST | /chat | Send a message and receive a complete response |
| POST | /chat/stream | Send a message and receive a streaming SSE response |
### Sessions
| Method | Path | Description |
|---|---|---|
| GET | /sessions/:sessionId/history | Get paginated episode history for a session |
---
@@ -120,13 +165,52 @@ Response:
}
```
| Field | Description |
|---|---|
| `sessionId` | Echo of the provided session ID |
| `response` | The AI's response text |
| `model` | Model name as reported by the inference service |
| `tokenCount` | Combined prompt + completion token count |
---
> Note: If `sessionId` does not exist in the memory service, a new session
> is automatically created. Clients can safely generate a UUID for new
> conversations and pass it on the first message.
**POST /chat/stream**
Same request body as `POST /chat`.
Response is a stream of Server-Sent Events. Each event contains a text
delta. The stream ends with a `done` event.
data: {"text":"Hello"}
data: {"text":" Tim"}
data: {"text":"!"}
data: {"done":true}
Clients should read the `text` field from each chunk and accumulate them
to build the full response string. The connection is closed by the server
after the `{"done":true}` event.
---
**GET /sessions/:sessionId/history**
Returns paginated episode history for a session identified by its external ID.
Query parameters:
| Parameter | Default | Description |
|---|---|---|
| limit | 20 | Maximum number of episodes to return |
| offset | 0 | Number of episodes to skip (for pagination) |
Response:
```json
{
"sessionId": "your-session-uuid",
"episodes": [
{
"id": 42,
"session_id": 1,
"user_message": "Hello, my name is Tim.",
"ai_response": "Hello Tim! How can I help you today?",
"token_count": 87,
"created_at": 1712345678,
"metadata": null
}
]
}
```
Episodes are ordered newest first. Returns `404` if the session does not exist.