Updated documentation, streaming chat and chat history the update highlights
This commit is contained in:
@@ -15,10 +15,13 @@ or inference services — all traffic flows through orchestration.
|
||||
## Dependencies
|
||||
|
||||
- `express` — HTTP API
|
||||
- `node-fetch` — inter-service HTTP communication
|
||||
- `node-fetch` — inter-service HTTP communication (memory service client only)
|
||||
- `dotenv` — environment variable loading
|
||||
- `@nexusai/shared` — shared utilities
|
||||
|
||||
> `memory.js` uses `node-fetch` v2 (pinned) because it is CommonJS. All other
|
||||
> service clients use Node.js built-in `fetch`.
|
||||
|
||||
## Environment Variables
|
||||
|
||||
| Variable | Required | Default | Description |
|
||||
@@ -27,16 +30,20 @@ or inference services — all traffic flows through orchestration.
|
||||
| MEMORY_SERVICE_URL | No | http://localhost:3002 | Memory service URL |
|
||||
| EMBEDDING_SERVICE_URL | No | http://localhost:3003 | Embedding service URL |
|
||||
| INFERENCE_SERVICE_URL | No | http://localhost:3001 | Inference service URL |
|
||||
| QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search |
|
||||
|
||||
## Internal Structure
|
||||
src/
|
||||
├── services/
|
||||
│ ├── memory.js # HTTP wrapper functions for memory service calls
|
||||
│ └── inference.js # HTTP wrapper functions for inference service calls
|
||||
│ ├── memory.js # HTTP client for memory service
|
||||
│ ├── inference.js # HTTP client for inference service
|
||||
│ ├── embedding.js # HTTP client for embedding service
|
||||
│ └── qdrant.js # HTTP client for Qdrant vector search
|
||||
├── chat/
|
||||
│ └── index.js # Core pipeline logic — context assembly and coordination
|
||||
├── routes/
|
||||
│ └── chat.js # Express route handlers
|
||||
│ ├── chat.js # POST /chat and POST /chat/stream route handlers
|
||||
│ └── sessions.js # GET /sessions/:sessionId/history route handler
|
||||
└── index.js # Express app entry point
|
||||
|
||||
The `services/` layer wraps all downstream HTTP calls in named functions,
|
||||
@@ -45,40 +52,71 @@ URL or endpoint changes have a single place to be updated.
|
||||
|
||||
## Chat Pipeline
|
||||
|
||||
When a request hits `POST /chat`, the following steps run in order:
|
||||
Both `POST /chat` and `POST /chat/stream` share the same context assembly
|
||||
steps. The only difference is how the inference response is delivered to
|
||||
the client.
|
||||
|
||||
1. **Session resolution** — looks up the session by `externalId` in the memory
|
||||
service. If not found, auto-creates a new session. Clients can generate a
|
||||
UUID for new conversations and pass it directly — no pre-creation step needed.
|
||||
|
||||
2. **Memory retrieval** — fetches the most recent episodes for the session
|
||||
(default: 10) from the memory service to use as conversational context.
|
||||
2. **Recent episode retrieval** — fetches the most recent episodes for the session
|
||||
(default: 10) from the memory service.
|
||||
|
||||
3. **Prompt assembly** — combines a system prompt, the retrieved episodes, and
|
||||
the current user message into a single prompt string.
|
||||
3. **Semantic search** — embeds the user message via the embedding service, then
|
||||
queries Qdrant for the top-5 most similar past episodes (score threshold: 0.75).
|
||||
Results are deduplicated against the recent episode set using a `Set` of IDs.
|
||||
Full episode content is fetched from the memory service by ID. This step is
|
||||
non-critical — if it fails, a warning is logged and the pipeline continues with
|
||||
recency-only context.
|
||||
|
||||
4. **Inference** — sends the assembled prompt to the inference service and
|
||||
waits for the response.
|
||||
4. **Prompt assembly** — combines the system prompt, semantic episodes (if any),
|
||||
recent episodes, and the current user message into a single prompt string.
|
||||
|
||||
5. **Episode write** — writes the new exchange (user message + AI response)
|
||||
back to the memory service as a fire-and-forget operation. The client
|
||||
receives the response immediately without waiting for the write to complete.
|
||||
5. **Inference** — sends the assembled prompt to the inference service. `/chat`
|
||||
awaits the full response; `/chat/stream` opens an SSE connection and pipes
|
||||
chunks to the client as they arrive.
|
||||
|
||||
6. **Response** — returns the AI response, model name, session ID, and token
|
||||
6. **Episode write** — writes the new exchange (user message + AI response)
|
||||
back to the memory service as a fire-and-forget operation. For streaming,
|
||||
the full response text is accumulated across chunks before writing.
|
||||
|
||||
7. **Response** — returns the AI response, model name, session ID, and token
|
||||
count to the client.
|
||||
|
||||
## Prompt Structure
|
||||
|
||||
The prompt sent to the inference service follows this structure:
|
||||
[System prompt]
|
||||
Here are some relevant memories from your past conversations:
|
||||
Here are some relevant memories from earlier conversations:
|
||||
User: {past user message}
|
||||
Assistant: {past ai response}
|
||||
... (up to 5 semantic episodes)
|
||||
Here is the recent conversation history:
|
||||
User: {past user message}
|
||||
Assistant: {past ai response}
|
||||
... (up to 10 recent episodes)
|
||||
--- End of recent memories ---
|
||||
--- End of memories ---
|
||||
User: {current message}
|
||||
Assistant:
|
||||
|
||||
Semantic episodes appear before recent episodes so the model encounters
|
||||
long-range relevant context before the immediate conversation flow.
|
||||
|
||||
## SSE Stream Format
|
||||
|
||||
The inference service emits chunks in this format:
|
||||
data: {"model":"companion:latest","response":"Hello","done":false}
|
||||
data: {"model":"companion:latest","response":"!","done":true,"eval_count":3,...}
|
||||
data: [DONE]
|
||||
|
||||
The orchestration service re-emits to the client as:
|
||||
data: {"text":"Hello"}
|
||||
data: {"text":"!"}
|
||||
data: {"done":true}
|
||||
|
||||
The `[DONE]` sentinel from the inference service is consumed internally
|
||||
and not forwarded. The client stream is terminated by `res.end()` after
|
||||
the `{"done":true}` event.
|
||||
|
||||
## Endpoints
|
||||
|
||||
### Health
|
||||
@@ -91,7 +129,14 @@ Assistant:
|
||||
|
||||
| Method | Path | Description |
|
||||
|---|---|---|
|
||||
| POST | /chat | Send a message and receive a response |
|
||||
| POST | /chat | Send a message and receive a complete response |
|
||||
| POST | /chat/stream | Send a message and receive a streaming SSE response |
|
||||
|
||||
### Sessions
|
||||
|
||||
| Method | Path | Description |
|
||||
|---|---|---|
|
||||
| GET | /sessions/:sessionId/history | Get paginated episode history for a session |
|
||||
|
||||
---
|
||||
|
||||
@@ -120,13 +165,52 @@ Response:
|
||||
}
|
||||
```
|
||||
|
||||
| Field | Description |
|
||||
|---|---|
|
||||
| `sessionId` | Echo of the provided session ID |
|
||||
| `response` | The AI's response text |
|
||||
| `model` | Model name as reported by the inference service |
|
||||
| `tokenCount` | Combined prompt + completion token count |
|
||||
---
|
||||
|
||||
> Note: If `sessionId` does not exist in the memory service, a new session
|
||||
> is automatically created. Clients can safely generate a UUID for new
|
||||
> conversations and pass it on the first message.
|
||||
**POST /chat/stream**
|
||||
|
||||
Same request body as `POST /chat`.
|
||||
|
||||
Response is a stream of Server-Sent Events. Each event contains a text
|
||||
delta. The stream ends with a `done` event.
|
||||
data: {"text":"Hello"}
|
||||
data: {"text":" Tim"}
|
||||
data: {"text":"!"}
|
||||
data: {"done":true}
|
||||
|
||||
Clients should read the `text` field from each chunk and accumulate them
|
||||
to build the full response string. The connection is closed by the server
|
||||
after the `{"done":true}` event.
|
||||
|
||||
---
|
||||
|
||||
**GET /sessions/:sessionId/history**
|
||||
|
||||
Returns paginated episode history for a session identified by its external ID.
|
||||
|
||||
Query parameters:
|
||||
|
||||
| Parameter | Default | Description |
|
||||
|---|---|---|
|
||||
| limit | 20 | Maximum number of episodes to return |
|
||||
| offset | 0 | Number of episodes to skip (for pagination) |
|
||||
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"sessionId": "your-session-uuid",
|
||||
"episodes": [
|
||||
{
|
||||
"id": 42,
|
||||
"session_id": 1,
|
||||
"user_message": "Hello, my name is Tim.",
|
||||
"ai_response": "Hello Tim! How can I help you today?",
|
||||
"token_count": 87,
|
||||
"created_at": 1712345678,
|
||||
"metadata": null
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Episodes are ordered newest first. Returns `404` if the session does not exist.
|
||||
@@ -65,3 +65,12 @@ Default pagination and result limits for SQLite episode queries.
|
||||
| `DEFAULT_RECENT_LIMIT` | `10` | Default number of recent episodes to retrieve |
|
||||
| `DEFAULT_PAGE_SIZE` | `20` | Default episodes per page for paginated queries |
|
||||
| `DEFAULT_SEARCH_LIMIT` | `10` | Default number of FTS search results to return |
|
||||
|
||||
#### `SERVICES`
|
||||
|
||||
Default URLs for inter-service communication. Used as fallback values
|
||||
when the corresponding environment variable is not set.
|
||||
|
||||
| Key | Value | Description |
|
||||
|---|---|---|
|
||||
| `EMBEDDING_URL` | `http://localhost:3003` | Fallback embedding service URL |
|
||||
Reference in New Issue
Block a user