Updated documentation, streaming chat and chat history the update highlights
This commit is contained in:
@@ -15,10 +15,13 @@ or inference services — all traffic flows through orchestration.
|
|||||||
## Dependencies
|
## Dependencies
|
||||||
|
|
||||||
- `express` — HTTP API
|
- `express` — HTTP API
|
||||||
- `node-fetch` — inter-service HTTP communication
|
- `node-fetch` — inter-service HTTP communication (memory service client only)
|
||||||
- `dotenv` — environment variable loading
|
- `dotenv` — environment variable loading
|
||||||
- `@nexusai/shared` — shared utilities
|
- `@nexusai/shared` — shared utilities
|
||||||
|
|
||||||
|
> `memory.js` uses `node-fetch` v2 (pinned) because it is CommonJS. All other
|
||||||
|
> service clients use Node.js built-in `fetch`.
|
||||||
|
|
||||||
## Environment Variables
|
## Environment Variables
|
||||||
|
|
||||||
| Variable | Required | Default | Description |
|
| Variable | Required | Default | Description |
|
||||||
@@ -27,16 +30,20 @@ or inference services — all traffic flows through orchestration.
|
|||||||
| MEMORY_SERVICE_URL | No | http://localhost:3002 | Memory service URL |
|
| MEMORY_SERVICE_URL | No | http://localhost:3002 | Memory service URL |
|
||||||
| EMBEDDING_SERVICE_URL | No | http://localhost:3003 | Embedding service URL |
|
| EMBEDDING_SERVICE_URL | No | http://localhost:3003 | Embedding service URL |
|
||||||
| INFERENCE_SERVICE_URL | No | http://localhost:3001 | Inference service URL |
|
| INFERENCE_SERVICE_URL | No | http://localhost:3001 | Inference service URL |
|
||||||
|
| QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search |
|
||||||
|
|
||||||
## Internal Structure
|
## Internal Structure
|
||||||
src/
|
src/
|
||||||
├── services/
|
├── services/
|
||||||
│ ├── memory.js # HTTP wrapper functions for memory service calls
|
│ ├── memory.js # HTTP client for memory service
|
||||||
│ └── inference.js # HTTP wrapper functions for inference service calls
|
│ ├── inference.js # HTTP client for inference service
|
||||||
|
│ ├── embedding.js # HTTP client for embedding service
|
||||||
|
│ └── qdrant.js # HTTP client for Qdrant vector search
|
||||||
├── chat/
|
├── chat/
|
||||||
│ └── index.js # Core pipeline logic — context assembly and coordination
|
│ └── index.js # Core pipeline logic — context assembly and coordination
|
||||||
├── routes/
|
├── routes/
|
||||||
│ └── chat.js # Express route handlers
|
│ ├── chat.js # POST /chat and POST /chat/stream route handlers
|
||||||
|
│ └── sessions.js # GET /sessions/:sessionId/history route handler
|
||||||
└── index.js # Express app entry point
|
└── index.js # Express app entry point
|
||||||
|
|
||||||
The `services/` layer wraps all downstream HTTP calls in named functions,
|
The `services/` layer wraps all downstream HTTP calls in named functions,
|
||||||
@@ -45,40 +52,71 @@ URL or endpoint changes have a single place to be updated.
|
|||||||
|
|
||||||
## Chat Pipeline
|
## Chat Pipeline
|
||||||
|
|
||||||
When a request hits `POST /chat`, the following steps run in order:
|
Both `POST /chat` and `POST /chat/stream` share the same context assembly
|
||||||
|
steps. The only difference is how the inference response is delivered to
|
||||||
|
the client.
|
||||||
|
|
||||||
1. **Session resolution** — looks up the session by `externalId` in the memory
|
1. **Session resolution** — looks up the session by `externalId` in the memory
|
||||||
service. If not found, auto-creates a new session. Clients can generate a
|
service. If not found, auto-creates a new session. Clients can generate a
|
||||||
UUID for new conversations and pass it directly — no pre-creation step needed.
|
UUID for new conversations and pass it directly — no pre-creation step needed.
|
||||||
|
|
||||||
2. **Memory retrieval** — fetches the most recent episodes for the session
|
2. **Recent episode retrieval** — fetches the most recent episodes for the session
|
||||||
(default: 10) from the memory service to use as conversational context.
|
(default: 10) from the memory service.
|
||||||
|
|
||||||
3. **Prompt assembly** — combines a system prompt, the retrieved episodes, and
|
3. **Semantic search** — embeds the user message via the embedding service, then
|
||||||
the current user message into a single prompt string.
|
queries Qdrant for the top-5 most similar past episodes (score threshold: 0.75).
|
||||||
|
Results are deduplicated against the recent episode set using a `Set` of IDs.
|
||||||
|
Full episode content is fetched from the memory service by ID. This step is
|
||||||
|
non-critical — if it fails, a warning is logged and the pipeline continues with
|
||||||
|
recency-only context.
|
||||||
|
|
||||||
4. **Inference** — sends the assembled prompt to the inference service and
|
4. **Prompt assembly** — combines the system prompt, semantic episodes (if any),
|
||||||
waits for the response.
|
recent episodes, and the current user message into a single prompt string.
|
||||||
|
|
||||||
5. **Episode write** — writes the new exchange (user message + AI response)
|
5. **Inference** — sends the assembled prompt to the inference service. `/chat`
|
||||||
back to the memory service as a fire-and-forget operation. The client
|
awaits the full response; `/chat/stream` opens an SSE connection and pipes
|
||||||
receives the response immediately without waiting for the write to complete.
|
chunks to the client as they arrive.
|
||||||
|
|
||||||
6. **Response** — returns the AI response, model name, session ID, and token
|
6. **Episode write** — writes the new exchange (user message + AI response)
|
||||||
|
back to the memory service as a fire-and-forget operation. For streaming,
|
||||||
|
the full response text is accumulated across chunks before writing.
|
||||||
|
|
||||||
|
7. **Response** — returns the AI response, model name, session ID, and token
|
||||||
count to the client.
|
count to the client.
|
||||||
|
|
||||||
## Prompt Structure
|
## Prompt Structure
|
||||||
|
|
||||||
The prompt sent to the inference service follows this structure:
|
|
||||||
[System prompt]
|
[System prompt]
|
||||||
Here are some relevant memories from your past conversations:
|
Here are some relevant memories from earlier conversations:
|
||||||
|
User: {past user message}
|
||||||
|
Assistant: {past ai response}
|
||||||
|
... (up to 5 semantic episodes)
|
||||||
|
Here is the recent conversation history:
|
||||||
User: {past user message}
|
User: {past user message}
|
||||||
Assistant: {past ai response}
|
Assistant: {past ai response}
|
||||||
... (up to 10 recent episodes)
|
... (up to 10 recent episodes)
|
||||||
--- End of recent memories ---
|
--- End of memories ---
|
||||||
User: {current message}
|
User: {current message}
|
||||||
Assistant:
|
Assistant:
|
||||||
|
|
||||||
|
Semantic episodes appear before recent episodes so the model encounters
|
||||||
|
long-range relevant context before the immediate conversation flow.
|
||||||
|
|
||||||
|
## SSE Stream Format
|
||||||
|
|
||||||
|
The inference service emits chunks in this format:
|
||||||
|
data: {"model":"companion:latest","response":"Hello","done":false}
|
||||||
|
data: {"model":"companion:latest","response":"!","done":true,"eval_count":3,...}
|
||||||
|
data: [DONE]
|
||||||
|
|
||||||
|
The orchestration service re-emits to the client as:
|
||||||
|
data: {"text":"Hello"}
|
||||||
|
data: {"text":"!"}
|
||||||
|
data: {"done":true}
|
||||||
|
|
||||||
|
The `[DONE]` sentinel from the inference service is consumed internally
|
||||||
|
and not forwarded. The client stream is terminated by `res.end()` after
|
||||||
|
the `{"done":true}` event.
|
||||||
|
|
||||||
## Endpoints
|
## Endpoints
|
||||||
|
|
||||||
### Health
|
### Health
|
||||||
@@ -91,7 +129,14 @@ Assistant:
|
|||||||
|
|
||||||
| Method | Path | Description |
|
| Method | Path | Description |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
| POST | /chat | Send a message and receive a response |
|
| POST | /chat | Send a message and receive a complete response |
|
||||||
|
| POST | /chat/stream | Send a message and receive a streaming SSE response |
|
||||||
|
|
||||||
|
### Sessions
|
||||||
|
|
||||||
|
| Method | Path | Description |
|
||||||
|
|---|---|---|
|
||||||
|
| GET | /sessions/:sessionId/history | Get paginated episode history for a session |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -120,13 +165,52 @@ Response:
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
| Field | Description |
|
---
|
||||||
|---|---|
|
|
||||||
| `sessionId` | Echo of the provided session ID |
|
|
||||||
| `response` | The AI's response text |
|
|
||||||
| `model` | Model name as reported by the inference service |
|
|
||||||
| `tokenCount` | Combined prompt + completion token count |
|
|
||||||
|
|
||||||
> Note: If `sessionId` does not exist in the memory service, a new session
|
**POST /chat/stream**
|
||||||
> is automatically created. Clients can safely generate a UUID for new
|
|
||||||
> conversations and pass it on the first message.
|
Same request body as `POST /chat`.
|
||||||
|
|
||||||
|
Response is a stream of Server-Sent Events. Each event contains a text
|
||||||
|
delta. The stream ends with a `done` event.
|
||||||
|
data: {"text":"Hello"}
|
||||||
|
data: {"text":" Tim"}
|
||||||
|
data: {"text":"!"}
|
||||||
|
data: {"done":true}
|
||||||
|
|
||||||
|
Clients should read the `text` field from each chunk and accumulate them
|
||||||
|
to build the full response string. The connection is closed by the server
|
||||||
|
after the `{"done":true}` event.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**GET /sessions/:sessionId/history**
|
||||||
|
|
||||||
|
Returns paginated episode history for a session identified by its external ID.
|
||||||
|
|
||||||
|
Query parameters:
|
||||||
|
|
||||||
|
| Parameter | Default | Description |
|
||||||
|
|---|---|---|
|
||||||
|
| limit | 20 | Maximum number of episodes to return |
|
||||||
|
| offset | 0 | Number of episodes to skip (for pagination) |
|
||||||
|
|
||||||
|
Response:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"sessionId": "your-session-uuid",
|
||||||
|
"episodes": [
|
||||||
|
{
|
||||||
|
"id": 42,
|
||||||
|
"session_id": 1,
|
||||||
|
"user_message": "Hello, my name is Tim.",
|
||||||
|
"ai_response": "Hello Tim! How can I help you today?",
|
||||||
|
"token_count": 87,
|
||||||
|
"created_at": 1712345678,
|
||||||
|
"metadata": null
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Episodes are ordered newest first. Returns `404` if the session does not exist.
|
||||||
@@ -65,3 +65,12 @@ Default pagination and result limits for SQLite episode queries.
|
|||||||
| `DEFAULT_RECENT_LIMIT` | `10` | Default number of recent episodes to retrieve |
|
| `DEFAULT_RECENT_LIMIT` | `10` | Default number of recent episodes to retrieve |
|
||||||
| `DEFAULT_PAGE_SIZE` | `20` | Default episodes per page for paginated queries |
|
| `DEFAULT_PAGE_SIZE` | `20` | Default episodes per page for paginated queries |
|
||||||
| `DEFAULT_SEARCH_LIMIT` | `10` | Default number of FTS search results to return |
|
| `DEFAULT_SEARCH_LIMIT` | `10` | Default number of FTS search results to return |
|
||||||
|
|
||||||
|
#### `SERVICES`
|
||||||
|
|
||||||
|
Default URLs for inter-service communication. Used as fallback values
|
||||||
|
when the corresponding environment variable is not set.
|
||||||
|
|
||||||
|
| Key | Value | Description |
|
||||||
|
|---|---|---|
|
||||||
|
| `EMBEDDING_URL` | `http://localhost:3003` | Fallback embedding service URL |
|
||||||
Reference in New Issue
Block a user