247 lines
7.4 KiB
Markdown
247 lines
7.4 KiB
Markdown
# Orchestration Service
|
|
|
|
**Package:** `@nexusai/orchestration-service`
|
|
**Location:** `packages/orchestration-service`
|
|
**Deployed on:** Mini PC 2 (192.168.0.205)
|
|
**Port:** 4000
|
|
|
|
## Purpose
|
|
|
|
The main entry point for all clients. Assembles context packages from
|
|
memory, routes prompts to inference, and writes new episodes back to
|
|
memory after each interaction. Clients never talk directly to the memory
|
|
or inference services — all traffic flows through orchestration.
|
|
|
|
## Dependencies
|
|
|
|
- `express` : HTTP API
|
|
- `cors` : cross-origin resource sharing middleware
|
|
- `node-fetch` : inter-service HTTP communication (memory service client only)
|
|
- `dotenv` : environment variable loading
|
|
- `@nexusai/shared` : shared utilities
|
|
|
|
> `memory.js` uses `node-fetch` v2 (pinned) because it is CommonJS. All other
|
|
> service clients use Node.js built-in `fetch`.
|
|
|
|
## Environment Variables
|
|
|
|
| Variable | Required | Default | Description |
|
|
|---|---|---|---|
|
|
| PORT | No | 4000 | Port to listen on |
|
|
| MEMORY_SERVICE_URL | No | http://localhost:3002 | Memory service URL |
|
|
| EMBEDDING_SERVICE_URL | No | http://localhost:3003 | Embedding service URL |
|
|
| INFERENCE_SERVICE_URL | No | http://localhost:3001 | Inference service URL |
|
|
| QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search |
|
|
| CORS_ORIGIN | No | http://localhost:5173 | Allowed origin for CORS requests |
|
|
|
|
## Internal Structure
|
|
```
|
|
src/
|
|
├── services/
|
|
│ ├── memory.js # HTTP client for memory service
|
|
│ ├── inference.js # HTTP client for inference service
|
|
│ ├── embedding.js # HTTP client for embedding service
|
|
│ └── qdrant.js # HTTP client for Qdrant vector search
|
|
├── chat/
|
|
│ └── index.js # Core pipeline logic — context assembly and coordination
|
|
├── routes/
|
|
│ ├── chat.js # POST /chat and POST /chat/stream route handlers
|
|
│ └── sessions.js # GET /sessions/:sessionId/history route handler
|
|
└── index.js # Express app entry point
|
|
```
|
|
|
|
The `services/` layer wraps all downstream HTTP calls in named functions,
|
|
keeping the pipeline logic in `chat/index.js` readable and ensuring that
|
|
URL or endpoint changes have a single place to be updated.
|
|
|
|
## Chat Pipeline
|
|
|
|
Both `POST /chat` and `POST /chat/stream` share the same context assembly
|
|
steps. The only difference is how the inference response is delivered to
|
|
the client.
|
|
|
|
1. **Session resolution** — looks up the session by `externalId` in the memory
|
|
service. If not found, auto-creates a new session. Clients can generate a
|
|
UUID for new conversations and pass it directly — no pre-creation step needed.
|
|
|
|
2. **Recent episode retrieval** — fetches the most recent episodes for the session
|
|
(default: 10) from the memory service.
|
|
|
|
3. **Semantic search** — embeds the user message via the embedding service, then
|
|
queries Qdrant for the top-5 most similar past episodes (score threshold: 0.75).
|
|
Results are deduplicated against the recent episode set using a `Set` of IDs.
|
|
Full episode content is fetched from the memory service by ID. This step is
|
|
non-critical — if it fails, a warning is logged and the pipeline continues with
|
|
recency-only context.
|
|
|
|
4. **Prompt assembly** — combines the system prompt, semantic episodes (if any),
|
|
recent episodes, and the current user message into a single prompt string.
|
|
|
|
5. **Inference** — sends the assembled prompt to the inference service. `/chat`
|
|
awaits the full response; `/chat/stream` opens an SSE connection and pipes
|
|
chunks to the client as they arrive.
|
|
|
|
6. **Episode write** — writes the new exchange (user message + AI response)
|
|
back to the memory service as a fire-and-forget operation. For streaming,
|
|
the full response text is accumulated across chunks before writing.
|
|
|
|
7. **Response** — returns the AI response, model name, session ID, and token
|
|
count to the client.
|
|
|
|
## Prompt Structure
|
|
[System prompt]
|
|
Here are some relevant memories from earlier conversations:
|
|
User: {past user message}
|
|
Assistant: {past ai response}
|
|
... (up to 5 semantic episodes)
|
|
Here is the recent conversation history:
|
|
User: {past user message}
|
|
Assistant: {past ai response}
|
|
... (up to 10 recent episodes)
|
|
--- End of memories ---
|
|
User: {current message}
|
|
Assistant:
|
|
|
|
Semantic episodes appear before recent episodes so the model encounters
|
|
long-range relevant context before the immediate conversation flow.
|
|
|
|
## SSE Stream Format
|
|
|
|
The inference service emits chunks in this format:
|
|
data: {"model":"companion:latest","response":"Hello","done":false}
|
|
data: {"model":"companion:latest","response":"!","done":true,"eval_count":3,...}
|
|
data: [DONE]
|
|
|
|
The orchestration service re-emits to the client as:
|
|
data: {"text":"Hello"}
|
|
data: {"text":"!"}
|
|
data: {"done":true}
|
|
|
|
The `[DONE]` sentinel from the inference service is consumed internally
|
|
and not forwarded. The client stream is terminated by `res.end()` after
|
|
the `{"done":true}` event.
|
|
|
|
## Endpoints
|
|
|
|
### Health
|
|
|
|
| Method | Path | Description |
|
|
|---|---|---|
|
|
| GET | /health | Service health check — reports downstream service URLs |
|
|
|
|
### Chat
|
|
|
|
| Method | Path | Description |
|
|
|---|---|---|
|
|
| POST | /chat | Send a message and receive a complete response |
|
|
| POST | /chat/stream | Send a message and receive a streaming SSE response |
|
|
|
|
### Sessions
|
|
|
|
| Method | Path | Description |
|
|
|---|---|---|
|
|
| GET | /sessions | Get paginated list of all sessions |
|
|
| GET | /sessions/:sessionId/history | Get paginated episode history for a session |
|
|
|
|
---
|
|
|
|
**POST /chat**
|
|
|
|
Request body:
|
|
```json
|
|
{
|
|
"sessionId": "your-session-uuid",
|
|
"message": "Hello, my name is Tim.",
|
|
"model": "companion:latest",
|
|
"temperature": 0.7
|
|
}
|
|
```
|
|
|
|
`model` and `temperature` are optional — fall back to inference service defaults
|
|
if omitted.
|
|
|
|
Response:
|
|
```json
|
|
{
|
|
"sessionId": "your-session-uuid",
|
|
"response": "Hello Tim! How can I help you today?",
|
|
"model": "companion:latest",
|
|
"tokenCount": 87
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
**POST /chat/stream**
|
|
|
|
Same request body as `POST /chat`.
|
|
|
|
Response is a stream of Server-Sent Events. Each event contains a text
|
|
delta. The stream ends with a `done` event.
|
|
data: {"text":"Hello"}
|
|
data: {"text":" Tim"}
|
|
data: {"text":"!"}
|
|
data: {"done":true}
|
|
|
|
Clients should read the `text` field from each chunk and accumulate them
|
|
to build the full response string. The connection is closed by the server
|
|
after the `{"done":true}` event.
|
|
|
|
---
|
|
|
|
**GET /sessions/:sessionId/history**
|
|
|
|
Returns paginated episode history for a session identified by its external ID.
|
|
|
|
Query parameters:
|
|
|
|
| Parameter | Default | Description |
|
|
|---|---|---|
|
|
| limit | 20 | Maximum number of episodes to return |
|
|
| offset | 0 | Number of episodes to skip (for pagination) |
|
|
|
|
Response:
|
|
```json
|
|
{
|
|
"sessionId": "your-session-uuid",
|
|
"episodes": [
|
|
{
|
|
"id": 42,
|
|
"session_id": 1,
|
|
"user_message": "Hello, my name is Tim.",
|
|
"ai_response": "Hello Tim! How can I help you today?",
|
|
"token_count": 87,
|
|
"created_at": 1712345678,
|
|
"metadata": null
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
**GET /sessions**
|
|
|
|
Returns a paginated list of all sessions, ordered by most recently active.
|
|
|
|
Query parameters:
|
|
|
|
| Parameter | Default | Description |
|
|
|---|---|---|
|
|
| limit | 20 | Maximum number of sessions to return |
|
|
| offset | 0 | Number of sessions to skip (for pagination) |
|
|
|
|
Response:
|
|
```json
|
|
[
|
|
{
|
|
"id": 1,
|
|
"external_id": "test-semantic",
|
|
"metadata": null,
|
|
"created_at": 1712345678,
|
|
"updated_at": 1712345999
|
|
}
|
|
]
|
|
```
|
|
|
|
Episodes are ordered newest first. Returns `404` if the session does not exist. |