319 lines
9.3 KiB
Markdown
319 lines
9.3 KiB
Markdown
# Orchestration Service
|
|
|
|
**Package:** `@nexusai/orchestration-service`
|
|
**Location:** `packages/orchestration-service`
|
|
**Deployed on:** Mini PC 2 (192.168.0.205)
|
|
**Port:** 4000
|
|
|
|
## Purpose
|
|
|
|
The main entry point for all clients. Assembles context packages from
|
|
memory, routes prompts to inference, and writes new episodes back to
|
|
memory after each interaction. Clients never talk directly to the memory
|
|
or inference services — all traffic flows through orchestration.
|
|
|
|
## Dependencies
|
|
|
|
- `express` — HTTP API
|
|
- `cors` — cross-origin resource sharing middleware
|
|
- `dotenv` — environment variable loading
|
|
- `@nexusai/shared` — shared utilities
|
|
|
|
## Environment Variables
|
|
|
|
| Variable | Required | Default | Description |
|
|
|---|---|---|---|
|
|
| PORT | No | 4000 | Port to listen on |
|
|
| MEMORY_SERVICE_URL | No | http://localhost:3002 | Memory service URL |
|
|
| EMBEDDING_SERVICE_URL | No | http://localhost:3003 | Embedding service URL |
|
|
| INFERENCE_SERVICE_URL | No | http://localhost:3001 | Inference service URL |
|
|
| QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search |
|
|
| CORS_ORIGIN | No | http://localhost:5173 | Allowed origin for CORS requests |
|
|
| MODELS_MANIFEST_PATH | Yes | — | Path to `models.json` manifest file |
|
|
|
|
## Internal Structure
|
|
|
|
```
|
|
src/
|
|
├── services/
|
|
│ ├── memory.js # HTTP client for memory service
|
|
│ ├── inference.js # HTTP client for inference service
|
|
│ ├── embedding.js # HTTP client for embedding service
|
|
│ └── qdrant.js # HTTP client for Qdrant vector search
|
|
├── chat/
|
|
│ └── index.js # Core pipeline logic — context assembly and coordination
|
|
├── routes/
|
|
│ ├── chat.js # POST /chat and POST /chat/stream route handlers
|
|
│ ├── sessions.js # Session list, history, rename, and delete routes
|
|
│ ├── projects.js # Project CRUD routes — proxies to memory service
|
|
│ └── models.js # GET /models — reads models.json manifest from disk
|
|
└── index.js # Express app entry point
|
|
```
|
|
|
|
The `services/` layer wraps all downstream HTTP calls in named functions,
|
|
keeping the pipeline logic in `chat/index.js` readable and ensuring that
|
|
URL or endpoint changes have a single place to be updated.
|
|
|
|
## Chat Pipeline
|
|
|
|
Both `POST /chat` and `POST /chat/stream` share the same context assembly
|
|
steps. The only difference is how the inference response is delivered to
|
|
the client.
|
|
|
|
1. **Session resolution** — looks up the session by `externalId` in the memory
|
|
service. If not found, auto-creates a new session. Clients can generate a
|
|
UUID for new conversations and pass it directly — no pre-creation step needed.
|
|
|
|
2. **Recent episode retrieval** — fetches the most recent episodes for the session
|
|
(default: 5) from the memory service.
|
|
|
|
3. **Semantic search** — embeds the user message via the embedding service, then
|
|
queries Qdrant for the top-5 most similar past episodes (score threshold: 0.75).
|
|
Results are deduplicated against the recent episode set using a `Set` of IDs.
|
|
Full episode content is fetched from the memory service by ID. This step is
|
|
non-critical — if it fails, a warning is logged and the pipeline continues with
|
|
recency-only context.
|
|
|
|
4. **Prompt assembly** — combines the system prompt, semantic episodes (if any),
|
|
recent episodes, and the current user message into a single prompt string.
|
|
|
|
5. **Inference** — sends the assembled prompt to the inference service. `/chat`
|
|
awaits the full response; `/chat/stream` opens an SSE connection and pipes
|
|
chunks to the client as they arrive.
|
|
|
|
6. **Episode write** — writes the new exchange (user message + AI response)
|
|
back to the memory service as a fire-and-forget operation. For streaming,
|
|
the full response text is accumulated across chunks before writing.
|
|
|
|
7. **Response** — returns the AI response, model name, session ID, and token
|
|
count to the client.
|
|
|
|
## Prompt Structure
|
|
|
|
```
|
|
[System prompt]
|
|
|
|
Here are some relevant memories from earlier conversations:
|
|
User: {past user message}
|
|
Assistant: {past ai response}
|
|
... (up to 5 semantic episodes)
|
|
---
|
|
Here are some relevant memories from your past conversations:
|
|
User: {past user message}
|
|
Assistant: {past ai response}
|
|
... (up to 5 recent episodes)
|
|
--- End of recent memories ---
|
|
|
|
User: {current message}
|
|
Assistant:
|
|
```
|
|
|
|
Semantic episodes appear before recent episodes so the model encounters
|
|
long-range relevant context before the immediate conversation flow.
|
|
|
|
## SSE Stream Format
|
|
|
|
The inference service emits chunks from the llama.cpp provider in this format:
|
|
```
|
|
data: {"response":"Hello","done":false}
|
|
data: {"response":"!","done":false}
|
|
data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":42}
|
|
data: [DONE]
|
|
```
|
|
|
|
The orchestration service re-emits to the client as:
|
|
```
|
|
data: {"text":"Hello"}
|
|
data: {"text":"!"}
|
|
data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":42}
|
|
```
|
|
|
|
The `[DONE]` sentinel from the inference service is consumed internally
|
|
and not forwarded. The client stream is terminated by `res.end()` after
|
|
the done event. Model name and token count are included on the done event
|
|
so the client can display them in the UI.
|
|
|
|
## Models Manifest
|
|
|
|
The `/models` endpoint reads a `models.json` file from disk at the path
|
|
specified by `MODELS_MANIFEST_PATH`. The file lives on the main PC alongside
|
|
the model files, and is accessible to orchestration via a network share
|
|
mounted at `/mnt/nexus-models`.
|
|
|
|
The manifest is read fresh on each request — no restart needed when models
|
|
are added or removed.
|
|
|
|
**models.json format:**
|
|
```json
|
|
[
|
|
{ "value": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf", "label": "Gemma 4 26B Claude Distill" }
|
|
]
|
|
```
|
|
|
|
- `value` — must match the model name as reported by `llama-server` (including `.gguf` extension)
|
|
- `label` — display name shown in the UI
|
|
|
|
## Endpoints
|
|
|
|
### Health
|
|
|
|
| Method | Path | Description |
|
|
|---|---|---|
|
|
| GET | /health | Service health check — reports downstream service URLs |
|
|
|
|
### Chat
|
|
|
|
| Method | Path | Description |
|
|
|---|---|---|
|
|
| POST | /chat | Send a message and receive a complete response |
|
|
| POST | /chat/stream | Send a message and receive a streaming SSE response |
|
|
|
|
### Sessions
|
|
|
|
| Method | Path | Description |
|
|
|---|---|---|
|
|
| GET | /sessions | Get paginated list of all sessions |
|
|
| GET | /sessions/:sessionId/history | Get paginated episode history for a session |
|
|
| PATCH | /sessions/:sessionId | Rename a session |
|
|
| DELETE | /sessions/:sessionId | Delete a session and all its episodes |
|
|
|
|
### Projects
|
|
|
|
Projects are proxied directly from the memory service with no transformation.
|
|
|
|
| Method | Path | Description |
|
|
|---|---|---|
|
|
| GET | /projects | Get all projects |
|
|
| POST | /projects | Create a new project |
|
|
| PATCH | /projects/:id | Update a project |
|
|
| DELETE | /projects/:id | Delete a project |
|
|
|
|
### Models
|
|
|
|
| Method | Path | Description |
|
|
|---|---|---|
|
|
| GET | /models | Get list of available models from manifest file |
|
|
|
|
---
|
|
|
|
**POST /chat**
|
|
|
|
Request body:
|
|
```json
|
|
{
|
|
"sessionId": "your-session-uuid",
|
|
"message": "Hello, my name is Tim.",
|
|
"model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
|
|
"temperature": 0.7
|
|
}
|
|
```
|
|
|
|
`model` and `temperature` are optional — fall back to inference service defaults
|
|
if omitted.
|
|
|
|
Response:
|
|
```json
|
|
{
|
|
"sessionId": "your-session-uuid",
|
|
"response": "Hello Tim! How can I help you today?",
|
|
"model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
|
|
"tokenCount": 87
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
**POST /chat/stream**
|
|
|
|
Same request body as `POST /chat`.
|
|
|
|
Response is a stream of Server-Sent Events:
|
|
```
|
|
data: {"text":"Hello"}
|
|
data: {"text":" Tim"}
|
|
data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":87}
|
|
```
|
|
|
|
---
|
|
|
|
**PATCH /sessions/:sessionId**
|
|
|
|
Request body:
|
|
```json
|
|
{ "name": "My Renamed Session" }
|
|
```
|
|
|
|
Returns the updated session object. `name` is required and trimmed of whitespace.
|
|
|
|
---
|
|
|
|
**DELETE /sessions/:sessionId**
|
|
|
|
Returns `204 No Content`. Cascades to delete all episodes for the session.
|
|
|
|
---
|
|
|
|
**GET /sessions/:sessionId/history**
|
|
|
|
Query parameters:
|
|
|
|
| Parameter | Default | Description |
|
|
|---|---|---|
|
|
| limit | 20 | Maximum number of episodes to return |
|
|
| offset | 0 | Number of episodes to skip (for pagination) |
|
|
|
|
Response:
|
|
```json
|
|
{
|
|
"sessionId": "your-session-uuid",
|
|
"episodes": [
|
|
{
|
|
"id": 42,
|
|
"session_id": 1,
|
|
"user_message": "Hello, my name is Tim.",
|
|
"ai_response": "Hello Tim! How can I help you today?",
|
|
"token_count": 87,
|
|
"created_at": 1712345678,
|
|
"metadata": null
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
Episodes are ordered newest first.
|
|
|
|
---
|
|
|
|
**GET /models**
|
|
|
|
Returns the parsed contents of `models.json`:
|
|
```json
|
|
[
|
|
{ "value": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf", "label": "Gemma 4 26B Claude Distill" }
|
|
]
|
|
```
|
|
|
|
Returns `500` if the manifest file cannot be read or parsed.
|
|
|
|
## Caddy Configuration
|
|
|
|
The Caddy reverse proxy on Mini PC 2 must have a handle block for each route
|
|
prefix the client needs to reach. Current required blocks:
|
|
|
|
```
|
|
handle /chat* {
|
|
reverse_proxy localhost:4000
|
|
}
|
|
handle /sessions* {
|
|
reverse_proxy localhost:4000
|
|
}
|
|
handle /models* {
|
|
reverse_proxy localhost:4000
|
|
}
|
|
handle /projects* {
|
|
reverse_proxy localhost:4000
|
|
}
|
|
```
|
|
|
|
When adding new top-level routes to the orchestration service, add a matching
|
|
block here and reload Caddy: `caddy reload --config /path/to/Caddyfile` |