227 lines
9.1 KiB
Markdown
227 lines
9.1 KiB
Markdown
# Orchestration Service
|
|
|
|
**Package:** `@nexusai/orchestration-service`
|
|
**Location:** `packages/orchestration-service`
|
|
**Deployed on:** Mini PC 2 (192.168.0.205)
|
|
**Port:** 4000
|
|
|
|
## Purpose
|
|
|
|
The main entry point for all clients. Assembles context packages from
|
|
memory, routes prompts to inference, and writes new episodes back to
|
|
memory after each interaction. Clients never talk directly to the memory
|
|
or inference services — all traffic flows through orchestration.
|
|
|
|
## Dependencies
|
|
|
|
- `express` — HTTP API
|
|
- `cors` — cross-origin resource sharing middleware
|
|
- `dotenv` — environment variable loading
|
|
- `@nexusai/shared` — shared utilities
|
|
|
|
## Environment Variables
|
|
|
|
| Variable | Required | Default | Description |
|
|
|---|---|---|---|
|
|
| PORT | No | 4000 | Port to listen on |
|
|
| MEMORY_SERVICE_URL | No | http://localhost:3002 | Memory service URL |
|
|
| EMBEDDING_SERVICE_URL | No | http://localhost:3003 | Embedding service URL |
|
|
| INFERENCE_SERVICE_URL | No | http://localhost:3001 | Inference service URL |
|
|
| LLAMA_SERVER_URL | No | http://localhost:8080 | Direct llama-server URL for /models/props |
|
|
| QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search |
|
|
| CORS_ORIGIN | No | http://localhost:5173 | Allowed origin for CORS requests |
|
|
| EXTRACTION_URL | No | http://localhost:11434 | Ollama URL for summarisation |
|
|
| EXTRACTION_MODEL | No | qwen2.5:3b | Ollama model used for summarisation |
|
|
|
|
## Internal Structure
|
|
|
|
```
|
|
src/
|
|
├── services/
|
|
│ ├── memory.js # HTTP client for memory service
|
|
│ ├── inference.js # HTTP client for inference service
|
|
│ ├── embedding.js # HTTP client for embedding service
|
|
│ ├── qdrant.js # HTTP client for Qdrant (direct vector search)
|
|
│ ├── graph.js # HTTP client for memory-service graph endpoints
|
|
│ └── summarization.js # Session summarisation — triggers after each episode
|
|
├── chat/
|
|
│ └── index.js # Core pipeline — context assembly, graph expansion, auto-naming
|
|
├── config/
|
|
│ └── settings.js # Settings load/save — reads/writes data/settings.json
|
|
├── routes/
|
|
│ ├── chat.js # POST /chat and POST /chat/stream
|
|
│ ├── sessions.js # Session CRUD proxy
|
|
│ ├── projects.js # Project CRUD proxy
|
|
│ ├── episodes.js # Episode list and delete proxy
|
|
│ ├── summaries.js # GET /summaries/session/:id and /summaries/project/:id
|
|
│ ├── settings.js # GET /settings and PATCH /settings
|
|
│ ├── health.js # GET /health/services — pings all four services
|
|
│ └── models.js # GET /models and GET /models/props
|
|
└── index.js # Express app entry point
|
|
```
|
|
|
|
The `services/` layer wraps all downstream HTTP calls in named functions.
|
|
URL or endpoint changes have a single place to be updated.
|
|
|
|
## Settings
|
|
|
|
Settings are persisted to `data/settings.json` and loaded on every request
|
|
via `appSettings.load()` — changes apply immediately without a service restart.
|
|
|
|
| Setting | Default | Description |
|
|
|---|---|---|
|
|
| `recentEpisodeLimit` | 5 | Recent episodes injected into prompt |
|
|
| `semanticLimit` | 5 | Semantic search results injected into prompt |
|
|
| `scoreThreshold` | 0.5 | Minimum similarity score for Qdrant semantic results |
|
|
| `semanticWeight` | 1.0 | RRF weight for Qdrant semantic results |
|
|
| `keywordWeight` | 0 | RRF weight for FTS5 keyword results (`0` = disabled) |
|
|
| `modelsFolderPath` | `/mnt/nexus-models` | Path to folder containing .gguf files |
|
|
| `temperature` | 0.7 | Inference temperature |
|
|
| `repeatPenalty` | 1.1 | Repeat token penalty |
|
|
| `topP` | 0.9 | Nucleus sampling probability mass |
|
|
| `topK` | 40 | Top-K token candidates per step |
|
|
| `systemPrompt` | *(ORCHESTRATION.SYSTEM_PROMPT)* | Global system prompt. `null` reverts to hardcoded constant. |
|
|
|
|
## Chat Pipeline
|
|
|
|
Both `POST /chat` and `POST /chat/stream` share the same steps. The only
|
|
difference is how the inference response is delivered to the client.
|
|
|
|
### Steps
|
|
|
|
1. **Session resolution** — look up session by `externalId`. Auto-create if
|
|
not found.
|
|
|
|
2. **Project context resolution** — if the session has a `project_id`, fetch
|
|
the project and all its session IDs. Used to scope semantic search. The
|
|
project's `system_prompt` is also read at this step if set.
|
|
|
|
3. **System prompt resolution** — three-tier hierarchy:
|
|
- `project.system_prompt` — highest priority
|
|
- `settings.systemPrompt` — global setting from `settings.json`
|
|
- `ORCHESTRATION.SYSTEM_PROMPT` — hardcoded constant (last resort)
|
|
|
|
4. **Recent episode retrieval** — fetch most recent episodes (`recentEpisodeLimit`).
|
|
|
|
5. **Fused episode retrieval** — runs semantic (Qdrant) and keyword (FTS5)
|
|
search in parallel, then merges results via Reciprocal Rank Fusion (RRF).
|
|
Both paths are filtered against `recentIds` before fusion. FTS is scoped
|
|
to the current session or all project sessions. If `keywordWeight` is `0`,
|
|
the FTS call is skipped entirely. Non-critical — failures fall back to
|
|
whichever strategy succeeded.
|
|
|
|
6. **Entity search** — query `entities` Qdrant collection filtered by
|
|
`projectId`. Returns entity IDs alongside Qdrant payload data (the Qdrant
|
|
point ID equals the SQLite entity ID). Non-critical.
|
|
|
|
7. **Graph neighborhood expansion** — call `POST /graph/neighbors` on
|
|
memory-service with the entity IDs from step 6. Returns a 1-hop subgraph
|
|
`{ nodes, edges }` — entity objects plus the relationships connecting them.
|
|
If no entities were found or the graph call fails, falls back to flat entity
|
|
list (no edges). Non-critical.
|
|
|
|
8. **Prompt assembly** — combine system prompt, graph context, fused episodes,
|
|
recent episodes, and user message.
|
|
|
|
9. **Inference** — send to inference service. `/chat` awaits full response;
|
|
`/chat/stream` pipes SSE chunks to the client.
|
|
|
|
10. **Episode write** — write exchange back to memory with `projectId`.
|
|
|
|
11. **Summarisation trigger** — `triggerSummary(session, allEpisodes)` called
|
|
fire-and-forget. See `summarization.md` for full details.
|
|
|
|
12. **Auto-naming** — on first message with no session name, fires a secondary
|
|
inference call (max 20 tokens, temperature 0.3) to generate a session name.
|
|
|
|
### Prompt Structure
|
|
|
|
```
|
|
[Resolved system prompt]
|
|
|
|
Here is what you know about entities relevant to this conversation and their connections:
|
|
- {name} ({type}): {notes}
|
|
→ {label} {neighbor_name} ({neighbor_type})
|
|
---
|
|
Here are some relevant memories from earlier conversations:
|
|
User: {past user message}
|
|
Assistant: {past ai response}
|
|
---
|
|
Here are some relevant memories from your past conversations:
|
|
User: {past user message}
|
|
Assistant: {past ai response}
|
|
--- End of recent memories ---
|
|
|
|
User: {current message}
|
|
Assistant:
|
|
```
|
|
|
|
The entity block renders the full graph neighborhood — seed entities matched
|
|
by Qdrant search plus any neighbors pulled in by 1-hop traversal. Each entity
|
|
shows its `notes` and any outbound relationships with their targets. Neighbor
|
|
nodes that have no outbound edges within the subgraph appear without connection
|
|
lines.
|
|
|
|
## Summarisation
|
|
|
|
After each episode write, `triggerSummary` is called fire-and-forget. It
|
|
checks token thresholds and episode counts before generating, then stores
|
|
the result in the memory service.
|
|
|
|
> For full details on trigger conditions, prompt format, cumulative updates,
|
|
> ChatML token stripping, and episode range tracking, see `summarization.md`.
|
|
|
|
## SSE Stream Format
|
|
|
|
Inference service → orchestration:
|
|
```
|
|
data: {"response":"Hello","done":false}
|
|
data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42}
|
|
data: [DONE]
|
|
```
|
|
|
|
Orchestration → client:
|
|
```
|
|
data: {"text":"Hello"}
|
|
data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42}
|
|
```
|
|
|
|
The `[DONE]` sentinel is consumed internally and not forwarded.
|
|
|
|
## Models Route
|
|
|
|
`GET /models` scans `.gguf` files live from `modelsFolderPath` and merges
|
|
with `models.json` for metadata. Returns file size in GB.
|
|
|
|
`GET /models/props` fetches directly from llama-server. Returns
|
|
`{ contextWindow, modelAlias }`. Returns `503` if unreachable.
|
|
|
|
## Sessions Route Behaviour
|
|
|
|
`PATCH /sessions/:sessionId` accepts `name`, `projectId`, or both.
|
|
Rejects only when neither is provided — allows `useChat` to write project
|
|
assignment separately from rename operations.
|
|
|
|
## Caddy Configuration
|
|
|
|
Each route prefix needs a handle block in the Caddyfile on Mini PC 2.
|
|
**Any new top-level route must be added here AND in `vite.config.js`.**
|
|
|
|
```
|
|
handle /chat* { reverse_proxy localhost:4000 }
|
|
handle /sessions* { reverse_proxy localhost:4000 }
|
|
handle /models* { reverse_proxy localhost:4000 }
|
|
handle /projects* { reverse_proxy localhost:4000 }
|
|
handle /episodes* { reverse_proxy localhost:4000 }
|
|
handle /settings* { reverse_proxy localhost:4000 }
|
|
handle /summaries* { reverse_proxy localhost:4000 }
|
|
handle /health* { reverse_proxy localhost:4000 }
|
|
```
|
|
|
|
After updating: `caddy reload --config /path/to/Caddyfile`
|
|
|
|
> Note: `/graph` routes are on the memory-service (port 3002) and are called
|
|
> internally by orchestration — they do not need a Caddy entry.
|
|
|
|
For all HTTP endpoints, see `api-routes.md`.
|