# Orchestration Service **Package:** `@nexusai/orchestration-service` **Location:** `packages/orchestration-service` **Deployed on:** Mini PC 2 (192.168.0.205) **Port:** 4000 ## Purpose The main entry point for all clients. Assembles context packages from memory, routes prompts to inference, and writes new episodes back to memory after each interaction. Clients never talk directly to the memory or inference services — all traffic flows through orchestration. ## Dependencies - `express` — HTTP API - `cors` — cross-origin resource sharing middleware - `dotenv` — environment variable loading - `@nexusai/shared` — shared utilities ## Environment Variables | Variable | Required | Default | Description | |---|---|---|---| | PORT | No | 4000 | Port to listen on | | MEMORY_SERVICE_URL | No | http://localhost:3002 | Memory service URL | | EMBEDDING_SERVICE_URL | No | http://localhost:3003 | Embedding service URL | | INFERENCE_SERVICE_URL | No | http://localhost:3001 | Inference service URL | | QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search | | CORS_ORIGIN | No | http://localhost:5173 | Allowed origin for CORS requests | | MODELS_MANIFEST_PATH | Yes | — | Path to `models.json` manifest file | ## Internal Structure ``` src/ ├── services/ │ ├── memory.js # HTTP client for memory service │ ├── inference.js # HTTP client for inference service │ ├── embedding.js # HTTP client for embedding service │ └── qdrant.js # HTTP client for Qdrant (direct vector search) ├── chat/ │ └── index.js # Core pipeline — context assembly, isolation, auto-naming ├── routes/ │ ├── chat.js # POST /chat and POST /chat/stream │ ├── sessions.js # Session CRUD proxy │ ├── projects.js # Project CRUD proxy │ └── models.js # GET /models — reads models.json from disk └── index.js # Express app entry point ``` The `services/` layer wraps all downstream HTTP calls in named functions. URL or endpoint changes have a single place to be updated. ## Chat Pipeline Both `POST /chat` and `POST /chat/stream` share the same steps. The only difference is how the inference response is delivered to the client. ### Steps 1. **Session resolution** — look up session by `externalId`. Auto-create if not found. Clients generate a UUID for new conversations — no pre-creation step needed. 2. **Project context resolution** — if the session has a `project_id`, fetch the project and all its session IDs. Used to scope semantic search. See `memory-isolation.md` for full behaviour. 3. **Recent episode retrieval** — fetch the most recent episodes for the session (`RECENT_EPISODE_LIMIT`, default 5). 4. **Semantic search** — embed the user message, query Qdrant for the top-5 most similar past episodes (`SCORE_THRESHOLD` 0.75). Deduplicated against recent episodes. Non-critical — if it fails, pipeline continues with recency-only context. 5. **Entity search** — reuse the embedded user message vector to query the `entities` Qdrant collection (score threshold 0.6, limit 5). Returns entity payloads (`name`, `type`, `notes`) directly — no SQLite roundtrip needed. Non-critical — if it fails, pipeline continues without entity context. 6. **Prompt assembly** — combine system prompt, entity context, semantic episodes, recent episodes, and user message. 7. **Inference** — send to inference service. `/chat` awaits full response; `/chat/stream` pipes SSE chunks to the client. 8. **Episode write** — write the exchange back to memory. Fire-and-forget for `/chat`; awaited for `/chat/stream` to ensure the full text is accumulated before saving. 9. **Auto-naming** — on `isFirstMessage && !session.name`, fire a secondary inference call with a naming prompt (max 20 tokens, temperature 0.3) and write the result back as `session.name`. Fully fire-and-forget. ### Prompt Structure ``` [System prompt] Here is what you know about entities relevant to this conversation: - {name} ({type}): {notes} ... (up to 5 entity results) --- Here are some relevant memories from earlier conversations: User: {past user message} Assistant: {past ai response} ... (up to 5 semantic episodes) --- Here are some relevant memories from your past conversations: User: {past user message} Assistant: {past ai response} ... (up to 5 recent episodes) --- End of recent memories --- User: {current message} Assistant: ``` Entity context appears first — before episodic memory — because structured facts about known entities are the most stable and reliable context. Semantic episodes follow, then recent episodes as the immediate conversation flow. ## SSE Stream Format Inference service → orchestration: ``` data: {"response":"Hello","done":false} data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42} data: [DONE] ``` Orchestration → client: ``` data: {"text":"Hello"} data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42} ``` The `[DONE]` sentinel is consumed internally and not forwarded. The stream is terminated by `res.end()` after the done event. ## Models Manifest `GET /models` reads `models.json` fresh on each request from `MODELS_MANIFEST_PATH`. The file lives on the main PC alongside model files, accessible via an SMB mount at `/mnt/nexus-models`. ```json [ { "value": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf", "label": "Gemma 4 26B Claude Distill" } ] ``` `value` must match the model name as reported by `llama-server` (including `.gguf` extension). No service restart needed when models are added or removed. ## Sessions Route Behaviour `PATCH /sessions/:sessionId` accepts either `name`, `projectId`, or both. The validation guard only rejects requests where neither is provided: ```js if (!name?.trim() && projectId === undefined) { return res.status(400).json({ error: 'name or projectId is required' }); } ``` This allows `useChat` to write project assignment separately from rename operations. ## Caddy Configuration Each route prefix needs a handle block in the Caddyfile on Mini PC 2: ``` handle /chat* { reverse_proxy localhost:4000 } handle /sessions* { reverse_proxy localhost:4000 } handle /models* { reverse_proxy localhost:4000 } handle /projects* { reverse_proxy localhost:4000 } ``` After updating: `caddy reload --config /path/to/Caddyfile` For all HTTP endpoints, see `api-routes.md`.