8.3 KiB
Orchestration Service
Package: @nexusai/orchestration-service
Location: packages/orchestration-service
Deployed on: Mini PC 2 (192.168.0.205)
Port: 4000
Purpose
The main entry point for all clients. Assembles context packages from memory, routes prompts to inference, and writes new episodes back to memory after each interaction. Clients never talk directly to the memory or inference services — all traffic flows through orchestration.
Dependencies
express— HTTP APIcors— cross-origin resource sharing middlewaredotenv— environment variable loading@nexusai/shared— shared utilities
Environment Variables
| Variable | Required | Default | Description |
|---|---|---|---|
| PORT | No | 4000 | Port to listen on |
| MEMORY_SERVICE_URL | No | http://localhost:3002 | Memory service URL |
| EMBEDDING_SERVICE_URL | No | http://localhost:3003 | Embedding service URL |
| INFERENCE_SERVICE_URL | No | http://localhost:3001 | Inference service URL |
| LLAMA_SERVER_URL | No | http://localhost:8080 | Direct llama-server URL for /models/props |
| QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search |
| CORS_ORIGIN | No | http://localhost:5173 | Allowed origin for CORS requests |
| MODELS_MANIFEST_PATH | No | — | Legacy — superseded by modelsFolderPath in settings.json |
Internal Structure
src/
├── services/
│ ├── memory.js # HTTP client for memory service
│ ├── inference.js # HTTP client for inference service
│ ├── embedding.js # HTTP client for embedding service
│ └── qdrant.js # HTTP client for Qdrant (direct vector search)
├── chat/
│ └── index.js # Core pipeline — context assembly, isolation, auto-naming
├── config/
│ └── settings.js # Settings load/save — reads/writes data/settings.json
├── routes/
│ ├── chat.js # POST /chat and POST /chat/stream
│ ├── sessions.js # Session CRUD proxy
│ ├── projects.js # Project CRUD proxy — passes req.body straight through
│ ├── episodes.js # Episode list and delete proxy
│ ├── settings.js # GET /settings and PATCH /settings
│ ├── health.js # GET /health — pings all four services
│ └── models.js # GET /models — scans .gguf files live, merges with models.json
# GET /models/props — context window + loaded model from llama-server
└── index.js # Express app entry point
The services/ layer wraps all downstream HTTP calls in named functions.
URL or endpoint changes have a single place to be updated.
Settings
Settings are persisted to data/settings.json and loaded on every request
via appSettings.load() — changes apply immediately without a service restart.
| Setting | Default | Description |
|---|---|---|
recentEpisodeLimit |
5 | Recent episodes injected into prompt |
semanticLimit |
5 | Semantic search results injected into prompt |
scoreThreshold |
0.75 | Minimum similarity score for semantic results |
modelsFolderPath |
/mnt/nexus-models |
Path to folder containing .gguf files |
temperature |
0.7 | Inference temperature |
repeatPenalty |
1.1 | Repeat token penalty |
topP |
0.9 | Nucleus sampling probability mass |
topK |
40 | Top-K token candidates per step |
systemPrompt |
(ORCHESTRATION.SYSTEM_PROMPT) | Global system prompt. null reverts to hardcoded constant. |
Defaults are defined in config/settings.js and fall back to constants in
@nexusai/shared. Values saved in settings.json take precedence.
Chat Pipeline
Both POST /chat and POST /chat/stream share the same steps. The only
difference is how the inference response is delivered to the client.
Steps
-
Session resolution — look up session by
externalId. Auto-create if not found. Clients generate a UUID for new conversations — no pre-creation step needed. -
Project context resolution — if the session has a
project_id, fetch the project and all its session IDs. Used to scope semantic search. The project'ssystem_promptis also read at this step if set. -
System prompt resolution — three-tier hierarchy:
project.system_prompt— if the session is in a project and it's set (highest priority)settings.systemPrompt— global setting fromsettings.jsonORCHESTRATION.SYSTEM_PROMPT— hardcoded constant in@nexusai/shared(last resort)
-
Recent episode retrieval — fetch the most recent episodes for the session (
recentEpisodeLimit, default 5). -
Semantic search — embed the user message, query Qdrant for the top most similar past episodes (
semanticLimit,scoreThreshold). Deduplicated against recent episodes. Non-critical — if it fails, pipeline continues with recency-only context. -
Entity search — query the
entitiesQdrant collection filtered byprojectId. Non-project sessions receive no entity context. Non-critical. -
Prompt assembly — combine resolved system prompt, entity context, semantic episodes, recent episodes, and user message.
-
Inference — send to inference service with settings-derived parameters (temperature, topP, topK, repeatPenalty).
/chatawaits full response;/chat/streampipes SSE chunks to the client. -
Episode write — write the exchange back to memory with
projectId. Fire-and-forget for/chat; awaited for/chat/stream. -
Auto-naming — on
isFirstMessage && !session.name, fire a secondary inference call with a naming prompt (max 20 tokens, temperature 0.3) and write the result back assession.name. Fully fire-and-forget.
Prompt Structure
[Resolved system prompt]
Here is what you know about entities relevant to this conversation:
- {name} ({type}): {notes}
... (up to 5 entity results)
---
Here are some relevant memories from earlier conversations:
User: {past user message}
Assistant: {past ai response}
... (up to semanticLimit semantic episodes)
---
Here are some relevant memories from your past conversations:
User: {past user message}
Assistant: {past ai response}
... (up to recentEpisodeLimit recent episodes)
--- End of recent memories ---
User: {current message}
Assistant:
Entity context appears first — before episodic memory — because structured facts about known entities are the most stable and reliable context. Semantic episodes follow, then recent episodes as the immediate conversation flow.
SSE Stream Format
Inference service → orchestration:
data: {"response":"Hello","done":false}
data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42}
data: [DONE]
Orchestration → client:
data: {"text":"Hello"}
data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42}
The [DONE] sentinel is consumed internally and not forwarded. The stream
is terminated by res.end() after the done event.
Models Route
GET /models scans .gguf files live on each request from modelsFolderPath
(read from settings). Merges results with a models.json file in the same
folder for richer metadata (label, description). Returns file size in GB.
GET /models/props fetches directly from llama-server via LLAMA_SERVER_URL.
Returns { contextWindow, modelAlias }. n_ctx is at
data.default_generation_settings.n_ctx in the llama-server response.
Returns 503 if llama-server is unreachable.
Sessions Route Behaviour
PATCH /sessions/:sessionId accepts either name, projectId, or both.
The validation guard only rejects requests where neither is provided:
if (!name?.trim() && projectId === undefined) {
return res.status(400).json({ error: 'name or projectId is required' });
}
This allows useChat to write project assignment separately from rename
operations.
Caddy Configuration
Each route prefix needs a handle block in the Caddyfile on Mini PC 2:
handle /chat* { reverse_proxy localhost:4000 }
handle /sessions* { reverse_proxy localhost:4000 }
handle /models* { reverse_proxy localhost:4000 }
handle /projects* { reverse_proxy localhost:4000 }
handle /episodes* { reverse_proxy localhost:4000 }
handle /settings* { reverse_proxy localhost:4000 }
handle /health* { reverse_proxy localhost:4000 }
After updating: caddy reload --config /path/to/Caddyfile
For all HTTP endpoints, see api-routes.md.