Files
nexusAI/docs/services/orchestration-service.md
2026-04-13 03:42:14 -07:00

8.4 KiB

Orchestration Service

Package: @nexusai/orchestration-service
Location: packages/orchestration-service
Deployed on: Mini PC 2 (192.168.0.205)
Port: 4000

Purpose

The main entry point for all clients. Assembles context packages from memory, routes prompts to inference, and writes new episodes back to memory after each interaction. Clients never talk directly to the memory or inference services — all traffic flows through orchestration.

Dependencies

  • express — HTTP API
  • cors — cross-origin resource sharing middleware
  • dotenv — environment variable loading
  • @nexusai/shared — shared utilities

Environment Variables

Variable Required Default Description
PORT No 4000 Port to listen on
MEMORY_SERVICE_URL No http://localhost:3002 Memory service URL
EMBEDDING_SERVICE_URL No http://localhost:3003 Embedding service URL
INFERENCE_SERVICE_URL No http://localhost:3001 Inference service URL
QDRANT_URL No http://localhost:6333 Qdrant URL for semantic search
CORS_ORIGIN No http://localhost:5173 Allowed origin for CORS requests
MODELS_MANIFEST_PATH Yes Path to models.json manifest file

Internal Structure

src/
├── services/
│   ├── memory.js      # HTTP client for memory service
│   ├── inference.js   # HTTP client for inference service
│   ├── embedding.js   # HTTP client for embedding service
│   └── qdrant.js      # HTTP client for Qdrant vector search
├── chat/
│   └── index.js       # Core pipeline logic — context assembly and coordination
├── routes/
│   ├── chat.js        # POST /chat and POST /chat/stream route handlers
│   ├── sessions.js    # Session list, history, rename, and delete routes
│   └── models.js      # GET /models — reads models.json manifest from disk
└── index.js           # Express app entry point

The services/ layer wraps all downstream HTTP calls in named functions, keeping the pipeline logic in chat/index.js readable and ensuring that URL or endpoint changes have a single place to be updated.

Chat Pipeline

Both POST /chat and POST /chat/stream share the same context assembly steps. The only difference is how the inference response is delivered to the client.

  1. Session resolution — looks up the session by externalId in the memory service. If not found, auto-creates a new session. Clients can generate a UUID for new conversations and pass it directly — no pre-creation step needed.

  2. Recent episode retrieval — fetches the most recent episodes for the session (default: 5) from the memory service.

  3. Semantic search — embeds the user message via the embedding service, then queries Qdrant for the top-5 most similar past episodes (score threshold: 0.75). Results are deduplicated against the recent episode set using a Set of IDs. Full episode content is fetched from the memory service by ID. This step is non-critical — if it fails, a warning is logged and the pipeline continues with recency-only context.

  4. Prompt assembly — combines the system prompt, semantic episodes (if any), recent episodes, and the current user message into a single prompt string.

  5. Inference — sends the assembled prompt to the inference service. /chat awaits the full response; /chat/stream opens an SSE connection and pipes chunks to the client as they arrive.

  6. Episode write — writes the new exchange (user message + AI response) back to the memory service as a fire-and-forget operation. For streaming, the full response text is accumulated across chunks before writing.

  7. Response — returns the AI response, model name, session ID, and token count to the client.

Prompt Structure

[System prompt]

Here are some relevant memories from earlier conversations:
User: {past user message}
Assistant: {past ai response}
... (up to 5 semantic episodes)
---
Here are some relevant memories from your past conversations:
User: {past user message}
Assistant: {past ai response}
... (up to 5 recent episodes)
--- End of recent memories ---

User: {current message}
Assistant:

Semantic episodes appear before recent episodes so the model encounters long-range relevant context before the immediate conversation flow.

SSE Stream Format

The inference service emits chunks from the llama.cpp provider in this format:

data: {"response":"Hello","done":false}
data: {"response":"!","done":false}
data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":42}
data: [DONE]

The orchestration service re-emits to the client as:

data: {"text":"Hello"}
data: {"text":"!"}
data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":42}

The [DONE] sentinel from the inference service is consumed internally and not forwarded. The client stream is terminated by res.end() after the done event. Model name and token count are included on the done event so the client can display them in the UI.

Models Manifest

The /models endpoint reads a models.json file from disk at the path specified by MODELS_MANIFEST_PATH. The file lives on the main PC alongside the model files, and is accessible to orchestration via a network share mounted at /mnt/nexus-models.

The manifest is read fresh on each request — no restart needed when models are added or removed.

models.json format:

[
  { "value": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf", "label": "Gemma 4 26B Claude Distill" }
]
  • value — must match the model name as reported by llama-server (including .gguf extension)
  • label — display name shown in the UI

Endpoints

Health

Method Path Description
GET /health Service health check — reports downstream service URLs

Chat

Method Path Description
POST /chat Send a message and receive a complete response
POST /chat/stream Send a message and receive a streaming SSE response

Sessions

Method Path Description
GET /sessions Get paginated list of all sessions
GET /sessions/:sessionId/history Get paginated episode history for a session
PATCH /sessions/:sessionId Rename a session
DELETE /sessions/:sessionId Delete a session and all its episodes

Models

Method Path Description
GET /models Get list of available models from manifest file

POST /chat

Request body:

{
  "sessionId": "your-session-uuid",
  "message": "Hello, my name is Tim.",
  "model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
  "temperature": 0.7
}

model and temperature are optional — fall back to inference service defaults if omitted.

Response:

{
  "sessionId": "your-session-uuid",
  "response": "Hello Tim! How can I help you today?",
  "model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
  "tokenCount": 87
}

POST /chat/stream

Same request body as POST /chat.

Response is a stream of Server-Sent Events:

data: {"text":"Hello"}
data: {"text":" Tim"}
data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":87}

PATCH /sessions/:sessionId

Request body:

{ "name": "My Renamed Session" }

Returns the updated session object. name is required and trimmed of whitespace.


DELETE /sessions/:sessionId

Returns 204 No Content. Cascades to delete all episodes for the session.


GET /sessions/:sessionId/history

Query parameters:

Parameter Default Description
limit 20 Maximum number of episodes to return
offset 0 Number of episodes to skip (for pagination)

Response:

{
  "sessionId": "your-session-uuid",
  "episodes": [
    {
      "id": 42,
      "session_id": 1,
      "user_message": "Hello, my name is Tim.",
      "ai_response": "Hello Tim! How can I help you today?",
      "token_count": 87,
      "created_at": 1712345678,
      "metadata": null
    }
  ]
}

Episodes are ordered newest first.


GET /models

Returns the parsed contents of models.json:

[
  { "value": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf", "label": "Gemma 4 26B Claude Distill" }
]

Returns 500 if the manifest file cannot be read or parsed.