nexusAI/docs/services/orchestration-service.md

# Orchestration Service

**Package:** `@nexusai/orchestration-service`
**Location:** `packages/orchestration-service`
**Deployed on:** Mini PC 2 (192.168.0.205)
**Port:** 4000

## Purpose

The main entry point for all clients. Assembles context packages from
memory, routes prompts to inference, and writes new episodes back to
memory after each interaction. Clients never talk directly to the memory
or inference services — all traffic flows through orchestration.

## Dependencies

- `express` — HTTP API
- `cors` — cross-origin resource sharing middleware
- `dotenv` — environment variable loading
- `@nexusai/shared` — shared utilities

## Environment Variables

| Variable | Required | Default | Description |
|---|---|---|---|
| PORT | No | 4000 | Port to listen on |
| MEMORY_SERVICE_URL | No | http://localhost:3002 | Memory service URL |
| EMBEDDING_SERVICE_URL | No | http://localhost:3003 | Embedding service URL |
| INFERENCE_SERVICE_URL | No | http://localhost:3001 | Inference service URL |
| QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search |
| CORS_ORIGIN | No | http://localhost:5173 | Allowed origin for CORS requests |
| MODELS_MANIFEST_PATH | Yes | — | Path to `models.json` manifest file |

## Internal Structure

```
src/
├── services/
│   ├── memory.js      # HTTP client for memory service
│   ├── inference.js   # HTTP client for inference service
│   ├── embedding.js   # HTTP client for embedding service
│   └── qdrant.js      # HTTP client for Qdrant vector search
├── chat/
│   └── index.js       # Core pipeline logic — context assembly and coordination
├── routes/
│   ├── chat.js        # POST /chat and POST /chat/stream route handlers
│   ├── sessions.js    # Session list, history, rename, and delete routes
│   ├── projects.js    # Project CRUD routes — proxies to memory service
│   └── models.js      # GET /models — reads models.json manifest from disk
└── index.js           # Express app entry point
```

The `services/` layer wraps all downstream HTTP calls in named functions,
keeping the pipeline logic in `chat/index.js` readable and ensuring that
URL or endpoint changes have a single place to be updated.

## Chat Pipeline

Both `POST /chat` and `POST /chat/stream` share the same context assembly
steps. The only difference is how the inference response is delivered to
the client.

1. **Session resolution** — looks up the session by `externalId` in the memory
   service. If not found, auto-creates a new session. Clients can generate a
   UUID for new conversations and pass it directly — no pre-creation step needed.

2. **Recent episode retrieval** — fetches the most recent episodes for the session
   (default: 5) from the memory service.

3. **Semantic search** — embeds the user message via the embedding service, then
   queries Qdrant for the top-5 most similar past episodes (score threshold: 0.75).
   Results are deduplicated against the recent episode set using a `Set` of IDs.
   Full episode content is fetched from the memory service by ID. This step is
   non-critical — if it fails, a warning is logged and the pipeline continues with
   recency-only context.

4. **Prompt assembly** — combines the system prompt, semantic episodes (if any),
   recent episodes, and the current user message into a single prompt string.

5. **Inference** — sends the assembled prompt to the inference service. `/chat`
   awaits the full response; `/chat/stream` opens an SSE connection and pipes
   chunks to the client as they arrive.

6. **Episode write** — writes the new exchange (user message + AI response)
   back to the memory service as a fire-and-forget operation. For streaming,
   the full response text is accumulated across chunks before writing.

7. **Response** — returns the AI response, model name, session ID, and token
   count to the client.

## Prompt Structure

```
[System prompt]

Here are some relevant memories from earlier conversations:
User: {past user message}
Assistant: {past ai response}
... (up to 5 semantic episodes)
---
Here are some relevant memories from your past conversations:
User: {past user message}
Assistant: {past ai response}
... (up to 5 recent episodes)
--- End of recent memories ---

User: {current message}
Assistant:
```

Semantic episodes appear before recent episodes so the model encounters
long-range relevant context before the immediate conversation flow.

## SSE Stream Format

The inference service emits chunks from the llama.cpp provider in this format:
```
data: {"response":"Hello","done":false}
data: {"response":"!","done":false}
data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":42}
data: [DONE]
```

The orchestration service re-emits to the client as:
```
data: {"text":"Hello"}
data: {"text":"!"}
data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":42}
```

The `[DONE]` sentinel from the inference service is consumed internally
and not forwarded. The client stream is terminated by `res.end()` after
the done event. Model name and token count are included on the done event
so the client can display them in the UI.

## Models Manifest

The `/models` endpoint reads a `models.json` file from disk at the path
specified by `MODELS_MANIFEST_PATH`. The file lives on the main PC alongside
the model files, and is accessible to orchestration via a network share
mounted at `/mnt/nexus-models`.

The manifest is read fresh on each request — no restart needed when models
are added or removed.

**models.json format:**
```json
[
  { "value": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf", "label": "Gemma 4 26B Claude Distill" }
]
```

- `value` — must match the model name as reported by `llama-server` (including `.gguf` extension)
- `label` — display name shown in the UI

## Endpoints

### Health

| Method | Path | Description |
|---|---|---|
| GET | /health | Service health check — reports downstream service URLs |

### Chat

| Method | Path | Description |
|---|---|---|
| POST | /chat | Send a message and receive a complete response |
| POST | /chat/stream | Send a message and receive a streaming SSE response |

### Sessions

| Method | Path | Description |
|---|---|---|
| GET | /sessions | Get paginated list of all sessions |
| GET | /sessions/:sessionId/history | Get paginated episode history for a session |
| PATCH | /sessions/:sessionId | Rename a session |
| DELETE | /sessions/:sessionId | Delete a session and all its episodes |

### Projects

Projects are proxied directly from the memory service with no transformation.

| Method | Path | Description |
|---|---|---|
| GET | /projects | Get all projects |
| POST | /projects | Create a new project |
| PATCH | /projects/:id | Update a project |
| DELETE | /projects/:id | Delete a project |

### Models

| Method | Path | Description |
|---|---|---|
| GET | /models | Get list of available models from manifest file |

---

**POST /chat**

Request body:
```json
{
  "sessionId": "your-session-uuid",
  "message": "Hello, my name is Tim.",
  "model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
  "temperature": 0.7
}
```

`model` and `temperature` are optional — fall back to inference service defaults
if omitted.

Response:
```json
{
  "sessionId": "your-session-uuid",
  "response": "Hello Tim! How can I help you today?",
  "model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
  "tokenCount": 87
}
```

---

**POST /chat/stream**

Same request body as `POST /chat`.

Response is a stream of Server-Sent Events:
```
data: {"text":"Hello"}
data: {"text":" Tim"}
data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":87}
```

---

**PATCH /sessions/:sessionId**

Request body:
```json
{ "name": "My Renamed Session" }
```

Returns the updated session object. `name` is required and trimmed of whitespace.

---

**DELETE /sessions/:sessionId**

Returns `204 No Content`. Cascades to delete all episodes for the session.

---

**GET /sessions/:sessionId/history**

Query parameters:

| Parameter | Default | Description |
|---|---|---|
| limit | 20 | Maximum number of episodes to return |
| offset | 0 | Number of episodes to skip (for pagination) |

Response:
```json
{
  "sessionId": "your-session-uuid",
  "episodes": [
    {
      "id": 42,
      "session_id": 1,
      "user_message": "Hello, my name is Tim.",
      "ai_response": "Hello Tim! How can I help you today?",
      "token_count": 87,
      "created_at": 1712345678,
      "metadata": null
    }
  ]
}
```

Episodes are ordered newest first.

---

**GET /models**

Returns the parsed contents of `models.json`:
```json
[
  { "value": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf", "label": "Gemma 4 26B Claude Distill" }
]
```

Returns `500` if the manifest file cannot be read or parsed.

## Caddy Configuration

The Caddy reverse proxy on Mini PC 2 must have a handle block for each route
prefix the client needs to reach. Current required blocks:

```
handle /chat* {
    reverse_proxy localhost:4000
}
handle /sessions* {
    reverse_proxy localhost:4000
}
handle /models* {
    reverse_proxy localhost:4000
}
handle /projects* {
    reverse_proxy localhost:4000
}
```

When adding new top-level routes to the orchestration service, add a matching
block here and reload Caddy: `caddy reload --config /path/to/Caddyfile`