nexusAI/docs/services/orchestration-service.md

# Orchestration Service

**Package:** `@nexusai/orchestration-service`
**Location:** `packages/orchestration-service`
**Deployed on:** Mini PC 2 (192.168.0.205)
**Port:** 4000

## Purpose

The main entry point for all clients. Assembles context packages from
memory, routes prompts to inference, and writes new episodes back to
memory after each interaction. Clients never talk directly to the memory
or inference services — all traffic flows through orchestration.

## Dependencies

- `express` — HTTP API
- `cors` — cross-origin resource sharing middleware
- `dotenv` — environment variable loading
- `@nexusai/shared` — shared utilities

## Environment Variables

| Variable | Required | Default | Description |
|---|---|---|---|
| PORT | No | 4000 | Port to listen on |
| MEMORY_SERVICE_URL | No | http://localhost:3002 | Memory service URL |
| EMBEDDING_SERVICE_URL | No | http://localhost:3003 | Embedding service URL |
| INFERENCE_SERVICE_URL | No | http://localhost:3001 | Inference service URL |
| LLAMA_SERVER_URL | No | http://localhost:8080 | Direct llama-server URL for /models/props |
| QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search |
| CORS_ORIGIN | No | http://localhost:5173 | Allowed origin for CORS requests |
| EXTRACTION_URL | No | http://localhost:11434 | Ollama URL for summarisation |
| EXTRACTION_MODEL | No | qwen2.5:3b | Ollama model used for summarisation |

## Internal Structure

```
src/
├── services/
│   ├── memory.js         # HTTP client for memory service
│   ├── inference.js      # HTTP client for inference service
│   ├── embedding.js      # HTTP client for embedding service
│   ├── qdrant.js         # HTTP client for Qdrant (direct vector search)
│   ├── graph.js          # HTTP client for memory-service graph endpoints
│   └── summarization.js  # Session summarisation — triggers after each episode
├── chat/
│   └── index.js          # Core pipeline — context assembly, graph expansion, auto-naming
├── config/
│   └── settings.js       # Settings load/save — reads/writes data/settings.json
├── routes/
│   ├── chat.js           # POST /chat and POST /chat/stream
│   ├── sessions.js       # Session CRUD proxy
│   ├── projects.js       # Project CRUD proxy
│   ├── episodes.js       # Episode list and delete proxy
│   ├── summaries.js      # GET /summaries/session/:id and /summaries/project/:id
│   ├── settings.js       # GET /settings and PATCH /settings
│   ├── health.js         # GET /health/services — pings all four services
│   └── models.js         # GET /models and GET /models/props
└── index.js              # Express app entry point
```

The `services/` layer wraps all downstream HTTP calls in named functions.
URL or endpoint changes have a single place to be updated.

## Settings

Settings are persisted to `data/settings.json` and loaded on every request
via `appSettings.load()` — changes apply immediately without a service restart.

| Setting | Default | Description |
|---|---|---|
| `recentEpisodeLimit` | 5 | Recent episodes injected into prompt |
| `semanticLimit` | 5 | Semantic search results injected into prompt |
| `scoreThreshold` | 0.5 | Minimum similarity score for Qdrant semantic results |
| `semanticWeight` | 1.0 | RRF weight for Qdrant semantic results |
| `keywordWeight` | 0 | RRF weight for FTS5 keyword results (`0` = disabled) |
| `modelsFolderPath` | `/mnt/nexus-models` | Path to folder containing .gguf files |
| `temperature` | 0.7 | Inference temperature |
| `repeatPenalty` | 1.1 | Repeat token penalty |
| `topP` | 0.9 | Nucleus sampling probability mass |
| `topK` | 40 | Top-K token candidates per step |
| `systemPrompt` | *(ORCHESTRATION.SYSTEM_PROMPT)* | Global system prompt. `null` reverts to hardcoded constant. |

## Chat Pipeline

Both `POST /chat` and `POST /chat/stream` share the same steps. The only
difference is how the inference response is delivered to the client.

### Steps

1. **Session resolution** — look up session by `externalId`. Auto-create if
   not found.

2. **Project context resolution** — if the session has a `project_id`, fetch
   the project and all its session IDs. Used to scope semantic search. The
   project's `system_prompt` is also read at this step if set.

3. **System prompt resolution** — three-tier hierarchy:
   - `project.system_prompt` — highest priority
   - `settings.systemPrompt` — global setting from `settings.json`
   - `ORCHESTRATION.SYSTEM_PROMPT` — hardcoded constant (last resort)

4. **Recent episode retrieval** — fetch most recent episodes (`recentEpisodeLimit`).

5. **Fused episode retrieval** — runs semantic (Qdrant) and keyword (FTS5)
   search in parallel, then merges results via Reciprocal Rank Fusion (RRF).
   Both paths are filtered against `recentIds` before fusion. FTS is scoped
   to the current session or all project sessions. If `keywordWeight` is `0`,
   the FTS call is skipped entirely. Non-critical — failures fall back to
   whichever strategy succeeded.

6. **Entity search** — query `entities` Qdrant collection filtered by
   `projectId`. Returns entity IDs alongside Qdrant payload data (the Qdrant
   point ID equals the SQLite entity ID). Non-critical.

7. **Graph neighborhood expansion** — call `POST /graph/neighbors` on
   memory-service with the entity IDs from step 6. Returns a 1-hop subgraph
   `{ nodes, edges }` — entity objects plus the relationships connecting them.
   If no entities were found or the graph call fails, falls back to flat entity
   list (no edges). Non-critical.

8. **Prompt assembly** — combine system prompt, graph context, fused episodes,
   recent episodes, and user message.

9. **Inference** — send to inference service. `/chat` awaits full response;
   `/chat/stream` pipes SSE chunks to the client.

10. **Episode write** — write exchange back to memory with `projectId`.

11. **Summarisation trigger** — `triggerSummary(session, allEpisodes)` called
    fire-and-forget. See `summarization.md` for full details.

12. **Auto-naming** — on first message with no session name, fires a secondary
    inference call (max 20 tokens, temperature 0.3) to generate a session name.

### Prompt Structure

```
[Resolved system prompt]

Here is what you know about entities relevant to this conversation and their connections:
- {name} ({type}): {notes}
  → {label} {neighbor_name} ({neighbor_type})
---
Here are some relevant memories from earlier conversations:
User: {past user message}
Assistant: {past ai response}
---
Here are some relevant memories from your past conversations:
User: {past user message}
Assistant: {past ai response}
--- End of recent memories ---

User: {current message}
Assistant:
```

The entity block renders the full graph neighborhood — seed entities matched
by Qdrant search plus any neighbors pulled in by 1-hop traversal. Each entity
shows its `notes` and any outbound relationships with their targets. Neighbor
nodes that have no outbound edges within the subgraph appear without connection
lines.

## Summarisation

After each episode write, `triggerSummary` is called fire-and-forget. It
checks token thresholds and episode counts before generating, then stores
the result in the memory service.

> For full details on trigger conditions, prompt format, cumulative updates,
> ChatML token stripping, and episode range tracking, see `summarization.md`.

## SSE Stream Format

Inference service → orchestration:
```
data: {"response":"Hello","done":false}
data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42}
data: [DONE]
```

Orchestration → client:
```
data: {"text":"Hello"}
data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42}
```

The `[DONE]` sentinel is consumed internally and not forwarded.

## Models Route

`GET /models` scans `.gguf` files live from `modelsFolderPath` and merges
with `models.json` for metadata. Returns file size in GB.

`GET /models/props` fetches directly from llama-server. Returns
`{ contextWindow, modelAlias }`. Returns `503` if unreachable.

## Sessions Route Behaviour

`PATCH /sessions/:sessionId` accepts `name`, `projectId`, or both.
Rejects only when neither is provided — allows `useChat` to write project
assignment separately from rename operations.

## Caddy Configuration

Each route prefix needs a handle block in the Caddyfile on Mini PC 2.
**Any new top-level route must be added here AND in `vite.config.js`.**

```
handle /chat*      { reverse_proxy localhost:4000 }
handle /sessions*  { reverse_proxy localhost:4000 }
handle /models*    { reverse_proxy localhost:4000 }
handle /projects*  { reverse_proxy localhost:4000 }
handle /episodes*  { reverse_proxy localhost:4000 }
handle /settings*  { reverse_proxy localhost:4000 }
handle /summaries* { reverse_proxy localhost:4000 }
handle /health*    { reverse_proxy localhost:4000 }
```

After updating: `caddy reload --config /path/to/Caddyfile`

> Note: `/graph` routes are on the memory-service (port 3002) and are called
> internally by orchestration — they do not need a Caddy entry.

For all HTTP endpoints, see `api-routes.md`.