nexusAI/docs/services/orchestration-service.md

# Orchestration Service

**Package:** `@nexusai/orchestration-service`
**Location:** `packages/orchestration-service`
**Deployed on:** Mini PC 2 (192.168.0.205)
**Port:** 4000

## Purpose

The main entry point for all clients. Assembles context packages from
memory, routes prompts to inference, and writes new episodes back to
memory after each interaction. Clients never talk directly to the memory
or inference services — all traffic flows through orchestration.

## Dependencies

- `express` — HTTP API
- `cors` — cross-origin resource sharing middleware
- `dotenv` — environment variable loading
- `@nexusai/shared` — shared utilities

## Environment Variables

| Variable | Required | Default | Description |
|---|---|---|---|
| PORT | No | 4000 | Port to listen on |
| MEMORY_SERVICE_URL | No | http://localhost:3002 | Memory service URL |
| EMBEDDING_SERVICE_URL | No | http://localhost:3003 | Embedding service URL |
| INFERENCE_SERVICE_URL | No | http://localhost:3001 | Inference service URL |
| LLAMA_SERVER_URL | No | http://localhost:8080 | Direct llama-server URL for /models/props |
| QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search |
| CORS_ORIGIN | No | http://localhost:5173 | Allowed origin for CORS requests |
| MODELS_MANIFEST_PATH | No | — | Legacy — superseded by `modelsFolderPath` in settings.json |

## Internal Structure

```
src/
├── services/
│   ├── memory.js      # HTTP client for memory service
│   ├── inference.js   # HTTP client for inference service
│   ├── embedding.js   # HTTP client for embedding service
│   └── qdrant.js      # HTTP client for Qdrant (direct vector search)
├── chat/
│   └── index.js       # Core pipeline — context assembly, isolation, auto-naming
├── config/
│   └── settings.js    # Settings load/save — reads/writes data/settings.json
├── routes/
│   ├── chat.js        # POST /chat and POST /chat/stream
│   ├── sessions.js    # Session CRUD proxy
│   ├── projects.js    # Project CRUD proxy
│   ├── episodes.js    # Episode list and delete proxy
│   ├── settings.js    # GET /settings and PATCH /settings
│   ├── health.js      # GET /health — pings all four services
│   └── models.js      # GET /models — scans .gguf files live, merges with models.json
                       # GET /models/props — context window + loaded model from llama-server
└── index.js           # Express app entry point
```

The `services/` layer wraps all downstream HTTP calls in named functions.
URL or endpoint changes have a single place to be updated.

## Settings

Settings are persisted to `data/settings.json` and loaded on every request
via `appSettings.load()` — changes apply immediately without a service restart.

| Setting | Default | Description |
|---|---|---|
| `recentEpisodeLimit` | 5 | Recent episodes injected into prompt |
| `semanticLimit` | 5 | Semantic search results injected into prompt |
| `scoreThreshold` | 0.75 | Minimum similarity score for semantic results |
| `modelsFolderPath` | `/mnt/nexus-models` | Path to folder containing .gguf files |
| `temperature` | 0.7 | Inference temperature |
| `repeatPenalty` | 1.1 | Repeat token penalty |
| `topP` | 0.9 | Nucleus sampling probability mass |
| `topK` | 40 | Top-K token candidates per step |

Defaults are defined in `config/settings.js` and fall back to constants in
`@nexusai/shared`. Values saved in `settings.json` take precedence.

## Chat Pipeline

Both `POST /chat` and `POST /chat/stream` share the same steps. The only
difference is how the inference response is delivered to the client.

### Steps

1. **Session resolution** — look up session by `externalId`. Auto-create if
   not found. Clients generate a UUID for new conversations — no pre-creation
   step needed.

2. **Project context resolution** — if the session has a `project_id`, fetch
   the project and all its session IDs. Used to scope semantic search. See
   `memory-isolation.md` for full behaviour.

3. **Recent episode retrieval** — fetch the most recent episodes for the
   session (`recentEpisodeLimit`, default 5).

4. **Semantic search** — embed the user message, query Qdrant for the top
   most similar past episodes (`semanticLimit`, `scoreThreshold`). Deduplicated
   against recent episodes. Non-critical — if it fails, pipeline continues with
   recency-only context.

5. **Entity search** — reuse the embedded user message vector to query the
   `entities` Qdrant collection (score threshold 0.6, limit 5). Returns
   entity payloads (`name`, `type`, `notes`) directly — no SQLite roundtrip
   needed. Non-critical — if it fails, pipeline continues without entity context.

6. **Prompt assembly** — combine system prompt, entity context, semantic
   episodes, recent episodes, and user message.

7. **Inference** — send to inference service with settings-derived parameters
   (temperature, topP, topK, repeatPenalty). `/chat` awaits full response;
   `/chat/stream` pipes SSE chunks to the client.

8. **Episode write** — write the exchange back to memory. Fire-and-forget
   for `/chat`; awaited for `/chat/stream` to ensure the full text is
   accumulated before saving.

9. **Auto-naming** — on `isFirstMessage && !session.name`, fire a secondary
   inference call with a naming prompt (max 20 tokens, temperature 0.3) and
   write the result back as `session.name`. Fully fire-and-forget.

### Prompt Structure

```
[System prompt]

Here is what you know about entities relevant to this conversation:
- {name} ({type}): {notes}
... (up to 5 entity results)
---
Here are some relevant memories from earlier conversations:
User: {past user message}
Assistant: {past ai response}
... (up to semanticLimit semantic episodes)
---
Here are some relevant memories from your past conversations:
User: {past user message}
Assistant: {past ai response}
... (up to recentEpisodeLimit recent episodes)
--- End of recent memories ---

User: {current message}
Assistant:
```

Entity context appears first — before episodic memory — because structured
facts about known entities are the most stable and reliable context. Semantic
episodes follow, then recent episodes as the immediate conversation flow.

## SSE Stream Format

Inference service → orchestration:
```
data: {"response":"Hello","done":false}
data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42}
data: [DONE]
```

Orchestration → client:
```
data: {"text":"Hello"}
data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42}
```

The `[DONE]` sentinel is consumed internally and not forwarded. The stream
is terminated by `res.end()` after the done event.

## Models Route

`GET /models` scans `.gguf` files live on each request from `modelsFolderPath`
(read from settings). Merges results with a `models.json` file in the same
folder for richer metadata (label, description). Returns file size in GB.

`GET /models/props` fetches directly from llama-server via `LLAMA_SERVER_URL`.
Returns `{ contextWindow, modelAlias }`. Used by the client to display
read-only context window size and the currently loaded model in the settings
panel. Returns `503` if llama-server is unreachable.

## Sessions Route Behaviour

`PATCH /sessions/:sessionId` accepts either `name`, `projectId`, or both.
The validation guard only rejects requests where neither is provided:

```js
if (!name?.trim() && projectId === undefined) {
  return res.status(400).json({ error: 'name or projectId is required' });
}
```

This allows `useChat` to write project assignment separately from rename
operations.

## Caddy Configuration

Each route prefix needs a handle block in the Caddyfile on Mini PC 2:

```
handle /chat*     { reverse_proxy localhost:4000 }
handle /sessions* { reverse_proxy localhost:4000 }
handle /models*   { reverse_proxy localhost:4000 }
handle /projects* { reverse_proxy localhost:4000 }
handle /episodes* { reverse_proxy localhost:4000 }
handle /settings* { reverse_proxy localhost:4000 }
handle /health*   { reverse_proxy localhost:4000 }
```

After updating: `caddy reload --config /path/to/Caddyfile`

For all HTTP endpoints, see `api-routes.md`.