Files

Storme-bit 1a97b19280 roadmap phase 1 complete

2026-04-27 03:10:39 -07:00

8.6 KiB

Raw Blame History

Orchestration Service

Package: @nexusai/orchestration-service
Location: packages/orchestration-service
Deployed on: Mini PC 2 (192.168.0.205)
Port: 4000

Purpose

The main entry point for all clients. Assembles context packages from memory, routes prompts to inference, and writes new episodes back to memory after each interaction. Clients never talk directly to the memory or inference services — all traffic flows through orchestration.

Dependencies

express — HTTP API
cors — cross-origin resource sharing middleware
dotenv — environment variable loading
@nexusai/shared — shared utilities

Environment Variables

Variable	Required	Default	Description
PORT	No	4000	Port to listen on
MEMORY_SERVICE_URL	No	http://localhost:3002	Memory service URL
EMBEDDING_SERVICE_URL	No	http://localhost:3003	Embedding service URL
INFERENCE_SERVICE_URL	No	http://localhost:3001	Inference service URL
LLAMA_SERVER_URL	No	http://localhost:8080	Direct llama-server URL for /models/props
QDRANT_URL	No	http://localhost:6333	Qdrant URL for semantic search
CORS_ORIGIN	No	http://localhost:5173	Allowed origin for CORS requests
EXTRACTION_URL	No	http://localhost:11434	Ollama URL for summarisation
EXTRACTION_MODEL	No	qwen2.5:3b	Ollama model used for summarisation

Internal Structure

src/
├── services/
│   ├── memory.js         # HTTP client for memory service
│   ├── inference.js      # HTTP client for inference service
│   ├── embedding.js      # HTTP client for embedding service
│   ├── qdrant.js         # HTTP client for Qdrant (direct vector search)
│   ├── graph.js          # HTTP client for memory-service graph endpoints
│   └── summarization.js  # Session summarisation — triggers after each episode
├── chat/
│   └── index.js          # Core pipeline — context assembly, graph expansion, auto-naming
├── config/
│   └── settings.js       # Settings load/save — reads/writes data/settings.json
├── routes/
│   ├── chat.js           # POST /chat and POST /chat/stream
│   ├── sessions.js       # Session CRUD proxy
│   ├── projects.js       # Project CRUD proxy
│   ├── episodes.js       # Episode list and delete proxy
│   ├── summaries.js      # GET /summaries/session/:id and /summaries/project/:id
│   ├── settings.js       # GET /settings and PATCH /settings
│   ├── health.js         # GET /health/services — pings all four services
│   └── models.js         # GET /models and GET /models/props
└── index.js              # Express app entry point

The services/ layer wraps all downstream HTTP calls in named functions. URL or endpoint changes have a single place to be updated.

Settings

Settings are persisted to data/settings.json and loaded on every request via appSettings.load() — changes apply immediately without a service restart.

Setting	Default	Description
`recentEpisodeLimit`	5	Recent episodes injected into prompt
`semanticLimit`	5	Semantic search results injected into prompt
`scoreThreshold`	0.5	Minimum similarity score for semantic results
`modelsFolderPath`	`/mnt/nexus-models`	Path to folder containing .gguf files
`temperature`	0.7	Inference temperature
`repeatPenalty`	1.1	Repeat token penalty
`topP`	0.9	Nucleus sampling probability mass
`topK`	40	Top-K token candidates per step
`systemPrompt`	(ORCHESTRATION.SYSTEM_PROMPT)	Global system prompt. `null` reverts to hardcoded constant.

Chat Pipeline

Both POST /chat and POST /chat/stream share the same steps. The only difference is how the inference response is delivered to the client.

Steps

Session resolution — look up session by externalId. Auto-create if not found.
Project context resolution — if the session has a project_id, fetch the project and all its session IDs. Used to scope semantic search. The project's system_prompt is also read at this step if set.
System prompt resolution — three-tier hierarchy:
- project.system_prompt — highest priority
- settings.systemPrompt — global setting from settings.json
- ORCHESTRATION.SYSTEM_PROMPT — hardcoded constant (last resort)
Recent episode retrieval — fetch most recent episodes (recentEpisodeLimit).
Semantic search — embed user message, query Qdrant for similar past episodes. Deduplicated against recent episodes. Non-critical.
Entity search — query entities Qdrant collection filtered by projectId. Returns entity IDs alongside Qdrant payload data (the Qdrant point ID equals the SQLite entity ID). Non-critical.
Graph neighborhood expansion — call POST /graph/neighbors on memory-service with the entity IDs from step 6. Returns a 1-hop subgraph { nodes, edges } — entity objects plus the relationships connecting them. If no entities were found or the graph call fails, falls back to flat entity list (no edges). Non-critical.
Prompt assembly — combine system prompt, graph context, semantic episodes, recent episodes, and user message.
Inference — send to inference service. /chat awaits full response; /chat/stream pipes SSE chunks to the client.
Episode write — write exchange back to memory with projectId.
Summarisation trigger — triggerSummary(session, allEpisodes) called fire-and-forget. See summarization.md for full details.
Auto-naming — on first message with no session name, fires a secondary inference call (max 20 tokens, temperature 0.3) to generate a session name.

Prompt Structure

[Resolved system prompt]

Here is what you know about entities relevant to this conversation and their connections:
- {name} ({type}): {notes}
  → {label} {neighbor_name} ({neighbor_type})
---
Here are some relevant memories from earlier conversations:
User: {past user message}
Assistant: {past ai response}
---
Here are some relevant memories from your past conversations:
User: {past user message}
Assistant: {past ai response}
--- End of recent memories ---

User: {current message}
Assistant:

The entity block renders the full graph neighborhood — seed entities matched by Qdrant search plus any neighbors pulled in by 1-hop traversal. Each entity shows its notes and any outbound relationships with their targets. Neighbor nodes that have no outbound edges within the subgraph appear without connection lines.

Summarisation

After each episode write, triggerSummary is called fire-and-forget. It checks token thresholds and episode counts before generating, then stores the result in the memory service.

For full details on trigger conditions, prompt format, cumulative updates, ChatML token stripping, and episode range tracking, see summarization.md.

SSE Stream Format

Inference service → orchestration:

data: {"response":"Hello","done":false}
data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42}
data: [DONE]

Orchestration → client:

data: {"text":"Hello"}
data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42}

The [DONE] sentinel is consumed internally and not forwarded.

Models Route

GET /models scans .gguf files live from modelsFolderPath and merges with models.json for metadata. Returns file size in GB.

GET /models/props fetches directly from llama-server. Returns { contextWindow, modelAlias }. Returns 503 if unreachable.

Sessions Route Behaviour

PATCH /sessions/:sessionId accepts name, projectId, or both. Rejects only when neither is provided — allows useChat to write project assignment separately from rename operations.

Caddy Configuration

Each route prefix needs a handle block in the Caddyfile on Mini PC 2. Any new top-level route must be added here AND in vite.config.js.

handle /chat*      { reverse_proxy localhost:4000 }
handle /sessions*  { reverse_proxy localhost:4000 }
handle /models*    { reverse_proxy localhost:4000 }
handle /projects*  { reverse_proxy localhost:4000 }
handle /episodes*  { reverse_proxy localhost:4000 }
handle /settings*  { reverse_proxy localhost:4000 }
handle /summaries* { reverse_proxy localhost:4000 }
handle /health*    { reverse_proxy localhost:4000 }

After updating: caddy reload --config /path/to/Caddyfile

Note: /graph routes are on the memory-service (port 3002) and are called internally by orchestration — they do not need a Caddy entry.

For all HTTP endpoints, see api-routes.md.

8.6 KiB Raw Blame History