clean up
This commit is contained in:
124
packages/orchestration-service/CLAUDE.md
Normal file
124
packages/orchestration-service/CLAUDE.md
Normal file
@@ -0,0 +1,124 @@
|
||||
# CLAUDE.md
|
||||
|
||||
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||||
|
||||
See the root [CLAUDE.md](../../CLAUDE.md) for overall architecture, service roles, and the end-to-end chat flow.
|
||||
|
||||
## Running This Service
|
||||
|
||||
```bash
|
||||
npm run orchestration # From repo root (node src/index.js)
|
||||
npm -w packages/orchestration-service run dev # With --watch
|
||||
```
|
||||
|
||||
Default port: **4000**. Depends on memory-service, embedding-service, inference-service, and Qdrant.
|
||||
|
||||
## Context Assembly (`src/chat/index.js`)
|
||||
|
||||
`assembleContext(externalId, userMessage)` is the core function that builds the inference prompt. Order of operations:
|
||||
|
||||
1. Resolve session by `externalId` (creates it if missing — every chat call is self-healing).
|
||||
2. If session has a `project_id`, load the project and fetch all sibling sessions (via `getProjectSessions`, hardcoded `limit=200`).
|
||||
3. Fetch `recentEpisodeLimit` recent episodes from memory-service.
|
||||
4. Embed the user message; search Qdrant EPISODES with `scoreThreshold`:
|
||||
- No project: `must: [sessionId == this session]`
|
||||
- Project: `should: [sessionId == s1, sessionId == s2, ...]` across all project sessions
|
||||
- Dedup against recent episode IDs before including.
|
||||
5. Embed and search Qdrant ENTITIES; filter by `projectId` if applicable.
|
||||
6. Build prompt in this fixed order: **system prompt → entities → semantic episodes → recent episodes → user message → "Assistant:"**
|
||||
|
||||
The ordering prioritizes established facts (entities) and relevant past context (semantic) over pure recency.
|
||||
|
||||
## System Prompt Resolution
|
||||
|
||||
Priority from highest to lowest:
|
||||
1. `project.system_prompt` (stored on the project row in memory-service)
|
||||
2. `settings.systemPrompt` (saved in `data/settings.json`)
|
||||
3. `ORCHESTRATION.SYSTEM_PROMPT` (shared constants fallback)
|
||||
|
||||
## Settings (`src/config/settings.js`)
|
||||
|
||||
Settings are loaded from `data/settings.json` merged with defaults at every `GET /settings` call. `PATCH /settings` validates each field individually with specific constraints:
|
||||
|
||||
| Field | Constraint |
|
||||
|---|---|
|
||||
| `recentEpisodeLimit` | integer, 1–20 |
|
||||
| `semanticLimit` | integer, 1–20 |
|
||||
| `scoreThreshold` | number, 0–1 |
|
||||
| `temperature` | number, 0–2 |
|
||||
| `repeatPenalty` | number, 1–2 |
|
||||
| `topP` | number, 0–1 |
|
||||
| `topK` | integer, 1–100 |
|
||||
| `modelsFolderPath` | path must exist and be readable |
|
||||
| `systemPrompt` | string (trimmed); `null` reverts to shared default |
|
||||
|
||||
`data/settings.json` is created on first save. Parent directories are created if missing.
|
||||
|
||||
## Streaming SSE (`src/chat/index.js` — `chatStream`)
|
||||
|
||||
The route sets SSE headers and delegates to `chatStream`, which:
|
||||
1. Calls `inference.completeStream()` → receives a raw HTTP Response with a readable body.
|
||||
2. Reads the body in chunks, buffers across chunk boundaries, splits on `\n\n`.
|
||||
3. For each event line starting with `data: `, parses the JSON and calls `onChunk(data.response)`.
|
||||
4. The `[DONE]` sentinel (used by some llama-server versions) is explicitly ignored.
|
||||
5. After stream ends, saves the assembled full response as an episode (same as non-streaming).
|
||||
|
||||
If a chunk parse fails the error is logged and the stream continues. If the response body closes with no text accumulated, the episode is not saved (logged as warning).
|
||||
|
||||
## Fire-and-Forget Tasks
|
||||
|
||||
After every successful chat turn:
|
||||
- **Summarization** (`services/summarization.js` → `triggerSummary`): checks token threshold → recency guard → calls Ollama → POSTs to memory-service. Only runs if `SUMMARIES.THRESHOLD_TOKENS` is exceeded AND at least `SUMMARIES.MIN_EPISODES_SINCE` new episodes have occurred since the last summary.
|
||||
- **Auto-naming** (`chat/index.js` → `autoNameSession`): only fires on the first message of a session. Uses temp 0.3, `maxTokens=20`, prompts for a ≤5-word title.
|
||||
|
||||
Both tasks catch all errors and log warnings without surfacing to the client.
|
||||
|
||||
## Summarization Recency Guard
|
||||
|
||||
`src/services/summarization.js` reads the `episode_range` field of the latest existing summary (format: `"<startId>-<endId>"`). It counts SQLite episodes with `id > endId`; if fewer than `SUMMARIES.MIN_EPISODES_SINCE`, it skips. This prevents rapid re-summarization on high-traffic sessions.
|
||||
|
||||
When the existing summary's token count exceeds `SUMMARIES.MAX_SUMMARY_TOKENS`, it is treated as "expired" — a fresh summary is generated instead of an incremental update.
|
||||
|
||||
## Qdrant Calls (Direct, Not Via Memory-Service)
|
||||
|
||||
`src/services/qdrant.js` makes REST calls to Qdrant directly at `QDRANT_URL`. This bypasses memory-service for semantic search performance. Orchestration fetches episode/entity content from memory-service by ID *after* getting vector search results from Qdrant.
|
||||
|
||||
`searchEntities` checks `projectId !== null && projectId !== undefined` before applying the filter — a session with no project skips the filter entirely and searches globally.
|
||||
|
||||
## Models Endpoint
|
||||
|
||||
`GET /models` scans `modelsFolderPath` for `.gguf` files and optionally reads a `models.json` manifest (keyed by filename) for labels and descriptions. File size is reported in GB. Returns 500 if the folder is inaccessible.
|
||||
|
||||
`GET /models/props` proxies `/props` from llama-server and returns `{contextWindow, modelAlias}`. Returns 503 if llama-server is unreachable.
|
||||
|
||||
## Health Check
|
||||
|
||||
`GET /health/services` runs parallel fetch calls to all four dependent services with a 3-second `AbortSignal.timeout` each. Results are returned as an array — the endpoint never returns a non-2xx itself regardless of downstream status.
|
||||
|
||||
## Background Model (qwen2.5:3b)
|
||||
Used for entity extraction and summarization via Ollama on Mini PC 1. Uses **ChatML
|
||||
format** (`<|im_start|>` / `<|im_end|>`) — not Phi3 format. Use `format: 'json'`
|
||||
only for structured extraction, never for free-text summarization.
|
||||
|
||||
## API Endpoints Quick Reference
|
||||
|
||||
| Method | Path | Notes |
|
||||
|---|---|---|
|
||||
| GET | `/health` | Returns service URLs |
|
||||
| GET | `/health/services` | Parallel status of all dependencies |
|
||||
| POST | `/chat` | Blocking completion |
|
||||
| POST | `/chat/stream` | SSE streaming |
|
||||
| GET/PATCH | `/settings` | Persistent settings |
|
||||
| GET | `/models` | `.gguf` file scan |
|
||||
| GET | `/models/props` | llama-server model info |
|
||||
| GET | `/sessions` | Delegates to memory-service |
|
||||
| GET | `/sessions/:sessionId/history` | Paginated episodes by external ID |
|
||||
| PATCH | `/sessions/:sessionId` | `name` and/or `projectId` |
|
||||
| DELETE | `/sessions/:sessionId` | |
|
||||
| GET | `/episodes` | Delegates; supports `q` for FTS |
|
||||
| DELETE | `/episodes/:id` | Delegates |
|
||||
| GET/POST/PATCH/DELETE | `/projects` and `/projects/:id` | Delegates |
|
||||
| POST | `/summaries/project/:projectId/generate` | On-demand; 422 if no data |
|
||||
| GET | `/summaries/project/:projectId/overview` | |
|
||||
| GET | `/summaries/session/:sessionId` | Resolves external ID first |
|
||||
| GET | `/summaries/project/:projectId` | |
|
||||
Reference in New Issue
Block a user