update documentation

This commit is contained in:
Storme-bit
2026-04-17 03:46:17 -07:00
parent 27e3c98304
commit 5145b9a7db
13 changed files with 822 additions and 794 deletions

View File

@@ -39,56 +39,58 @@ src/
│ ├── memory.js # HTTP client for memory service
│ ├── inference.js # HTTP client for inference service
│ ├── embedding.js # HTTP client for embedding service
│ └── qdrant.js # HTTP client for Qdrant vector search
│ └── qdrant.js # HTTP client for Qdrant (direct vector search)
├── chat/
│ └── index.js # Core pipeline logic — context assembly and coordination
│ └── index.js # Core pipeline — context assembly, isolation, auto-naming
├── routes/
│ ├── chat.js # POST /chat and POST /chat/stream route handlers
│ ├── sessions.js # Session list, history, rename, and delete routes
│ ├── projects.js # Project CRUD routes — proxies to memory service
│ └── models.js # GET /models — reads models.json manifest from disk
│ ├── chat.js # POST /chat and POST /chat/stream
│ ├── sessions.js # Session CRUD proxy
│ ├── projects.js # Project CRUD proxy
│ └── models.js # GET /models — reads models.json from disk
└── index.js # Express app entry point
```
The `services/` layer wraps all downstream HTTP calls in named functions,
keeping the pipeline logic in `chat/index.js` readable and ensuring that
The `services/` layer wraps all downstream HTTP calls in named functions.
URL or endpoint changes have a single place to be updated.
## Chat Pipeline
Both `POST /chat` and `POST /chat/stream` share the same context assembly
steps. The only difference is how the inference response is delivered to
the client.
Both `POST /chat` and `POST /chat/stream` share the same steps. The only
difference is how the inference response is delivered to the client.
1. **Session resolution** — looks up the session by `externalId` in the memory
service. If not found, auto-creates a new session. Clients can generate a
UUID for new conversations and pass it directly — no pre-creation step needed.
### Steps
2. **Recent episode retrieval** — fetches the most recent episodes for the session
(default: 5) from the memory service.
1. **Session resolution** — look up session by `externalId`. Auto-create if
not found. Clients generate a UUID for new conversations — no pre-creation
step needed.
3. **Semantic search**embeds the user message via the embedding service, then
queries Qdrant for the top-5 most similar past episodes (score threshold: 0.75).
Results are deduplicated against the recent episode set using a `Set` of IDs.
Full episode content is fetched from the memory service by ID. This step is
non-critical — if it fails, a warning is logged and the pipeline continues with
2. **Project context resolution**if the session has a `project_id`, fetch
the project and all its session IDs. Used to scope semantic search. See
`memory-isolation.md` for full behaviour.
3. **Recent episode retrieval** — fetch the most recent episodes for the
session (`RECENT_EPISODE_LIMIT`, default 5).
4. **Semantic search** — embed the user message, query Qdrant for the top-5
most similar past episodes (`SCORE_THRESHOLD` 0.75). Deduplicated against
recent episodes. Non-critical — if it fails, pipeline continues with
recency-only context.
4. **Prompt assembly** — combines the system prompt, semantic episodes (if any),
recent episodes, and the current user message into a single prompt string.
5. **Prompt assembly** — combine system prompt, semantic episodes, recent
episodes, and user message.
5. **Inference** — sends the assembled prompt to the inference service. `/chat`
awaits the full response; `/chat/stream` opens an SSE connection and pipes
chunks to the client as they arrive.
6. **Inference** — send to inference service. `/chat` awaits full response;
`/chat/stream` pipes SSE chunks to the client.
6. **Episode write** — writes the new exchange (user message + AI response)
back to the memory service as a fire-and-forget operation. For streaming,
the full response text is accumulated across chunks before writing.
7. **Episode write** — write the exchange back to memory. Fire-and-forget
for `/chat`; awaited for `/chat/stream` to ensure the full text is
accumulated before saving.
7. **Response** — returns the AI response, model name, session ID, and token
count to the client.
8. **Auto-naming** — on `isFirstMessage && !session.name`, fire a secondary
inference call with a naming prompt (max 20 tokens, temperature 0.3) and
write the result back as `session.name`. Fully fire-and-forget.
## Prompt Structure
### Prompt Structure
```
[System prompt]
@@ -108,212 +110,67 @@ User: {current message}
Assistant:
```
Semantic episodes appear before recent episodes so the model encounters
long-range relevant context before the immediate conversation flow.
Semantic episodes appear before recent episodes so the model sees
long-range context before the immediate conversation flow.
## SSE Stream Format
The inference service emits chunks from the llama.cpp provider in this format:
Inference service → orchestration:
```
data: {"response":"Hello","done":false}
data: {"response":"!","done":false}
data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":42}
data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42}
data: [DONE]
```
The orchestration service re-emits to the client as:
Orchestration client:
```
data: {"text":"Hello"}
data: {"text":"!"}
data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":42}
data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42}
```
The `[DONE]` sentinel from the inference service is consumed internally
and not forwarded. The client stream is terminated by `res.end()` after
the done event. Model name and token count are included on the done event
so the client can display them in the UI.
The `[DONE]` sentinel is consumed internally and not forwarded. The stream
is terminated by `res.end()` after the done event.
## Models Manifest
The `/models` endpoint reads a `models.json` file from disk at the path
specified by `MODELS_MANIFEST_PATH`. The file lives on the main PC alongside
the model files, and is accessible to orchestration via a network share
mounted at `/mnt/nexus-models`.
`GET /models` reads `models.json` fresh on each request from
`MODELS_MANIFEST_PATH`. The file lives on the main PC alongside model files,
accessible via an SMB mount at `/mnt/nexus-models`.
The manifest is read fresh on each request — no restart needed when models
are added or removed.
**models.json format:**
```json
[
{ "value": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf", "label": "Gemma 4 26B Claude Distill" }
]
```
- `value` must match the model name as reported by `llama-server` (including `.gguf` extension)
- `label` — display name shown in the UI
`value` must match the model name as reported by `llama-server` (including
`.gguf` extension). No service restart needed when models are added or removed.
## Endpoints
## Sessions Route Behaviour
### Health
`PATCH /sessions/:sessionId` accepts either `name`, `projectId`, or both.
The validation guard only rejects requests where neither is provided:
| Method | Path | Description |
|---|---|---|
| GET | /health | Service health check — reports downstream service URLs |
### Chat
| Method | Path | Description |
|---|---|---|
| POST | /chat | Send a message and receive a complete response |
| POST | /chat/stream | Send a message and receive a streaming SSE response |
### Sessions
| Method | Path | Description |
|---|---|---|
| GET | /sessions | Get paginated list of all sessions |
| GET | /sessions/:sessionId/history | Get paginated episode history for a session |
| PATCH | /sessions/:sessionId | Rename a session |
| DELETE | /sessions/:sessionId | Delete a session and all its episodes |
### Projects
Projects are proxied directly from the memory service with no transformation.
| Method | Path | Description |
|---|---|---|
| GET | /projects | Get all projects |
| POST | /projects | Create a new project |
| PATCH | /projects/:id | Update a project |
| DELETE | /projects/:id | Delete a project |
### Models
| Method | Path | Description |
|---|---|---|
| GET | /models | Get list of available models from manifest file |
---
**POST /chat**
Request body:
```json
{
"sessionId": "your-session-uuid",
"message": "Hello, my name is Tim.",
"model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
"temperature": 0.7
```js
if (!name?.trim() && projectId === undefined) {
return res.status(400).json({ error: 'name or projectId is required' });
}
```
`model` and `temperature` are optional — fall back to inference service defaults
if omitted.
Response:
```json
{
"sessionId": "your-session-uuid",
"response": "Hello Tim! How can I help you today?",
"model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
"tokenCount": 87
}
```
---
**POST /chat/stream**
Same request body as `POST /chat`.
Response is a stream of Server-Sent Events:
```
data: {"text":"Hello"}
data: {"text":" Tim"}
data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":87}
```
---
**PATCH /sessions/:sessionId**
Request body:
```json
{ "name": "My Renamed Session" }
```
Returns the updated session object. `name` is required and trimmed of whitespace.
---
**DELETE /sessions/:sessionId**
Returns `204 No Content`. Cascades to delete all episodes for the session.
---
**GET /sessions/:sessionId/history**
Query parameters:
| Parameter | Default | Description |
|---|---|---|
| limit | 20 | Maximum number of episodes to return |
| offset | 0 | Number of episodes to skip (for pagination) |
Response:
```json
{
"sessionId": "your-session-uuid",
"episodes": [
{
"id": 42,
"session_id": 1,
"user_message": "Hello, my name is Tim.",
"ai_response": "Hello Tim! How can I help you today?",
"token_count": 87,
"created_at": 1712345678,
"metadata": null
}
]
}
```
Episodes are ordered newest first.
---
**GET /models**
Returns the parsed contents of `models.json`:
```json
[
{ "value": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf", "label": "Gemma 4 26B Claude Distill" }
]
```
Returns `500` if the manifest file cannot be read or parsed.
This allows `useChat` to write project assignment separately from rename
operations.
## Caddy Configuration
The Caddy reverse proxy on Mini PC 2 must have a handle block for each route
prefix the client needs to reach. Current required blocks:
Each route prefix needs a handle block in the Caddyfile on Mini PC 2:
```
handle /chat* {
reverse_proxy localhost:4000
}
handle /sessions* {
reverse_proxy localhost:4000
}
handle /models* {
reverse_proxy localhost:4000
}
handle /projects* {
reverse_proxy localhost:4000
}
handle /chat* { reverse_proxy localhost:4000 }
handle /sessions* { reverse_proxy localhost:4000 }
handle /models* { reverse_proxy localhost:4000 }
handle /projects* { reverse_proxy localhost:4000 }
```
When adding new top-level routes to the orchestration service, add a matching
block here and reload Caddy: `caddy reload --config /path/to/Caddyfile`
After updating: `caddy reload --config /path/to/Caddyfile`
For all HTTP endpoints, see `api-routes.md`.