update documentation
This commit is contained in:
@@ -39,56 +39,58 @@ src/
|
||||
│ ├── memory.js # HTTP client for memory service
|
||||
│ ├── inference.js # HTTP client for inference service
|
||||
│ ├── embedding.js # HTTP client for embedding service
|
||||
│ └── qdrant.js # HTTP client for Qdrant vector search
|
||||
│ └── qdrant.js # HTTP client for Qdrant (direct vector search)
|
||||
├── chat/
|
||||
│ └── index.js # Core pipeline logic — context assembly and coordination
|
||||
│ └── index.js # Core pipeline — context assembly, isolation, auto-naming
|
||||
├── routes/
|
||||
│ ├── chat.js # POST /chat and POST /chat/stream route handlers
|
||||
│ ├── sessions.js # Session list, history, rename, and delete routes
|
||||
│ ├── projects.js # Project CRUD routes — proxies to memory service
|
||||
│ └── models.js # GET /models — reads models.json manifest from disk
|
||||
│ ├── chat.js # POST /chat and POST /chat/stream
|
||||
│ ├── sessions.js # Session CRUD proxy
|
||||
│ ├── projects.js # Project CRUD proxy
|
||||
│ └── models.js # GET /models — reads models.json from disk
|
||||
└── index.js # Express app entry point
|
||||
```
|
||||
|
||||
The `services/` layer wraps all downstream HTTP calls in named functions,
|
||||
keeping the pipeline logic in `chat/index.js` readable and ensuring that
|
||||
The `services/` layer wraps all downstream HTTP calls in named functions.
|
||||
URL or endpoint changes have a single place to be updated.
|
||||
|
||||
## Chat Pipeline
|
||||
|
||||
Both `POST /chat` and `POST /chat/stream` share the same context assembly
|
||||
steps. The only difference is how the inference response is delivered to
|
||||
the client.
|
||||
Both `POST /chat` and `POST /chat/stream` share the same steps. The only
|
||||
difference is how the inference response is delivered to the client.
|
||||
|
||||
1. **Session resolution** — looks up the session by `externalId` in the memory
|
||||
service. If not found, auto-creates a new session. Clients can generate a
|
||||
UUID for new conversations and pass it directly — no pre-creation step needed.
|
||||
### Steps
|
||||
|
||||
2. **Recent episode retrieval** — fetches the most recent episodes for the session
|
||||
(default: 5) from the memory service.
|
||||
1. **Session resolution** — look up session by `externalId`. Auto-create if
|
||||
not found. Clients generate a UUID for new conversations — no pre-creation
|
||||
step needed.
|
||||
|
||||
3. **Semantic search** — embeds the user message via the embedding service, then
|
||||
queries Qdrant for the top-5 most similar past episodes (score threshold: 0.75).
|
||||
Results are deduplicated against the recent episode set using a `Set` of IDs.
|
||||
Full episode content is fetched from the memory service by ID. This step is
|
||||
non-critical — if it fails, a warning is logged and the pipeline continues with
|
||||
2. **Project context resolution** — if the session has a `project_id`, fetch
|
||||
the project and all its session IDs. Used to scope semantic search. See
|
||||
`memory-isolation.md` for full behaviour.
|
||||
|
||||
3. **Recent episode retrieval** — fetch the most recent episodes for the
|
||||
session (`RECENT_EPISODE_LIMIT`, default 5).
|
||||
|
||||
4. **Semantic search** — embed the user message, query Qdrant for the top-5
|
||||
most similar past episodes (`SCORE_THRESHOLD` 0.75). Deduplicated against
|
||||
recent episodes. Non-critical — if it fails, pipeline continues with
|
||||
recency-only context.
|
||||
|
||||
4. **Prompt assembly** — combines the system prompt, semantic episodes (if any),
|
||||
recent episodes, and the current user message into a single prompt string.
|
||||
5. **Prompt assembly** — combine system prompt, semantic episodes, recent
|
||||
episodes, and user message.
|
||||
|
||||
5. **Inference** — sends the assembled prompt to the inference service. `/chat`
|
||||
awaits the full response; `/chat/stream` opens an SSE connection and pipes
|
||||
chunks to the client as they arrive.
|
||||
6. **Inference** — send to inference service. `/chat` awaits full response;
|
||||
`/chat/stream` pipes SSE chunks to the client.
|
||||
|
||||
6. **Episode write** — writes the new exchange (user message + AI response)
|
||||
back to the memory service as a fire-and-forget operation. For streaming,
|
||||
the full response text is accumulated across chunks before writing.
|
||||
7. **Episode write** — write the exchange back to memory. Fire-and-forget
|
||||
for `/chat`; awaited for `/chat/stream` to ensure the full text is
|
||||
accumulated before saving.
|
||||
|
||||
7. **Response** — returns the AI response, model name, session ID, and token
|
||||
count to the client.
|
||||
8. **Auto-naming** — on `isFirstMessage && !session.name`, fire a secondary
|
||||
inference call with a naming prompt (max 20 tokens, temperature 0.3) and
|
||||
write the result back as `session.name`. Fully fire-and-forget.
|
||||
|
||||
## Prompt Structure
|
||||
### Prompt Structure
|
||||
|
||||
```
|
||||
[System prompt]
|
||||
@@ -108,212 +110,67 @@ User: {current message}
|
||||
Assistant:
|
||||
```
|
||||
|
||||
Semantic episodes appear before recent episodes so the model encounters
|
||||
long-range relevant context before the immediate conversation flow.
|
||||
Semantic episodes appear before recent episodes so the model sees
|
||||
long-range context before the immediate conversation flow.
|
||||
|
||||
## SSE Stream Format
|
||||
|
||||
The inference service emits chunks from the llama.cpp provider in this format:
|
||||
Inference service → orchestration:
|
||||
```
|
||||
data: {"response":"Hello","done":false}
|
||||
data: {"response":"!","done":false}
|
||||
data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":42}
|
||||
data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42}
|
||||
data: [DONE]
|
||||
```
|
||||
|
||||
The orchestration service re-emits to the client as:
|
||||
Orchestration → client:
|
||||
```
|
||||
data: {"text":"Hello"}
|
||||
data: {"text":"!"}
|
||||
data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":42}
|
||||
data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42}
|
||||
```
|
||||
|
||||
The `[DONE]` sentinel from the inference service is consumed internally
|
||||
and not forwarded. The client stream is terminated by `res.end()` after
|
||||
the done event. Model name and token count are included on the done event
|
||||
so the client can display them in the UI.
|
||||
The `[DONE]` sentinel is consumed internally and not forwarded. The stream
|
||||
is terminated by `res.end()` after the done event.
|
||||
|
||||
## Models Manifest
|
||||
|
||||
The `/models` endpoint reads a `models.json` file from disk at the path
|
||||
specified by `MODELS_MANIFEST_PATH`. The file lives on the main PC alongside
|
||||
the model files, and is accessible to orchestration via a network share
|
||||
mounted at `/mnt/nexus-models`.
|
||||
`GET /models` reads `models.json` fresh on each request from
|
||||
`MODELS_MANIFEST_PATH`. The file lives on the main PC alongside model files,
|
||||
accessible via an SMB mount at `/mnt/nexus-models`.
|
||||
|
||||
The manifest is read fresh on each request — no restart needed when models
|
||||
are added or removed.
|
||||
|
||||
**models.json format:**
|
||||
```json
|
||||
[
|
||||
{ "value": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf", "label": "Gemma 4 26B Claude Distill" }
|
||||
]
|
||||
```
|
||||
|
||||
- `value` — must match the model name as reported by `llama-server` (including `.gguf` extension)
|
||||
- `label` — display name shown in the UI
|
||||
`value` must match the model name as reported by `llama-server` (including
|
||||
`.gguf` extension). No service restart needed when models are added or removed.
|
||||
|
||||
## Endpoints
|
||||
## Sessions Route Behaviour
|
||||
|
||||
### Health
|
||||
`PATCH /sessions/:sessionId` accepts either `name`, `projectId`, or both.
|
||||
The validation guard only rejects requests where neither is provided:
|
||||
|
||||
| Method | Path | Description |
|
||||
|---|---|---|
|
||||
| GET | /health | Service health check — reports downstream service URLs |
|
||||
|
||||
### Chat
|
||||
|
||||
| Method | Path | Description |
|
||||
|---|---|---|
|
||||
| POST | /chat | Send a message and receive a complete response |
|
||||
| POST | /chat/stream | Send a message and receive a streaming SSE response |
|
||||
|
||||
### Sessions
|
||||
|
||||
| Method | Path | Description |
|
||||
|---|---|---|
|
||||
| GET | /sessions | Get paginated list of all sessions |
|
||||
| GET | /sessions/:sessionId/history | Get paginated episode history for a session |
|
||||
| PATCH | /sessions/:sessionId | Rename a session |
|
||||
| DELETE | /sessions/:sessionId | Delete a session and all its episodes |
|
||||
|
||||
### Projects
|
||||
|
||||
Projects are proxied directly from the memory service with no transformation.
|
||||
|
||||
| Method | Path | Description |
|
||||
|---|---|---|
|
||||
| GET | /projects | Get all projects |
|
||||
| POST | /projects | Create a new project |
|
||||
| PATCH | /projects/:id | Update a project |
|
||||
| DELETE | /projects/:id | Delete a project |
|
||||
|
||||
### Models
|
||||
|
||||
| Method | Path | Description |
|
||||
|---|---|---|
|
||||
| GET | /models | Get list of available models from manifest file |
|
||||
|
||||
---
|
||||
|
||||
**POST /chat**
|
||||
|
||||
Request body:
|
||||
```json
|
||||
{
|
||||
"sessionId": "your-session-uuid",
|
||||
"message": "Hello, my name is Tim.",
|
||||
"model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
|
||||
"temperature": 0.7
|
||||
```js
|
||||
if (!name?.trim() && projectId === undefined) {
|
||||
return res.status(400).json({ error: 'name or projectId is required' });
|
||||
}
|
||||
```
|
||||
|
||||
`model` and `temperature` are optional — fall back to inference service defaults
|
||||
if omitted.
|
||||
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"sessionId": "your-session-uuid",
|
||||
"response": "Hello Tim! How can I help you today?",
|
||||
"model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
|
||||
"tokenCount": 87
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**POST /chat/stream**
|
||||
|
||||
Same request body as `POST /chat`.
|
||||
|
||||
Response is a stream of Server-Sent Events:
|
||||
```
|
||||
data: {"text":"Hello"}
|
||||
data: {"text":" Tim"}
|
||||
data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":87}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**PATCH /sessions/:sessionId**
|
||||
|
||||
Request body:
|
||||
```json
|
||||
{ "name": "My Renamed Session" }
|
||||
```
|
||||
|
||||
Returns the updated session object. `name` is required and trimmed of whitespace.
|
||||
|
||||
---
|
||||
|
||||
**DELETE /sessions/:sessionId**
|
||||
|
||||
Returns `204 No Content`. Cascades to delete all episodes for the session.
|
||||
|
||||
---
|
||||
|
||||
**GET /sessions/:sessionId/history**
|
||||
|
||||
Query parameters:
|
||||
|
||||
| Parameter | Default | Description |
|
||||
|---|---|---|
|
||||
| limit | 20 | Maximum number of episodes to return |
|
||||
| offset | 0 | Number of episodes to skip (for pagination) |
|
||||
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"sessionId": "your-session-uuid",
|
||||
"episodes": [
|
||||
{
|
||||
"id": 42,
|
||||
"session_id": 1,
|
||||
"user_message": "Hello, my name is Tim.",
|
||||
"ai_response": "Hello Tim! How can I help you today?",
|
||||
"token_count": 87,
|
||||
"created_at": 1712345678,
|
||||
"metadata": null
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Episodes are ordered newest first.
|
||||
|
||||
---
|
||||
|
||||
**GET /models**
|
||||
|
||||
Returns the parsed contents of `models.json`:
|
||||
```json
|
||||
[
|
||||
{ "value": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf", "label": "Gemma 4 26B Claude Distill" }
|
||||
]
|
||||
```
|
||||
|
||||
Returns `500` if the manifest file cannot be read or parsed.
|
||||
This allows `useChat` to write project assignment separately from rename
|
||||
operations.
|
||||
|
||||
## Caddy Configuration
|
||||
|
||||
The Caddy reverse proxy on Mini PC 2 must have a handle block for each route
|
||||
prefix the client needs to reach. Current required blocks:
|
||||
Each route prefix needs a handle block in the Caddyfile on Mini PC 2:
|
||||
|
||||
```
|
||||
handle /chat* {
|
||||
reverse_proxy localhost:4000
|
||||
}
|
||||
handle /sessions* {
|
||||
reverse_proxy localhost:4000
|
||||
}
|
||||
handle /models* {
|
||||
reverse_proxy localhost:4000
|
||||
}
|
||||
handle /projects* {
|
||||
reverse_proxy localhost:4000
|
||||
}
|
||||
handle /chat* { reverse_proxy localhost:4000 }
|
||||
handle /sessions* { reverse_proxy localhost:4000 }
|
||||
handle /models* { reverse_proxy localhost:4000 }
|
||||
handle /projects* { reverse_proxy localhost:4000 }
|
||||
```
|
||||
|
||||
When adding new top-level routes to the orchestration service, add a matching
|
||||
block here and reload Caddy: `caddy reload --config /path/to/Caddyfile`
|
||||
After updating: `caddy reload --config /path/to/Caddyfile`
|
||||
|
||||
For all HTTP endpoints, see `api-routes.md`.
|
||||
Reference in New Issue
Block a user