updated documentation

2026-04-13 03:42:14 -07:00
parent 5f024093d1
commit 045da0d7f4
5 changed files with 464 additions and 112 deletions
--- a/docs/services/chat-client.md
+++ b/docs/services/chat-client.md
@@ -27,33 +27,46 @@ npm run dev         # local dev server on port 5173
 Vite bakes environment variables into the bundle at build time. The `.env`
 file is only needed on the machine running the build, not where files are served.
 After building, copy `dist/` contents to `/srv/nexusai` on Mini PC 2 for Caddy to serve.
 ## Environment Variables
 | Variable | Required | Default | Description |
 |---|---|---|---|
-| VITE_ORCHESTRATION_URL | No | `''` (empty) | Orchestration base URL. Empty string uses Vite proxy in dev, Caddy proxy in production. |
+| VITE_ORCHESTRATION_URL | No | `''` (empty) | Orchestration base URL. Must be set to the HTTPS domain in production to avoid mixed content errors. |
 Production value:
 ```
 VITE_ORCHESTRATION_URL=https://nexus.jellystorm.com
 ```
 ## Internal Structure
 ```
 src/
 ├── api/
-│   └── orchestration.js   # All fetch calls to the orchestration service
+│   └── orchestration.js    # All fetch calls to the orchestration service
 ├── config/
 │   └── constants.js        # FALLBACK_MODELS, DEFAULT_MODEL, API_DEFAULTS
 ├── hooks/
-│   ├── useSession.js       # Session list, history loading, active session state
+│   ├── useSession.js        # Session list, history loading, active session state
-│   └── useChat.js          # Message sending, SSE streaming, message state
+│   ├── useChat.js           # Message sending, SSE streaming, message state
 │   ├── useModels.js         # Dynamic model list fetched from /models endpoint
 │   └── useContextMenu.js   # Right-click context menu position and visibility
 ├── components/
-│   ├── App.jsx             # Root component — layout and shared state
+│   ├── App.jsx              # Root component — layout and shared state
-│   ├── SessionList.jsx     # Left sidebar — session list and new chat button
+│   ├── SessionList.jsx      # Left sidebar — session list, rename, delete
-│   ├── ChatWindow.jsx      # Centre panel — message thread and input bar
+│   ├── ChatWindow.jsx       # Centre panel — message thread and input bar
-│   ├── MessageBubble.jsx   # Individual message bubble (user or assistant)
+│   ├── MessageBubble.jsx    # Individual message bubble (user or assistant)
-│   └── InfoPanel.jsx       # Right panel — model selector and session metadata
+│   ├── InfoPanel.jsx        # Right panel — model selector and session metadata
-├── index.css               # Global reset and CSS variables
+│   └── SessionModal.jsx     # Modal dialog for session settings (rename)
-└── main.jsx                # React entry point
+├── index.css                # Global reset, CSS variables, utility classes
 └── main.jsx                 # React entry point
 ```
 ## Layout
 Three-panel layout with collapsible sidebars:
 ```
 ┌─────────────────┬──────────────────────────┬─────────────┐
 │  Session List   │       Chat Window         │  Info Panel │
 │  (collapsible)  │                           │ (collapsible)│
@@ -64,9 +77,54 @@ Three-panel layout with collapsible sidebars:
 │ Session 2       │                           │             │
 │                 │  [input bar]              │             │
 └─────────────────┴──────────────────────────┴─────────────┘
 ```
-On mobile, sidebars collapse to a 56px icon rail. The centre chat window
+Sidebars collapse to a 56px icon rail. The centre chat window always
-always fills the remaining space.
+fills the remaining space.
 ## CSS Architecture
 Styles follow a hybrid approach — CSS utility classes for static reusable
 rules, inline styles for dynamic prop-driven values.
 ### CSS Variables (`:root`)
 | Variable | Value | Description |
 |---|---|---|
 | `--bg-base` | `#0f1117` | Page background |
 | `--bg-surface` | `#1a1d27` | Panel backgrounds |
 | `--bg-elevated` | `#222536` | Elevated elements (inputs, cards) |
 | `--border` | `#2e3150` | Border colour |
 | `--accent` | `#6c63ff` | Primary accent (buttons, highlights) |
 | `--accent-hover` | `#574fd6` | Accent hover state |
 | `--text-primary` | `#e8e8f0` | Primary text |
 | `--text-secondary` | `#8b8fa8` | Secondary text |
 | `--text-muted` | `#555870` | Muted / placeholder text |
 | `--bubble-user` | `#6c63ff` | User message bubble background |
 | `--bubble-ai` | `#222536` | AI message bubble background |
 | `--sidebar-width` | `280px` | Expanded sidebar width |
 | `--panel-width` | `260px` | Expanded info panel width |
 | `--header-height` | `56px` | Shared header height across all panels |
 | `--radius-sm` | `6px` | Small border radius |
 | `--radius-md` | `8px` | Medium border radius |
 | `--radius-lg` | `12px` | Large border radius |
 ### Utility Classes
 | Class | Description |
 |---|---|
 | `.panel-header` | Shared header row — used in all three panels |
 | `.btn-reset` | Resets button styles (no border, bg, cursor pointer) |
 | `.btn-icon` | Icon button with hover state |
 | `.btn-primary` | Accent-coloured action button with `:hover` and `:disabled` states |
 | `.flex` / `.flex-col` | Flex layout helpers |
 | `.flex-1` / `.flex-shrink` | Flex sizing helpers |
 | `.items-center` / `.justify-center` / `.justify-between` | Alignment helpers |
 | `.overflow-hidden` / `.scroll-y` | Overflow helpers |
 | `.text-xs` / `.text-sm` / `.text-base` | Font size helpers |
 | `.text-muted` / `.text-secondary` / `.text-accent` | Colour helpers |
 | `.label-upper` | Uppercase section label style |
 | `.truncate` | Text overflow ellipsis |
 ## API Layer
@@ -78,39 +136,71 @@ All orchestration calls are centralised in `src/api/orchestration.js`:
 | `fetchSessionHistory` | GET | /sessions/:id/history | Load episode history on session select |
 | `sendMessage` | POST | /chat | Send message, await full response |
 | `streamMessage` | POST | /chat/stream | Send message, receive SSE token stream |
 | `fetchModels` | GET | /models | Load available models from manifest |
 | `renameSession` | PATCH | /sessions/:id | Rename a session |
 | `deleteSession` | DELETE | /sessions/:id | Delete a session |
 `streamMessage` returns an abort function — call it to cancel a stream mid-flight.
-It uses a buffer pattern to handle SSE chunks that may span multiple network packets.
+Uses a buffer pattern to handle SSE chunks that may span multiple network packets.
 ## Streaming
 The chat input sends messages via `POST /chat/stream`. Tokens arrive as SSE events:
 ```
 data: {"text":"Hello"}
 data: {"text":" Tim"}
-data: {"done":true}
+data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":87}
 ```
 An empty assistant bubble is appended immediately when the stream opens, then
 updated token by token using `updateLastMessage`. The blinking cursor in
 `MessageBubble` is shown while `message.streaming === true` and disappears
-when `done` is received.
+when the done event is received. Model name and token count from the done
 event are stored in `useChat` state and displayed in the InfoPanel.
-## Model Selector
+## Dynamic Model Selector
-Available models are defined in `InfoPanel.jsx`:
+Available models are fetched from `GET /models` on mount via the `useModels` hook.
 The hook initialises with `FALLBACK_MODELS` from `constants.js` and replaces them
 with the server response on success. If the fetch fails, the fallback list is used
 silently — a warning is logged to the console.
-| Label | Value |
+```js
-|---|---|
+// constants.js
-| Companion | `companion:latest` |
+export const FALLBACK_MODELS = [
-| Mistral Nemo | `mistral-nemo:latest` |
+  { value: 'companion:latest', label: 'Companion' },
-| Coder | `coder:latest` |
+  // ...
-| Qwen 2.5 Coder 14B | `qwen2.5-coder:14b` |
+];
 ```
-The selected model is passed with every chat request. To add a new model,
+The selected model is passed with every chat request. To add a model, update
-update the `MODELS` array in `InfoPanel.jsx`.
+`models.json` on the main PC — no client rebuild needed.
 ## Session Management
-Sessions are identified by a `external_id` — a human-readable string or UUID
+Sessions are identified by `external_id` — a UUID generated client-side via the
-generated client-side. New sessions are created locally with `uuid` and auto-registered
+`uuid` package. New sessions are created locally and auto-registered in the memory
-in the memory service on the first message. The session list refreshes after each
+service on the first message. The session list refreshes after each completed
-completed response to surface newly created sessions.
+response to surface newly created sessions.
 ### Session Actions
 The session list supports rename and delete:
 - **Hover** — reveals ✎ (rename) and ✕ (delete) icon buttons on the session row
 - **Right-click** — opens a context menu with the same actions
 Rename opens a `SessionModal` dialog. The modal is designed to expand into a full
 session settings panel in future — the title is already "Session Settings" to
 reflect this intent.
 Delete is immediate with no confirmation dialog (planned for a future update).
 Actions are disabled on unsaved (new) sessions that haven't had a message sent yet.
 ### Context Menu
 Implemented via `useContextMenu` hook — tracks `{ x, y, session }` state and
 attaches a `window` click listener to dismiss on any outside click. Rendered
 outside the sidebar div (via React fragment) to avoid being clipped by
 `overflow: hidden`.
--- a/docs/services/inference-service.md
+++ b/docs/services/inference-service.md
@@ -2,7 +2,7 @@
 **Package:** `@nexusai/inference-service`  
 **Location:** `packages/inference-service`  
-**Deployed on:** Main PC  
+**Deployed on:** Main PC (192.168.0.79)  
 **Port:** 3001
 ## Purpose
@@ -15,7 +15,7 @@ to switch inference backends without changes to the rest of the system.
 ## Dependencies
 - `express` — HTTP API
- `ollama` — Ollama client (used by the Ollama provider)
+- `ollama` — Ollama client (used by the Ollama provider, kept as fallback)
 - `dotenv` — environment variable loading
 - `@nexusai/shared` — shared utilities
@@ -24,9 +24,13 @@ to switch inference backends without changes to the rest of the system.
 | Variable | Required | Default | Description |
 |---|---|---|---|
 | PORT | No | 3001 | Port to listen on |
-| INFERENCE_PROVIDER | No | ollama | Active inference provider (ollama, llamacpp) |
+| INFERENCE_PROVIDER | No | llamacpp | Active inference provider (`ollama` or `llamacpp`) |
-| INFERENCE_URL | No | http://localhost:11434 | URL of the inference runtime |
+| INFERENCE_URL | No | http://localhost:8080 | URL of the inference runtime |
-| DEFAULT_MODEL | No | llama3.2 | Default model name passed to the provider |
+| DEFAULT_MODEL | No | local-model | Default model name passed to the provider |
 > `INFERENCE_URL` points to `llama-server` directly (port 8080), not to this
 > service itself. The orchestration service uses `INFERENCE_SERVICE_URL` to
 > reach this service on port 3001.
 ## Provider Architecture
@@ -39,14 +43,87 @@ signatures, so the rest of the service is unaware of which backend is active.
 | Provider | Value | Runtime |
 |---|---|---|
-| Ollama | `ollama` | Ollama via the `ollama` npm package |
+| llama.cpp | `llamacpp` | llama.cpp server (OpenAI-compatible API) — **current default** |
-| llama.cpp | `llamacpp` | llama.cpp server (OpenAI-compatible API) |
+| Ollama | `ollama` | Ollama via the `ollama` npm package — available as fallback |
-Switching providers requires only a `.env` change — no code modifications needed.
+Switching providers requires only a `.env` change — no code modifications needed:
 ```
 INFERENCE_PROVIDER=llamacpp
 INFERENCE_URL=http://localhost:8080
 ```
 ### Provider Validation
 The provider loader validates `INFERENCE_PROVIDER` at startup and throws immediately
 if an unknown value is set — prevents silent misconfiguration:
 ```
 Error: Unknown inference provider: "foo". Valid options: ollama, llamacpp
 ```
 ## llama.cpp Provider
 The llama.cpp provider uses the OpenAI-compatible REST API exposed by `llama-server`.
 ### Starting llama-server
 `llama-server` must be started manually on the main PC before the inference service
 can handle requests. It loads a single model at startup:
 ```powershell
 .\llama-gpu\llama-server.exe `
  -m .\models\gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf `
  -ngl 99 `
  --reasoning off `
  --host 0.0.0.0 `
  --port 8080 `
  -c 64000
 ```
 Key flags:
 | Flag | Description |
 |---|---|
 | `-m` | Path to the `.gguf` model file |
 | `-ngl 99` | Offload as many layers as possible to GPU |
 | `--reasoning off` | Disables thinking/reasoning delay on Gemma 4 models |
 | `--host 0.0.0.0` | Allows connections from other machines on the LAN |
 | `--port 8080` | Port for the llama-server HTTP API |
 | `-c 64000` | Context window size in tokens |
 > `-c 64000` is intentionally large. Monitor VRAM usage — if pressure builds,
 > reduce this value. The NexusAI memory architecture handles context injection
 > so a smaller window (6–8K) is often sufficient.
 ### Model Naming
 The model name sent in API requests must match the name as reported by
 `llama-server` — including the `.gguf` extension. The reported name can be
 verified with:
 ```powershell
 Invoke-RestMethod -Uri "http://192.168.0.79:8080/v1/models"
 ```
 Set `DEFAULT_MODEL` in `.env` to the exact reported name:
 ```
 DEFAULT_MODEL=gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf
 ```
 ### Inference Parameters
 The llamacpp provider maps NexusAI options to OpenAI-compatible fields:
 | NexusAI option | API field | Default |
 |---|---|---|
 | `temperature` | `temperature` | 0.7 |
 | `maxTokens` | `max_tokens` | 1024 |
 | `topP` | `top_p` | 0.9 |
 | `topK` | `top_k` | 40 |
 | `repeatPenalty` | `repeat_penalty` | 1.1 |
 | `seed` | `seed` | null (random) |
 ## Internal Structure
 ```
 src/
 ├── providers/
 │   ├── ollama.js      # Ollama provider — uses ollama npm package
@@ -55,6 +132,27 @@ src/
 │   └── inference.js   # /complete and /complete/stream route handlers
 ├── infer.js           # Provider loader — selects and re-exports active provider
 └── index.js           # Express app + route definitions
 ```
 ## Streaming Response Format
 The llama.cpp provider yields chunks in this shape:
 ```js
 { response: "token text", done: false }
 // final chunk:
 { response: '', done: true, model: "model-name.gguf", tokenCount: 42 }
 ```
 The inference route re-emits these as SSE events:
 ```
 data: {"response":"token text"}
 data: {"done":true,"model":"model-name.gguf","tokenCount":42}
 data: [DONE]
 ```
 `model` and `tokenCount` are captured from the llama.cpp `finish_reason: stop`
 chunk (`usage.completion_tokens`) and emitted on the done event so the
 orchestration layer can forward them to the client.
 ## Endpoints
@@ -79,7 +177,7 @@ Request body:
 ```json
 {
  "prompt": "What is the capital of France?",
-  "model": "companion:latest",
+  "model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
  "temperature": 0.7,
  "maxTokens": 1024
 }
@@ -93,33 +191,26 @@ Response:
 ```json
 {
  "text": "The capital of France is Paris.",
-  "model": "companion:latest",
+  "model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
  "done": true,
  "evalCount": 8,
  "promptEvalCount": 41
 }
 ```
 | Field | Description |
 |---|---|
 | `text` | The model's response |
 | `model` | Model name as reported by the provider |
 | `done` | Whether generation completed normally |
 | `evalCount` | Number of tokens generated |
 | `promptEvalCount` | Number of tokens in the prompt |
 ---
 **POST /complete/stream**
-Same request body as `/complete` (`maxTokens` not applicable for streaming).
+Same request body as `/complete`.
-Response is a stream of Server-Sent Events. Each event contains a partial
+Response is a stream of Server-Sent Events:
-response chunk as JSON. The stream closes with a final `data: [DONE]` event.
+```
-data: {"model":"companion:latest","response":"The","done":false}
+data: {"response":"The"}
-data: {"model":"companion:latest","response":" capital","done":false}
+data: {"response":" capital of France is Paris."}
-data: {"model":"companion:latest","response":" of France is Paris.","done":false}
+data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":8}
 data: [DONE]
 ```
-Clients should read the `response` field from each chunk and accumulate
+Clients should accumulate `response` fields to build the full response string.
-them to build the full response string.
+The `done` event carries `model` and `tokenCount` for display in the UI.
--- a/docs/services/memory-service.md
+++ b/docs/services/memory-service.md
@@ -34,7 +34,7 @@ service to generate and store a vector in Qdrant.
 ```
 src/
 ├── db/
-│   ├── index.js       # SQLite connection + initialization
+│   ├── index.js       # SQLite connection + initialization + migrations
 │   └── schema.js      # Table definitions, indexes, FTS5, triggers
 ├── episodic/
 │   └── index.js       # Session + episode CRUD, FTS search, embedding write path
@@ -49,12 +49,29 @@ src/
 Five core tables:
- **sessions** — top-level conversation containers, identified by an `external_id`
+- **sessions** — top-level conversation containers, identified by an `external_id` and optional `name`
 - **episodes** — individual exchanges (user message + AI response) tied to a session
 - **entities** — named things the system learns about (people, places, concepts)
 - **relationships** — directional labeled links between entities
 - **summaries** — condensed episode groups for efficient context retrieval
 ### Migrations
 Schema changes that cannot be expressed in `CREATE TABLE IF NOT EXISTS` are applied
 as migrations in `db/index.js` at startup, wrapped in try/catch to safely ignore
 already-applied changes:
 ```js
 try {
    db.exec(`ALTER TABLE sessions ADD COLUMN name TEXT`);
 } catch {
    // Column already exists — safe to ignore on subsequent startups
 }
 ```
 Current migrations:
 - `ALTER TABLE sessions ADD COLUMN name TEXT` — adds display name to sessions
 ### FTS5 Full-Text Search
 An `episodes_fts` virtual table enables keyword search across all episodes.
@@ -144,9 +161,14 @@ Entities and relationships are stored in SQLite with two key constraints:
 | Method | Path | Description |
 |---|---|---|
 | POST | /sessions | Create a new session |
 | GET | /sessions | Get paginated list of all sessions |
 | GET | /sessions/:id | Get session by internal ID |
 | GET | /sessions/by-external/:externalId | Get session by external ID |
-| DELETE | /sessions/:id | Delete session (cascades to episodes + summaries) |
+| PATCH | /sessions/by-external/:externalId | Update session name |
 | DELETE | /sessions/by-external/:externalId | Delete session (cascades to episodes + summaries) |
 > Route ordering matters in Express: `by-external/:externalId` must be defined before
 > `/:id` to prevent the literal string `by-external` being captured as an ID parameter.
 **POST /sessions body:**
 ```json
@@ -156,6 +178,20 @@ Entities and relationships are stored in SQLite with two key constraints:
 }
 ```
 **PATCH /sessions/by-external/:externalId body:**
 ```json
 {
  "name": "My Renamed Session"
 }
 ```
 Returns the updated session object. `name` is required and must be non-empty.
 **DELETE /sessions/by-external/:externalId**
 Returns `204 No Content` on success. Cascades to delete all associated episodes
 and summaries via SQLite `ON DELETE CASCADE`.
 ### Episodes
 | Method | Path | Description |
--- a/docs/services/orchestration-service.md
+++ b/docs/services/orchestration-service.md
@@ -14,14 +14,10 @@ or inference services — all traffic flows through orchestration.
 ## Dependencies
- `express` : HTTP API
+- `express` — HTTP API
- `cors` : cross-origin resource sharing middleware
+- `cors` — cross-origin resource sharing middleware
- `node-fetch` : inter-service HTTP communication (memory service client only)
+- `dotenv` — environment variable loading
- `dotenv` : environment variable loading
+- `@nexusai/shared` — shared utilities
 - `@nexusai/shared` : shared utilities
 > `memory.js` uses `node-fetch` v2 (pinned) because it is CommonJS. All other
 > service clients use Node.js built-in `fetch`.
 ## Environment Variables
@@ -33,6 +29,7 @@ or inference services — all traffic flows through orchestration.
 | INFERENCE_SERVICE_URL | No | http://localhost:3001 | Inference service URL |
 | QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search |
 | CORS_ORIGIN | No | http://localhost:5173 | Allowed origin for CORS requests |
 | MODELS_MANIFEST_PATH | Yes | — | Path to `models.json` manifest file |
 ## Internal Structure
 ```
@@ -46,7 +43,8 @@ src/
 │   └── index.js       # Core pipeline logic — context assembly and coordination
 ├── routes/
 │   ├── chat.js        # POST /chat and POST /chat/stream route handlers
-│   └── sessions.js    # GET /sessions/:sessionId/history route handler
+│   ├── sessions.js    # Session list, history, rename, and delete routes
 │   └── models.js      # GET /models — reads models.json manifest from disk
 └── index.js           # Express app entry point
 ```
@@ -65,7 +63,7 @@ the client.
   UUID for new conversations and pass it directly — no pre-creation step needed.
 2. **Recent episode retrieval** — fetches the most recent episodes for the session
-   (default: 10) from the memory service.
+   (default: 5) from the memory service.
 3. **Semantic search** — embeds the user message via the embedding service, then
   queries Qdrant for the top-5 most similar past episodes (score threshold: 0.75).
@@ -89,37 +87,68 @@ the client.
   count to the client.
 ## Prompt Structure
 ```
 [System prompt]
 Here are some relevant memories from earlier conversations:
 User: {past user message}
 Assistant: {past ai response}
 ... (up to 5 semantic episodes)
-Here is the recent conversation history:
+---
 Here are some relevant memories from your past conversations:
 User: {past user message}
 Assistant: {past ai response}
-... (up to 10 recent episodes)
+... (up to 5 recent episodes)
--- End of memories ---
+--- End of recent memories ---
 User: {current message}
 Assistant:
 ```
 Semantic episodes appear before recent episodes so the model encounters
 long-range relevant context before the immediate conversation flow.
 ## SSE Stream Format
-The inference service emits chunks in this format:
+The inference service emits chunks from the llama.cpp provider in this format:
-data: {"model":"companion:latest","response":"Hello","done":false}
+```
-data: {"model":"companion:latest","response":"!","done":true,"eval_count":3,...}
+data: {"response":"Hello","done":false}
 data: {"response":"!","done":false}
 data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":42}
 data: [DONE]
 ```
 The orchestration service re-emits to the client as:
 ```
 data: {"text":"Hello"}
 data: {"text":"!"}
-data: {"done":true}
+data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":42}
 ```
 The `[DONE]` sentinel from the inference service is consumed internally
 and not forwarded. The client stream is terminated by `res.end()` after
-the `{"done":true}` event.
+the done event. Model name and token count are included on the done event
 so the client can display them in the UI.
 ## Models Manifest
 The `/models` endpoint reads a `models.json` file from disk at the path
 specified by `MODELS_MANIFEST_PATH`. The file lives on the main PC alongside
 the model files, and is accessible to orchestration via a network share
 mounted at `/mnt/nexus-models`.
 The manifest is read fresh on each request — no restart needed when models
 are added or removed.
 **models.json format:**
 ```json
 [
  { "value": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf", "label": "Gemma 4 26B Claude Distill" }
 ]
 ```
 - `value` — must match the model name as reported by `llama-server` (including `.gguf` extension)
 - `label` — display name shown in the UI
 ## Endpoints
@@ -142,6 +171,14 @@ the `{"done":true}` event.
 |---|---|---|
 | GET | /sessions | Get paginated list of all sessions |
 | GET | /sessions/:sessionId/history | Get paginated episode history for a session |
 | PATCH | /sessions/:sessionId | Rename a session |
 | DELETE | /sessions/:sessionId | Delete a session and all its episodes |
 ### Models
 | Method | Path | Description |
 |---|---|---|
 | GET | /models | Get list of available models from manifest file |
 ---
@@ -152,7 +189,7 @@ Request body:
 {
  "sessionId": "your-session-uuid",
  "message": "Hello, my name is Tim.",
-  "model": "companion:latest",
+  "model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
  "temperature": 0.7
 }
 ```
@@ -165,7 +202,7 @@ Response:
 {
  "sessionId": "your-session-uuid",
  "response": "Hello Tim! How can I help you today?",
-  "model": "companion:latest",
+  "model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
  "tokenCount": 87
 }
 ```
@@ -176,23 +213,34 @@ Response:
 Same request body as `POST /chat`.
-Response is a stream of Server-Sent Events. Each event contains a text
+Response is a stream of Server-Sent Events:
-delta. The stream ends with a `done` event.
+```
 data: {"text":"Hello"}
 data: {"text":" Tim"}
-data: {"text":"!"}
+data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":87}
-data: {"done":true}
+```
-Clients should read the `text` field from each chunk and accumulate them
+---
-to build the full response string. The connection is closed by the server
+
-after the `{"done":true}` event.
+**PATCH /sessions/:sessionId**
 Request body:
 ```json
 { "name": "My Renamed Session" }
 ```
 Returns the updated session object. `name` is required and trimmed of whitespace.
 ---
 **DELETE /sessions/:sessionId**
 Returns `204 No Content`. Cascades to delete all episodes for the session.
 ---
 **GET /sessions/:sessionId/history**
 Returns paginated episode history for a session identified by its external ID.
 Query parameters:
 | Parameter | Default | Description |
@@ -218,30 +266,17 @@ Response:
 }
 ```
 Episodes are ordered newest first.
 ---
-**GET /sessions**
+**GET /models**
-Returns a paginated list of all sessions, ordered by most recently active.
+Returns the parsed contents of `models.json`:
 Query parameters:
 | Parameter | Default | Description |
 |---|---|---|
 | limit | 20 | Maximum number of sessions to return |
 | offset | 0 | Number of sessions to skip (for pagination) |
 Response:
 ```json
 [
-  {
+  { "value": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf", "label": "Gemma 4 26B Claude Distill" }
    "id": 1,
    "external_id": "test-semantic",
    "metadata": null,
    "created_at": 1712345678,
    "updated_at": 1712345999
  }
 ]
 ```
-Episodes are ordered newest first. Returns `404` if the session does not exist.
+Returns `500` if the manifest file cannot be read or parsed.
--- a/docs/services/shared.md
+++ b/docs/services/shared.md
@@ -24,13 +24,40 @@ const DB   = getEnv('SQLITE_PATH');    // required — throws if missing
 ---
 ### `parseRow(row)`
 Parses a SQLite row object, deserialising any JSON-encoded `metadata` fields
 into plain objects. Returns `null` if the row is `null` or `undefined`.
 ```js
 const { parseRow } = require('@nexusai/shared');
 const session = parseRow(db.prepare('SELECT * FROM sessions WHERE id = ?').get(id));
 ```
 ---
 ### `formatEpisodeText(userMessage, aiResponse)`
 Combines a user message and AI response into the canonical text format used
 for embedding:
 ```
 User: {userMessage}
 Assistant: {aiResponse}
 ```
 Used by the memory service's embedding write path to ensure consistent
 vector representations across all episodes.
 ---
 ### Constants
 Tuneable values and shared identifiers are centralised in `constants.js`
 rather than hardcoded across services. Import the relevant group by name.
 ```js
-const { QDRANT, COLLECTIONS, EPISODIC } = require('@nexusai/shared');
+const { QDRANT, COLLECTIONS, EPISODIC, LLAMACPP } = require('@nexusai/shared');
 ```
 #### `QDRANT`
@@ -40,15 +67,14 @@ embedding model and Qdrant collection setup.
 | Key | Value | Description |
 |---|---|---|
-| `DEFAULT_URL` | `http://localhost:6333` | Fallback Qdrant URL if `QDRANT_URL` env var is not set |
+| `DEFAULT_URL` | `http://localhost:6333` | Fallback Qdrant URL |
 | `VECTOR_SIZE` | `768` | Output dimensions of `nomic-embed-text` |
 | `DISTANCE_METRIC` | `'Cosine'` | Similarity metric used for all collections |
 | `DEFAULT_LIMIT` | `10` | Default top-k for vector searches |
 #### `COLLECTIONS`
-Canonical Qdrant collection names. Used by both the semantic layer and
+Canonical Qdrant collection names.
 any service that constructs Qdrant queries directly.
 | Key | Value |
 |---|---|
@@ -65,6 +91,8 @@ Default pagination and result limits for SQLite episode queries.
 | `DEFAULT_RECENT_LIMIT` | `10` | Default number of recent episodes to retrieve |
 | `DEFAULT_PAGE_SIZE` | `20` | Default episodes per page for paginated queries |
 | `DEFAULT_SEARCH_LIMIT` | `10` | Default number of FTS search results to return |
 | `DEFAULT_OFFSET` | `0` | Default pagination offset |
 | `DEFAULT_SESSIONS_LIMIT` | `20` | Default number of sessions to return |
 #### `SERVICES`
@@ -74,3 +102,75 @@ when the corresponding environment variable is not set.
 | Key | Value | Description |
 |---|---|---|
 | `EMBEDDING_URL` | `http://localhost:3003` | Fallback embedding service URL |
 | `MEMORY_URL` | `http://localhost:3002` | Fallback memory service URL |
 | `INFERENCE_URL` | `http://localhost:3001` | Fallback inference service URL |
 #### `PORTS`
 Default port numbers for each service.
 | Key | Value |
 |---|---|
 | `INFERENCE` | `'3001'` |
 | `MEMORY` | `'3002'` |
 | `EMBEDDING` | `'3003'` |
 | `ORCHESTRATION` | `'4000'` |
 #### `OLLAMA`
 Ollama runtime defaults — used by the Ollama inference provider.
 | Key | Value | Description |
 |---|---|---|
 | `DEFAULT_URL` | `http://localhost:11434` | Fallback Ollama URL |
 | `EMBED_MODEL` | `'nomic-embed-text'` | Default embedding model |
 | `OLLAMA_MODEL` | `'companion:latest'` | Default chat model |
 #### `LLAMACPP`
 llama.cpp runtime defaults — used by the llama.cpp inference provider.
 | Key | Value | Description |
 |---|---|---|
 | `DEFAULT_URL` | `http://localhost:8080` | Fallback llama-server URL |
 | `DEFAULT_MODEL` | `'local-model'` | Fallback model name (override via `DEFAULT_MODEL` env var) |
 > Always set `DEFAULT_MODEL` in the inference service `.env` to the exact model
 > name reported by `llama-server` (including `.gguf` extension). The shared
 > constant is a last-resort fallback only.
 #### `INFERENCE_DEFAULTS`
 Default inference parameters applied when not specified in a request.
 | Key | Value | Description |
 |---|---|---|
 | `TEMPERATURE` | `0.7` | Controls randomness (0 = deterministic, 1 = creative) |
 | `MAX_TOKENS` | `1024` | Maximum tokens to generate |
 | `TOP_P` | `0.9` | Nucleus sampling probability mass |
 | `TOP_K` | `40` | Top-K candidates at each step |
 | `REPEAT_PENALTY` | `1.1` | Penalty for recently used tokens |
 | `SEED` | `null` | null = random; set integer for reproducible outputs |
 #### `ORCHESTRATION`
 Orchestration pipeline defaults.
 | Key | Value | Description |
 |---|---|---|
 | `RECENT_EPISODE_LIMIT` | `5` | Recent episodes to inject into prompt |
 | `SEMANTIC_LIMIT` | `5` | Semantic search results to inject into prompt |
 | `SCORE_THRESHOLD` | `0.75` | Minimum similarity score for semantic results |
 | `CORS_ORIGIN` | `'http://localhost:5173'` | Fallback allowed CORS origin |
 | `SYSTEM_PROMPT` | *(see below)* | Default system prompt |
 Default system prompt:
 > "You are a helpful, context-aware AI assistant. You have access to memories
 > of past conversations with the user. Use them to provide consistent,
 > personalised responses."
 #### `SQLITE`
 | Key | Value | Description |
 |---|---|---|
 | `DEFAULT_PATH` | `'./data/nexusai.db'` | Fallback SQLite database path |