updated documentation

2026-04-13 03:42:14 -07:00
parent 5f024093d1
commit 045da0d7f4
5 changed files with 464 additions and 112 deletions
--- a/docs/services/chat-client.md
+++ b/docs/services/chat-client.md
@@ -27,33 +27,46 @@ npm run dev         # local dev server on port 5173
 Vite bakes environment variables into the bundle at build time. The `.env`
 file is only needed on the machine running the build, not where files are served.

+After building, copy `dist/` contents to `/srv/nexusai` on Mini PC 2 for Caddy to serve.
+
 ## Environment Variables

 | Variable | Required | Default | Description |
 |---|---|---|---|
-| VITE_ORCHESTRATION_URL | No | `''` (empty) | Orchestration base URL. Empty string uses Vite proxy in dev, Caddy proxy in production. |
+| VITE_ORCHESTRATION_URL | No | `''` (empty) | Orchestration base URL. Must be set to the HTTPS domain in production to avoid mixed content errors. |
+
+Production value:
+```
+VITE_ORCHESTRATION_URL=https://nexus.jellystorm.com
+```

 ## Internal Structure
 ```
 src/
 ├── api/
 │   └── orchestration.js    # All fetch calls to the orchestration service
+├── config/
+│   └── constants.js        # FALLBACK_MODELS, DEFAULT_MODEL, API_DEFAULTS
 ├── hooks/
 │   ├── useSession.js        # Session list, history loading, active session state
-│   └── useChat.js          # Message sending, SSE streaming, message state
+│   ├── useChat.js           # Message sending, SSE streaming, message state
+│   ├── useModels.js         # Dynamic model list fetched from /models endpoint
+│   └── useContextMenu.js   # Right-click context menu position and visibility
 ├── components/
 │   ├── App.jsx              # Root component — layout and shared state
-│   ├── SessionList.jsx     # Left sidebar — session list and new chat button
+│   ├── SessionList.jsx      # Left sidebar — session list, rename, delete
 │   ├── ChatWindow.jsx       # Centre panel — message thread and input bar
 │   ├── MessageBubble.jsx    # Individual message bubble (user or assistant)
-│   └── InfoPanel.jsx       # Right panel — model selector and session metadata
-├── index.css               # Global reset and CSS variables
+│   ├── InfoPanel.jsx        # Right panel — model selector and session metadata
+│   └── SessionModal.jsx     # Modal dialog for session settings (rename)
+├── index.css                # Global reset, CSS variables, utility classes
 └── main.jsx                 # React entry point
 ```

 ## Layout

 Three-panel layout with collapsible sidebars:
+```
 ┌─────────────────┬──────────────────────────┬─────────────┐
 │  Session List   │       Chat Window         │  Info Panel │
 │  (collapsible)  │                           │ (collapsible)│
@@ -64,9 +77,54 @@ Three-panel layout with collapsible sidebars:
 │ Session 2       │                           │             │
 │                 │  [input bar]              │             │
 └─────────────────┴──────────────────────────┴─────────────┘
+```

-On mobile, sidebars collapse to a 56px icon rail. The centre chat window
-always fills the remaining space.
+Sidebars collapse to a 56px icon rail. The centre chat window always
+fills the remaining space.
+
+## CSS Architecture
+
+Styles follow a hybrid approach — CSS utility classes for static reusable
+rules, inline styles for dynamic prop-driven values.
+
+### CSS Variables (`:root`)
+
+| Variable | Value | Description |
+|---|---|---|
+| `--bg-base` | `#0f1117` | Page background |
+| `--bg-surface` | `#1a1d27` | Panel backgrounds |
+| `--bg-elevated` | `#222536` | Elevated elements (inputs, cards) |
+| `--border` | `#2e3150` | Border colour |
+| `--accent` | `#6c63ff` | Primary accent (buttons, highlights) |
+| `--accent-hover` | `#574fd6` | Accent hover state |
+| `--text-primary` | `#e8e8f0` | Primary text |
+| `--text-secondary` | `#8b8fa8` | Secondary text |
+| `--text-muted` | `#555870` | Muted / placeholder text |
+| `--bubble-user` | `#6c63ff` | User message bubble background |
+| `--bubble-ai` | `#222536` | AI message bubble background |
+| `--sidebar-width` | `280px` | Expanded sidebar width |
+| `--panel-width` | `260px` | Expanded info panel width |
+| `--header-height` | `56px` | Shared header height across all panels |
+| `--radius-sm` | `6px` | Small border radius |
+| `--radius-md` | `8px` | Medium border radius |
+| `--radius-lg` | `12px` | Large border radius |
+
+### Utility Classes
+
+| Class | Description |
+|---|---|
+| `.panel-header` | Shared header row — used in all three panels |
+| `.btn-reset` | Resets button styles (no border, bg, cursor pointer) |
+| `.btn-icon` | Icon button with hover state |
+| `.btn-primary` | Accent-coloured action button with `:hover` and `:disabled` states |
+| `.flex` / `.flex-col` | Flex layout helpers |
+| `.flex-1` / `.flex-shrink` | Flex sizing helpers |
+| `.items-center` / `.justify-center` / `.justify-between` | Alignment helpers |
+| `.overflow-hidden` / `.scroll-y` | Overflow helpers |
+| `.text-xs` / `.text-sm` / `.text-base` | Font size helpers |
+| `.text-muted` / `.text-secondary` / `.text-accent` | Colour helpers |
+| `.label-upper` | Uppercase section label style |
+| `.truncate` | Text overflow ellipsis |

 ## API Layer

@@ -78,39 +136,71 @@ All orchestration calls are centralised in `src/api/orchestration.js`:
 | `fetchSessionHistory` | GET | /sessions/:id/history | Load episode history on session select |
 | `sendMessage` | POST | /chat | Send message, await full response |
 | `streamMessage` | POST | /chat/stream | Send message, receive SSE token stream |
+| `fetchModels` | GET | /models | Load available models from manifest |
+| `renameSession` | PATCH | /sessions/:id | Rename a session |
+| `deleteSession` | DELETE | /sessions/:id | Delete a session |

 `streamMessage` returns an abort function — call it to cancel a stream mid-flight.
-It uses a buffer pattern to handle SSE chunks that may span multiple network packets.
+Uses a buffer pattern to handle SSE chunks that may span multiple network packets.

 ## Streaming

 The chat input sends messages via `POST /chat/stream`. Tokens arrive as SSE events:
+```
 data: {"text":"Hello"}
 data: {"text":" Tim"}
-data: {"done":true}
+data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":87}
+```

 An empty assistant bubble is appended immediately when the stream opens, then
 updated token by token using `updateLastMessage`. The blinking cursor in
 `MessageBubble` is shown while `message.streaming === true` and disappears
-when `done` is received.
+when the done event is received. Model name and token count from the done
+event are stored in `useChat` state and displayed in the InfoPanel.

-## Model Selector
+## Dynamic Model Selector

-Available models are defined in `InfoPanel.jsx`:
+Available models are fetched from `GET /models` on mount via the `useModels` hook.
+The hook initialises with `FALLBACK_MODELS` from `constants.js` and replaces them
+with the server response on success. If the fetch fails, the fallback list is used
+silently — a warning is logged to the console.

-| Label | Value |
-|---|---|
-| Companion | `companion:latest` |
-| Mistral Nemo | `mistral-nemo:latest` |
-| Coder | `coder:latest` |
-| Qwen 2.5 Coder 14B | `qwen2.5-coder:14b` |
+```js
+// constants.js
+export const FALLBACK_MODELS = [
+  { value: 'companion:latest', label: 'Companion' },
+  // ...
+];
+```

-The selected model is passed with every chat request. To add a new model,
-update the `MODELS` array in `InfoPanel.jsx`.
+The selected model is passed with every chat request. To add a model, update
+`models.json` on the main PC — no client rebuild needed.

 ## Session Management

-Sessions are identified by a `external_id` — a human-readable string or UUID
-generated client-side. New sessions are created locally with `uuid` and auto-registered
-in the memory service on the first message. The session list refreshes after each
-completed response to surface newly created sessions.
+Sessions are identified by `external_id` — a UUID generated client-side via the
+`uuid` package. New sessions are created locally and auto-registered in the memory
+service on the first message. The session list refreshes after each completed
+response to surface newly created sessions.
+
+### Session Actions
+
+The session list supports rename and delete:
+
+- **Hover** — reveals ✎ (rename) and ✕ (delete) icon buttons on the session row
+- **Right-click** — opens a context menu with the same actions
+
+Rename opens a `SessionModal` dialog. The modal is designed to expand into a full
+session settings panel in future — the title is already "Session Settings" to
+reflect this intent.
+
+Delete is immediate with no confirmation dialog (planned for a future update).
+
+Actions are disabled on unsaved (new) sessions that haven't had a message sent yet.
+
+### Context Menu
+
+Implemented via `useContextMenu` hook — tracks `{ x, y, session }` state and
+attaches a `window` click listener to dismiss on any outside click. Rendered
+outside the sidebar div (via React fragment) to avoid being clipped by
+`overflow: hidden`.
--- a/docs/services/inference-service.md
+++ b/docs/services/inference-service.md
@@ -2,7 +2,7 @@

 **Package:** `@nexusai/inference-service`  
 **Location:** `packages/inference-service`  
-**Deployed on:** Main PC  
+**Deployed on:** Main PC (192.168.0.79)  
 **Port:** 3001

 ## Purpose
@@ -15,7 +15,7 @@ to switch inference backends without changes to the rest of the system.
 ## Dependencies

 - `express` — HTTP API
- `ollama` — Ollama client (used by the Ollama provider)
+- `ollama` — Ollama client (used by the Ollama provider, kept as fallback)
 - `dotenv` — environment variable loading
 - `@nexusai/shared` — shared utilities

@@ -24,9 +24,13 @@ to switch inference backends without changes to the rest of the system.
 | Variable | Required | Default | Description |
 |---|---|---|---|
 | PORT | No | 3001 | Port to listen on |
-| INFERENCE_PROVIDER | No | ollama | Active inference provider (ollama, llamacpp) |
-| INFERENCE_URL | No | http://localhost:11434 | URL of the inference runtime |
-| DEFAULT_MODEL | No | llama3.2 | Default model name passed to the provider |
+| INFERENCE_PROVIDER | No | llamacpp | Active inference provider (`ollama` or `llamacpp`) |
+| INFERENCE_URL | No | http://localhost:8080 | URL of the inference runtime |
+| DEFAULT_MODEL | No | local-model | Default model name passed to the provider |
+
+> `INFERENCE_URL` points to `llama-server` directly (port 8080), not to this
+> service itself. The orchestration service uses `INFERENCE_SERVICE_URL` to
+> reach this service on port 3001.

 ## Provider Architecture

@@ -39,14 +43,87 @@ signatures, so the rest of the service is unaware of which backend is active.

 | Provider | Value | Runtime |
 |---|---|---|
-| Ollama | `ollama` | Ollama via the `ollama` npm package |
-| llama.cpp | `llamacpp` | llama.cpp server (OpenAI-compatible API) |
+| llama.cpp | `llamacpp` | llama.cpp server (OpenAI-compatible API) — **current default** |
+| Ollama | `ollama` | Ollama via the `ollama` npm package — available as fallback |

-Switching providers requires only a `.env` change — no code modifications needed.
+Switching providers requires only a `.env` change — no code modifications needed:
+```
 INFERENCE_PROVIDER=llamacpp
 INFERENCE_URL=http://localhost:8080
+```
+
+### Provider Validation
+
+The provider loader validates `INFERENCE_PROVIDER` at startup and throws immediately
+if an unknown value is set — prevents silent misconfiguration:
+```
+Error: Unknown inference provider: "foo". Valid options: ollama, llamacpp
+```
+
+## llama.cpp Provider
+
+The llama.cpp provider uses the OpenAI-compatible REST API exposed by `llama-server`.
+
+### Starting llama-server
+
+`llama-server` must be started manually on the main PC before the inference service
+can handle requests. It loads a single model at startup:
+
+```powershell
+.\llama-gpu\llama-server.exe `
+  -m .\models\gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf `
+  -ngl 99 `
+  --reasoning off `
+  --host 0.0.0.0 `
+  --port 8080 `
+  -c 64000
+```
+
+Key flags:
+
+| Flag | Description |
+|---|---|
+| `-m` | Path to the `.gguf` model file |
+| `-ngl 99` | Offload as many layers as possible to GPU |
+| `--reasoning off` | Disables thinking/reasoning delay on Gemma 4 models |
+| `--host 0.0.0.0` | Allows connections from other machines on the LAN |
+| `--port 8080` | Port for the llama-server HTTP API |
+| `-c 64000` | Context window size in tokens |
+
+> `-c 64000` is intentionally large. Monitor VRAM usage — if pressure builds,
+> reduce this value. The NexusAI memory architecture handles context injection
+> so a smaller window (6–8K) is often sufficient.
+
+### Model Naming
+
+The model name sent in API requests must match the name as reported by
+`llama-server` — including the `.gguf` extension. The reported name can be
+verified with:
+
+```powershell
+Invoke-RestMethod -Uri "http://192.168.0.79:8080/v1/models"
+```
+
+Set `DEFAULT_MODEL` in `.env` to the exact reported name:
+```
+DEFAULT_MODEL=gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf
+```
+
+### Inference Parameters
+
+The llamacpp provider maps NexusAI options to OpenAI-compatible fields:
+
+| NexusAI option | API field | Default |
+|---|---|---|
+| `temperature` | `temperature` | 0.7 |
+| `maxTokens` | `max_tokens` | 1024 |
+| `topP` | `top_p` | 0.9 |
+| `topK` | `top_k` | 40 |
+| `repeatPenalty` | `repeat_penalty` | 1.1 |
+| `seed` | `seed` | null (random) |

 ## Internal Structure
+```
 src/
 ├── providers/
 │   ├── ollama.js      # Ollama provider — uses ollama npm package
@@ -55,6 +132,27 @@ src/
 │   └── inference.js   # /complete and /complete/stream route handlers
 ├── infer.js           # Provider loader — selects and re-exports active provider
 └── index.js           # Express app + route definitions
+```
+
+## Streaming Response Format
+
+The llama.cpp provider yields chunks in this shape:
+```js
+{ response: "token text", done: false }
+// final chunk:
+{ response: '', done: true, model: "model-name.gguf", tokenCount: 42 }
+```
+
+The inference route re-emits these as SSE events:
+```
+data: {"response":"token text"}
+data: {"done":true,"model":"model-name.gguf","tokenCount":42}
+data: [DONE]
+```
+
+`model` and `tokenCount` are captured from the llama.cpp `finish_reason: stop`
+chunk (`usage.completion_tokens`) and emitted on the done event so the
+orchestration layer can forward them to the client.

 ## Endpoints

@@ -79,7 +177,7 @@ Request body:
 ```json
 {
  "prompt": "What is the capital of France?",
-  "model": "companion:latest",
+  "model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
  "temperature": 0.7,
  "maxTokens": 1024
 }
@@ -93,33 +191,26 @@ Response:
 ```json
 {
  "text": "The capital of France is Paris.",
-  "model": "companion:latest",
+  "model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
  "done": true,
  "evalCount": 8,
  "promptEvalCount": 41
 }
 ```

-| Field | Description |
-|---|---|
-| `text` | The model's response |
-| `model` | Model name as reported by the provider |
-| `done` | Whether generation completed normally |
-| `evalCount` | Number of tokens generated |
-| `promptEvalCount` | Number of tokens in the prompt |
-
 ---

 **POST /complete/stream**

-Same request body as `/complete` (`maxTokens` not applicable for streaming).
+Same request body as `/complete`.

-Response is a stream of Server-Sent Events. Each event contains a partial
-response chunk as JSON. The stream closes with a final `data: [DONE]` event.
-data: {"model":"companion:latest","response":"The","done":false}
-data: {"model":"companion:latest","response":" capital","done":false}
-data: {"model":"companion:latest","response":" of France is Paris.","done":false}
+Response is a stream of Server-Sent Events:
+```
+data: {"response":"The"}
+data: {"response":" capital of France is Paris."}
+data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":8}
 data: [DONE]
+```

-Clients should read the `response` field from each chunk and accumulate
-them to build the full response string.
+Clients should accumulate `response` fields to build the full response string.
+The `done` event carries `model` and `tokenCount` for display in the UI.
--- a/docs/services/memory-service.md
+++ b/docs/services/memory-service.md
@@ -34,7 +34,7 @@ service to generate and store a vector in Qdrant.
 ```
 src/
 ├── db/
-│   ├── index.js       # SQLite connection + initialization
+│   ├── index.js       # SQLite connection + initialization + migrations
 │   └── schema.js      # Table definitions, indexes, FTS5, triggers
 ├── episodic/
 │   └── index.js       # Session + episode CRUD, FTS search, embedding write path
@@ -49,12 +49,29 @@ src/

 Five core tables:

- **sessions** — top-level conversation containers, identified by an `external_id`
+- **sessions** — top-level conversation containers, identified by an `external_id` and optional `name`
 - **episodes** — individual exchanges (user message + AI response) tied to a session
 - **entities** — named things the system learns about (people, places, concepts)
 - **relationships** — directional labeled links between entities
 - **summaries** — condensed episode groups for efficient context retrieval

+### Migrations
+
+Schema changes that cannot be expressed in `CREATE TABLE IF NOT EXISTS` are applied
+as migrations in `db/index.js` at startup, wrapped in try/catch to safely ignore
+already-applied changes:
+
+```js
+try {
+    db.exec(`ALTER TABLE sessions ADD COLUMN name TEXT`);
+} catch {
+    // Column already exists — safe to ignore on subsequent startups
+}
+```
+
+Current migrations:
+- `ALTER TABLE sessions ADD COLUMN name TEXT` — adds display name to sessions
+
 ### FTS5 Full-Text Search

 An `episodes_fts` virtual table enables keyword search across all episodes.
@@ -144,9 +161,14 @@ Entities and relationships are stored in SQLite with two key constraints:
 | Method | Path | Description |
 |---|---|---|
 | POST | /sessions | Create a new session |
+| GET | /sessions | Get paginated list of all sessions |
 | GET | /sessions/:id | Get session by internal ID |
 | GET | /sessions/by-external/:externalId | Get session by external ID |
-| DELETE | /sessions/:id | Delete session (cascades to episodes + summaries) |
+| PATCH | /sessions/by-external/:externalId | Update session name |
+| DELETE | /sessions/by-external/:externalId | Delete session (cascades to episodes + summaries) |
+
+> Route ordering matters in Express: `by-external/:externalId` must be defined before
+> `/:id` to prevent the literal string `by-external` being captured as an ID parameter.

 **POST /sessions body:**
 ```json
@@ -156,6 +178,20 @@ Entities and relationships are stored in SQLite with two key constraints:
 }
 ```

+**PATCH /sessions/by-external/:externalId body:**
+```json
+{
+  "name": "My Renamed Session"
+}
+```
+
+Returns the updated session object. `name` is required and must be non-empty.
+
+**DELETE /sessions/by-external/:externalId**
+
+Returns `204 No Content` on success. Cascades to delete all associated episodes
+and summaries via SQLite `ON DELETE CASCADE`.
+
 ### Episodes

 | Method | Path | Description |
--- a/docs/services/orchestration-service.md
+++ b/docs/services/orchestration-service.md
@@ -14,14 +14,10 @@ or inference services — all traffic flows through orchestration.

 ## Dependencies

- `express` : HTTP API
- `cors` : cross-origin resource sharing middleware
- `node-fetch` : inter-service HTTP communication (memory service client only)
- `dotenv` : environment variable loading
- `@nexusai/shared` : shared utilities
-
-> `memory.js` uses `node-fetch` v2 (pinned) because it is CommonJS. All other
-> service clients use Node.js built-in `fetch`.
+- `express` — HTTP API
+- `cors` — cross-origin resource sharing middleware
+- `dotenv` — environment variable loading
+- `@nexusai/shared` — shared utilities

 ## Environment Variables

@@ -33,6 +29,7 @@ or inference services — all traffic flows through orchestration.
 | INFERENCE_SERVICE_URL | No | http://localhost:3001 | Inference service URL |
 | QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search |
 | CORS_ORIGIN | No | http://localhost:5173 | Allowed origin for CORS requests |
+| MODELS_MANIFEST_PATH | Yes | — | Path to `models.json` manifest file |

 ## Internal Structure
 ```
@@ -46,7 +43,8 @@ src/
 │   └── index.js       # Core pipeline logic — context assembly and coordination
 ├── routes/
 │   ├── chat.js        # POST /chat and POST /chat/stream route handlers
-│   └── sessions.js    # GET /sessions/:sessionId/history route handler
+│   ├── sessions.js    # Session list, history, rename, and delete routes
+│   └── models.js      # GET /models — reads models.json manifest from disk
 └── index.js           # Express app entry point
 ```

@@ -65,7 +63,7 @@ the client.
   UUID for new conversations and pass it directly — no pre-creation step needed.

 2. **Recent episode retrieval** — fetches the most recent episodes for the session
-   (default: 10) from the memory service.
+   (default: 5) from the memory service.

 3. **Semantic search** — embeds the user message via the embedding service, then
   queries Qdrant for the top-5 most similar past episodes (score threshold: 0.75).
@@ -89,37 +87,68 @@ the client.
   count to the client.

 ## Prompt Structure
+```
 [System prompt]
+
 Here are some relevant memories from earlier conversations:
 User: {past user message}
 Assistant: {past ai response}
 ... (up to 5 semantic episodes)
-Here is the recent conversation history:
+---
+Here are some relevant memories from your past conversations:
 User: {past user message}
 Assistant: {past ai response}
-... (up to 10 recent episodes)
--- End of memories ---
+... (up to 5 recent episodes)
+--- End of recent memories ---
+
 User: {current message}
 Assistant:
+```

 Semantic episodes appear before recent episodes so the model encounters
 long-range relevant context before the immediate conversation flow.

 ## SSE Stream Format

-The inference service emits chunks in this format:
-data: {"model":"companion:latest","response":"Hello","done":false}
-data: {"model":"companion:latest","response":"!","done":true,"eval_count":3,...}
+The inference service emits chunks from the llama.cpp provider in this format:
+```
+data: {"response":"Hello","done":false}
+data: {"response":"!","done":false}
+data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":42}
 data: [DONE]
+```

 The orchestration service re-emits to the client as:
+```
 data: {"text":"Hello"}
 data: {"text":"!"}
-data: {"done":true}
+data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":42}
+```

 The `[DONE]` sentinel from the inference service is consumed internally
 and not forwarded. The client stream is terminated by `res.end()` after
-the `{"done":true}` event.
+the done event. Model name and token count are included on the done event
+so the client can display them in the UI.
+
+## Models Manifest
+
+The `/models` endpoint reads a `models.json` file from disk at the path
+specified by `MODELS_MANIFEST_PATH`. The file lives on the main PC alongside
+the model files, and is accessible to orchestration via a network share
+mounted at `/mnt/nexus-models`.
+
+The manifest is read fresh on each request — no restart needed when models
+are added or removed.
+
+**models.json format:**
+```json
+[
+  { "value": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf", "label": "Gemma 4 26B Claude Distill" }
+]
+```
+
+- `value` — must match the model name as reported by `llama-server` (including `.gguf` extension)
+- `label` — display name shown in the UI

 ## Endpoints

@@ -142,6 +171,14 @@ the `{"done":true}` event.
 |---|---|---|
 | GET | /sessions | Get paginated list of all sessions |
 | GET | /sessions/:sessionId/history | Get paginated episode history for a session |
+| PATCH | /sessions/:sessionId | Rename a session |
+| DELETE | /sessions/:sessionId | Delete a session and all its episodes |
+
+### Models
+
+| Method | Path | Description |
+|---|---|---|
+| GET | /models | Get list of available models from manifest file |

 ---

@@ -152,7 +189,7 @@ Request body:
 {
  "sessionId": "your-session-uuid",
  "message": "Hello, my name is Tim.",
-  "model": "companion:latest",
+  "model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
  "temperature": 0.7
 }
 ```
@@ -165,7 +202,7 @@ Response:
 {
  "sessionId": "your-session-uuid",
  "response": "Hello Tim! How can I help you today?",
-  "model": "companion:latest",
+  "model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
  "tokenCount": 87
 }
 ```
@@ -176,23 +213,34 @@ Response:

 Same request body as `POST /chat`.

-Response is a stream of Server-Sent Events. Each event contains a text
-delta. The stream ends with a `done` event.
+Response is a stream of Server-Sent Events:
+```
 data: {"text":"Hello"}
 data: {"text":" Tim"}
-data: {"text":"!"}
-data: {"done":true}
+data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":87}
+```

-Clients should read the `text` field from each chunk and accumulate them
-to build the full response string. The connection is closed by the server
-after the `{"done":true}` event.
+---
+
+**PATCH /sessions/:sessionId**
+
+Request body:
+```json
+{ "name": "My Renamed Session" }
+```
+
+Returns the updated session object. `name` is required and trimmed of whitespace.
+
+---
+
+**DELETE /sessions/:sessionId**
+
+Returns `204 No Content`. Cascades to delete all episodes for the session.

 ---

 **GET /sessions/:sessionId/history**

-Returns paginated episode history for a session identified by its external ID.
-
 Query parameters:

 | Parameter | Default | Description |
@@ -218,30 +266,17 @@ Response:
 }
 ```

+Episodes are ordered newest first.
+
 ---

-**GET /sessions**
+**GET /models**

-Returns a paginated list of all sessions, ordered by most recently active.
-
-Query parameters:
-
-| Parameter | Default | Description |
-|---|---|---|
-| limit | 20 | Maximum number of sessions to return |
-| offset | 0 | Number of sessions to skip (for pagination) |
-
-Response:
+Returns the parsed contents of `models.json`:
 ```json
 [
-  {
-    "id": 1,
-    "external_id": "test-semantic",
-    "metadata": null,
-    "created_at": 1712345678,
-    "updated_at": 1712345999
-  }
+  { "value": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf", "label": "Gemma 4 26B Claude Distill" }
 ]
 ```

-Episodes are ordered newest first. Returns `404` if the session does not exist.
+Returns `500` if the manifest file cannot be read or parsed.
--- a/docs/services/shared.md
+++ b/docs/services/shared.md
@@ -24,13 +24,40 @@ const DB   = getEnv('SQLITE_PATH');    // required — throws if missing

 ---

+### `parseRow(row)`
+
+Parses a SQLite row object, deserialising any JSON-encoded `metadata` fields
+into plain objects. Returns `null` if the row is `null` or `undefined`.
+
+```js
+const { parseRow } = require('@nexusai/shared');
+const session = parseRow(db.prepare('SELECT * FROM sessions WHERE id = ?').get(id));
+```
+
+---
+
+### `formatEpisodeText(userMessage, aiResponse)`
+
+Combines a user message and AI response into the canonical text format used
+for embedding:
+
+```
+User: {userMessage}
+Assistant: {aiResponse}
+```
+
+Used by the memory service's embedding write path to ensure consistent
+vector representations across all episodes.
+
+---
+
 ### Constants

 Tuneable values and shared identifiers are centralised in `constants.js`
 rather than hardcoded across services. Import the relevant group by name.

 ```js
-const { QDRANT, COLLECTIONS, EPISODIC } = require('@nexusai/shared');
+const { QDRANT, COLLECTIONS, EPISODIC, LLAMACPP } = require('@nexusai/shared');
 ```

 #### `QDRANT`
@@ -40,15 +67,14 @@ embedding model and Qdrant collection setup.

 | Key | Value | Description |
 |---|---|---|
-| `DEFAULT_URL` | `http://localhost:6333` | Fallback Qdrant URL if `QDRANT_URL` env var is not set |
+| `DEFAULT_URL` | `http://localhost:6333` | Fallback Qdrant URL |
 | `VECTOR_SIZE` | `768` | Output dimensions of `nomic-embed-text` |
 | `DISTANCE_METRIC` | `'Cosine'` | Similarity metric used for all collections |
 | `DEFAULT_LIMIT` | `10` | Default top-k for vector searches |

 #### `COLLECTIONS`

-Canonical Qdrant collection names. Used by both the semantic layer and
-any service that constructs Qdrant queries directly.
+Canonical Qdrant collection names.

 | Key | Value |
 |---|---|
@@ -65,6 +91,8 @@ Default pagination and result limits for SQLite episode queries.
 | `DEFAULT_RECENT_LIMIT` | `10` | Default number of recent episodes to retrieve |
 | `DEFAULT_PAGE_SIZE` | `20` | Default episodes per page for paginated queries |
 | `DEFAULT_SEARCH_LIMIT` | `10` | Default number of FTS search results to return |
+| `DEFAULT_OFFSET` | `0` | Default pagination offset |
+| `DEFAULT_SESSIONS_LIMIT` | `20` | Default number of sessions to return |

 #### `SERVICES`

@@ -74,3 +102,75 @@ when the corresponding environment variable is not set.
 | Key | Value | Description |
 |---|---|---|
 | `EMBEDDING_URL` | `http://localhost:3003` | Fallback embedding service URL |
+| `MEMORY_URL` | `http://localhost:3002` | Fallback memory service URL |
+| `INFERENCE_URL` | `http://localhost:3001` | Fallback inference service URL |
+
+#### `PORTS`
+
+Default port numbers for each service.
+
+| Key | Value |
+|---|---|
+| `INFERENCE` | `'3001'` |
+| `MEMORY` | `'3002'` |
+| `EMBEDDING` | `'3003'` |
+| `ORCHESTRATION` | `'4000'` |
+
+#### `OLLAMA`
+
+Ollama runtime defaults — used by the Ollama inference provider.
+
+| Key | Value | Description |
+|---|---|---|
+| `DEFAULT_URL` | `http://localhost:11434` | Fallback Ollama URL |
+| `EMBED_MODEL` | `'nomic-embed-text'` | Default embedding model |
+| `OLLAMA_MODEL` | `'companion:latest'` | Default chat model |
+
+#### `LLAMACPP`
+
+llama.cpp runtime defaults — used by the llama.cpp inference provider.
+
+| Key | Value | Description |
+|---|---|---|
+| `DEFAULT_URL` | `http://localhost:8080` | Fallback llama-server URL |
+| `DEFAULT_MODEL` | `'local-model'` | Fallback model name (override via `DEFAULT_MODEL` env var) |
+
+> Always set `DEFAULT_MODEL` in the inference service `.env` to the exact model
+> name reported by `llama-server` (including `.gguf` extension). The shared
+> constant is a last-resort fallback only.
+
+#### `INFERENCE_DEFAULTS`
+
+Default inference parameters applied when not specified in a request.
+
+| Key | Value | Description |
+|---|---|---|
+| `TEMPERATURE` | `0.7` | Controls randomness (0 = deterministic, 1 = creative) |
+| `MAX_TOKENS` | `1024` | Maximum tokens to generate |
+| `TOP_P` | `0.9` | Nucleus sampling probability mass |
+| `TOP_K` | `40` | Top-K candidates at each step |
+| `REPEAT_PENALTY` | `1.1` | Penalty for recently used tokens |
+| `SEED` | `null` | null = random; set integer for reproducible outputs |
+
+#### `ORCHESTRATION`
+
+Orchestration pipeline defaults.
+
+| Key | Value | Description |
+|---|---|---|
+| `RECENT_EPISODE_LIMIT` | `5` | Recent episodes to inject into prompt |
+| `SEMANTIC_LIMIT` | `5` | Semantic search results to inject into prompt |
+| `SCORE_THRESHOLD` | `0.75` | Minimum similarity score for semantic results |
+| `CORS_ORIGIN` | `'http://localhost:5173'` | Fallback allowed CORS origin |
+| `SYSTEM_PROMPT` | *(see below)* | Default system prompt |
+
+Default system prompt:
+> "You are a helpful, context-aware AI assistant. You have access to memories
+> of past conversations with the user. Use them to provide consistent,
+> personalised responses."
+
+#### `SQLITE`
+
+| Key | Value | Description |
+|---|---|---|
+| `DEFAULT_PATH` | `'./data/nexusai.db'` | Fallback SQLite database path |