documentation updated for model inference settings

2026-04-18 06:41:50 -07:00
parent c198a00dde
commit 44989a2b8b
5 changed files with 182 additions and 41 deletions
--- a/docs/reference/API-routes.md
+++ b/docs/reference/API-routes.md
@@ -30,7 +30,10 @@ here for reference and direct debugging use.
  "temperature": 0.7
 }
 ```
-`model` and `temperature` are optional.
+`model` and `temperature` are optional. Inference parameters (temperature,
+topP, topK, repeatPenalty) are read from `settings.json` on every request —
+the request body values are not used for these; they are controlled via
+`PATCH /settings`.

 **POST /chat — response:**
 ```json
@@ -110,9 +113,74 @@ Returns `201` with the created project object.

 | Method | Path | Description |
 |---|---|---|
-| GET | /models | Available models from `models.json` manifest |
+| GET | /models | Available models scanned live from models folder |
+| GET | /models/props | Live model props from llama-server (context window, loaded model) |

-Returns array: `[{ "value": "model-name.gguf", "label": "Display Name" }]`
+**GET /models** — returns array:
+```json
+[{ "value": "model-name.gguf", "label": "Display Name", "description": null, "size": "19.7 GB" }]
+```
+Scans `.gguf` files live from `modelsFolderPath` (set in settings). Merges
+with `models.json` in the same folder for label and description metadata.
+
+**GET /models/props** — returns:
+```json
+{ "contextWindow": 64000, "modelAlias": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf" }
+```
+Fetches directly from llama-server `/props`. Returns `503` if llama-server
+is unreachable.
+
+### Settings
+
+| Method | Path | Description |
+|---|---|---|
+| GET | /settings | Get all current settings |
+| PATCH | /settings | Update one or more settings |
+
+**GET /settings — response:**
+```json
+{
+  "recentEpisodeLimit": 9,
+  "semanticLimit": 5,
+  "scoreThreshold": 0.6,
+  "modelsFolderPath": "/mnt/nexus-models",
+  "temperature": 0.65,
+  "repeatPenalty": 1.3,
+  "topP": 0.9,
+  "topK": 41
+}
+```
+
+**PATCH /settings — body:** any subset of the above fields.
+
+| Field | Type | Range | Description |
+|---|---|---|---|
+| `recentEpisodeLimit` | integer | 1–20 | Recent episodes injected into prompt |
+| `semanticLimit` | integer | 1–20 | Max semantic search results |
+| `scoreThreshold` | float | 0–1 | Minimum similarity score |
+| `modelsFolderPath` | string | — | Path to folder containing .gguf files |
+| `temperature` | float | 0–2 | Inference randomness |
+| `repeatPenalty` | float | 1–2 | Repeat token penalty |
+| `topP` | float | 0–1 | Nucleus sampling probability mass |
+| `topK` | integer | 1–100 | Top-K token candidates per step |
+
+Settings are persisted to `data/settings.json` and read on every request —
+changes take effect immediately without a service restart.
+
+### Episodes
+
+| Method | Path | Description |
+|---|---|---|
+| GET | /episodes | Paginated episode list across all sessions |
+| DELETE | /episodes/:id | Delete an episode (SQLite + Qdrant) |
+
+**GET /episodes — query params:**
+
+| Param | Default | Description |
+|---|---|---|
+| limit | 20 | Episodes per page |
+| offset | 0 | Pagination offset |
+| q | — | Keyword search (FTS) |

 ---

@@ -158,10 +226,11 @@ are not touched.
 | Method | Path | Description |
 |---|---|---|
 | POST | /episodes | Create episode + auto-embed into Qdrant |
+| GET | /episodes | Paginated episode list across all sessions |
 | GET | /episodes/search?q=&limit= | FTS keyword search across all episodes |
 | GET | /episodes/:id | Get episode by ID |
 | GET | /sessions/:id/episodes?limit=&offset= | Paginated episodes for a session |
-| DELETE | /episodes/:id | Delete an episode |
+| DELETE | /episodes/:id | Delete episode (SQLite + Qdrant cleanup) |

 > Route ordering: `/episodes/search` must be defined before `/episodes/:id`.

@@ -266,10 +335,14 @@ is awkward to encode in a path.
  "prompt": "What is the capital of France?",
  "model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
  "temperature": 0.7,
-  "maxTokens": 1024
+  "maxTokens": 1024,
+  "topP": 0.9,
+  "topK": 40,
+  "repeatPenalty": 1.1
 }
 ```
-All fields except `prompt` are optional.
+All fields except `prompt` are optional. In normal usage these are forwarded
+from orchestration, which reads them from `settings.json`.

 **POST /complete — response:**
 ```json
--- a/docs/services/chat-client.md
+++ b/docs/services/chat-client.md
@@ -14,6 +14,7 @@ inference services. Served as static files by Caddy on Mini PC 2.
 ## Dependencies

 - `react` + `react-dom` — UI framework
+- `react-markdown` — Markdown rendering in message bubbles and memory viewer
 - `uuid` — session ID generation
 - `vite` + `@vitejs/plugin-react` — build tooling

@@ -63,13 +64,16 @@ export default defineConfig({
      '/sessions': 'http://192.168.0.205:4000',
      '/chat':     'http://192.168.0.205:4000',
      '/projects': 'http://192.168.0.205:4000',
+      '/episodes': 'http://192.168.0.205:4000',
+      '/settings': 'http://192.168.0.205:4000',
+      '/health':   'http://192.168.0.205:4000',
    }
  }
 });
 ```

 When adding new top-level routes to the orchestration service, add a matching
-entry here too.
+entry here and in the Caddy config.

 ## Internal Structure

@@ -84,19 +88,22 @@ src/
 │   ├── useChat.js           # Message sending, SSE streaming, message state
 │   ├── useModels.js         # Dynamic model list fetched from /models endpoint
 │   ├── useProjects.js       # Project list fetched from /projects endpoint
+│   ├── useSettings.js       # Settings fetch + saveSetting helper
 │   └── useContextMenu.js    # Right-click context menu position and visibility
 ├── components/
 │   ├── App.jsx              # Root component — layout, shared state, view routing
 │   ├── Sidebar.jsx          # Left sidebar — projects, recent chats, navigation
 │   ├── ChatWindow.jsx       # Centre panel — message thread and input bar
-│   ├── MessageBubble.jsx    # Individual message bubble (user or assistant)
+│   ├── MessageBubble.jsx    # Individual message bubble — renders markdown via react-markdown
 │   ├── InfoPanel.jsx        # Right panel — model selector and session metadata (slide-in)
 │   ├── SessionModal.jsx     # Modal for session rename, project assignment, delete
 │   ├── ProjectModal.jsx     # Modal for project create, edit, delete
 │   ├── AllChatsView.jsx     # Full paginated session list with multi-select bulk delete
 │   ├── AllProjectsView.jsx  # Project tile grid with create/edit/delete
 │   ├── ProjectView.jsx      # Individual project — session list, new chat button
-│   └── SettingsView.jsx     # Settings placeholder (Appearance, Memory, Models, About)
+│   ├── MemoryView.jsx       # Paginated, searchable, expandable, deletable episode viewer
+│   └── SettingsView.jsx     # Settings — Memory limits, Models (inference params, active
+│                            #   model, context window), Service Health, Appearance placeholder
 ├── index.css                # Global reset, CSS variables, utility classes
 └── main.jsx                 # React entry point
 ```
@@ -118,7 +125,7 @@ panel are persistent across all views.
 │ ⊞ View Projects  │  all-projects → AllProjectsView│
 │                  │  project      → ProjectView   │
 │ PROJECTS ▾       │  settings     → SettingsView  │
-│  [tile] [tile]   │                               │
+│  [tile] [tile]   │  memory       → MemoryView    │
 │  All Projects →  │                               │
 │                  │                               │
 │ RECENT CHATS ▾   │                               │
@@ -143,6 +150,7 @@ via the `⊹` button in the `ChatWindow` header.
 | `'all-projects'` | `AllProjectsView` | "View Projects" button or ⊞ icon |
 | `'project'` | `ProjectView` | Clicking a project tile in the sidebar |
 | `'settings'` | `SettingsView` | Settings button or ⚙ icon |
+| `'memory'` | `MemoryView` | "Open →" button in Settings → Memory section |

 `activeProject` state in `App.jsx` tracks which project `ProjectView` is
 displaying. Set via `onSelectProject` before navigating to `'project'`.
@@ -261,4 +269,19 @@ exposes `refreshProjects` for keeping the sidebar in sync after mutations.
 and a filtered session list. The "+ New Chat" button creates a new session,
 navigates to `'chat'`, and writes the project assignment after the first message.

-For memory isolation behaviour, see `memory-isolation.md`.
+For memory isolation behaviour, see `memory-isolation.md`.
+
+## Settings
+
+`useSettings` fetches from `GET /settings` on mount and exposes a `saveSetting(key, value)`
+helper that issues a `PATCH /settings` with a single key-value pair. The `saving`
+boolean is exposed for disabling save buttons during in-flight requests.
+
+`SettingsView` is organised into sections:
+
+- **Memory** — recent episode limit, semantic limit, score threshold, link to MemoryView
+- **Models** — models folder path, temperature, repeat penalty, Top-P, Top-K,
+  active model dropdown, read-only model info panel (file, size, context window,
+  loaded model from llama-server)
+- **About** — service health check panel, version
+- **Appearance** — theme (coming soon)
--- a/docs/services/inference-service.md
+++ b/docs/services/inference-service.md
@@ -54,6 +54,11 @@ INFERENCE_URL=http://localhost:8080
 The provider loader throws immediately on an unknown value, preventing silent
 misconfiguration.

+> **LM Studio compatibility note:** LM Studio exposes an OpenAI-compatible
+> `/v1/chat/completions` endpoint with the same request shape as llama.cpp.
+> A future `lmstudio.js` provider would be nearly identical to `llamacpp.js` —
+> only the `BASE_URL` would differ. No architectural changes required.
+
 ## Internal Structure

 ```
@@ -109,14 +114,19 @@ Set `DEFAULT_MODEL` in `.env` to the exact reported name.

 ### Inference Parameters

-| NexusAI option | API field | Default |
-|---|---|---|
-| `temperature` | `temperature` | 0.7 |
-| `maxTokens` | `max_tokens` | 1024 |
-| `topP` | `top_p` | 0.9 |
-| `topK` | `top_k` | 40 |
-| `repeatPenalty` | `repeat_penalty` | 1.1 |
-| `seed` | `seed` | null (random) |
+All parameters are resolved in `resolveOptions()` — falling back to
+`INFERENCE_DEFAULTS` from `@nexusai/shared` if not provided in the request.
+In normal usage, orchestration reads these from `settings.json` and forwards
+them on every request.
+
+| NexusAI option | API field | Default | Description |
+|---|---|---|---|
+| `temperature` | `temperature` | 0.7 | Response randomness (0 = deterministic) |
+| `maxTokens` | `max_tokens` | 1024 | Max tokens to generate |
+| `topP` | `top_p` | 0.9 | Nucleus sampling probability mass |
+| `topK` | `top_k` | 40 | Top-K token candidates per step |
+| `repeatPenalty` | `repeat_penalty` | 1.1 | Penalty for recently used tokens |
+| `seed` | `seed` | null | null = random; integer for reproducible output |

 ## Streaming Response Format

--- a/docs/services/orchestration-service.md
+++ b/docs/services/orchestration-service.md
@@ -27,9 +27,10 @@ or inference services — all traffic flows through orchestration.
 | MEMORY_SERVICE_URL | No | http://localhost:3002 | Memory service URL |
 | EMBEDDING_SERVICE_URL | No | http://localhost:3003 | Embedding service URL |
 | INFERENCE_SERVICE_URL | No | http://localhost:3001 | Inference service URL |
+| LLAMA_SERVER_URL | No | http://localhost:8080 | Direct llama-server URL for /models/props |
 | QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search |
 | CORS_ORIGIN | No | http://localhost:5173 | Allowed origin for CORS requests |
-| MODELS_MANIFEST_PATH | Yes | — | Path to `models.json` manifest file |
+| MODELS_MANIFEST_PATH | No | — | Legacy — superseded by `modelsFolderPath` in settings.json |

 ## Internal Structure

@@ -42,17 +43,42 @@ src/
 │   └── qdrant.js      # HTTP client for Qdrant (direct vector search)
 ├── chat/
 │   └── index.js       # Core pipeline — context assembly, isolation, auto-naming
+├── config/
+│   └── settings.js    # Settings load/save — reads/writes data/settings.json
 ├── routes/
 │   ├── chat.js        # POST /chat and POST /chat/stream
 │   ├── sessions.js    # Session CRUD proxy
 │   ├── projects.js    # Project CRUD proxy
-│   └── models.js      # GET /models — reads models.json from disk
+│   ├── episodes.js    # Episode list and delete proxy
+│   ├── settings.js    # GET /settings and PATCH /settings
+│   ├── health.js      # GET /health — pings all four services
+│   └── models.js      # GET /models — scans .gguf files live, merges with models.json
+                       # GET /models/props — context window + loaded model from llama-server
 └── index.js           # Express app entry point
 ```

 The `services/` layer wraps all downstream HTTP calls in named functions.
 URL or endpoint changes have a single place to be updated.

+## Settings
+
+Settings are persisted to `data/settings.json` and loaded on every request
+via `appSettings.load()` — changes apply immediately without a service restart.
+
+| Setting | Default | Description |
+|---|---|---|
+| `recentEpisodeLimit` | 5 | Recent episodes injected into prompt |
+| `semanticLimit` | 5 | Semantic search results injected into prompt |
+| `scoreThreshold` | 0.75 | Minimum similarity score for semantic results |
+| `modelsFolderPath` | `/mnt/nexus-models` | Path to folder containing .gguf files |
+| `temperature` | 0.7 | Inference temperature |
+| `repeatPenalty` | 1.1 | Repeat token penalty |
+| `topP` | 0.9 | Nucleus sampling probability mass |
+| `topK` | 40 | Top-K token candidates per step |
+
+Defaults are defined in `config/settings.js` and fall back to constants in
+`@nexusai/shared`. Values saved in `settings.json` take precedence.
+
 ## Chat Pipeline

 Both `POST /chat` and `POST /chat/stream` share the same steps. The only
@@ -69,11 +95,11 @@ difference is how the inference response is delivered to the client.
   `memory-isolation.md` for full behaviour.

 3. **Recent episode retrieval** — fetch the most recent episodes for the
-   session (`RECENT_EPISODE_LIMIT`, default 5).
+   session (`recentEpisodeLimit`, default 5).

-4. **Semantic search** — embed the user message, query Qdrant for the top-5
-   most similar past episodes (`SCORE_THRESHOLD` 0.75). Deduplicated against
-   recent episodes. Non-critical — if it fails, pipeline continues with
+4. **Semantic search** — embed the user message, query Qdrant for the top
+   most similar past episodes (`semanticLimit`, `scoreThreshold`). Deduplicated
+   against recent episodes. Non-critical — if it fails, pipeline continues with
   recency-only context.

 5. **Entity search** — reuse the embedded user message vector to query the
@@ -84,7 +110,8 @@ difference is how the inference response is delivered to the client.
 6. **Prompt assembly** — combine system prompt, entity context, semantic
   episodes, recent episodes, and user message.

-7. **Inference** — send to inference service. `/chat` awaits full response;
+7. **Inference** — send to inference service with settings-derived parameters
+   (temperature, topP, topK, repeatPenalty). `/chat` awaits full response;
   `/chat/stream` pipes SSE chunks to the client.

 8. **Episode write** — write the exchange back to memory. Fire-and-forget
@@ -107,12 +134,12 @@ Here is what you know about entities relevant to this conversation:
 Here are some relevant memories from earlier conversations:
 User: {past user message}
 Assistant: {past ai response}
-... (up to 5 semantic episodes)
+... (up to semanticLimit semantic episodes)
 ---
 Here are some relevant memories from your past conversations:
 User: {past user message}
 Assistant: {past ai response}
-... (up to 5 recent episodes)
+... (up to recentEpisodeLimit recent episodes)
 --- End of recent memories ---

 User: {current message}
@@ -141,20 +168,16 @@ data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42}
 The `[DONE]` sentinel is consumed internally and not forwarded. The stream
 is terminated by `res.end()` after the done event.

-## Models Manifest
+## Models Route

-`GET /models` reads `models.json` fresh on each request from
-`MODELS_MANIFEST_PATH`. The file lives on the main PC alongside model files,
-accessible via an SMB mount at `/mnt/nexus-models`.
+`GET /models` scans `.gguf` files live on each request from `modelsFolderPath`
+(read from settings). Merges results with a `models.json` file in the same
+folder for richer metadata (label, description). Returns file size in GB.

-```json
-[
-  { "value": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf", "label": "Gemma 4 26B Claude Distill" }
-]
-```
-
-`value` must match the model name as reported by `llama-server` (including
-`.gguf` extension). No service restart needed when models are added or removed.
+`GET /models/props` fetches directly from llama-server via `LLAMA_SERVER_URL`.
+Returns `{ contextWindow, modelAlias }`. Used by the client to display
+read-only context window size and the currently loaded model in the settings
+panel. Returns `503` if llama-server is unreachable.

 ## Sessions Route Behaviour

@@ -179,6 +202,9 @@ handle /chat*     { reverse_proxy localhost:4000 }
 handle /sessions* { reverse_proxy localhost:4000 }
 handle /models*   { reverse_proxy localhost:4000 }
 handle /projects* { reverse_proxy localhost:4000 }
+handle /episodes* { reverse_proxy localhost:4000 }
+handle /settings* { reverse_proxy localhost:4000 }
+handle /health*   { reverse_proxy localhost:4000 }
 ```

 After updating: `caddy reload --config /path/to/Caddyfile`
--- a/docs/services/shared.md
+++ b/docs/services/shared.md
@@ -142,6 +142,9 @@ llama.cpp runtime defaults — used by the llama.cpp inference provider.
 #### `INFERENCE_DEFAULTS`

 Default inference parameters applied when not specified in a request.
+These are used as fallbacks in `resolveOptions()` in both providers.
+Orchestration reads live values from `settings.json` and forwards them
+on every request — these constants are the fallback layer only.

 | Key | Value | Description |
 |---|---|---|
@@ -154,16 +157,22 @@ Default inference parameters applied when not specified in a request.

 #### `ORCHESTRATION`

-Orchestration pipeline defaults.
+Orchestration pipeline defaults. Used as fallback values in
+`config/settings.js` when `settings.json` doesn't contain a key.

 | Key | Value | Description |
 |---|---|---|
 | `RECENT_EPISODE_LIMIT` | `5` | Recent episodes to inject into prompt |
 | `SEMANTIC_LIMIT` | `5` | Semantic search results to inject into prompt |
 | `SCORE_THRESHOLD` | `0.75` | Minimum similarity score for semantic results |
+| `TEMPERATURE` | `0.7` | Default inference temperature |
 | `CORS_ORIGIN` | `'http://localhost:5173'` | Fallback allowed CORS origin |
 | `SYSTEM_PROMPT` | *(see below)* | Default system prompt |

+> `repeatPenalty`, `topP`, and `topK` defaults are sourced from
+> `INFERENCE_DEFAULTS` in `config/settings.js` rather than `ORCHESTRATION`,
+> since those constants already define the canonical values.
+
 Default system prompt:
 > "You are a helpful, context-aware AI assistant. You have access to memories
 > of past conversations with the user. Use them to provide consistent,