diff --git a/docs/reference/API-routes.md b/docs/reference/API-routes.md index 876cbfc..7ed9e3b 100644 --- a/docs/reference/API-routes.md +++ b/docs/reference/API-routes.md @@ -30,7 +30,10 @@ here for reference and direct debugging use. "temperature": 0.7 } ``` -`model` and `temperature` are optional. +`model` and `temperature` are optional. Inference parameters (temperature, +topP, topK, repeatPenalty) are read from `settings.json` on every request — +the request body values are not used for these; they are controlled via +`PATCH /settings`. **POST /chat — response:** ```json @@ -110,9 +113,74 @@ Returns `201` with the created project object. | Method | Path | Description | |---|---|---| -| GET | /models | Available models from `models.json` manifest | +| GET | /models | Available models scanned live from models folder | +| GET | /models/props | Live model props from llama-server (context window, loaded model) | -Returns array: `[{ "value": "model-name.gguf", "label": "Display Name" }]` +**GET /models** — returns array: +```json +[{ "value": "model-name.gguf", "label": "Display Name", "description": null, "size": "19.7 GB" }] +``` +Scans `.gguf` files live from `modelsFolderPath` (set in settings). Merges +with `models.json` in the same folder for label and description metadata. + +**GET /models/props** — returns: +```json +{ "contextWindow": 64000, "modelAlias": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf" } +``` +Fetches directly from llama-server `/props`. Returns `503` if llama-server +is unreachable. + +### Settings + +| Method | Path | Description | +|---|---|---| +| GET | /settings | Get all current settings | +| PATCH | /settings | Update one or more settings | + +**GET /settings — response:** +```json +{ + "recentEpisodeLimit": 9, + "semanticLimit": 5, + "scoreThreshold": 0.6, + "modelsFolderPath": "/mnt/nexus-models", + "temperature": 0.65, + "repeatPenalty": 1.3, + "topP": 0.9, + "topK": 41 +} +``` + +**PATCH /settings — body:** any subset of the above fields. + +| Field | Type | Range | Description | +|---|---|---|---| +| `recentEpisodeLimit` | integer | 1–20 | Recent episodes injected into prompt | +| `semanticLimit` | integer | 1–20 | Max semantic search results | +| `scoreThreshold` | float | 0–1 | Minimum similarity score | +| `modelsFolderPath` | string | — | Path to folder containing .gguf files | +| `temperature` | float | 0–2 | Inference randomness | +| `repeatPenalty` | float | 1–2 | Repeat token penalty | +| `topP` | float | 0–1 | Nucleus sampling probability mass | +| `topK` | integer | 1–100 | Top-K token candidates per step | + +Settings are persisted to `data/settings.json` and read on every request — +changes take effect immediately without a service restart. + +### Episodes + +| Method | Path | Description | +|---|---|---| +| GET | /episodes | Paginated episode list across all sessions | +| DELETE | /episodes/:id | Delete an episode (SQLite + Qdrant) | + +**GET /episodes — query params:** + +| Param | Default | Description | +|---|---|---| +| limit | 20 | Episodes per page | +| offset | 0 | Pagination offset | +| q | — | Keyword search (FTS) | --- @@ -158,10 +226,11 @@ are not touched. | Method | Path | Description | |---|---|---| | POST | /episodes | Create episode + auto-embed into Qdrant | +| GET | /episodes | Paginated episode list across all sessions | | GET | /episodes/search?q=&limit= | FTS keyword search across all episodes | | GET | /episodes/:id | Get episode by ID | | GET | /sessions/:id/episodes?limit=&offset= | Paginated episodes for a session | -| DELETE | /episodes/:id | Delete an episode | +| DELETE | /episodes/:id | Delete episode (SQLite + Qdrant cleanup) | > Route ordering: `/episodes/search` must be defined before `/episodes/:id`. @@ -266,10 +335,14 @@ is awkward to encode in a path. "prompt": "What is the capital of France?", "model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf", "temperature": 0.7, - "maxTokens": 1024 + "maxTokens": 1024, + "topP": 0.9, + "topK": 40, + "repeatPenalty": 1.1 } ``` -All fields except `prompt` are optional. +All fields except `prompt` are optional. In normal usage these are forwarded +from orchestration, which reads them from `settings.json`. **POST /complete — response:** ```json diff --git a/docs/services/chat-client.md b/docs/services/chat-client.md index ab7a0f8..0828926 100644 --- a/docs/services/chat-client.md +++ b/docs/services/chat-client.md @@ -14,6 +14,7 @@ inference services. Served as static files by Caddy on Mini PC 2. ## Dependencies - `react` + `react-dom` — UI framework +- `react-markdown` — Markdown rendering in message bubbles and memory viewer - `uuid` — session ID generation - `vite` + `@vitejs/plugin-react` — build tooling @@ -63,13 +64,16 @@ export default defineConfig({ '/sessions': 'http://192.168.0.205:4000', '/chat': 'http://192.168.0.205:4000', '/projects': 'http://192.168.0.205:4000', + '/episodes': 'http://192.168.0.205:4000', + '/settings': 'http://192.168.0.205:4000', + '/health': 'http://192.168.0.205:4000', } } }); ``` When adding new top-level routes to the orchestration service, add a matching -entry here too. +entry here and in the Caddy config. ## Internal Structure @@ -84,19 +88,22 @@ src/ │ ├── useChat.js # Message sending, SSE streaming, message state │ ├── useModels.js # Dynamic model list fetched from /models endpoint │ ├── useProjects.js # Project list fetched from /projects endpoint +│ ├── useSettings.js # Settings fetch + saveSetting helper │ └── useContextMenu.js # Right-click context menu position and visibility ├── components/ │ ├── App.jsx # Root component — layout, shared state, view routing │ ├── Sidebar.jsx # Left sidebar — projects, recent chats, navigation │ ├── ChatWindow.jsx # Centre panel — message thread and input bar -│ ├── MessageBubble.jsx # Individual message bubble (user or assistant) +│ ├── MessageBubble.jsx # Individual message bubble — renders markdown via react-markdown │ ├── InfoPanel.jsx # Right panel — model selector and session metadata (slide-in) │ ├── SessionModal.jsx # Modal for session rename, project assignment, delete │ ├── ProjectModal.jsx # Modal for project create, edit, delete │ ├── AllChatsView.jsx # Full paginated session list with multi-select bulk delete │ ├── AllProjectsView.jsx # Project tile grid with create/edit/delete │ ├── ProjectView.jsx # Individual project — session list, new chat button -│ └── SettingsView.jsx # Settings placeholder (Appearance, Memory, Models, About) +│ ├── MemoryView.jsx # Paginated, searchable, expandable, deletable episode viewer +│ └── SettingsView.jsx # Settings — Memory limits, Models (inference params, active +│ # model, context window), Service Health, Appearance placeholder ├── index.css # Global reset, CSS variables, utility classes └── main.jsx # React entry point ``` @@ -118,7 +125,7 @@ panel are persistent across all views. │ ⊞ View Projects │ all-projects → AllProjectsView│ │ │ project → ProjectView │ │ PROJECTS ▾ │ settings → SettingsView │ -│ [tile] [tile] │ │ +│ [tile] [tile] │ memory → MemoryView │ │ All Projects → │ │ │ │ │ │ RECENT CHATS ▾ │ │ @@ -143,6 +150,7 @@ via the `⊹` button in the `ChatWindow` header. | `'all-projects'` | `AllProjectsView` | "View Projects" button or ⊞ icon | | `'project'` | `ProjectView` | Clicking a project tile in the sidebar | | `'settings'` | `SettingsView` | Settings button or ⚙ icon | +| `'memory'` | `MemoryView` | "Open →" button in Settings → Memory section | `activeProject` state in `App.jsx` tracks which project `ProjectView` is displaying. Set via `onSelectProject` before navigating to `'project'`. @@ -261,4 +269,19 @@ exposes `refreshProjects` for keeping the sidebar in sync after mutations. and a filtered session list. The "+ New Chat" button creates a new session, navigates to `'chat'`, and writes the project assignment after the first message. -For memory isolation behaviour, see `memory-isolation.md`. \ No newline at end of file +For memory isolation behaviour, see `memory-isolation.md`. + +## Settings + +`useSettings` fetches from `GET /settings` on mount and exposes a `saveSetting(key, value)` +helper that issues a `PATCH /settings` with a single key-value pair. The `saving` +boolean is exposed for disabling save buttons during in-flight requests. + +`SettingsView` is organised into sections: + +- **Memory** — recent episode limit, semantic limit, score threshold, link to MemoryView +- **Models** — models folder path, temperature, repeat penalty, Top-P, Top-K, + active model dropdown, read-only model info panel (file, size, context window, + loaded model from llama-server) +- **About** — service health check panel, version +- **Appearance** — theme (coming soon) \ No newline at end of file diff --git a/docs/services/inference-service.md b/docs/services/inference-service.md index bdcd686..f558725 100644 --- a/docs/services/inference-service.md +++ b/docs/services/inference-service.md @@ -54,6 +54,11 @@ INFERENCE_URL=http://localhost:8080 The provider loader throws immediately on an unknown value, preventing silent misconfiguration. +> **LM Studio compatibility note:** LM Studio exposes an OpenAI-compatible +> `/v1/chat/completions` endpoint with the same request shape as llama.cpp. +> A future `lmstudio.js` provider would be nearly identical to `llamacpp.js` — +> only the `BASE_URL` would differ. No architectural changes required. + ## Internal Structure ``` @@ -109,14 +114,19 @@ Set `DEFAULT_MODEL` in `.env` to the exact reported name. ### Inference Parameters -| NexusAI option | API field | Default | -|---|---|---| -| `temperature` | `temperature` | 0.7 | -| `maxTokens` | `max_tokens` | 1024 | -| `topP` | `top_p` | 0.9 | -| `topK` | `top_k` | 40 | -| `repeatPenalty` | `repeat_penalty` | 1.1 | -| `seed` | `seed` | null (random) | +All parameters are resolved in `resolveOptions()` — falling back to +`INFERENCE_DEFAULTS` from `@nexusai/shared` if not provided in the request. +In normal usage, orchestration reads these from `settings.json` and forwards +them on every request. + +| NexusAI option | API field | Default | Description | +|---|---|---|---| +| `temperature` | `temperature` | 0.7 | Response randomness (0 = deterministic) | +| `maxTokens` | `max_tokens` | 1024 | Max tokens to generate | +| `topP` | `top_p` | 0.9 | Nucleus sampling probability mass | +| `topK` | `top_k` | 40 | Top-K token candidates per step | +| `repeatPenalty` | `repeat_penalty` | 1.1 | Penalty for recently used tokens | +| `seed` | `seed` | null | null = random; integer for reproducible output | ## Streaming Response Format diff --git a/docs/services/orchestration-service.md b/docs/services/orchestration-service.md index 56a4f98..e6ae466 100644 --- a/docs/services/orchestration-service.md +++ b/docs/services/orchestration-service.md @@ -27,9 +27,10 @@ or inference services — all traffic flows through orchestration. | MEMORY_SERVICE_URL | No | http://localhost:3002 | Memory service URL | | EMBEDDING_SERVICE_URL | No | http://localhost:3003 | Embedding service URL | | INFERENCE_SERVICE_URL | No | http://localhost:3001 | Inference service URL | +| LLAMA_SERVER_URL | No | http://localhost:8080 | Direct llama-server URL for /models/props | | QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search | | CORS_ORIGIN | No | http://localhost:5173 | Allowed origin for CORS requests | -| MODELS_MANIFEST_PATH | Yes | — | Path to `models.json` manifest file | +| MODELS_MANIFEST_PATH | No | — | Legacy — superseded by `modelsFolderPath` in settings.json | ## Internal Structure @@ -42,17 +43,42 @@ src/ │ └── qdrant.js # HTTP client for Qdrant (direct vector search) ├── chat/ │ └── index.js # Core pipeline — context assembly, isolation, auto-naming +├── config/ +│ └── settings.js # Settings load/save — reads/writes data/settings.json ├── routes/ │ ├── chat.js # POST /chat and POST /chat/stream │ ├── sessions.js # Session CRUD proxy │ ├── projects.js # Project CRUD proxy -│ └── models.js # GET /models — reads models.json from disk +│ ├── episodes.js # Episode list and delete proxy +│ ├── settings.js # GET /settings and PATCH /settings +│ ├── health.js # GET /health — pings all four services +│ └── models.js # GET /models — scans .gguf files live, merges with models.json + # GET /models/props — context window + loaded model from llama-server └── index.js # Express app entry point ``` The `services/` layer wraps all downstream HTTP calls in named functions. URL or endpoint changes have a single place to be updated. +## Settings + +Settings are persisted to `data/settings.json` and loaded on every request +via `appSettings.load()` — changes apply immediately without a service restart. + +| Setting | Default | Description | +|---|---|---| +| `recentEpisodeLimit` | 5 | Recent episodes injected into prompt | +| `semanticLimit` | 5 | Semantic search results injected into prompt | +| `scoreThreshold` | 0.75 | Minimum similarity score for semantic results | +| `modelsFolderPath` | `/mnt/nexus-models` | Path to folder containing .gguf files | +| `temperature` | 0.7 | Inference temperature | +| `repeatPenalty` | 1.1 | Repeat token penalty | +| `topP` | 0.9 | Nucleus sampling probability mass | +| `topK` | 40 | Top-K token candidates per step | + +Defaults are defined in `config/settings.js` and fall back to constants in +`@nexusai/shared`. Values saved in `settings.json` take precedence. + ## Chat Pipeline Both `POST /chat` and `POST /chat/stream` share the same steps. The only @@ -69,11 +95,11 @@ difference is how the inference response is delivered to the client. `memory-isolation.md` for full behaviour. 3. **Recent episode retrieval** — fetch the most recent episodes for the - session (`RECENT_EPISODE_LIMIT`, default 5). + session (`recentEpisodeLimit`, default 5). -4. **Semantic search** — embed the user message, query Qdrant for the top-5 - most similar past episodes (`SCORE_THRESHOLD` 0.75). Deduplicated against - recent episodes. Non-critical — if it fails, pipeline continues with +4. **Semantic search** — embed the user message, query Qdrant for the top + most similar past episodes (`semanticLimit`, `scoreThreshold`). Deduplicated + against recent episodes. Non-critical — if it fails, pipeline continues with recency-only context. 5. **Entity search** — reuse the embedded user message vector to query the @@ -84,7 +110,8 @@ difference is how the inference response is delivered to the client. 6. **Prompt assembly** — combine system prompt, entity context, semantic episodes, recent episodes, and user message. -7. **Inference** — send to inference service. `/chat` awaits full response; +7. **Inference** — send to inference service with settings-derived parameters + (temperature, topP, topK, repeatPenalty). `/chat` awaits full response; `/chat/stream` pipes SSE chunks to the client. 8. **Episode write** — write the exchange back to memory. Fire-and-forget @@ -107,12 +134,12 @@ Here is what you know about entities relevant to this conversation: Here are some relevant memories from earlier conversations: User: {past user message} Assistant: {past ai response} -... (up to 5 semantic episodes) +... (up to semanticLimit semantic episodes) --- Here are some relevant memories from your past conversations: User: {past user message} Assistant: {past ai response} -... (up to 5 recent episodes) +... (up to recentEpisodeLimit recent episodes) --- End of recent memories --- User: {current message} @@ -141,20 +168,16 @@ data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42} The `[DONE]` sentinel is consumed internally and not forwarded. The stream is terminated by `res.end()` after the done event. -## Models Manifest +## Models Route -`GET /models` reads `models.json` fresh on each request from -`MODELS_MANIFEST_PATH`. The file lives on the main PC alongside model files, -accessible via an SMB mount at `/mnt/nexus-models`. +`GET /models` scans `.gguf` files live on each request from `modelsFolderPath` +(read from settings). Merges results with a `models.json` file in the same +folder for richer metadata (label, description). Returns file size in GB. -```json -[ - { "value": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf", "label": "Gemma 4 26B Claude Distill" } -] -``` - -`value` must match the model name as reported by `llama-server` (including -`.gguf` extension). No service restart needed when models are added or removed. +`GET /models/props` fetches directly from llama-server via `LLAMA_SERVER_URL`. +Returns `{ contextWindow, modelAlias }`. Used by the client to display +read-only context window size and the currently loaded model in the settings +panel. Returns `503` if llama-server is unreachable. ## Sessions Route Behaviour @@ -179,6 +202,9 @@ handle /chat* { reverse_proxy localhost:4000 } handle /sessions* { reverse_proxy localhost:4000 } handle /models* { reverse_proxy localhost:4000 } handle /projects* { reverse_proxy localhost:4000 } +handle /episodes* { reverse_proxy localhost:4000 } +handle /settings* { reverse_proxy localhost:4000 } +handle /health* { reverse_proxy localhost:4000 } ``` After updating: `caddy reload --config /path/to/Caddyfile` diff --git a/docs/services/shared.md b/docs/services/shared.md index 3ffbd34..69ffffa 100644 --- a/docs/services/shared.md +++ b/docs/services/shared.md @@ -142,6 +142,9 @@ llama.cpp runtime defaults — used by the llama.cpp inference provider. #### `INFERENCE_DEFAULTS` Default inference parameters applied when not specified in a request. +These are used as fallbacks in `resolveOptions()` in both providers. +Orchestration reads live values from `settings.json` and forwards them +on every request — these constants are the fallback layer only. | Key | Value | Description | |---|---|---| @@ -154,16 +157,22 @@ Default inference parameters applied when not specified in a request. #### `ORCHESTRATION` -Orchestration pipeline defaults. +Orchestration pipeline defaults. Used as fallback values in +`config/settings.js` when `settings.json` doesn't contain a key. | Key | Value | Description | |---|---|---| | `RECENT_EPISODE_LIMIT` | `5` | Recent episodes to inject into prompt | | `SEMANTIC_LIMIT` | `5` | Semantic search results to inject into prompt | | `SCORE_THRESHOLD` | `0.75` | Minimum similarity score for semantic results | +| `TEMPERATURE` | `0.7` | Default inference temperature | | `CORS_ORIGIN` | `'http://localhost:5173'` | Fallback allowed CORS origin | | `SYSTEM_PROMPT` | *(see below)* | Default system prompt | +> `repeatPenalty`, `topP`, and `topK` defaults are sourced from +> `INFERENCE_DEFAULTS` in `config/settings.js` rather than `ORCHESTRATION`, +> since those constants already define the canonical values. + Default system prompt: > "You are a helpful, context-aware AI assistant. You have access to memories > of past conversations with the user. Use them to provide consistent,