documentation updated for model inference settings

This commit is contained in:
Storme-bit
2026-04-18 06:41:50 -07:00
parent c198a00dde
commit 44989a2b8b
5 changed files with 182 additions and 41 deletions

View File

@@ -30,7 +30,10 @@ here for reference and direct debugging use.
"temperature": 0.7
}
```
`model` and `temperature` are optional.
`model` and `temperature` are optional. Inference parameters (temperature,
topP, topK, repeatPenalty) are read from `settings.json` on every request —
the request body values are not used for these; they are controlled via
`PATCH /settings`.
**POST /chat — response:**
```json
@@ -110,9 +113,74 @@ Returns `201` with the created project object.
| Method | Path | Description |
|---|---|---|
| GET | /models | Available models from `models.json` manifest |
| GET | /models | Available models scanned live from models folder |
| GET | /models/props | Live model props from llama-server (context window, loaded model) |
Returns array: `[{ "value": "model-name.gguf", "label": "Display Name" }]`
**GET /models** — returns array:
```json
[{ "value": "model-name.gguf", "label": "Display Name", "description": null, "size": "19.7 GB" }]
```
Scans `.gguf` files live from `modelsFolderPath` (set in settings). Merges
with `models.json` in the same folder for label and description metadata.
**GET /models/props** — returns:
```json
{ "contextWindow": 64000, "modelAlias": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf" }
```
Fetches directly from llama-server `/props`. Returns `503` if llama-server
is unreachable.
### Settings
| Method | Path | Description |
|---|---|---|
| GET | /settings | Get all current settings |
| PATCH | /settings | Update one or more settings |
**GET /settings — response:**
```json
{
"recentEpisodeLimit": 9,
"semanticLimit": 5,
"scoreThreshold": 0.6,
"modelsFolderPath": "/mnt/nexus-models",
"temperature": 0.65,
"repeatPenalty": 1.3,
"topP": 0.9,
"topK": 41
}
```
**PATCH /settings — body:** any subset of the above fields.
| Field | Type | Range | Description |
|---|---|---|---|
| `recentEpisodeLimit` | integer | 120 | Recent episodes injected into prompt |
| `semanticLimit` | integer | 120 | Max semantic search results |
| `scoreThreshold` | float | 01 | Minimum similarity score |
| `modelsFolderPath` | string | — | Path to folder containing .gguf files |
| `temperature` | float | 02 | Inference randomness |
| `repeatPenalty` | float | 12 | Repeat token penalty |
| `topP` | float | 01 | Nucleus sampling probability mass |
| `topK` | integer | 1100 | Top-K token candidates per step |
Settings are persisted to `data/settings.json` and read on every request —
changes take effect immediately without a service restart.
### Episodes
| Method | Path | Description |
|---|---|---|
| GET | /episodes | Paginated episode list across all sessions |
| DELETE | /episodes/:id | Delete an episode (SQLite + Qdrant) |
**GET /episodes — query params:**
| Param | Default | Description |
|---|---|---|
| limit | 20 | Episodes per page |
| offset | 0 | Pagination offset |
| q | — | Keyword search (FTS) |
---
@@ -158,10 +226,11 @@ are not touched.
| Method | Path | Description |
|---|---|---|
| POST | /episodes | Create episode + auto-embed into Qdrant |
| GET | /episodes | Paginated episode list across all sessions |
| GET | /episodes/search?q=&limit= | FTS keyword search across all episodes |
| GET | /episodes/:id | Get episode by ID |
| GET | /sessions/:id/episodes?limit=&offset= | Paginated episodes for a session |
| DELETE | /episodes/:id | Delete an episode |
| DELETE | /episodes/:id | Delete episode (SQLite + Qdrant cleanup) |
> Route ordering: `/episodes/search` must be defined before `/episodes/:id`.
@@ -266,10 +335,14 @@ is awkward to encode in a path.
"prompt": "What is the capital of France?",
"model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
"temperature": 0.7,
"maxTokens": 1024
"maxTokens": 1024,
"topP": 0.9,
"topK": 40,
"repeatPenalty": 1.1
}
```
All fields except `prompt` are optional.
All fields except `prompt` are optional. In normal usage these are forwarded
from orchestration, which reads them from `settings.json`.
**POST /complete — response:**
```json

View File

@@ -14,6 +14,7 @@ inference services. Served as static files by Caddy on Mini PC 2.
## Dependencies
- `react` + `react-dom` — UI framework
- `react-markdown` — Markdown rendering in message bubbles and memory viewer
- `uuid` — session ID generation
- `vite` + `@vitejs/plugin-react` — build tooling
@@ -63,13 +64,16 @@ export default defineConfig({
'/sessions': 'http://192.168.0.205:4000',
'/chat': 'http://192.168.0.205:4000',
'/projects': 'http://192.168.0.205:4000',
'/episodes': 'http://192.168.0.205:4000',
'/settings': 'http://192.168.0.205:4000',
'/health': 'http://192.168.0.205:4000',
}
}
});
```
When adding new top-level routes to the orchestration service, add a matching
entry here too.
entry here and in the Caddy config.
## Internal Structure
@@ -84,19 +88,22 @@ src/
│ ├── useChat.js # Message sending, SSE streaming, message state
│ ├── useModels.js # Dynamic model list fetched from /models endpoint
│ ├── useProjects.js # Project list fetched from /projects endpoint
│ ├── useSettings.js # Settings fetch + saveSetting helper
│ └── useContextMenu.js # Right-click context menu position and visibility
├── components/
│ ├── App.jsx # Root component — layout, shared state, view routing
│ ├── Sidebar.jsx # Left sidebar — projects, recent chats, navigation
│ ├── ChatWindow.jsx # Centre panel — message thread and input bar
│ ├── MessageBubble.jsx # Individual message bubble (user or assistant)
│ ├── MessageBubble.jsx # Individual message bubble — renders markdown via react-markdown
│ ├── InfoPanel.jsx # Right panel — model selector and session metadata (slide-in)
│ ├── SessionModal.jsx # Modal for session rename, project assignment, delete
│ ├── ProjectModal.jsx # Modal for project create, edit, delete
│ ├── AllChatsView.jsx # Full paginated session list with multi-select bulk delete
│ ├── AllProjectsView.jsx # Project tile grid with create/edit/delete
│ ├── ProjectView.jsx # Individual project — session list, new chat button
── SettingsView.jsx # Settings placeholder (Appearance, Memory, Models, About)
── MemoryView.jsx # Paginated, searchable, expandable, deletable episode viewer
│ └── SettingsView.jsx # Settings — Memory limits, Models (inference params, active
│ # model, context window), Service Health, Appearance placeholder
├── index.css # Global reset, CSS variables, utility classes
└── main.jsx # React entry point
```
@@ -118,7 +125,7 @@ panel are persistent across all views.
│ ⊞ View Projects │ all-projects → AllProjectsView│
│ │ project → ProjectView │
│ PROJECTS ▾ │ settings → SettingsView │
│ [tile] [tile] │
│ [tile] [tile] │ memory → MemoryView
│ All Projects → │ │
│ │ │
│ RECENT CHATS ▾ │ │
@@ -143,6 +150,7 @@ via the `⊹` button in the `ChatWindow` header.
| `'all-projects'` | `AllProjectsView` | "View Projects" button or ⊞ icon |
| `'project'` | `ProjectView` | Clicking a project tile in the sidebar |
| `'settings'` | `SettingsView` | Settings button or ⚙ icon |
| `'memory'` | `MemoryView` | "Open →" button in Settings → Memory section |
`activeProject` state in `App.jsx` tracks which project `ProjectView` is
displaying. Set via `onSelectProject` before navigating to `'project'`.
@@ -262,3 +270,18 @@ and a filtered session list. The "+ New Chat" button creates a new session,
navigates to `'chat'`, and writes the project assignment after the first message.
For memory isolation behaviour, see `memory-isolation.md`.
## Settings
`useSettings` fetches from `GET /settings` on mount and exposes a `saveSetting(key, value)`
helper that issues a `PATCH /settings` with a single key-value pair. The `saving`
boolean is exposed for disabling save buttons during in-flight requests.
`SettingsView` is organised into sections:
- **Memory** — recent episode limit, semantic limit, score threshold, link to MemoryView
- **Models** — models folder path, temperature, repeat penalty, Top-P, Top-K,
active model dropdown, read-only model info panel (file, size, context window,
loaded model from llama-server)
- **About** — service health check panel, version
- **Appearance** — theme (coming soon)

View File

@@ -54,6 +54,11 @@ INFERENCE_URL=http://localhost:8080
The provider loader throws immediately on an unknown value, preventing silent
misconfiguration.
> **LM Studio compatibility note:** LM Studio exposes an OpenAI-compatible
> `/v1/chat/completions` endpoint with the same request shape as llama.cpp.
> A future `lmstudio.js` provider would be nearly identical to `llamacpp.js` —
> only the `BASE_URL` would differ. No architectural changes required.
## Internal Structure
```
@@ -109,14 +114,19 @@ Set `DEFAULT_MODEL` in `.env` to the exact reported name.
### Inference Parameters
| NexusAI option | API field | Default |
|---|---|---|
| `temperature` | `temperature` | 0.7 |
| `maxTokens` | `max_tokens` | 1024 |
| `topP` | `top_p` | 0.9 |
| `topK` | `top_k` | 40 |
| `repeatPenalty` | `repeat_penalty` | 1.1 |
| `seed` | `seed` | null (random) |
All parameters are resolved in `resolveOptions()` — falling back to
`INFERENCE_DEFAULTS` from `@nexusai/shared` if not provided in the request.
In normal usage, orchestration reads these from `settings.json` and forwards
them on every request.
| NexusAI option | API field | Default | Description |
|---|---|---|---|
| `temperature` | `temperature` | 0.7 | Response randomness (0 = deterministic) |
| `maxTokens` | `max_tokens` | 1024 | Max tokens to generate |
| `topP` | `top_p` | 0.9 | Nucleus sampling probability mass |
| `topK` | `top_k` | 40 | Top-K token candidates per step |
| `repeatPenalty` | `repeat_penalty` | 1.1 | Penalty for recently used tokens |
| `seed` | `seed` | null | null = random; integer for reproducible output |
## Streaming Response Format

View File

@@ -27,9 +27,10 @@ or inference services — all traffic flows through orchestration.
| MEMORY_SERVICE_URL | No | http://localhost:3002 | Memory service URL |
| EMBEDDING_SERVICE_URL | No | http://localhost:3003 | Embedding service URL |
| INFERENCE_SERVICE_URL | No | http://localhost:3001 | Inference service URL |
| LLAMA_SERVER_URL | No | http://localhost:8080 | Direct llama-server URL for /models/props |
| QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search |
| CORS_ORIGIN | No | http://localhost:5173 | Allowed origin for CORS requests |
| MODELS_MANIFEST_PATH | Yes | — | Path to `models.json` manifest file |
| MODELS_MANIFEST_PATH | No | — | Legacy — superseded by `modelsFolderPath` in settings.json |
## Internal Structure
@@ -42,17 +43,42 @@ src/
│ └── qdrant.js # HTTP client for Qdrant (direct vector search)
├── chat/
│ └── index.js # Core pipeline — context assembly, isolation, auto-naming
├── config/
│ └── settings.js # Settings load/save — reads/writes data/settings.json
├── routes/
│ ├── chat.js # POST /chat and POST /chat/stream
│ ├── sessions.js # Session CRUD proxy
│ ├── projects.js # Project CRUD proxy
── models.js # GET /models — reads models.json from disk
── episodes.js # Episode list and delete proxy
│ ├── settings.js # GET /settings and PATCH /settings
│ ├── health.js # GET /health — pings all four services
│ └── models.js # GET /models — scans .gguf files live, merges with models.json
# GET /models/props — context window + loaded model from llama-server
└── index.js # Express app entry point
```
The `services/` layer wraps all downstream HTTP calls in named functions.
URL or endpoint changes have a single place to be updated.
## Settings
Settings are persisted to `data/settings.json` and loaded on every request
via `appSettings.load()` — changes apply immediately without a service restart.
| Setting | Default | Description |
|---|---|---|
| `recentEpisodeLimit` | 5 | Recent episodes injected into prompt |
| `semanticLimit` | 5 | Semantic search results injected into prompt |
| `scoreThreshold` | 0.75 | Minimum similarity score for semantic results |
| `modelsFolderPath` | `/mnt/nexus-models` | Path to folder containing .gguf files |
| `temperature` | 0.7 | Inference temperature |
| `repeatPenalty` | 1.1 | Repeat token penalty |
| `topP` | 0.9 | Nucleus sampling probability mass |
| `topK` | 40 | Top-K token candidates per step |
Defaults are defined in `config/settings.js` and fall back to constants in
`@nexusai/shared`. Values saved in `settings.json` take precedence.
## Chat Pipeline
Both `POST /chat` and `POST /chat/stream` share the same steps. The only
@@ -69,11 +95,11 @@ difference is how the inference response is delivered to the client.
`memory-isolation.md` for full behaviour.
3. **Recent episode retrieval** — fetch the most recent episodes for the
session (`RECENT_EPISODE_LIMIT`, default 5).
session (`recentEpisodeLimit`, default 5).
4. **Semantic search** — embed the user message, query Qdrant for the top-5
most similar past episodes (`SCORE_THRESHOLD` 0.75). Deduplicated against
recent episodes. Non-critical — if it fails, pipeline continues with
4. **Semantic search** — embed the user message, query Qdrant for the top
most similar past episodes (`semanticLimit`, `scoreThreshold`). Deduplicated
against recent episodes. Non-critical — if it fails, pipeline continues with
recency-only context.
5. **Entity search** — reuse the embedded user message vector to query the
@@ -84,7 +110,8 @@ difference is how the inference response is delivered to the client.
6. **Prompt assembly** — combine system prompt, entity context, semantic
episodes, recent episodes, and user message.
7. **Inference** — send to inference service. `/chat` awaits full response;
7. **Inference** — send to inference service with settings-derived parameters
(temperature, topP, topK, repeatPenalty). `/chat` awaits full response;
`/chat/stream` pipes SSE chunks to the client.
8. **Episode write** — write the exchange back to memory. Fire-and-forget
@@ -107,12 +134,12 @@ Here is what you know about entities relevant to this conversation:
Here are some relevant memories from earlier conversations:
User: {past user message}
Assistant: {past ai response}
... (up to 5 semantic episodes)
... (up to semanticLimit semantic episodes)
---
Here are some relevant memories from your past conversations:
User: {past user message}
Assistant: {past ai response}
... (up to 5 recent episodes)
... (up to recentEpisodeLimit recent episodes)
--- End of recent memories ---
User: {current message}
@@ -141,20 +168,16 @@ data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42}
The `[DONE]` sentinel is consumed internally and not forwarded. The stream
is terminated by `res.end()` after the done event.
## Models Manifest
## Models Route
`GET /models` reads `models.json` fresh on each request from
`MODELS_MANIFEST_PATH`. The file lives on the main PC alongside model files,
accessible via an SMB mount at `/mnt/nexus-models`.
`GET /models` scans `.gguf` files live on each request from `modelsFolderPath`
(read from settings). Merges results with a `models.json` file in the same
folder for richer metadata (label, description). Returns file size in GB.
```json
[
{ "value": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf", "label": "Gemma 4 26B Claude Distill" }
]
```
`value` must match the model name as reported by `llama-server` (including
`.gguf` extension). No service restart needed when models are added or removed.
`GET /models/props` fetches directly from llama-server via `LLAMA_SERVER_URL`.
Returns `{ contextWindow, modelAlias }`. Used by the client to display
read-only context window size and the currently loaded model in the settings
panel. Returns `503` if llama-server is unreachable.
## Sessions Route Behaviour
@@ -179,6 +202,9 @@ handle /chat* { reverse_proxy localhost:4000 }
handle /sessions* { reverse_proxy localhost:4000 }
handle /models* { reverse_proxy localhost:4000 }
handle /projects* { reverse_proxy localhost:4000 }
handle /episodes* { reverse_proxy localhost:4000 }
handle /settings* { reverse_proxy localhost:4000 }
handle /health* { reverse_proxy localhost:4000 }
```
After updating: `caddy reload --config /path/to/Caddyfile`

View File

@@ -142,6 +142,9 @@ llama.cpp runtime defaults — used by the llama.cpp inference provider.
#### `INFERENCE_DEFAULTS`
Default inference parameters applied when not specified in a request.
These are used as fallbacks in `resolveOptions()` in both providers.
Orchestration reads live values from `settings.json` and forwards them
on every request — these constants are the fallback layer only.
| Key | Value | Description |
|---|---|---|
@@ -154,16 +157,22 @@ Default inference parameters applied when not specified in a request.
#### `ORCHESTRATION`
Orchestration pipeline defaults.
Orchestration pipeline defaults. Used as fallback values in
`config/settings.js` when `settings.json` doesn't contain a key.
| Key | Value | Description |
|---|---|---|
| `RECENT_EPISODE_LIMIT` | `5` | Recent episodes to inject into prompt |
| `SEMANTIC_LIMIT` | `5` | Semantic search results to inject into prompt |
| `SCORE_THRESHOLD` | `0.75` | Minimum similarity score for semantic results |
| `TEMPERATURE` | `0.7` | Default inference temperature |
| `CORS_ORIGIN` | `'http://localhost:5173'` | Fallback allowed CORS origin |
| `SYSTEM_PROMPT` | *(see below)* | Default system prompt |
> `repeatPenalty`, `topP`, and `topK` defaults are sourced from
> `INFERENCE_DEFAULTS` in `config/settings.js` rather than `ORCHESTRATION`,
> since those constants already define the canonical values.
Default system prompt:
> "You are a helpful, context-aware AI assistant. You have access to memories
> of past conversations with the user. Use them to provide consistent,