documentation updated for model inference settings
This commit is contained in:
@@ -30,7 +30,10 @@ here for reference and direct debugging use.
|
||||
"temperature": 0.7
|
||||
}
|
||||
```
|
||||
`model` and `temperature` are optional.
|
||||
`model` and `temperature` are optional. Inference parameters (temperature,
|
||||
topP, topK, repeatPenalty) are read from `settings.json` on every request —
|
||||
the request body values are not used for these; they are controlled via
|
||||
`PATCH /settings`.
|
||||
|
||||
**POST /chat — response:**
|
||||
```json
|
||||
@@ -110,9 +113,74 @@ Returns `201` with the created project object.
|
||||
|
||||
| Method | Path | Description |
|
||||
|---|---|---|
|
||||
| GET | /models | Available models from `models.json` manifest |
|
||||
| GET | /models | Available models scanned live from models folder |
|
||||
| GET | /models/props | Live model props from llama-server (context window, loaded model) |
|
||||
|
||||
Returns array: `[{ "value": "model-name.gguf", "label": "Display Name" }]`
|
||||
**GET /models** — returns array:
|
||||
```json
|
||||
[{ "value": "model-name.gguf", "label": "Display Name", "description": null, "size": "19.7 GB" }]
|
||||
```
|
||||
Scans `.gguf` files live from `modelsFolderPath` (set in settings). Merges
|
||||
with `models.json` in the same folder for label and description metadata.
|
||||
|
||||
**GET /models/props** — returns:
|
||||
```json
|
||||
{ "contextWindow": 64000, "modelAlias": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf" }
|
||||
```
|
||||
Fetches directly from llama-server `/props`. Returns `503` if llama-server
|
||||
is unreachable.
|
||||
|
||||
### Settings
|
||||
|
||||
| Method | Path | Description |
|
||||
|---|---|---|
|
||||
| GET | /settings | Get all current settings |
|
||||
| PATCH | /settings | Update one or more settings |
|
||||
|
||||
**GET /settings — response:**
|
||||
```json
|
||||
{
|
||||
"recentEpisodeLimit": 9,
|
||||
"semanticLimit": 5,
|
||||
"scoreThreshold": 0.6,
|
||||
"modelsFolderPath": "/mnt/nexus-models",
|
||||
"temperature": 0.65,
|
||||
"repeatPenalty": 1.3,
|
||||
"topP": 0.9,
|
||||
"topK": 41
|
||||
}
|
||||
```
|
||||
|
||||
**PATCH /settings — body:** any subset of the above fields.
|
||||
|
||||
| Field | Type | Range | Description |
|
||||
|---|---|---|---|
|
||||
| `recentEpisodeLimit` | integer | 1–20 | Recent episodes injected into prompt |
|
||||
| `semanticLimit` | integer | 1–20 | Max semantic search results |
|
||||
| `scoreThreshold` | float | 0–1 | Minimum similarity score |
|
||||
| `modelsFolderPath` | string | — | Path to folder containing .gguf files |
|
||||
| `temperature` | float | 0–2 | Inference randomness |
|
||||
| `repeatPenalty` | float | 1–2 | Repeat token penalty |
|
||||
| `topP` | float | 0–1 | Nucleus sampling probability mass |
|
||||
| `topK` | integer | 1–100 | Top-K token candidates per step |
|
||||
|
||||
Settings are persisted to `data/settings.json` and read on every request —
|
||||
changes take effect immediately without a service restart.
|
||||
|
||||
### Episodes
|
||||
|
||||
| Method | Path | Description |
|
||||
|---|---|---|
|
||||
| GET | /episodes | Paginated episode list across all sessions |
|
||||
| DELETE | /episodes/:id | Delete an episode (SQLite + Qdrant) |
|
||||
|
||||
**GET /episodes — query params:**
|
||||
|
||||
| Param | Default | Description |
|
||||
|---|---|---|
|
||||
| limit | 20 | Episodes per page |
|
||||
| offset | 0 | Pagination offset |
|
||||
| q | — | Keyword search (FTS) |
|
||||
|
||||
---
|
||||
|
||||
@@ -158,10 +226,11 @@ are not touched.
|
||||
| Method | Path | Description |
|
||||
|---|---|---|
|
||||
| POST | /episodes | Create episode + auto-embed into Qdrant |
|
||||
| GET | /episodes | Paginated episode list across all sessions |
|
||||
| GET | /episodes/search?q=&limit= | FTS keyword search across all episodes |
|
||||
| GET | /episodes/:id | Get episode by ID |
|
||||
| GET | /sessions/:id/episodes?limit=&offset= | Paginated episodes for a session |
|
||||
| DELETE | /episodes/:id | Delete an episode |
|
||||
| DELETE | /episodes/:id | Delete episode (SQLite + Qdrant cleanup) |
|
||||
|
||||
> Route ordering: `/episodes/search` must be defined before `/episodes/:id`.
|
||||
|
||||
@@ -266,10 +335,14 @@ is awkward to encode in a path.
|
||||
"prompt": "What is the capital of France?",
|
||||
"model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
|
||||
"temperature": 0.7,
|
||||
"maxTokens": 1024
|
||||
"maxTokens": 1024,
|
||||
"topP": 0.9,
|
||||
"topK": 40,
|
||||
"repeatPenalty": 1.1
|
||||
}
|
||||
```
|
||||
All fields except `prompt` are optional.
|
||||
All fields except `prompt` are optional. In normal usage these are forwarded
|
||||
from orchestration, which reads them from `settings.json`.
|
||||
|
||||
**POST /complete — response:**
|
||||
```json
|
||||
|
||||
@@ -14,6 +14,7 @@ inference services. Served as static files by Caddy on Mini PC 2.
|
||||
## Dependencies
|
||||
|
||||
- `react` + `react-dom` — UI framework
|
||||
- `react-markdown` — Markdown rendering in message bubbles and memory viewer
|
||||
- `uuid` — session ID generation
|
||||
- `vite` + `@vitejs/plugin-react` — build tooling
|
||||
|
||||
@@ -63,13 +64,16 @@ export default defineConfig({
|
||||
'/sessions': 'http://192.168.0.205:4000',
|
||||
'/chat': 'http://192.168.0.205:4000',
|
||||
'/projects': 'http://192.168.0.205:4000',
|
||||
'/episodes': 'http://192.168.0.205:4000',
|
||||
'/settings': 'http://192.168.0.205:4000',
|
||||
'/health': 'http://192.168.0.205:4000',
|
||||
}
|
||||
}
|
||||
});
|
||||
```
|
||||
|
||||
When adding new top-level routes to the orchestration service, add a matching
|
||||
entry here too.
|
||||
entry here and in the Caddy config.
|
||||
|
||||
## Internal Structure
|
||||
|
||||
@@ -84,19 +88,22 @@ src/
|
||||
│ ├── useChat.js # Message sending, SSE streaming, message state
|
||||
│ ├── useModels.js # Dynamic model list fetched from /models endpoint
|
||||
│ ├── useProjects.js # Project list fetched from /projects endpoint
|
||||
│ ├── useSettings.js # Settings fetch + saveSetting helper
|
||||
│ └── useContextMenu.js # Right-click context menu position and visibility
|
||||
├── components/
|
||||
│ ├── App.jsx # Root component — layout, shared state, view routing
|
||||
│ ├── Sidebar.jsx # Left sidebar — projects, recent chats, navigation
|
||||
│ ├── ChatWindow.jsx # Centre panel — message thread and input bar
|
||||
│ ├── MessageBubble.jsx # Individual message bubble (user or assistant)
|
||||
│ ├── MessageBubble.jsx # Individual message bubble — renders markdown via react-markdown
|
||||
│ ├── InfoPanel.jsx # Right panel — model selector and session metadata (slide-in)
|
||||
│ ├── SessionModal.jsx # Modal for session rename, project assignment, delete
|
||||
│ ├── ProjectModal.jsx # Modal for project create, edit, delete
|
||||
│ ├── AllChatsView.jsx # Full paginated session list with multi-select bulk delete
|
||||
│ ├── AllProjectsView.jsx # Project tile grid with create/edit/delete
|
||||
│ ├── ProjectView.jsx # Individual project — session list, new chat button
|
||||
│ └── SettingsView.jsx # Settings placeholder (Appearance, Memory, Models, About)
|
||||
│ ├── MemoryView.jsx # Paginated, searchable, expandable, deletable episode viewer
|
||||
│ └── SettingsView.jsx # Settings — Memory limits, Models (inference params, active
|
||||
│ # model, context window), Service Health, Appearance placeholder
|
||||
├── index.css # Global reset, CSS variables, utility classes
|
||||
└── main.jsx # React entry point
|
||||
```
|
||||
@@ -118,7 +125,7 @@ panel are persistent across all views.
|
||||
│ ⊞ View Projects │ all-projects → AllProjectsView│
|
||||
│ │ project → ProjectView │
|
||||
│ PROJECTS ▾ │ settings → SettingsView │
|
||||
│ [tile] [tile] │ │
|
||||
│ [tile] [tile] │ memory → MemoryView │
|
||||
│ All Projects → │ │
|
||||
│ │ │
|
||||
│ RECENT CHATS ▾ │ │
|
||||
@@ -143,6 +150,7 @@ via the `⊹` button in the `ChatWindow` header.
|
||||
| `'all-projects'` | `AllProjectsView` | "View Projects" button or ⊞ icon |
|
||||
| `'project'` | `ProjectView` | Clicking a project tile in the sidebar |
|
||||
| `'settings'` | `SettingsView` | Settings button or ⚙ icon |
|
||||
| `'memory'` | `MemoryView` | "Open →" button in Settings → Memory section |
|
||||
|
||||
`activeProject` state in `App.jsx` tracks which project `ProjectView` is
|
||||
displaying. Set via `onSelectProject` before navigating to `'project'`.
|
||||
@@ -261,4 +269,19 @@ exposes `refreshProjects` for keeping the sidebar in sync after mutations.
|
||||
and a filtered session list. The "+ New Chat" button creates a new session,
|
||||
navigates to `'chat'`, and writes the project assignment after the first message.
|
||||
|
||||
For memory isolation behaviour, see `memory-isolation.md`.
|
||||
For memory isolation behaviour, see `memory-isolation.md`.
|
||||
|
||||
## Settings
|
||||
|
||||
`useSettings` fetches from `GET /settings` on mount and exposes a `saveSetting(key, value)`
|
||||
helper that issues a `PATCH /settings` with a single key-value pair. The `saving`
|
||||
boolean is exposed for disabling save buttons during in-flight requests.
|
||||
|
||||
`SettingsView` is organised into sections:
|
||||
|
||||
- **Memory** — recent episode limit, semantic limit, score threshold, link to MemoryView
|
||||
- **Models** — models folder path, temperature, repeat penalty, Top-P, Top-K,
|
||||
active model dropdown, read-only model info panel (file, size, context window,
|
||||
loaded model from llama-server)
|
||||
- **About** — service health check panel, version
|
||||
- **Appearance** — theme (coming soon)
|
||||
@@ -54,6 +54,11 @@ INFERENCE_URL=http://localhost:8080
|
||||
The provider loader throws immediately on an unknown value, preventing silent
|
||||
misconfiguration.
|
||||
|
||||
> **LM Studio compatibility note:** LM Studio exposes an OpenAI-compatible
|
||||
> `/v1/chat/completions` endpoint with the same request shape as llama.cpp.
|
||||
> A future `lmstudio.js` provider would be nearly identical to `llamacpp.js` —
|
||||
> only the `BASE_URL` would differ. No architectural changes required.
|
||||
|
||||
## Internal Structure
|
||||
|
||||
```
|
||||
@@ -109,14 +114,19 @@ Set `DEFAULT_MODEL` in `.env` to the exact reported name.
|
||||
|
||||
### Inference Parameters
|
||||
|
||||
| NexusAI option | API field | Default |
|
||||
|---|---|---|
|
||||
| `temperature` | `temperature` | 0.7 |
|
||||
| `maxTokens` | `max_tokens` | 1024 |
|
||||
| `topP` | `top_p` | 0.9 |
|
||||
| `topK` | `top_k` | 40 |
|
||||
| `repeatPenalty` | `repeat_penalty` | 1.1 |
|
||||
| `seed` | `seed` | null (random) |
|
||||
All parameters are resolved in `resolveOptions()` — falling back to
|
||||
`INFERENCE_DEFAULTS` from `@nexusai/shared` if not provided in the request.
|
||||
In normal usage, orchestration reads these from `settings.json` and forwards
|
||||
them on every request.
|
||||
|
||||
| NexusAI option | API field | Default | Description |
|
||||
|---|---|---|---|
|
||||
| `temperature` | `temperature` | 0.7 | Response randomness (0 = deterministic) |
|
||||
| `maxTokens` | `max_tokens` | 1024 | Max tokens to generate |
|
||||
| `topP` | `top_p` | 0.9 | Nucleus sampling probability mass |
|
||||
| `topK` | `top_k` | 40 | Top-K token candidates per step |
|
||||
| `repeatPenalty` | `repeat_penalty` | 1.1 | Penalty for recently used tokens |
|
||||
| `seed` | `seed` | null | null = random; integer for reproducible output |
|
||||
|
||||
## Streaming Response Format
|
||||
|
||||
|
||||
@@ -27,9 +27,10 @@ or inference services — all traffic flows through orchestration.
|
||||
| MEMORY_SERVICE_URL | No | http://localhost:3002 | Memory service URL |
|
||||
| EMBEDDING_SERVICE_URL | No | http://localhost:3003 | Embedding service URL |
|
||||
| INFERENCE_SERVICE_URL | No | http://localhost:3001 | Inference service URL |
|
||||
| LLAMA_SERVER_URL | No | http://localhost:8080 | Direct llama-server URL for /models/props |
|
||||
| QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search |
|
||||
| CORS_ORIGIN | No | http://localhost:5173 | Allowed origin for CORS requests |
|
||||
| MODELS_MANIFEST_PATH | Yes | — | Path to `models.json` manifest file |
|
||||
| MODELS_MANIFEST_PATH | No | — | Legacy — superseded by `modelsFolderPath` in settings.json |
|
||||
|
||||
## Internal Structure
|
||||
|
||||
@@ -42,17 +43,42 @@ src/
|
||||
│ └── qdrant.js # HTTP client for Qdrant (direct vector search)
|
||||
├── chat/
|
||||
│ └── index.js # Core pipeline — context assembly, isolation, auto-naming
|
||||
├── config/
|
||||
│ └── settings.js # Settings load/save — reads/writes data/settings.json
|
||||
├── routes/
|
||||
│ ├── chat.js # POST /chat and POST /chat/stream
|
||||
│ ├── sessions.js # Session CRUD proxy
|
||||
│ ├── projects.js # Project CRUD proxy
|
||||
│ └── models.js # GET /models — reads models.json from disk
|
||||
│ ├── episodes.js # Episode list and delete proxy
|
||||
│ ├── settings.js # GET /settings and PATCH /settings
|
||||
│ ├── health.js # GET /health — pings all four services
|
||||
│ └── models.js # GET /models — scans .gguf files live, merges with models.json
|
||||
# GET /models/props — context window + loaded model from llama-server
|
||||
└── index.js # Express app entry point
|
||||
```
|
||||
|
||||
The `services/` layer wraps all downstream HTTP calls in named functions.
|
||||
URL or endpoint changes have a single place to be updated.
|
||||
|
||||
## Settings
|
||||
|
||||
Settings are persisted to `data/settings.json` and loaded on every request
|
||||
via `appSettings.load()` — changes apply immediately without a service restart.
|
||||
|
||||
| Setting | Default | Description |
|
||||
|---|---|---|
|
||||
| `recentEpisodeLimit` | 5 | Recent episodes injected into prompt |
|
||||
| `semanticLimit` | 5 | Semantic search results injected into prompt |
|
||||
| `scoreThreshold` | 0.75 | Minimum similarity score for semantic results |
|
||||
| `modelsFolderPath` | `/mnt/nexus-models` | Path to folder containing .gguf files |
|
||||
| `temperature` | 0.7 | Inference temperature |
|
||||
| `repeatPenalty` | 1.1 | Repeat token penalty |
|
||||
| `topP` | 0.9 | Nucleus sampling probability mass |
|
||||
| `topK` | 40 | Top-K token candidates per step |
|
||||
|
||||
Defaults are defined in `config/settings.js` and fall back to constants in
|
||||
`@nexusai/shared`. Values saved in `settings.json` take precedence.
|
||||
|
||||
## Chat Pipeline
|
||||
|
||||
Both `POST /chat` and `POST /chat/stream` share the same steps. The only
|
||||
@@ -69,11 +95,11 @@ difference is how the inference response is delivered to the client.
|
||||
`memory-isolation.md` for full behaviour.
|
||||
|
||||
3. **Recent episode retrieval** — fetch the most recent episodes for the
|
||||
session (`RECENT_EPISODE_LIMIT`, default 5).
|
||||
session (`recentEpisodeLimit`, default 5).
|
||||
|
||||
4. **Semantic search** — embed the user message, query Qdrant for the top-5
|
||||
most similar past episodes (`SCORE_THRESHOLD` 0.75). Deduplicated against
|
||||
recent episodes. Non-critical — if it fails, pipeline continues with
|
||||
4. **Semantic search** — embed the user message, query Qdrant for the top
|
||||
most similar past episodes (`semanticLimit`, `scoreThreshold`). Deduplicated
|
||||
against recent episodes. Non-critical — if it fails, pipeline continues with
|
||||
recency-only context.
|
||||
|
||||
5. **Entity search** — reuse the embedded user message vector to query the
|
||||
@@ -84,7 +110,8 @@ difference is how the inference response is delivered to the client.
|
||||
6. **Prompt assembly** — combine system prompt, entity context, semantic
|
||||
episodes, recent episodes, and user message.
|
||||
|
||||
7. **Inference** — send to inference service. `/chat` awaits full response;
|
||||
7. **Inference** — send to inference service with settings-derived parameters
|
||||
(temperature, topP, topK, repeatPenalty). `/chat` awaits full response;
|
||||
`/chat/stream` pipes SSE chunks to the client.
|
||||
|
||||
8. **Episode write** — write the exchange back to memory. Fire-and-forget
|
||||
@@ -107,12 +134,12 @@ Here is what you know about entities relevant to this conversation:
|
||||
Here are some relevant memories from earlier conversations:
|
||||
User: {past user message}
|
||||
Assistant: {past ai response}
|
||||
... (up to 5 semantic episodes)
|
||||
... (up to semanticLimit semantic episodes)
|
||||
---
|
||||
Here are some relevant memories from your past conversations:
|
||||
User: {past user message}
|
||||
Assistant: {past ai response}
|
||||
... (up to 5 recent episodes)
|
||||
... (up to recentEpisodeLimit recent episodes)
|
||||
--- End of recent memories ---
|
||||
|
||||
User: {current message}
|
||||
@@ -141,20 +168,16 @@ data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42}
|
||||
The `[DONE]` sentinel is consumed internally and not forwarded. The stream
|
||||
is terminated by `res.end()` after the done event.
|
||||
|
||||
## Models Manifest
|
||||
## Models Route
|
||||
|
||||
`GET /models` reads `models.json` fresh on each request from
|
||||
`MODELS_MANIFEST_PATH`. The file lives on the main PC alongside model files,
|
||||
accessible via an SMB mount at `/mnt/nexus-models`.
|
||||
`GET /models` scans `.gguf` files live on each request from `modelsFolderPath`
|
||||
(read from settings). Merges results with a `models.json` file in the same
|
||||
folder for richer metadata (label, description). Returns file size in GB.
|
||||
|
||||
```json
|
||||
[
|
||||
{ "value": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf", "label": "Gemma 4 26B Claude Distill" }
|
||||
]
|
||||
```
|
||||
|
||||
`value` must match the model name as reported by `llama-server` (including
|
||||
`.gguf` extension). No service restart needed when models are added or removed.
|
||||
`GET /models/props` fetches directly from llama-server via `LLAMA_SERVER_URL`.
|
||||
Returns `{ contextWindow, modelAlias }`. Used by the client to display
|
||||
read-only context window size and the currently loaded model in the settings
|
||||
panel. Returns `503` if llama-server is unreachable.
|
||||
|
||||
## Sessions Route Behaviour
|
||||
|
||||
@@ -179,6 +202,9 @@ handle /chat* { reverse_proxy localhost:4000 }
|
||||
handle /sessions* { reverse_proxy localhost:4000 }
|
||||
handle /models* { reverse_proxy localhost:4000 }
|
||||
handle /projects* { reverse_proxy localhost:4000 }
|
||||
handle /episodes* { reverse_proxy localhost:4000 }
|
||||
handle /settings* { reverse_proxy localhost:4000 }
|
||||
handle /health* { reverse_proxy localhost:4000 }
|
||||
```
|
||||
|
||||
After updating: `caddy reload --config /path/to/Caddyfile`
|
||||
|
||||
@@ -142,6 +142,9 @@ llama.cpp runtime defaults — used by the llama.cpp inference provider.
|
||||
#### `INFERENCE_DEFAULTS`
|
||||
|
||||
Default inference parameters applied when not specified in a request.
|
||||
These are used as fallbacks in `resolveOptions()` in both providers.
|
||||
Orchestration reads live values from `settings.json` and forwards them
|
||||
on every request — these constants are the fallback layer only.
|
||||
|
||||
| Key | Value | Description |
|
||||
|---|---|---|
|
||||
@@ -154,16 +157,22 @@ Default inference parameters applied when not specified in a request.
|
||||
|
||||
#### `ORCHESTRATION`
|
||||
|
||||
Orchestration pipeline defaults.
|
||||
Orchestration pipeline defaults. Used as fallback values in
|
||||
`config/settings.js` when `settings.json` doesn't contain a key.
|
||||
|
||||
| Key | Value | Description |
|
||||
|---|---|---|
|
||||
| `RECENT_EPISODE_LIMIT` | `5` | Recent episodes to inject into prompt |
|
||||
| `SEMANTIC_LIMIT` | `5` | Semantic search results to inject into prompt |
|
||||
| `SCORE_THRESHOLD` | `0.75` | Minimum similarity score for semantic results |
|
||||
| `TEMPERATURE` | `0.7` | Default inference temperature |
|
||||
| `CORS_ORIGIN` | `'http://localhost:5173'` | Fallback allowed CORS origin |
|
||||
| `SYSTEM_PROMPT` | *(see below)* | Default system prompt |
|
||||
|
||||
> `repeatPenalty`, `topP`, and `topK` defaults are sourced from
|
||||
> `INFERENCE_DEFAULTS` in `config/settings.js` rather than `ORCHESTRATION`,
|
||||
> since those constants already define the canonical values.
|
||||
|
||||
Default system prompt:
|
||||
> "You are a helpful, context-aware AI assistant. You have access to memories
|
||||
> of past conversations with the user. Use them to provide consistent,
|
||||
|
||||
Reference in New Issue
Block a user