documentation updated for model inference settings

This commit is contained in:
Storme-bit
2026-04-18 06:41:50 -07:00
parent c198a00dde
commit 44989a2b8b
5 changed files with 182 additions and 41 deletions

View File

@@ -27,9 +27,10 @@ or inference services — all traffic flows through orchestration.
| MEMORY_SERVICE_URL | No | http://localhost:3002 | Memory service URL |
| EMBEDDING_SERVICE_URL | No | http://localhost:3003 | Embedding service URL |
| INFERENCE_SERVICE_URL | No | http://localhost:3001 | Inference service URL |
| LLAMA_SERVER_URL | No | http://localhost:8080 | Direct llama-server URL for /models/props |
| QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search |
| CORS_ORIGIN | No | http://localhost:5173 | Allowed origin for CORS requests |
| MODELS_MANIFEST_PATH | Yes | — | Path to `models.json` manifest file |
| MODELS_MANIFEST_PATH | No | — | Legacy — superseded by `modelsFolderPath` in settings.json |
## Internal Structure
@@ -42,17 +43,42 @@ src/
│ └── qdrant.js # HTTP client for Qdrant (direct vector search)
├── chat/
│ └── index.js # Core pipeline — context assembly, isolation, auto-naming
├── config/
│ └── settings.js # Settings load/save — reads/writes data/settings.json
├── routes/
│ ├── chat.js # POST /chat and POST /chat/stream
│ ├── sessions.js # Session CRUD proxy
│ ├── projects.js # Project CRUD proxy
── models.js # GET /models — reads models.json from disk
── episodes.js # Episode list and delete proxy
│ ├── settings.js # GET /settings and PATCH /settings
│ ├── health.js # GET /health — pings all four services
│ └── models.js # GET /models — scans .gguf files live, merges with models.json
# GET /models/props — context window + loaded model from llama-server
└── index.js # Express app entry point
```
The `services/` layer wraps all downstream HTTP calls in named functions.
URL or endpoint changes have a single place to be updated.
## Settings
Settings are persisted to `data/settings.json` and loaded on every request
via `appSettings.load()` — changes apply immediately without a service restart.
| Setting | Default | Description |
|---|---|---|
| `recentEpisodeLimit` | 5 | Recent episodes injected into prompt |
| `semanticLimit` | 5 | Semantic search results injected into prompt |
| `scoreThreshold` | 0.75 | Minimum similarity score for semantic results |
| `modelsFolderPath` | `/mnt/nexus-models` | Path to folder containing .gguf files |
| `temperature` | 0.7 | Inference temperature |
| `repeatPenalty` | 1.1 | Repeat token penalty |
| `topP` | 0.9 | Nucleus sampling probability mass |
| `topK` | 40 | Top-K token candidates per step |
Defaults are defined in `config/settings.js` and fall back to constants in
`@nexusai/shared`. Values saved in `settings.json` take precedence.
## Chat Pipeline
Both `POST /chat` and `POST /chat/stream` share the same steps. The only
@@ -69,11 +95,11 @@ difference is how the inference response is delivered to the client.
`memory-isolation.md` for full behaviour.
3. **Recent episode retrieval** — fetch the most recent episodes for the
session (`RECENT_EPISODE_LIMIT`, default 5).
session (`recentEpisodeLimit`, default 5).
4. **Semantic search** — embed the user message, query Qdrant for the top-5
most similar past episodes (`SCORE_THRESHOLD` 0.75). Deduplicated against
recent episodes. Non-critical — if it fails, pipeline continues with
4. **Semantic search** — embed the user message, query Qdrant for the top
most similar past episodes (`semanticLimit`, `scoreThreshold`). Deduplicated
against recent episodes. Non-critical — if it fails, pipeline continues with
recency-only context.
5. **Entity search** — reuse the embedded user message vector to query the
@@ -84,7 +110,8 @@ difference is how the inference response is delivered to the client.
6. **Prompt assembly** — combine system prompt, entity context, semantic
episodes, recent episodes, and user message.
7. **Inference** — send to inference service. `/chat` awaits full response;
7. **Inference** — send to inference service with settings-derived parameters
(temperature, topP, topK, repeatPenalty). `/chat` awaits full response;
`/chat/stream` pipes SSE chunks to the client.
8. **Episode write** — write the exchange back to memory. Fire-and-forget
@@ -107,12 +134,12 @@ Here is what you know about entities relevant to this conversation:
Here are some relevant memories from earlier conversations:
User: {past user message}
Assistant: {past ai response}
... (up to 5 semantic episodes)
... (up to semanticLimit semantic episodes)
---
Here are some relevant memories from your past conversations:
User: {past user message}
Assistant: {past ai response}
... (up to 5 recent episodes)
... (up to recentEpisodeLimit recent episodes)
--- End of recent memories ---
User: {current message}
@@ -141,20 +168,16 @@ data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42}
The `[DONE]` sentinel is consumed internally and not forwarded. The stream
is terminated by `res.end()` after the done event.
## Models Manifest
## Models Route
`GET /models` reads `models.json` fresh on each request from
`MODELS_MANIFEST_PATH`. The file lives on the main PC alongside model files,
accessible via an SMB mount at `/mnt/nexus-models`.
`GET /models` scans `.gguf` files live on each request from `modelsFolderPath`
(read from settings). Merges results with a `models.json` file in the same
folder for richer metadata (label, description). Returns file size in GB.
```json
[
{ "value": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf", "label": "Gemma 4 26B Claude Distill" }
]
```
`value` must match the model name as reported by `llama-server` (including
`.gguf` extension). No service restart needed when models are added or removed.
`GET /models/props` fetches directly from llama-server via `LLAMA_SERVER_URL`.
Returns `{ contextWindow, modelAlias }`. Used by the client to display
read-only context window size and the currently loaded model in the settings
panel. Returns `503` if llama-server is unreachable.
## Sessions Route Behaviour
@@ -179,6 +202,9 @@ handle /chat* { reverse_proxy localhost:4000 }
handle /sessions* { reverse_proxy localhost:4000 }
handle /models* { reverse_proxy localhost:4000 }
handle /projects* { reverse_proxy localhost:4000 }
handle /episodes* { reverse_proxy localhost:4000 }
handle /settings* { reverse_proxy localhost:4000 }
handle /health* { reverse_proxy localhost:4000 }
```
After updating: `caddy reload --config /path/to/Caddyfile`