documentation updated for model inference settings
This commit is contained in:
@@ -27,9 +27,10 @@ or inference services — all traffic flows through orchestration.
|
||||
| MEMORY_SERVICE_URL | No | http://localhost:3002 | Memory service URL |
|
||||
| EMBEDDING_SERVICE_URL | No | http://localhost:3003 | Embedding service URL |
|
||||
| INFERENCE_SERVICE_URL | No | http://localhost:3001 | Inference service URL |
|
||||
| LLAMA_SERVER_URL | No | http://localhost:8080 | Direct llama-server URL for /models/props |
|
||||
| QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search |
|
||||
| CORS_ORIGIN | No | http://localhost:5173 | Allowed origin for CORS requests |
|
||||
| MODELS_MANIFEST_PATH | Yes | — | Path to `models.json` manifest file |
|
||||
| MODELS_MANIFEST_PATH | No | — | Legacy — superseded by `modelsFolderPath` in settings.json |
|
||||
|
||||
## Internal Structure
|
||||
|
||||
@@ -42,17 +43,42 @@ src/
|
||||
│ └── qdrant.js # HTTP client for Qdrant (direct vector search)
|
||||
├── chat/
|
||||
│ └── index.js # Core pipeline — context assembly, isolation, auto-naming
|
||||
├── config/
|
||||
│ └── settings.js # Settings load/save — reads/writes data/settings.json
|
||||
├── routes/
|
||||
│ ├── chat.js # POST /chat and POST /chat/stream
|
||||
│ ├── sessions.js # Session CRUD proxy
|
||||
│ ├── projects.js # Project CRUD proxy
|
||||
│ └── models.js # GET /models — reads models.json from disk
|
||||
│ ├── episodes.js # Episode list and delete proxy
|
||||
│ ├── settings.js # GET /settings and PATCH /settings
|
||||
│ ├── health.js # GET /health — pings all four services
|
||||
│ └── models.js # GET /models — scans .gguf files live, merges with models.json
|
||||
# GET /models/props — context window + loaded model from llama-server
|
||||
└── index.js # Express app entry point
|
||||
```
|
||||
|
||||
The `services/` layer wraps all downstream HTTP calls in named functions.
|
||||
URL or endpoint changes have a single place to be updated.
|
||||
|
||||
## Settings
|
||||
|
||||
Settings are persisted to `data/settings.json` and loaded on every request
|
||||
via `appSettings.load()` — changes apply immediately without a service restart.
|
||||
|
||||
| Setting | Default | Description |
|
||||
|---|---|---|
|
||||
| `recentEpisodeLimit` | 5 | Recent episodes injected into prompt |
|
||||
| `semanticLimit` | 5 | Semantic search results injected into prompt |
|
||||
| `scoreThreshold` | 0.75 | Minimum similarity score for semantic results |
|
||||
| `modelsFolderPath` | `/mnt/nexus-models` | Path to folder containing .gguf files |
|
||||
| `temperature` | 0.7 | Inference temperature |
|
||||
| `repeatPenalty` | 1.1 | Repeat token penalty |
|
||||
| `topP` | 0.9 | Nucleus sampling probability mass |
|
||||
| `topK` | 40 | Top-K token candidates per step |
|
||||
|
||||
Defaults are defined in `config/settings.js` and fall back to constants in
|
||||
`@nexusai/shared`. Values saved in `settings.json` take precedence.
|
||||
|
||||
## Chat Pipeline
|
||||
|
||||
Both `POST /chat` and `POST /chat/stream` share the same steps. The only
|
||||
@@ -69,11 +95,11 @@ difference is how the inference response is delivered to the client.
|
||||
`memory-isolation.md` for full behaviour.
|
||||
|
||||
3. **Recent episode retrieval** — fetch the most recent episodes for the
|
||||
session (`RECENT_EPISODE_LIMIT`, default 5).
|
||||
session (`recentEpisodeLimit`, default 5).
|
||||
|
||||
4. **Semantic search** — embed the user message, query Qdrant for the top-5
|
||||
most similar past episodes (`SCORE_THRESHOLD` 0.75). Deduplicated against
|
||||
recent episodes. Non-critical — if it fails, pipeline continues with
|
||||
4. **Semantic search** — embed the user message, query Qdrant for the top
|
||||
most similar past episodes (`semanticLimit`, `scoreThreshold`). Deduplicated
|
||||
against recent episodes. Non-critical — if it fails, pipeline continues with
|
||||
recency-only context.
|
||||
|
||||
5. **Entity search** — reuse the embedded user message vector to query the
|
||||
@@ -84,7 +110,8 @@ difference is how the inference response is delivered to the client.
|
||||
6. **Prompt assembly** — combine system prompt, entity context, semantic
|
||||
episodes, recent episodes, and user message.
|
||||
|
||||
7. **Inference** — send to inference service. `/chat` awaits full response;
|
||||
7. **Inference** — send to inference service with settings-derived parameters
|
||||
(temperature, topP, topK, repeatPenalty). `/chat` awaits full response;
|
||||
`/chat/stream` pipes SSE chunks to the client.
|
||||
|
||||
8. **Episode write** — write the exchange back to memory. Fire-and-forget
|
||||
@@ -107,12 +134,12 @@ Here is what you know about entities relevant to this conversation:
|
||||
Here are some relevant memories from earlier conversations:
|
||||
User: {past user message}
|
||||
Assistant: {past ai response}
|
||||
... (up to 5 semantic episodes)
|
||||
... (up to semanticLimit semantic episodes)
|
||||
---
|
||||
Here are some relevant memories from your past conversations:
|
||||
User: {past user message}
|
||||
Assistant: {past ai response}
|
||||
... (up to 5 recent episodes)
|
||||
... (up to recentEpisodeLimit recent episodes)
|
||||
--- End of recent memories ---
|
||||
|
||||
User: {current message}
|
||||
@@ -141,20 +168,16 @@ data: {"done":true,"model":"gemma-4-26B...gguf","tokenCount":42}
|
||||
The `[DONE]` sentinel is consumed internally and not forwarded. The stream
|
||||
is terminated by `res.end()` after the done event.
|
||||
|
||||
## Models Manifest
|
||||
## Models Route
|
||||
|
||||
`GET /models` reads `models.json` fresh on each request from
|
||||
`MODELS_MANIFEST_PATH`. The file lives on the main PC alongside model files,
|
||||
accessible via an SMB mount at `/mnt/nexus-models`.
|
||||
`GET /models` scans `.gguf` files live on each request from `modelsFolderPath`
|
||||
(read from settings). Merges results with a `models.json` file in the same
|
||||
folder for richer metadata (label, description). Returns file size in GB.
|
||||
|
||||
```json
|
||||
[
|
||||
{ "value": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf", "label": "Gemma 4 26B Claude Distill" }
|
||||
]
|
||||
```
|
||||
|
||||
`value` must match the model name as reported by `llama-server` (including
|
||||
`.gguf` extension). No service restart needed when models are added or removed.
|
||||
`GET /models/props` fetches directly from llama-server via `LLAMA_SERVER_URL`.
|
||||
Returns `{ contextWindow, modelAlias }`. Used by the client to display
|
||||
read-only context window size and the currently loaded model in the settings
|
||||
panel. Returns `503` if llama-server is unreachable.
|
||||
|
||||
## Sessions Route Behaviour
|
||||
|
||||
@@ -179,6 +202,9 @@ handle /chat* { reverse_proxy localhost:4000 }
|
||||
handle /sessions* { reverse_proxy localhost:4000 }
|
||||
handle /models* { reverse_proxy localhost:4000 }
|
||||
handle /projects* { reverse_proxy localhost:4000 }
|
||||
handle /episodes* { reverse_proxy localhost:4000 }
|
||||
handle /settings* { reverse_proxy localhost:4000 }
|
||||
handle /health* { reverse_proxy localhost:4000 }
|
||||
```
|
||||
|
||||
After updating: `caddy reload --config /path/to/Caddyfile`
|
||||
|
||||
Reference in New Issue
Block a user