updated documentation

This commit is contained in:
Storme-bit
2026-04-13 03:42:14 -07:00
parent 5f024093d1
commit 045da0d7f4
5 changed files with 464 additions and 112 deletions

View File

@@ -27,33 +27,46 @@ npm run dev # local dev server on port 5173
Vite bakes environment variables into the bundle at build time. The `.env` Vite bakes environment variables into the bundle at build time. The `.env`
file is only needed on the machine running the build, not where files are served. file is only needed on the machine running the build, not where files are served.
After building, copy `dist/` contents to `/srv/nexusai` on Mini PC 2 for Caddy to serve.
## Environment Variables ## Environment Variables
| Variable | Required | Default | Description | | Variable | Required | Default | Description |
|---|---|---|---| |---|---|---|---|
| VITE_ORCHESTRATION_URL | No | `''` (empty) | Orchestration base URL. Empty string uses Vite proxy in dev, Caddy proxy in production. | | VITE_ORCHESTRATION_URL | No | `''` (empty) | Orchestration base URL. Must be set to the HTTPS domain in production to avoid mixed content errors. |
Production value:
```
VITE_ORCHESTRATION_URL=https://nexus.jellystorm.com
```
## Internal Structure ## Internal Structure
``` ```
src/ src/
├── api/ ├── api/
│ └── orchestration.js # All fetch calls to the orchestration service │ └── orchestration.js # All fetch calls to the orchestration service
├── config/
│ └── constants.js # FALLBACK_MODELS, DEFAULT_MODEL, API_DEFAULTS
├── hooks/ ├── hooks/
│ ├── useSession.js # Session list, history loading, active session state │ ├── useSession.js # Session list, history loading, active session state
── useChat.js # Message sending, SSE streaming, message state ── useChat.js # Message sending, SSE streaming, message state
│ ├── useModels.js # Dynamic model list fetched from /models endpoint
│ └── useContextMenu.js # Right-click context menu position and visibility
├── components/ ├── components/
│ ├── App.jsx # Root component — layout and shared state │ ├── App.jsx # Root component — layout and shared state
│ ├── SessionList.jsx # Left sidebar — session list and new chat button │ ├── SessionList.jsx # Left sidebar — session list, rename, delete
│ ├── ChatWindow.jsx # Centre panel — message thread and input bar │ ├── ChatWindow.jsx # Centre panel — message thread and input bar
│ ├── MessageBubble.jsx # Individual message bubble (user or assistant) │ ├── MessageBubble.jsx # Individual message bubble (user or assistant)
── InfoPanel.jsx # Right panel — model selector and session metadata ── InfoPanel.jsx # Right panel — model selector and session metadata
├── index.css # Global reset and CSS variables │ └── SessionModal.jsx # Modal dialog for session settings (rename)
── main.jsx # React entry point ── index.css # Global reset, CSS variables, utility classes
└── main.jsx # React entry point
``` ```
## Layout ## Layout
Three-panel layout with collapsible sidebars: Three-panel layout with collapsible sidebars:
```
┌─────────────────┬──────────────────────────┬─────────────┐ ┌─────────────────┬──────────────────────────┬─────────────┐
│ Session List │ Chat Window │ Info Panel │ │ Session List │ Chat Window │ Info Panel │
│ (collapsible) │ │ (collapsible)│ │ (collapsible) │ │ (collapsible)│
@@ -64,9 +77,54 @@ Three-panel layout with collapsible sidebars:
│ Session 2 │ │ │ │ Session 2 │ │ │
│ │ [input bar] │ │ │ │ [input bar] │ │
└─────────────────┴──────────────────────────┴─────────────┘ └─────────────────┴──────────────────────────┴─────────────┘
```
On mobile, sidebars collapse to a 56px icon rail. The centre chat window Sidebars collapse to a 56px icon rail. The centre chat window always
always fills the remaining space. fills the remaining space.
## CSS Architecture
Styles follow a hybrid approach — CSS utility classes for static reusable
rules, inline styles for dynamic prop-driven values.
### CSS Variables (`:root`)
| Variable | Value | Description |
|---|---|---|
| `--bg-base` | `#0f1117` | Page background |
| `--bg-surface` | `#1a1d27` | Panel backgrounds |
| `--bg-elevated` | `#222536` | Elevated elements (inputs, cards) |
| `--border` | `#2e3150` | Border colour |
| `--accent` | `#6c63ff` | Primary accent (buttons, highlights) |
| `--accent-hover` | `#574fd6` | Accent hover state |
| `--text-primary` | `#e8e8f0` | Primary text |
| `--text-secondary` | `#8b8fa8` | Secondary text |
| `--text-muted` | `#555870` | Muted / placeholder text |
| `--bubble-user` | `#6c63ff` | User message bubble background |
| `--bubble-ai` | `#222536` | AI message bubble background |
| `--sidebar-width` | `280px` | Expanded sidebar width |
| `--panel-width` | `260px` | Expanded info panel width |
| `--header-height` | `56px` | Shared header height across all panels |
| `--radius-sm` | `6px` | Small border radius |
| `--radius-md` | `8px` | Medium border radius |
| `--radius-lg` | `12px` | Large border radius |
### Utility Classes
| Class | Description |
|---|---|
| `.panel-header` | Shared header row — used in all three panels |
| `.btn-reset` | Resets button styles (no border, bg, cursor pointer) |
| `.btn-icon` | Icon button with hover state |
| `.btn-primary` | Accent-coloured action button with `:hover` and `:disabled` states |
| `.flex` / `.flex-col` | Flex layout helpers |
| `.flex-1` / `.flex-shrink` | Flex sizing helpers |
| `.items-center` / `.justify-center` / `.justify-between` | Alignment helpers |
| `.overflow-hidden` / `.scroll-y` | Overflow helpers |
| `.text-xs` / `.text-sm` / `.text-base` | Font size helpers |
| `.text-muted` / `.text-secondary` / `.text-accent` | Colour helpers |
| `.label-upper` | Uppercase section label style |
| `.truncate` | Text overflow ellipsis |
## API Layer ## API Layer
@@ -78,39 +136,71 @@ All orchestration calls are centralised in `src/api/orchestration.js`:
| `fetchSessionHistory` | GET | /sessions/:id/history | Load episode history on session select | | `fetchSessionHistory` | GET | /sessions/:id/history | Load episode history on session select |
| `sendMessage` | POST | /chat | Send message, await full response | | `sendMessage` | POST | /chat | Send message, await full response |
| `streamMessage` | POST | /chat/stream | Send message, receive SSE token stream | | `streamMessage` | POST | /chat/stream | Send message, receive SSE token stream |
| `fetchModels` | GET | /models | Load available models from manifest |
| `renameSession` | PATCH | /sessions/:id | Rename a session |
| `deleteSession` | DELETE | /sessions/:id | Delete a session |
`streamMessage` returns an abort function — call it to cancel a stream mid-flight. `streamMessage` returns an abort function — call it to cancel a stream mid-flight.
It uses a buffer pattern to handle SSE chunks that may span multiple network packets. Uses a buffer pattern to handle SSE chunks that may span multiple network packets.
## Streaming ## Streaming
The chat input sends messages via `POST /chat/stream`. Tokens arrive as SSE events: The chat input sends messages via `POST /chat/stream`. Tokens arrive as SSE events:
```
data: {"text":"Hello"} data: {"text":"Hello"}
data: {"text":" Tim"} data: {"text":" Tim"}
data: {"done":true} data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":87}
```
An empty assistant bubble is appended immediately when the stream opens, then An empty assistant bubble is appended immediately when the stream opens, then
updated token by token using `updateLastMessage`. The blinking cursor in updated token by token using `updateLastMessage`. The blinking cursor in
`MessageBubble` is shown while `message.streaming === true` and disappears `MessageBubble` is shown while `message.streaming === true` and disappears
when `done` is received. when the done event is received. Model name and token count from the done
event are stored in `useChat` state and displayed in the InfoPanel.
## Model Selector ## Dynamic Model Selector
Available models are defined in `InfoPanel.jsx`: Available models are fetched from `GET /models` on mount via the `useModels` hook.
The hook initialises with `FALLBACK_MODELS` from `constants.js` and replaces them
with the server response on success. If the fetch fails, the fallback list is used
silently — a warning is logged to the console.
| Label | Value | ```js
|---|---| // constants.js
| Companion | `companion:latest` | export const FALLBACK_MODELS = [
| Mistral Nemo | `mistral-nemo:latest` | { value: 'companion:latest', label: 'Companion' },
| Coder | `coder:latest` | // ...
| Qwen 2.5 Coder 14B | `qwen2.5-coder:14b` | ];
```
The selected model is passed with every chat request. To add a new model, The selected model is passed with every chat request. To add a model, update
update the `MODELS` array in `InfoPanel.jsx`. `models.json` on the main PC — no client rebuild needed.
## Session Management ## Session Management
Sessions are identified by a `external_id` — a human-readable string or UUID Sessions are identified by `external_id` — a UUID generated client-side via the
generated client-side. New sessions are created locally with `uuid` and auto-registered `uuid` package. New sessions are created locally and auto-registered in the memory
in the memory service on the first message. The session list refreshes after each service on the first message. The session list refreshes after each completed
completed response to surface newly created sessions. response to surface newly created sessions.
### Session Actions
The session list supports rename and delete:
- **Hover** — reveals ✎ (rename) and ✕ (delete) icon buttons on the session row
- **Right-click** — opens a context menu with the same actions
Rename opens a `SessionModal` dialog. The modal is designed to expand into a full
session settings panel in future — the title is already "Session Settings" to
reflect this intent.
Delete is immediate with no confirmation dialog (planned for a future update).
Actions are disabled on unsaved (new) sessions that haven't had a message sent yet.
### Context Menu
Implemented via `useContextMenu` hook — tracks `{ x, y, session }` state and
attaches a `window` click listener to dismiss on any outside click. Rendered
outside the sidebar div (via React fragment) to avoid being clipped by
`overflow: hidden`.

View File

@@ -2,7 +2,7 @@
**Package:** `@nexusai/inference-service` **Package:** `@nexusai/inference-service`
**Location:** `packages/inference-service` **Location:** `packages/inference-service`
**Deployed on:** Main PC **Deployed on:** Main PC (192.168.0.79)
**Port:** 3001 **Port:** 3001
## Purpose ## Purpose
@@ -15,7 +15,7 @@ to switch inference backends without changes to the rest of the system.
## Dependencies ## Dependencies
- `express` — HTTP API - `express` — HTTP API
- `ollama` — Ollama client (used by the Ollama provider) - `ollama` — Ollama client (used by the Ollama provider, kept as fallback)
- `dotenv` — environment variable loading - `dotenv` — environment variable loading
- `@nexusai/shared` — shared utilities - `@nexusai/shared` — shared utilities
@@ -24,9 +24,13 @@ to switch inference backends without changes to the rest of the system.
| Variable | Required | Default | Description | | Variable | Required | Default | Description |
|---|---|---|---| |---|---|---|---|
| PORT | No | 3001 | Port to listen on | | PORT | No | 3001 | Port to listen on |
| INFERENCE_PROVIDER | No | ollama | Active inference provider (ollama, llamacpp) | | INFERENCE_PROVIDER | No | llamacpp | Active inference provider (`ollama` or `llamacpp`) |
| INFERENCE_URL | No | http://localhost:11434 | URL of the inference runtime | | INFERENCE_URL | No | http://localhost:8080 | URL of the inference runtime |
| DEFAULT_MODEL | No | llama3.2 | Default model name passed to the provider | | DEFAULT_MODEL | No | local-model | Default model name passed to the provider |
> `INFERENCE_URL` points to `llama-server` directly (port 8080), not to this
> service itself. The orchestration service uses `INFERENCE_SERVICE_URL` to
> reach this service on port 3001.
## Provider Architecture ## Provider Architecture
@@ -39,14 +43,87 @@ signatures, so the rest of the service is unaware of which backend is active.
| Provider | Value | Runtime | | Provider | Value | Runtime |
|---|---|---| |---|---|---|
| Ollama | `ollama` | Ollama via the `ollama` npm package | | llama.cpp | `llamacpp` | llama.cpp server (OpenAI-compatible API) — **current default** |
| llama.cpp | `llamacpp` | llama.cpp server (OpenAI-compatible API) | | Ollama | `ollama` | Ollama via the `ollama` npm package — available as fallback |
Switching providers requires only a `.env` change — no code modifications needed. Switching providers requires only a `.env` change — no code modifications needed:
```
INFERENCE_PROVIDER=llamacpp INFERENCE_PROVIDER=llamacpp
INFERENCE_URL=http://localhost:8080 INFERENCE_URL=http://localhost:8080
```
### Provider Validation
The provider loader validates `INFERENCE_PROVIDER` at startup and throws immediately
if an unknown value is set — prevents silent misconfiguration:
```
Error: Unknown inference provider: "foo". Valid options: ollama, llamacpp
```
## llama.cpp Provider
The llama.cpp provider uses the OpenAI-compatible REST API exposed by `llama-server`.
### Starting llama-server
`llama-server` must be started manually on the main PC before the inference service
can handle requests. It loads a single model at startup:
```powershell
.\llama-gpu\llama-server.exe `
-m .\models\gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf `
-ngl 99 `
--reasoning off `
--host 0.0.0.0 `
--port 8080 `
-c 64000
```
Key flags:
| Flag | Description |
|---|---|
| `-m` | Path to the `.gguf` model file |
| `-ngl 99` | Offload as many layers as possible to GPU |
| `--reasoning off` | Disables thinking/reasoning delay on Gemma 4 models |
| `--host 0.0.0.0` | Allows connections from other machines on the LAN |
| `--port 8080` | Port for the llama-server HTTP API |
| `-c 64000` | Context window size in tokens |
> `-c 64000` is intentionally large. Monitor VRAM usage — if pressure builds,
> reduce this value. The NexusAI memory architecture handles context injection
> so a smaller window (68K) is often sufficient.
### Model Naming
The model name sent in API requests must match the name as reported by
`llama-server` — including the `.gguf` extension. The reported name can be
verified with:
```powershell
Invoke-RestMethod -Uri "http://192.168.0.79:8080/v1/models"
```
Set `DEFAULT_MODEL` in `.env` to the exact reported name:
```
DEFAULT_MODEL=gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf
```
### Inference Parameters
The llamacpp provider maps NexusAI options to OpenAI-compatible fields:
| NexusAI option | API field | Default |
|---|---|---|
| `temperature` | `temperature` | 0.7 |
| `maxTokens` | `max_tokens` | 1024 |
| `topP` | `top_p` | 0.9 |
| `topK` | `top_k` | 40 |
| `repeatPenalty` | `repeat_penalty` | 1.1 |
| `seed` | `seed` | null (random) |
## Internal Structure ## Internal Structure
```
src/ src/
├── providers/ ├── providers/
│ ├── ollama.js # Ollama provider — uses ollama npm package │ ├── ollama.js # Ollama provider — uses ollama npm package
@@ -55,6 +132,27 @@ src/
│ └── inference.js # /complete and /complete/stream route handlers │ └── inference.js # /complete and /complete/stream route handlers
├── infer.js # Provider loader — selects and re-exports active provider ├── infer.js # Provider loader — selects and re-exports active provider
└── index.js # Express app + route definitions └── index.js # Express app + route definitions
```
## Streaming Response Format
The llama.cpp provider yields chunks in this shape:
```js
{ response: "token text", done: false }
// final chunk:
{ response: '', done: true, model: "model-name.gguf", tokenCount: 42 }
```
The inference route re-emits these as SSE events:
```
data: {"response":"token text"}
data: {"done":true,"model":"model-name.gguf","tokenCount":42}
data: [DONE]
```
`model` and `tokenCount` are captured from the llama.cpp `finish_reason: stop`
chunk (`usage.completion_tokens`) and emitted on the done event so the
orchestration layer can forward them to the client.
## Endpoints ## Endpoints
@@ -79,7 +177,7 @@ Request body:
```json ```json
{ {
"prompt": "What is the capital of France?", "prompt": "What is the capital of France?",
"model": "companion:latest", "model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
"temperature": 0.7, "temperature": 0.7,
"maxTokens": 1024 "maxTokens": 1024
} }
@@ -93,33 +191,26 @@ Response:
```json ```json
{ {
"text": "The capital of France is Paris.", "text": "The capital of France is Paris.",
"model": "companion:latest", "model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
"done": true, "done": true,
"evalCount": 8, "evalCount": 8,
"promptEvalCount": 41 "promptEvalCount": 41
} }
``` ```
| Field | Description |
|---|---|
| `text` | The model's response |
| `model` | Model name as reported by the provider |
| `done` | Whether generation completed normally |
| `evalCount` | Number of tokens generated |
| `promptEvalCount` | Number of tokens in the prompt |
--- ---
**POST /complete/stream** **POST /complete/stream**
Same request body as `/complete` (`maxTokens` not applicable for streaming). Same request body as `/complete`.
Response is a stream of Server-Sent Events. Each event contains a partial Response is a stream of Server-Sent Events:
response chunk as JSON. The stream closes with a final `data: [DONE]` event. ```
data: {"model":"companion:latest","response":"The","done":false} data: {"response":"The"}
data: {"model":"companion:latest","response":" capital","done":false} data: {"response":" capital of France is Paris."}
data: {"model":"companion:latest","response":" of France is Paris.","done":false} data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":8}
data: [DONE] data: [DONE]
```
Clients should read the `response` field from each chunk and accumulate Clients should accumulate `response` fields to build the full response string.
them to build the full response string. The `done` event carries `model` and `tokenCount` for display in the UI.

View File

@@ -34,7 +34,7 @@ service to generate and store a vector in Qdrant.
``` ```
src/ src/
├── db/ ├── db/
│ ├── index.js # SQLite connection + initialization │ ├── index.js # SQLite connection + initialization + migrations
│ └── schema.js # Table definitions, indexes, FTS5, triggers │ └── schema.js # Table definitions, indexes, FTS5, triggers
├── episodic/ ├── episodic/
│ └── index.js # Session + episode CRUD, FTS search, embedding write path │ └── index.js # Session + episode CRUD, FTS search, embedding write path
@@ -49,12 +49,29 @@ src/
Five core tables: Five core tables:
- **sessions** — top-level conversation containers, identified by an `external_id` - **sessions** — top-level conversation containers, identified by an `external_id` and optional `name`
- **episodes** — individual exchanges (user message + AI response) tied to a session - **episodes** — individual exchanges (user message + AI response) tied to a session
- **entities** — named things the system learns about (people, places, concepts) - **entities** — named things the system learns about (people, places, concepts)
- **relationships** — directional labeled links between entities - **relationships** — directional labeled links between entities
- **summaries** — condensed episode groups for efficient context retrieval - **summaries** — condensed episode groups for efficient context retrieval
### Migrations
Schema changes that cannot be expressed in `CREATE TABLE IF NOT EXISTS` are applied
as migrations in `db/index.js` at startup, wrapped in try/catch to safely ignore
already-applied changes:
```js
try {
db.exec(`ALTER TABLE sessions ADD COLUMN name TEXT`);
} catch {
// Column already exists — safe to ignore on subsequent startups
}
```
Current migrations:
- `ALTER TABLE sessions ADD COLUMN name TEXT` — adds display name to sessions
### FTS5 Full-Text Search ### FTS5 Full-Text Search
An `episodes_fts` virtual table enables keyword search across all episodes. An `episodes_fts` virtual table enables keyword search across all episodes.
@@ -144,9 +161,14 @@ Entities and relationships are stored in SQLite with two key constraints:
| Method | Path | Description | | Method | Path | Description |
|---|---|---| |---|---|---|
| POST | /sessions | Create a new session | | POST | /sessions | Create a new session |
| GET | /sessions | Get paginated list of all sessions |
| GET | /sessions/:id | Get session by internal ID | | GET | /sessions/:id | Get session by internal ID |
| GET | /sessions/by-external/:externalId | Get session by external ID | | GET | /sessions/by-external/:externalId | Get session by external ID |
| DELETE | /sessions/:id | Delete session (cascades to episodes + summaries) | | PATCH | /sessions/by-external/:externalId | Update session name |
| DELETE | /sessions/by-external/:externalId | Delete session (cascades to episodes + summaries) |
> Route ordering matters in Express: `by-external/:externalId` must be defined before
> `/:id` to prevent the literal string `by-external` being captured as an ID parameter.
**POST /sessions body:** **POST /sessions body:**
```json ```json
@@ -156,6 +178,20 @@ Entities and relationships are stored in SQLite with two key constraints:
} }
``` ```
**PATCH /sessions/by-external/:externalId body:**
```json
{
"name": "My Renamed Session"
}
```
Returns the updated session object. `name` is required and must be non-empty.
**DELETE /sessions/by-external/:externalId**
Returns `204 No Content` on success. Cascades to delete all associated episodes
and summaries via SQLite `ON DELETE CASCADE`.
### Episodes ### Episodes
| Method | Path | Description | | Method | Path | Description |

View File

@@ -14,14 +14,10 @@ or inference services — all traffic flows through orchestration.
## Dependencies ## Dependencies
- `express` : HTTP API - `express` HTTP API
- `cors` : cross-origin resource sharing middleware - `cors` cross-origin resource sharing middleware
- `node-fetch` : inter-service HTTP communication (memory service client only) - `dotenv` — environment variable loading
- `dotenv` : environment variable loading - `@nexusai/shared` — shared utilities
- `@nexusai/shared` : shared utilities
> `memory.js` uses `node-fetch` v2 (pinned) because it is CommonJS. All other
> service clients use Node.js built-in `fetch`.
## Environment Variables ## Environment Variables
@@ -33,6 +29,7 @@ or inference services — all traffic flows through orchestration.
| INFERENCE_SERVICE_URL | No | http://localhost:3001 | Inference service URL | | INFERENCE_SERVICE_URL | No | http://localhost:3001 | Inference service URL |
| QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search | | QDRANT_URL | No | http://localhost:6333 | Qdrant URL for semantic search |
| CORS_ORIGIN | No | http://localhost:5173 | Allowed origin for CORS requests | | CORS_ORIGIN | No | http://localhost:5173 | Allowed origin for CORS requests |
| MODELS_MANIFEST_PATH | Yes | — | Path to `models.json` manifest file |
## Internal Structure ## Internal Structure
``` ```
@@ -46,7 +43,8 @@ src/
│ └── index.js # Core pipeline logic — context assembly and coordination │ └── index.js # Core pipeline logic — context assembly and coordination
├── routes/ ├── routes/
│ ├── chat.js # POST /chat and POST /chat/stream route handlers │ ├── chat.js # POST /chat and POST /chat/stream route handlers
── sessions.js # GET /sessions/:sessionId/history route handler ── sessions.js # Session list, history, rename, and delete routes
│ └── models.js # GET /models — reads models.json manifest from disk
└── index.js # Express app entry point └── index.js # Express app entry point
``` ```
@@ -65,7 +63,7 @@ the client.
UUID for new conversations and pass it directly — no pre-creation step needed. UUID for new conversations and pass it directly — no pre-creation step needed.
2. **Recent episode retrieval** — fetches the most recent episodes for the session 2. **Recent episode retrieval** — fetches the most recent episodes for the session
(default: 10) from the memory service. (default: 5) from the memory service.
3. **Semantic search** — embeds the user message via the embedding service, then 3. **Semantic search** — embeds the user message via the embedding service, then
queries Qdrant for the top-5 most similar past episodes (score threshold: 0.75). queries Qdrant for the top-5 most similar past episodes (score threshold: 0.75).
@@ -89,37 +87,68 @@ the client.
count to the client. count to the client.
## Prompt Structure ## Prompt Structure
```
[System prompt] [System prompt]
Here are some relevant memories from earlier conversations: Here are some relevant memories from earlier conversations:
User: {past user message} User: {past user message}
Assistant: {past ai response} Assistant: {past ai response}
... (up to 5 semantic episodes) ... (up to 5 semantic episodes)
Here is the recent conversation history: ---
Here are some relevant memories from your past conversations:
User: {past user message} User: {past user message}
Assistant: {past ai response} Assistant: {past ai response}
... (up to 10 recent episodes) ... (up to 5 recent episodes)
--- End of memories --- --- End of recent memories ---
User: {current message} User: {current message}
Assistant: Assistant:
```
Semantic episodes appear before recent episodes so the model encounters Semantic episodes appear before recent episodes so the model encounters
long-range relevant context before the immediate conversation flow. long-range relevant context before the immediate conversation flow.
## SSE Stream Format ## SSE Stream Format
The inference service emits chunks in this format: The inference service emits chunks from the llama.cpp provider in this format:
data: {"model":"companion:latest","response":"Hello","done":false} ```
data: {"model":"companion:latest","response":"!","done":true,"eval_count":3,...} data: {"response":"Hello","done":false}
data: {"response":"!","done":false}
data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":42}
data: [DONE] data: [DONE]
```
The orchestration service re-emits to the client as: The orchestration service re-emits to the client as:
```
data: {"text":"Hello"} data: {"text":"Hello"}
data: {"text":"!"} data: {"text":"!"}
data: {"done":true} data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":42}
```
The `[DONE]` sentinel from the inference service is consumed internally The `[DONE]` sentinel from the inference service is consumed internally
and not forwarded. The client stream is terminated by `res.end()` after and not forwarded. The client stream is terminated by `res.end()` after
the `{"done":true}` event. the done event. Model name and token count are included on the done event
so the client can display them in the UI.
## Models Manifest
The `/models` endpoint reads a `models.json` file from disk at the path
specified by `MODELS_MANIFEST_PATH`. The file lives on the main PC alongside
the model files, and is accessible to orchestration via a network share
mounted at `/mnt/nexus-models`.
The manifest is read fresh on each request — no restart needed when models
are added or removed.
**models.json format:**
```json
[
{ "value": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf", "label": "Gemma 4 26B Claude Distill" }
]
```
- `value` — must match the model name as reported by `llama-server` (including `.gguf` extension)
- `label` — display name shown in the UI
## Endpoints ## Endpoints
@@ -142,6 +171,14 @@ the `{"done":true}` event.
|---|---|---| |---|---|---|
| GET | /sessions | Get paginated list of all sessions | | GET | /sessions | Get paginated list of all sessions |
| GET | /sessions/:sessionId/history | Get paginated episode history for a session | | GET | /sessions/:sessionId/history | Get paginated episode history for a session |
| PATCH | /sessions/:sessionId | Rename a session |
| DELETE | /sessions/:sessionId | Delete a session and all its episodes |
### Models
| Method | Path | Description |
|---|---|---|
| GET | /models | Get list of available models from manifest file |
--- ---
@@ -152,7 +189,7 @@ Request body:
{ {
"sessionId": "your-session-uuid", "sessionId": "your-session-uuid",
"message": "Hello, my name is Tim.", "message": "Hello, my name is Tim.",
"model": "companion:latest", "model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
"temperature": 0.7 "temperature": 0.7
} }
``` ```
@@ -165,7 +202,7 @@ Response:
{ {
"sessionId": "your-session-uuid", "sessionId": "your-session-uuid",
"response": "Hello Tim! How can I help you today?", "response": "Hello Tim! How can I help you today?",
"model": "companion:latest", "model": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf",
"tokenCount": 87 "tokenCount": 87
} }
``` ```
@@ -176,23 +213,34 @@ Response:
Same request body as `POST /chat`. Same request body as `POST /chat`.
Response is a stream of Server-Sent Events. Each event contains a text Response is a stream of Server-Sent Events:
delta. The stream ends with a `done` event. ```
data: {"text":"Hello"} data: {"text":"Hello"}
data: {"text":" Tim"} data: {"text":" Tim"}
data: {"text":"!"} data: {"done":true,"model":"gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf","tokenCount":87}
data: {"done":true} ```
Clients should read the `text` field from each chunk and accumulate them ---
to build the full response string. The connection is closed by the server
after the `{"done":true}` event. **PATCH /sessions/:sessionId**
Request body:
```json
{ "name": "My Renamed Session" }
```
Returns the updated session object. `name` is required and trimmed of whitespace.
---
**DELETE /sessions/:sessionId**
Returns `204 No Content`. Cascades to delete all episodes for the session.
--- ---
**GET /sessions/:sessionId/history** **GET /sessions/:sessionId/history**
Returns paginated episode history for a session identified by its external ID.
Query parameters: Query parameters:
| Parameter | Default | Description | | Parameter | Default | Description |
@@ -218,30 +266,17 @@ Response:
} }
``` ```
Episodes are ordered newest first.
--- ---
**GET /sessions** **GET /models**
Returns a paginated list of all sessions, ordered by most recently active. Returns the parsed contents of `models.json`:
Query parameters:
| Parameter | Default | Description |
|---|---|---|
| limit | 20 | Maximum number of sessions to return |
| offset | 0 | Number of sessions to skip (for pagination) |
Response:
```json ```json
[ [
{ { "value": "gemma-4-26B-A4B-Claude-Distill-APEX-I-Mini.gguf", "label": "Gemma 4 26B Claude Distill" }
"id": 1,
"external_id": "test-semantic",
"metadata": null,
"created_at": 1712345678,
"updated_at": 1712345999
}
] ]
``` ```
Episodes are ordered newest first. Returns `404` if the session does not exist. Returns `500` if the manifest file cannot be read or parsed.

View File

@@ -24,13 +24,40 @@ const DB = getEnv('SQLITE_PATH'); // required — throws if missing
--- ---
### `parseRow(row)`
Parses a SQLite row object, deserialising any JSON-encoded `metadata` fields
into plain objects. Returns `null` if the row is `null` or `undefined`.
```js
const { parseRow } = require('@nexusai/shared');
const session = parseRow(db.prepare('SELECT * FROM sessions WHERE id = ?').get(id));
```
---
### `formatEpisodeText(userMessage, aiResponse)`
Combines a user message and AI response into the canonical text format used
for embedding:
```
User: {userMessage}
Assistant: {aiResponse}
```
Used by the memory service's embedding write path to ensure consistent
vector representations across all episodes.
---
### Constants ### Constants
Tuneable values and shared identifiers are centralised in `constants.js` Tuneable values and shared identifiers are centralised in `constants.js`
rather than hardcoded across services. Import the relevant group by name. rather than hardcoded across services. Import the relevant group by name.
```js ```js
const { QDRANT, COLLECTIONS, EPISODIC } = require('@nexusai/shared'); const { QDRANT, COLLECTIONS, EPISODIC, LLAMACPP } = require('@nexusai/shared');
``` ```
#### `QDRANT` #### `QDRANT`
@@ -40,15 +67,14 @@ embedding model and Qdrant collection setup.
| Key | Value | Description | | Key | Value | Description |
|---|---|---| |---|---|---|
| `DEFAULT_URL` | `http://localhost:6333` | Fallback Qdrant URL if `QDRANT_URL` env var is not set | | `DEFAULT_URL` | `http://localhost:6333` | Fallback Qdrant URL |
| `VECTOR_SIZE` | `768` | Output dimensions of `nomic-embed-text` | | `VECTOR_SIZE` | `768` | Output dimensions of `nomic-embed-text` |
| `DISTANCE_METRIC` | `'Cosine'` | Similarity metric used for all collections | | `DISTANCE_METRIC` | `'Cosine'` | Similarity metric used for all collections |
| `DEFAULT_LIMIT` | `10` | Default top-k for vector searches | | `DEFAULT_LIMIT` | `10` | Default top-k for vector searches |
#### `COLLECTIONS` #### `COLLECTIONS`
Canonical Qdrant collection names. Used by both the semantic layer and Canonical Qdrant collection names.
any service that constructs Qdrant queries directly.
| Key | Value | | Key | Value |
|---|---| |---|---|
@@ -65,6 +91,8 @@ Default pagination and result limits for SQLite episode queries.
| `DEFAULT_RECENT_LIMIT` | `10` | Default number of recent episodes to retrieve | | `DEFAULT_RECENT_LIMIT` | `10` | Default number of recent episodes to retrieve |
| `DEFAULT_PAGE_SIZE` | `20` | Default episodes per page for paginated queries | | `DEFAULT_PAGE_SIZE` | `20` | Default episodes per page for paginated queries |
| `DEFAULT_SEARCH_LIMIT` | `10` | Default number of FTS search results to return | | `DEFAULT_SEARCH_LIMIT` | `10` | Default number of FTS search results to return |
| `DEFAULT_OFFSET` | `0` | Default pagination offset |
| `DEFAULT_SESSIONS_LIMIT` | `20` | Default number of sessions to return |
#### `SERVICES` #### `SERVICES`
@@ -74,3 +102,75 @@ when the corresponding environment variable is not set.
| Key | Value | Description | | Key | Value | Description |
|---|---|---| |---|---|---|
| `EMBEDDING_URL` | `http://localhost:3003` | Fallback embedding service URL | | `EMBEDDING_URL` | `http://localhost:3003` | Fallback embedding service URL |
| `MEMORY_URL` | `http://localhost:3002` | Fallback memory service URL |
| `INFERENCE_URL` | `http://localhost:3001` | Fallback inference service URL |
#### `PORTS`
Default port numbers for each service.
| Key | Value |
|---|---|
| `INFERENCE` | `'3001'` |
| `MEMORY` | `'3002'` |
| `EMBEDDING` | `'3003'` |
| `ORCHESTRATION` | `'4000'` |
#### `OLLAMA`
Ollama runtime defaults — used by the Ollama inference provider.
| Key | Value | Description |
|---|---|---|
| `DEFAULT_URL` | `http://localhost:11434` | Fallback Ollama URL |
| `EMBED_MODEL` | `'nomic-embed-text'` | Default embedding model |
| `OLLAMA_MODEL` | `'companion:latest'` | Default chat model |
#### `LLAMACPP`
llama.cpp runtime defaults — used by the llama.cpp inference provider.
| Key | Value | Description |
|---|---|---|
| `DEFAULT_URL` | `http://localhost:8080` | Fallback llama-server URL |
| `DEFAULT_MODEL` | `'local-model'` | Fallback model name (override via `DEFAULT_MODEL` env var) |
> Always set `DEFAULT_MODEL` in the inference service `.env` to the exact model
> name reported by `llama-server` (including `.gguf` extension). The shared
> constant is a last-resort fallback only.
#### `INFERENCE_DEFAULTS`
Default inference parameters applied when not specified in a request.
| Key | Value | Description |
|---|---|---|
| `TEMPERATURE` | `0.7` | Controls randomness (0 = deterministic, 1 = creative) |
| `MAX_TOKENS` | `1024` | Maximum tokens to generate |
| `TOP_P` | `0.9` | Nucleus sampling probability mass |
| `TOP_K` | `40` | Top-K candidates at each step |
| `REPEAT_PENALTY` | `1.1` | Penalty for recently used tokens |
| `SEED` | `null` | null = random; set integer for reproducible outputs |
#### `ORCHESTRATION`
Orchestration pipeline defaults.
| Key | Value | Description |
|---|---|---|
| `RECENT_EPISODE_LIMIT` | `5` | Recent episodes to inject into prompt |
| `SEMANTIC_LIMIT` | `5` | Semantic search results to inject into prompt |
| `SCORE_THRESHOLD` | `0.75` | Minimum similarity score for semantic results |
| `CORS_ORIGIN` | `'http://localhost:5173'` | Fallback allowed CORS origin |
| `SYSTEM_PROMPT` | *(see below)* | Default system prompt |
Default system prompt:
> "You are a helpful, context-aware AI assistant. You have access to memories
> of past conversations with the user. Use them to provide consistent,
> personalised responses."
#### `SQLITE`
| Key | Value | Description |
|---|---|---|
| `DEFAULT_PATH` | `'./data/nexusai.db'` | Fallback SQLite database path |