retrieval fusion

2026-04-27 07:03:46 -07:00
parent 27ad614130
commit 055683424d
6 changed files with 188 additions and 14 deletions
--- a/docs/architecture/overview.md
+++ b/docs/architecture/overview.md
@@ -73,7 +73,7 @@ service by ID after the vector search.
 The core four-service architecture is complete and operational. Key capabilities:
- **Hybrid memory retrieval** — recent episodes + semantic search combined into every prompt
+- **Retrieval fusion** — Reciprocal Rank Fusion (RRF) merges semantic (Qdrant vector search) and keyword (SQLite FTS5) episode retrieval into a single ranked result set. Weights are configurable per strategy via settings; keyword search is off by default (`keywordWeight: 0`) and can be enabled without a service restart
 - **Entity layer + Knowledge graph** — automatic extraction of named entities and relationships from conversations via qwen2.5:3b. Entities and relationships are stored in SQLite with `mention_count` tracking. A graph traversal layer expands Qdrant entity search hits into a 1-hop neighborhood subgraph, injecting structured connected knowledge into every prompt
 - **Projects** — sessions grouped with shared or isolated memory pools
 - **Auto-naming** — sessions named automatically from first exchange via inference
--- a/docs/reference/API-routes.md
+++ b/docs/reference/API-routes.md
@@ -202,7 +202,9 @@ Returns `503` if llama-server is unreachable.
 |---|---|---|---|
 | `recentEpisodeLimit` | integer | 1–20 | Recent episodes injected into prompt |
 | `semanticLimit` | integer | 1–20 | Max semantic search results |
-| `scoreThreshold` | float | 0–1 | Minimum similarity score |
+| `scoreThreshold` | float | 0–1 | Minimum similarity score for Qdrant results |
 | `semanticWeight` | float | 0–5 | RRF weight for Qdrant semantic results |
 | `keywordWeight` | float | 0–5 | RRF weight for FTS5 keyword results (`0` = disabled) |
 | `modelsFolderPath` | string | — | Path to folder containing .gguf files |
 | `temperature` | float | 0–2 | Inference randomness |
 | `repeatPenalty` | float | 1–2 | Repeat token penalty |
--- a/docs/roadmap.md
+++ b/docs/roadmap.md
@@ -64,11 +64,11 @@ The highest-leverage memory upgrade. Transforms NexusAI from "remembers conversa
 - [x] Relationship traversal queries
 - [x] Graph-aware context assembly in orchestration
-### 2. Retrieval Fusion + Full-Text Search
+### 2. Retrieval Fusion + Full-Text Search ✅
 Multi-strategy retrieval merged into a single ranked result set.
- [ ] Reciprocal Rank Fusion (RRF) — merge semantic (Qdrant) + keyword (FTS5) results
+- [x] Reciprocal Rank Fusion (RRF) — merge semantic (Qdrant) + keyword (FTS5) results
- [ ] Configurable weights per retrieval strategy
+- [x] Configurable weights per retrieval strategy (`semanticWeight`, `keywordWeight` via `PATCH /settings`)
- [ ] Score threshold tuning per collection
+- [x] Score threshold retained per-strategy; FTS scoped to session/project sessions; `keywordWeight: 0` default (disabled until tuned)
 ### 3. Memory Consolidation Lifecycle
 Prevents long-term memory degradation and enables compression.
--- a/docs/services/orchestration-service.md
+++ b/docs/services/orchestration-service.md
@@ -72,7 +72,9 @@ via `appSettings.load()` — changes apply immediately without a service restart
 |---|---|---|
 | `recentEpisodeLimit` | 5 | Recent episodes injected into prompt |
 | `semanticLimit` | 5 | Semantic search results injected into prompt |
-| `scoreThreshold` | 0.5 | Minimum similarity score for semantic results |
+| `scoreThreshold` | 0.5 | Minimum similarity score for Qdrant semantic results |
 | `semanticWeight` | 1.0 | RRF weight for Qdrant semantic results |
 | `keywordWeight` | 0 | RRF weight for FTS5 keyword results (`0` = disabled) |
 | `modelsFolderPath` | `/mnt/nexus-models` | Path to folder containing .gguf files |
 | `temperature` | 0.7 | Inference temperature |
 | `repeatPenalty` | 1.1 | Repeat token penalty |
@@ -101,8 +103,12 @@ difference is how the inference response is delivered to the client.
 4. **Recent episode retrieval** — fetch most recent episodes (`recentEpisodeLimit`).
-5. **Semantic search** — embed user message, query Qdrant for similar past
+5. **Fused episode retrieval** — runs semantic (Qdrant) and keyword (FTS5)
-   episodes. Deduplicated against recent episodes. Non-critical.
+   search in parallel, then merges results via Reciprocal Rank Fusion (RRF).
   Both paths are filtered against `recentIds` before fusion. FTS is scoped
   to the current session or all project sessions. If `keywordWeight` is `0`,
   the FTS call is skipped entirely. Non-critical — failures fall back to
   whichever strategy succeeded.
 6. **Entity search** — query `entities` Qdrant collection filtered by
   `projectId`. Returns entity IDs alongside Qdrant payload data (the Qdrant
@@ -114,8 +120,8 @@ difference is how the inference response is delivered to the client.
   If no entities were found or the graph call fails, falls back to flat entity
   list (no edges). Non-critical.
-8. **Prompt assembly** — combine system prompt, graph context, semantic
+8. **Prompt assembly** — combine system prompt, graph context, fused episodes,
-   episodes, recent episodes, and user message.
+   recent episodes, and user message.
 9. **Inference** — send to inference service. `/chat` awaits full response;
   `/chat/stream` pipes SSE chunks to the client.
--- a/docs/services/retrieval-fusion.md
+++ b/docs/services/retrieval-fusion.md
@@ -0,0 +1,153 @@
 # Retrieval Fusion
 **Implementation:** `packages/orchestration-service/src/chat/index.js`  
 **FTS scoping:** `packages/memory-service/src/episodic/index.js`, `src/index.js`  
 **Settings:** `semanticWeight`, `keywordWeight` via `PATCH /settings`
 ## Purpose
 Rather than relying solely on Qdrant vector similarity (which finds semantically
 related content but misses exact keyword matches) or FTS5 keyword search alone
 (which finds exact matches but not paraphrases), Reciprocal Rank Fusion (RRF)
 merges the ranked results from both strategies into a single better-ranked list.
 Episodes that rank highly in **both** lists score highest. An episode that is
 the top semantic match but irrelevant by keyword, or vice versa, scores lower
 than one that satisfies both.
 ## How RRF Works
 For each episode `d`, its fused score is:
 ```
 RRF(d) = w_semantic / (k + rank_semantic(d))
        + w_keyword  / (k + rank_keyword(d))
 ```
 - `rank_i(d)` — 1-based position in that strategy's result list (episode absent from a list contributes 0 for that term)
 - `k = 60` — smoothing constant (standard; not exposed in settings)
 - `w_semantic`, `w_keyword` — user-tunable weights (both default-sourced from `RETRIEVAL` constants)
 Setting a weight to `0` removes that strategy's contribution entirely. Setting
 `keywordWeight` to `0` also short-circuits the FTS network call.
 ## Architecture
 Fusion lives in orchestration — the service already coordinates multiple data
 sources, and fusion is a retrieval strategy, not a storage concern.
 ```
 getFusedEpisodes()
 ├── getSemanticEpisodes()     — Qdrant embed+search → fetch full rows by ID
 │   (existing path, unchanged)
 └── getFTSResults()           — memory-service /episodes/search → full rows directly
    (skipped entirely if keywordWeight == 0)
         ↓
 fuseEpisodeResults()          — pure RRF, no I/O
         ↓
 fusedEpisodes[]               — top semanticLimit episodes by RRF score
 ```
 ### Data Shape Consistency
 Both sides must enter fusion as `Episode[]` — full SQLite row objects with
 the same shape — and both must be filtered against `recentIds` first:
 - **Semantic path**: `recentIds` filter applied before `getEpisodeById` fetch (existing behaviour)
 - **FTS path**: full rows returned directly; `recentIds` filter applied in `getFusedEpisodes` after receiving them
 FTS requests `semanticLimit * 2` results to provide headroom for the
 `recentIds` filter without under-serving the fusion.
 ## FTS Session Scoping
 Without scoping, FTS5 searches across all episodes in the database. For
 context assembly, results must be constrained to the current session or
 project session pool — the same scope used for Qdrant semantic search.
 `searchEpisodes(query, limit, sessionIds)` in memory-service accepts an
 optional `sessionIds` array. When provided, the SQL becomes:
 ```sql
 SELECT e.* FROM episodes e
 JOIN episodes_fts fts ON e.id = fts.rowid
 WHERE episodes_fts MATCH ?
 AND e.session_id IN (?, ?, ...)
 ORDER BY rank
 LIMIT ?
 ```
 The HTTP endpoint `GET /episodes/search` accepts `sessionIds` as a
 comma-separated query param: `?q=hello&sessionIds=1,2,3`.
 In orchestration, `ftsSessionIds` is set to:
 - `projectSessionIds` (all sessions in the project) — if the session belongs to a project
 - `[session.id]` — otherwise (single session only)
 This mirrors the Qdrant scoping logic exactly.
 ## `fuseEpisodeResults` — Implementation Detail
 ```js
 function fuseEpisodeResults(semanticEps, keywordEps, { semanticWeight, keywordWeight, limit }) {
    const k = RETRIEVAL.RRF_K; // 60
    const scores = new Map();  // episode.id → { episode, score }
    // Score semantic results (already filtered against recentIds)
    semanticEps.forEach((ep, i) => {
        scores.set(ep.id, { episode: ep, score: semanticWeight / (k + i + 1) });
    });
    // Score + merge keyword results (already filtered against recentIds)
    keywordEps.forEach((ep, i) => {
        const contrib = keywordWeight / (k + i + 1);
        if (scores.has(ep.id)) {
            scores.get(ep.id).score += contrib;   // appears in both — sum scores
        } else if (contrib > 0) {
            scores.set(ep.id, { episode: ep, score: contrib });  // FTS-only episode
        }
        // contrib == 0 (keywordWeight: 0) → episode not added (guard prevents score-0 bleed-through)
    });
    return [...scores.values()]
        .sort((a, b) => b.score - a.score)
        .slice(0, limit)
        .map(({ episode }) => episode);
 }
 ```
 The `else if (contrib > 0)` guard prevents FTS-only episodes from entering
 the result set with a score of 0 when `keywordWeight` is 0 — verified by
 the test suite.
 ## Settings
 | Setting | Default | Range | Description |
 |---|---|---|---|
 | `semanticWeight` | 1.0 | 0–5 | Weight applied to Qdrant semantic results |
 | `keywordWeight` | 0 | 0–5 | Weight applied to FTS5 keyword results. `0` = disabled |
 Both are readable via `GET /settings` and writable via `PATCH /settings`
 without a service restart. Changes take effect on the next chat request.
 **To enable keyword search:**
 ```bash
 curl -X PATCH http://localhost:4000/settings \
  -H "Content-Type: application/json" \
  -d '{"keywordWeight": 1.0}'
 ```
 **To favour keyword matches over semantic:**
 ```bash
 curl -X PATCH http://localhost:4000/settings \
  -H "Content-Type: application/json" \
  -d '{"semanticWeight": 0.5, "keywordWeight": 2.0}'
 ```
 ## Constants (`packages/shared/src/config/constants.js`)
 | Constant | Value | Description |
 |---|---|---|
 | `RETRIEVAL.RRF_K` | 60 | RRF smoothing constant — not exposed in settings |
 | `RETRIEVAL.SEMANTIC_WEIGHT` | 1.0 | Default semantic weight |
 | `RETRIEVAL.KEYWORD_WEIGHT` | 0 | Default keyword weight (off) |
--- a/packages/orchestration-service/CLAUDE.md
+++ b/packages/orchestration-service/CLAUDE.md
@@ -24,9 +24,10 @@ Default port: **4000**. Depends on memory-service, embedding-service, inference-
   - No project: `must: [sessionId == this session]`
   - Project: `should: [sessionId == s1, sessionId == s2, ...]` across all project sessions
   - Dedup against recent episode IDs before including.
-5. Embed and search Qdrant ENTITIES (filtered by `projectId` if in a project). Returns entity IDs alongside payload — the Qdrant point ID equals the SQLite entity ID.
+5. Run **fused episode retrieval** via `getFusedEpisodes` — Qdrant semantic search and FTS5 keyword search run in parallel, both filtered against `recentIds`, then merged via Reciprocal Rank Fusion (RRF). If `keywordWeight` is `0`, the FTS call is skipped. Returns top `semanticLimit` episodes by fused score.
-6. Expand matched entities into a 1-hop graph neighborhood via `POST /graph/neighbors` on the memory-service. Returns `{ nodes, edges }` — the full entity objects plus connecting relationships. Falls back to flat entity list (no edges) if the graph call fails.
+6. Embed and search Qdrant ENTITIES (filtered by `projectId` if in a project). Returns entity IDs alongside payload — the Qdrant point ID equals the SQLite entity ID.
-7. Build prompt in this fixed order: **system prompt → graph context → semantic episodes → recent episodes → user message → "Assistant:"**
+7. Expand matched entities into a 1-hop graph neighborhood via `POST /graph/neighbors` on the memory-service. Returns `{ nodes, edges }` — the full entity objects plus connecting relationships. Falls back to flat entity list (no edges) if the graph call fails.
 8. Build prompt in this fixed order: **system prompt → graph context → fused episodes → recent episodes → user message → "Assistant:"**
 The ordering prioritizes established facts (graph context) and relevant past context (semantic) over pure recency.
@@ -100,6 +101,18 @@ When the existing summary's token count exceeds `SUMMARIES.MAX_SUMMARY_TOKENS`,
 `searchEntities` checks `projectId !== null && projectId !== undefined` before applying the filter — a session with no project skips the filter entirely and searches globally.
 ## Retrieval Fusion (`src/chat/index.js`)
 Three functions handle fusion — all pure or lightly async, all non-critical:
 - **`getFTSResults(userMessage, { limit, sessionIds })`** — calls `memory.searchEpisodes`; returns `[]` and logs a warning on failure
 - **`fuseEpisodeResults(semanticEps, keywordEps, { semanticWeight, keywordWeight, limit })`** — pure RRF implementation. Key guard: FTS-only episodes are only added to the scores Map if `contrib > 0` (prevents score-0 bleed-through when `keywordWeight: 0`)
 - **`getFusedEpisodes(userMessage, session, recentIds, projectSessionIds, settings)`** — orchestrates both paths in `Promise.all`, applies `recentIds` filter to FTS results, calls fusion. Short-circuits FTS call entirely if `keywordWeight === 0`
 FTS is scoped to `projectSessionIds` if in a project, otherwise `[session.id]` — mirrors Qdrant scoping exactly.
 > For RRF formula, weight semantics, and enabling keyword search, see `docs/services/retrieval-fusion.md`.
 ## Graph Service Client (`src/services/graph.js`)
 Thin HTTP client for memory-service graph endpoints. One function: