diff --git a/multi-representation-search/multi-representation-search.ipynb b/multi-representation-search/multi-representation-search.ipynb new file mode 100644 index 0000000..6421baa --- /dev/null +++ b/multi-representation-search/multi-representation-search.ipynb @@ -0,0 +1,561 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "2153bba9", + "metadata": {}, + "source": [ + "# Multi-Representation Search: Step-by-Step Build-Up\n", + "\n", + "A document is rarely well-represented by a single embedding. A research paper has a title, an abstract, body chunks, and category tags, each carrying a different signal. Treat all four as one dense vector and the title gets averaged out; chunk-level grounding for downstream reasoning disappears.\n", + "\n", + "This notebook builds a Qdrant retrieval pipeline that uses each representation deliberately. Over six steps you'll go from a naive dense-only baseline to a fully fused pipeline with four named-vector prefetches, Reciprocal Rank Fusion, document-level grouping, and optional formula-based score boosting. After each step you'll run the same query and see the top retrieved papers change.\n", + "\n", + "The design rationale (why each component is there, when to use it, when not to) lives in the accompanying [tutorial](https://qdrant.tech/documentation/tutorials-search-engineering/multi-representation-search/). This notebook focuses on running the code and watching the result list shift.\n", + "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/qdrant/examples/blob/master/multi-representation-search/multi-representation-search.ipynb)\n" + ] + }, + { + "cell_type": "markdown", + "id": "4b597568", + "metadata": {}, + "source": [ + "## Requirements\n", + "\n", + "This notebook uses [Qdrant Cloud Inference](https://qdrant.tech/documentation/inference/#qdrant-cloud-inference) to generate embeddings server-side, so no client-side embedding library is required. The free tier covers this notebook's footprint. Core BM25 runs on any Qdrant instance, but dense Cloud Inference is Cloud-only. To self-host, generate dense vectors on the client with a library like [FastEmbed](https://qdrant.tech/documentation/fastembed/) and pass them as raw vectors instead of `models.Document`.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "59028f90", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install qdrant-client datasets" + ] + }, + { + "cell_type": "markdown", + "id": "c1e8c733", + "metadata": {}, + "source": [ + "## Dataset\n", + "\n", + "20 000 ML/CS arXiv papers (2018 and later) from the [`gfissore/arxiv-abstracts-2021`](https://huggingface.co/datasets/gfissore/arxiv-abstracts-2021) dataset. Each paper has a `title`, `abstract`, and `categories` (which this dataset returns as space-joined strings, so we split them before filtering).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ed2823ba", + "metadata": {}, + "outputs": [], + "source": [ + "from datasets import load_dataset\n", + "\n", + "ML_CATEGORIES = {\"cs.LG\", \"cs.CV\", \"cs.CL\", \"cs.AI\", \"stat.ML\"}\n", + "\n", + "# Non-streaming so HF caches the parquet locally; first run downloads ~2.5 GB, re-runs are instant.\n", + "dataset = load_dataset(\"gfissore/arxiv-abstracts-2021\", split=\"train\")\n", + "\n", + "papers = []\n", + "# IDs are roughly chronological; iterate from the end to land on 2021/2020/2019 papers first.\n", + "for i in range(len(dataset) - 1, -1, -1):\n", + " if len(papers) >= 20000:\n", + " break\n", + " row = dataset[i]\n", + " if not row[\"abstract\"] or not row[\"title\"]:\n", + " continue\n", + " # categories arrive as space-joined strings (e.g. [\"cs.LG cs.CV\"]); split each entry.\n", + " cats = [tok for entry in row[\"categories\"] for tok in entry.split()]\n", + " if not any(c in ML_CATEGORIES for c in cats):\n", + " continue\n", + " # Year lives in the YYMM prefix of new-format arXiv IDs (\"2104.01234\" -> 2021).\n", + " arxiv_id = row[\"id\"]\n", + " if \"/\" in arxiv_id or \".\" not in arxiv_id:\n", + " continue # skip pre-2007 IDs like \"math/0506001\"\n", + " if 2000 + int(arxiv_id[:2]) < 2018:\n", + " continue\n", + " papers.append({\n", + " \"arxiv_id\": arxiv_id,\n", + " \"title\": row[\"title\"].strip(),\n", + " \"abstract\": row[\"abstract\"].strip(),\n", + " \"tags\": cats,\n", + " })\n", + "print(f\"Loaded {len(papers)} papers\")" + ] + }, + { + "cell_type": "markdown", + "id": "26339a5a", + "metadata": {}, + "source": [ + "## Schema\n", + "\n", + "One Qdrant collection. Each point is a chunk. Each chunk holds four named vectors that we'll fuse at query time:\n", + "\n", + "- `dense_chunk`: the chunk's own embedding (body content).\n", + "- `dense_title`: the paper title embedding (topical naming).\n", + "- `dense_abstract`: the paper abstract embedding (paper-level view).\n", + "- `sparse_title`: BM25 over the title (lexical matches on rare entity names, jargon, specific model or paper names).\n", + "\n", + "Categories live in the `tags` payload with a keyword index, so queries can pre-filter by category.\n", + "\n", + "`dense_title`, `dense_abstract`, and `sparse_title` are duplicated across every chunk of the same paper. That trades a bit of storage for one-shot query fusion (one collection, one Query API call, every representation reachable from any point). For the typical case (a few dozen chunks per paper, embeddings under a kilobyte each) it's the simpler choice.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "788e1d18", + "metadata": {}, + "outputs": [], + "source": [ + "from qdrant_client import QdrantClient, models\n", + "\n", + "# Replace url and api_key with your own from https://cloud.qdrant.io\n", + "client = QdrantClient(\n", + " url=\"https://xyz-example.qdrant.io:6333\",\n", + " api_key=\"\",\n", + " cloud_inference=True,\n", + ")\n", + "\n", + "# 384 is the output dimension of sentence-transformers/all-minilm-l6-v2, used below for every dense vector.\n", + "client.create_collection(\n", + " collection_name=\"arxiv_multi_repr\",\n", + " vectors_config={\n", + " \"dense_chunk\": models.VectorParams(size=384, distance=models.Distance.COSINE),\n", + " \"dense_title\": models.VectorParams(size=384, distance=models.Distance.COSINE),\n", + " \"dense_abstract\": models.VectorParams(size=384, distance=models.Distance.COSINE),\n", + " },\n", + " sparse_vectors_config={\n", + " \"sparse_title\": models.SparseVectorParams(modifier=models.Modifier.IDF),\n", + " },\n", + ")\n", + "\n", + "# Index 'document_id' so the Query API can group by it; index 'tags' so we can filter on category.\n", + "client.create_payload_index(\n", + " collection_name=\"arxiv_multi_repr\",\n", + " field_name=\"document_id\",\n", + " field_schema=models.PayloadSchemaType.KEYWORD,\n", + ")\n", + "client.create_payload_index(\n", + " collection_name=\"arxiv_multi_repr\",\n", + " field_name=\"tags\",\n", + " field_schema=models.PayloadSchemaType.KEYWORD,\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "295e1a01", + "metadata": {}, + "source": [ + "## Ingestion\n", + "\n", + "Embeddings are generated server-side via Qdrant Cloud Inference:\n", + "\n", + "- `sentence-transformers/all-minilm-l6-v2` (384-dim) for the three dense vectors.\n", + "- `qdrant/bm25` (core BM25 since Qdrant 1.15) for the sparse vector, with `avg_len=10.0` calibrated for the title-only field (default is 256, calibrated for document-length text).\n", + "\n", + "Chunking uses a fixed two-sentence window for simplicity; the right chunking strategy depends on your document structure. One point per chunk, with the title and abstract Documents reused across every chunk of the same paper.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "725afca6", + "metadata": {}, + "outputs": [], + "source": [ + "DENSE_MODEL = \"sentence-transformers/all-minilm-l6-v2\"\n", + "BM25_MODEL = \"qdrant/bm25\"\n", + "\n", + "def chunk_sentences(text, target_len=2):\n", + " \"\"\"Split text into ~2-sentence chunks; fall back to the full text if it doesn't split cleanly.\"\"\"\n", + " sentences = [s.strip() for s in text.split(\". \") if s.strip()]\n", + " return [\". \".join(sentences[i:i + target_len])\n", + " for i in range(0, len(sentences), target_len)] or [text]\n", + "\n", + "\n", + "points = []\n", + "for paper in papers:\n", + " chunks = chunk_sentences(paper[\"abstract\"])\n", + "\n", + " # Title, abstract, and sparse docs are reused across every chunk of this paper; only the chunk text varies.\n", + " # Cloud Inference embeds each Document on the server, so you don't need a client-side embedding library.\n", + " title_doc = models.Document(text=paper[\"title\"], model=DENSE_MODEL)\n", + " abstract_doc = models.Document(text=paper[\"abstract\"], model=DENSE_MODEL)\n", + " # avg_len is the average word count of the indexed text.\n", + " # Default is 256 (document-length); setting it to the actual field length (~10 here) improves BM25 scoring accuracy.\n", + " sparse_doc = models.Document(\n", + " text=paper[\"title\"],\n", + " model=BM25_MODEL,\n", + " options={\"avg_len\": 10.0},\n", + " )\n", + "\n", + " for i, chunk in enumerate(chunks):\n", + " points.append(models.PointStruct(\n", + " id=len(points),\n", + " vector={\n", + " \"dense_chunk\": models.Document(text=chunk, model=DENSE_MODEL),\n", + " \"dense_title\": title_doc,\n", + " \"dense_abstract\": abstract_doc,\n", + " \"sparse_title\": sparse_doc,\n", + " },\n", + " payload={\n", + " \"document_id\": paper[\"arxiv_id\"],\n", + " \"title\": paper[\"title\"],\n", + " \"tags\": paper[\"tags\"],\n", + " \"chunk_index\": i,\n", + " \"chunk_text\": chunk,\n", + " },\n", + " ))\n", + "\n", + "client.upload_points(collection_name=\"arxiv_multi_repr\", points=points, batch_size=256)\n", + "print(f\"Uploaded {len(points)} chunks across {len(papers)} papers\")" + ] + }, + { + "cell_type": "markdown", + "id": "61b1aa7b", + "metadata": {}, + "source": [ + "## Query Helpers\n", + "\n", + "Two pieces used by every step below:\n", + "\n", + "- `SAMPLE_QUERY` is the single query we run through every step so we can watch the same query produce different results as capabilities are added.\n", + "- `show_results(retrieve_fn)` runs the retrieve function and prints the top 5 results: title, category tags, and an excerpt from the matching chunk. Accepts both chunk-level results (Steps 1-4) and grouped results (Steps 5-6, where each result is a paper with several chunks).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f70b01f8", + "metadata": {}, + "outputs": [], + "source": [ + "import textwrap\n", + "\n", + "SAMPLE_QUERY = \"diffusion models for image synthesis\"\n", + "\n", + "def show_results(retrieve_fn, query=SAMPLE_QUERY, k=5):\n", + " \"\"\"Print top-k results as: title, category tags, and a matching-chunk excerpt.\"\"\"\n", + " print(f\"Query: {query!r}\\n\")\n", + " for i, item in enumerate(retrieve_fn(query, limit=k), 1):\n", + " # item is a Point (Steps 1-4) or a Group (Steps 5-6).\n", + " # For groups, hits[0] is the top chunk for that paper.\n", + " point = item.hits[0] if hasattr(item, \"hits\") else item\n", + " payload = point.payload\n", + " title = payload[\"title\"]\n", + " tags = payload.get(\"tags\", [])\n", + " # Collapse whitespace (including embedded newlines) so the excerpt prints cleanly.\n", + " chunk = \" \".join(payload[\"chunk_text\"].split())\n", + " excerpt = chunk[:250].rstrip() + (\"...\" if len(chunk) > 250 else \"\")\n", + " print(textwrap.fill(f\"{i}. {title}\", width=140, initial_indent=\" \", subsequent_indent=\" \"))\n", + " if tags:\n", + " print(f\" [{', '.join(str(t) for t in tags[:3])}]\")\n", + " print(textwrap.fill(excerpt, width=140, initial_indent=\" \", subsequent_indent=\" \"))\n", + " print()\n" + ] + }, + { + "cell_type": "markdown", + "id": "4b9065fe", + "metadata": {}, + "source": [ + "## Step 1: Dense Over Chunks (Baseline)\n", + "\n", + "The naive baseline: encode the query with the dense model, search against `dense_chunk` only, return the chunk-level results' parent papers. No fusion, no title or sparse signal.\n", + "\n", + "This is what most \"vector search\" tutorials stop at. It's a reasonable default for short, homogeneous corpora where the chunk text already carries the full signal. It systematically underperforms when the signal lives outside the chunk: in the title (topical naming), or in keyword overlap that the embedding model has averaged out into a generic neighborhood.\n", + "\n", + "Each subsequent step closes one of those gaps.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "566dbbbd", + "metadata": {}, + "outputs": [], + "source": [ + "def retrieve_baseline(query, limit=10):\n", + " return client.query_points(\n", + " collection_name=\"arxiv_multi_repr\",\n", + " query=models.Document(text=query, model=DENSE_MODEL),\n", + " using=\"dense_chunk\",\n", + " limit=limit,\n", + " ).points\n", + "\n", + "show_results(retrieve_baseline)\n" + ] + }, + { + "cell_type": "markdown", + "id": "f710ce2f", + "metadata": {}, + "source": [ + "## Step 2: Add Sparse Title With RRF\n", + "\n", + "Add a second prefetch: BM25 over the title. Then fuse the two ranked lists with **Reciprocal Rank Fusion (RRF)**.\n", + "\n", + "Why RRF instead of weighted averages of raw scores? RRF works on rank, not score. Dense scores live in [0, 1], sparse BM25 scores don't, and RRF doesn't have to reconcile the two. Linear weights are fragile: a weight that helps one query class hurts another, and the right weight depends on query length, model, and corpus.\n", + "\n", + "What does sparse add? Queries with rare entity names, jargon, or specific model/paper names often produce dense embeddings near generic neighborhoods. The sparse path catches those exact-token matches on the title. RRF promotes documents both paths agree on.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "44b0f157", + "metadata": {}, + "outputs": [], + "source": [ + "def retrieve_hybrid(query, limit=10):\n", + " dense_query = models.Document(text=query, model=DENSE_MODEL)\n", + " sparse_query = models.Document(text=query, model=BM25_MODEL)\n", + " return client.query_points(\n", + " collection_name=\"arxiv_multi_repr\",\n", + " prefetch=[\n", + " models.Prefetch(query=dense_query, using=\"dense_chunk\", limit=50),\n", + " models.Prefetch(query=sparse_query, using=\"sparse_title\", limit=50),\n", + " ],\n", + " query=models.FusionQuery(fusion=models.Fusion.RRF),\n", + " limit=limit,\n", + " ).points\n", + "\n", + "show_results(retrieve_hybrid)\n" + ] + }, + { + "cell_type": "markdown", + "id": "4bdf38f7", + "metadata": {}, + "source": [ + "## Step 3: Add Title Prefetch\n", + "\n", + "Add a third prefetch: the same dense query vector, but searched against `dense_title` instead of `dense_chunk`. We're now fusing across three representations: chunk content, title (lexical), and title (semantic).\n", + "\n", + "The title prefetch saves queries where the topic is named explicitly but not echoed in any single chunk. For example: \"diffusion models for high-resolution image synthesis\" surfaces a paper titled \"High-Resolution Image Synthesis with Latent Diffusion Models\" via the title path even when its chunks phrase the contribution differently. The chunk prefetch alone misses it; the title path catches it; RRF promotes it because both paths agree.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b62d81a9", + "metadata": {}, + "outputs": [], + "source": [ + "def retrieve_three_repr(query, limit=10):\n", + " dense_query = models.Document(text=query, model=DENSE_MODEL)\n", + " sparse_query = models.Document(text=query, model=BM25_MODEL)\n", + " return client.query_points(\n", + " collection_name=\"arxiv_multi_repr\",\n", + " prefetch=[\n", + " models.Prefetch(query=dense_query, using=\"dense_chunk\", limit=50),\n", + " models.Prefetch(query=dense_query, using=\"dense_title\", limit=50),\n", + " models.Prefetch(query=sparse_query, using=\"sparse_title\", limit=50),\n", + " ],\n", + " query=models.FusionQuery(fusion=models.Fusion.RRF),\n", + " limit=limit,\n", + " ).points\n", + "\n", + "show_results(retrieve_three_repr)\n" + ] + }, + { + "cell_type": "markdown", + "id": "e59ce67e", + "metadata": {}, + "source": [ + "## Step 4: Add Abstract Prefetch\n", + "\n", + "Add a fourth prefetch on `dense_abstract`. The abstract gives a paper-level view that sits between the title (very short) and individual chunks (very local). It catches queries that match the paper's overall framing rather than a single passage or the title's topical naming.\n", + "\n", + "In a production setup where chunks are full paper bodies, the abstract is a meaningfully different representation. In this notebook's arXiv dataset (where chunks are 2-sentence slices of the abstract itself), the lift over Step 3 will be smaller because the abstract and the chunks share text. The prefetch is still worth wiring up; the pipeline shape is what generalizes to longer corpora.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e9c0dd1d", + "metadata": {}, + "outputs": [], + "source": [ + "def retrieve_four_repr(query, limit=10):\n", + " dense_query = models.Document(text=query, model=DENSE_MODEL)\n", + " sparse_query = models.Document(text=query, model=BM25_MODEL)\n", + " return client.query_points(\n", + " collection_name=\"arxiv_multi_repr\",\n", + " prefetch=[\n", + " models.Prefetch(query=dense_query, using=\"dense_chunk\", limit=50),\n", + " models.Prefetch(query=dense_query, using=\"dense_title\", limit=50),\n", + " models.Prefetch(query=dense_query, using=\"dense_abstract\", limit=50),\n", + " models.Prefetch(query=sparse_query, using=\"sparse_title\", limit=50),\n", + " ],\n", + " query=models.FusionQuery(fusion=models.Fusion.RRF),\n", + " limit=limit,\n", + " ).points\n", + "\n", + "show_results(retrieve_four_repr)\n" + ] + }, + { + "cell_type": "markdown", + "id": "1fed2f91", + "metadata": {}, + "source": [ + "## Step 5: Group by Document\n", + "\n", + "So far results are chunks, and the same paper can appear multiple times in the top 10. Most consumers want one entry per document with the top chunks attached: a results UI, a citation list, an LLM that needs document-level attribution.\n", + "\n", + "`query_points_groups` collapses chunks back to documents using `group_by=\"document_id\"`. Each group's `hits` field carries the top-`group_size` chunks for that paper.\n", + "\n", + "This step also wires in an optional `tags` parameter that filters candidates to specific arXiv categories before retrieval runs. Qdrant pre-filters on the payload index we added in the schema, so filtering happens before the fusion math, not after.\n", + "\n", + "A few things worth knowing:\n", + "\n", + "- Grouping is a *presentation* choice, not a relevance technique. The candidates and their fused scores don't change; only the result shape does.\n", + "- You may need to adjust the per-prefetch `limit` based on the number of chunks per document; grouping only sees what the prefetch returns.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1694ce42", + "metadata": {}, + "outputs": [], + "source": [ + "def retrieve_grouped(query, limit=10, group_size=3, tags=None):\n", + " dense_query = models.Document(text=query, model=DENSE_MODEL)\n", + " sparse_query = models.Document(text=query, model=BM25_MODEL)\n", + " # Optional category filter. When tags is provided, Qdrant pre-filters candidates\n", + " # to points whose 'tags' payload includes any of the given values.\n", + " query_filter = (\n", + " models.Filter(must=[models.FieldCondition(key=\"tags\", match=models.MatchAny(any=tags))])\n", + " if tags else None\n", + " )\n", + " # query_points_groups runs the prefetches, fuses with RRF, applies the filter, and groups results by document_id.\n", + " return client.query_points_groups(\n", + " collection_name=\"arxiv_multi_repr\",\n", + " prefetch=[\n", + " models.Prefetch(query=dense_query, using=\"dense_chunk\", limit=100),\n", + " models.Prefetch(query=dense_query, using=\"dense_title\", limit=100),\n", + " models.Prefetch(query=dense_query, using=\"dense_abstract\", limit=100),\n", + " models.Prefetch(query=sparse_query, using=\"sparse_title\", limit=100),\n", + " ],\n", + " query=models.FusionQuery(fusion=models.Fusion.RRF),\n", + " query_filter=query_filter,\n", + " group_by=\"document_id\",\n", + " group_size=group_size,\n", + " limit=limit,\n", + " ).groups\n", + "\n", + "show_results(retrieve_grouped)\n" + ] + }, + { + "cell_type": "markdown", + "id": "83c7905e", + "metadata": {}, + "source": [ + "## Step 6: Score Boosting With a Formula\n", + "\n", + "When you have ranking preferences that aren't captured by similarity alone (recency, source authority, geographic proximity, structured boosts), swap RRF for a `FormulaQuery`. Formulas operate on the prefetch scores and payload fields:\n", + "\n", + "- `$score[i]` references the score from prefetch `i`. Prefetch order is load-bearing.\n", + "- The `defaults` map provides fallback values for candidates that didn't appear in every prefetch, so the formula still evaluates.\n", + "\n", + "The formula below sums the chunk score with weighted contributions from the title, abstract, and sparse prefetches. This is a linear combination of raw scores, which breaks down when prefetches use different scoring scales. RRF avoids this by discarding scores; DBSF normalizes per prefetch; a custom formula has to align distributions itself, typically with [decay functions](https://qdrant.tech/documentation/search/search-relevance/#decay-functions). The full FormulaQuery syntax lives in the [Score Boosting](https://qdrant.tech/documentation/search/search-relevance/#score-boosting) reference.\n", + "\n", + "For time-based decay on a `published_at` payload field, swap a term for an `exp_decay` expression.\n", + "\n", + "For RRF vs. DBSF guidance, see the [hybrid-search FAQ](https://qdrant.tech/documentation/faq/qdrant-fundamentals/#when-should-i-use-reciprocal-rank-fusion-rrf-vs-distribution-based-score-fusion-dbsf-for-hybrid-search).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d25beee5", + "metadata": {}, + "outputs": [], + "source": [ + "def retrieve_boosted(query, limit=10, group_size=3):\n", + " dense_query = models.Document(text=query, model=DENSE_MODEL)\n", + " sparse_query = models.Document(text=query, model=BM25_MODEL)\n", + " return client.query_points_groups(\n", + " collection_name=\"arxiv_multi_repr\",\n", + " prefetch=[\n", + " # $score[0] = chunk, $score[1] = title, $score[2] = abstract, $score[3] = sparse\n", + " models.Prefetch(query=dense_query, using=\"dense_chunk\", limit=100),\n", + " models.Prefetch(query=dense_query, using=\"dense_title\", limit=100),\n", + " models.Prefetch(query=dense_query, using=\"dense_abstract\", limit=100),\n", + " models.Prefetch(query=sparse_query, using=\"sparse_title\", limit=100),\n", + " ],\n", + " query=models.FormulaQuery(\n", + " formula=models.SumExpression(sum=[\n", + " models.MultExpression(mult=[1.0, \"$score[0]\"]),\n", + " models.MultExpression(mult=[0.5, \"$score[1]\"]),\n", + " models.MultExpression(mult=[0.4, \"$score[2]\"]),\n", + " models.MultExpression(mult=[0.3, \"$score[3]\"]),\n", + " ]),\n", + " defaults={\"$score[1]\": 0.0, \"$score[2]\": 0.0, \"$score[3]\": 0.0},\n", + " ),\n", + " group_by=\"document_id\",\n", + " group_size=group_size,\n", + " limit=limit,\n", + " ).groups\n", + "\n", + "show_results(retrieve_boosted)\n" + ] + }, + { + "cell_type": "markdown", + "id": "ca1e7741", + "metadata": {}, + "source": [ + "## Wrap-up\n", + "\n", + "That's the recommended multi-representation pipeline end to end. The same schema works for any corpus with title-like, abstract-like, and body-like representations.\n", + "\n", + "If you ran this notebook with the same `SAMPLE_QUERY` (\"diffusion models for image synthesis\") and the same 20,000-paper arXiv slice, here's roughly what each step's top 5 should produce:\n", + "\n", + "- **Step 1 (`dense_chunk` only):** chunk-level results with the same paper appearing in multiple slots. SegDiff, LDM, GLIDE in the top 5.\n", + "- **Step 2 (+ `sparse_title`):** title-exact matches surface. Vector Quantized Diffusion Model jumps in.\n", + "- **Step 3 (+ `dense_title`):** LDM dominates with three of its own chunks. Semantic title match takes over.\n", + "- **Step 4 (+ `dense_abstract`):** modest shift. GLIDE returns thanks to abstract-level signal. Adding a prefetch isn't always dramatic.\n", + "- **Step 5 (grouping):** one entry per paper. The collapsed LDM chunks free up slots for Palette, Global Context, and Implicit Image Segmentation.\n", + "- **Step 6 (formula):** custom weighting reorders results. Vector Quantized Diffusion climbs back; ImageBART and Manifold-aware Synthesis enter as the formula amplifies raw scores differently from RRF's rank-based fusion.\n", + "\n", + "Swap the dataset, retune which representations earn their prefetch slots for your data, and wire in formula-based ranking preferences as needed.\n", + "\n", + "For the design rationale and references, see the [tutorial](https://qdrant.tech/documentation/tutorials-search-engineering/multi-representation-search/).\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}