From f954b2e3daae83500741c6b69a83b6f91199d5f1 Mon Sep 17 00:00:00 2001 From: Dylan Couzon Date: Tue, 5 May 2026 19:26:53 -0400 Subject: [PATCH 1/8] add multi-representation search tutorial notebook Companion notebook to the multi-representation search tutorial in qdrant/landing_page#2334. Builds the recommended retrieval pipeline (three named-vector prefetches, RRF fusion, document-level grouping) step by step against a 2,000-paper ML/CS arXiv slice, with qualitative top-K results at each step. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../multi-representation-search.ipynb | 617 ++++++++++++++++++ 1 file changed, 617 insertions(+) create mode 100644 multi-representation-search/multi-representation-search.ipynb diff --git a/multi-representation-search/multi-representation-search.ipynb b/multi-representation-search/multi-representation-search.ipynb new file mode 100644 index 0000000..7c9cc10 --- /dev/null +++ b/multi-representation-search/multi-representation-search.ipynb @@ -0,0 +1,617 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "2153bba9", + "metadata": {}, + "source": [ + "# Multi-Representation Search: Step-by-Step Build-Up\n", + "\n", + "A document is rarely well-represented by a single embedding. A research paper has a title, an abstract, body chunks, and category tags. Each carries a different signal, and squashing all four into one dense vector loses most of that structure: the title gets averaged out, keyword matches on tags disappear, and chunk-level grounding for downstream reasoning is gone.\n", + "\n", + "This notebook builds a Qdrant retrieval pipeline that uses each representation deliberately. Over five steps you'll go from a naive dense-only baseline to a fully fused pipeline with three named-vector prefetches, Reciprocal Rank Fusion, document-level grouping, and optional formula-based score boosting. After each step you'll run the same query and see the top retrieved papers change.\n", + "\n", + "The design rationale (why each component is there, when to use it, when not to) lives in the accompanying [tutorial](https://qdrant.tech/documentation/tutorials-search-engineering/multi-representation-search/). This notebook focuses on running the code and watching the result list shift.\n", + "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/qdrant/examples/blob/master/multi-representation-search/multi-representation-search.ipynb)\n" + ] + }, + { + "cell_type": "markdown", + "id": "4b597568", + "metadata": {}, + "source": [ + "## Requirements\n", + "\n", + "Use Python <3.13. Not all dependencies support the newest Python versions yet.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "59028f90", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install qdrant-client fastembed datasets" + ] + }, + { + "cell_type": "markdown", + "id": "c1e8c733", + "metadata": {}, + "source": "## Dataset\n\nTwo thousand arXiv papers from the [`gfissore/arxiv-abstracts-2021`](https://huggingface.co/datasets/gfissore/arxiv-abstracts-2021) Hugging Face dataset, filtered to machine-learning and computer-science categories (`cs.LG`, `cs.CV`, `cs.CL`, `cs.AI`, `stat.ML`) so the queries below have natural matches. Each paper carries the four representations we want to fuse over: a `title` (a few topical tokens), an `abstract` (which we use both as a summary and as the source for chunked body content), and `categories` (controlled-vocabulary tags). Swap in any other arXiv source as long as it exposes title, abstract, and categories.\n" + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ed2823ba", + "metadata": {}, + "outputs": [], + "source": "from datasets import load_dataset\n\nML_CATEGORIES = {\"cs.LG\", \"cs.CV\", \"cs.CL\", \"cs.AI\", \"stat.ML\"}\n\ndataset = load_dataset(\n \"gfissore/arxiv-abstracts-2021\", split=\"train\", streaming=True\n)\npapers = []\nfor row in dataset:\n if len(papers) >= 2000:\n break\n if not row[\"abstract\"] or not row[\"title\"]:\n continue\n cats = list(row[\"categories\"])\n if not any(c in ML_CATEGORIES for c in cats):\n continue # ML/CS papers only\n papers.append({\n \"arxiv_id\": row[\"id\"],\n \"title\": row[\"title\"].strip(),\n \"abstract\": row[\"abstract\"].strip(),\n \"categories\": cats,\n })\nprint(f\"Loaded {len(papers)} papers\")\n" + }, + { + "cell_type": "markdown", + "id": "26339a5a", + "metadata": {}, + "source": [ + "## Schema\n", + "\n", + "One Qdrant collection. Each point is a chunk. Each chunk holds four named vectors that we'll fuse at query time:\n", + "\n", + "- `dense_chunk`: the chunk's own embedding (body content).\n", + "- `dense_title`: the paper title embedding (topical naming).\n", + "- `dense_summary`: the paper abstract embedding (contribution focus).\n", + "- `sparse_keywords`: BM25 over the title and tags concatenated (lexical matches on short structured fields).\n", + "\n", + "`dense_title` and `dense_summary` are duplicated across every chunk of the same paper. That trades a bit of storage for one-shot query fusion (one collection, one Query API call, no `lookup_from`). For the typical case (a few dozen chunks per paper, embeddings under a kilobyte each) it's the simpler choice.\n", + "\n", + "We use *named vectors*, not a multivector field. Multivectors are designed for late-interaction models like ColBERT, where the MaxSim comparator combines per-token subvectors into one score per point. Title, summary, and chunk vectors are different kinds of content, so MaxSim would collapse the per-representation signal we want to fuse. The [tutorial](https://qdrant.tech/documentation/tutorials-search-engineering/multi-representation-search/) covers the contrast.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "788e1d18", + "metadata": {}, + "outputs": [], + "source": [ + "from qdrant_client import QdrantClient, models\n", + "\n", + "client = QdrantClient(\"http://localhost:6333\") # or QdrantClient(url=\"https://.cloud.qdrant.io\", api_key=\"...\") for Qdrant Cloud\n", + "\n", + "client.create_collection(\n", + " collection_name=\"arxiv_multi_repr\",\n", + " vectors_config={\n", + " \"dense_chunk\": models.VectorParams(size=384, distance=models.Distance.COSINE),\n", + " \"dense_title\": models.VectorParams(size=384, distance=models.Distance.COSINE),\n", + " \"dense_summary\": models.VectorParams(size=384, distance=models.Distance.COSINE),\n", + " },\n", + " sparse_vectors_config={\n", + " \"sparse_keywords\": models.SparseVectorParams(modifier=models.Modifier.IDF),\n", + " },\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "295e1a01", + "metadata": {}, + "source": "## Ingestion\n\nEmbeddings are generated locally with [FastEmbed](https://qdrant.tech/documentation/fastembed/):\n\n- `BAAI/bge-small-en-v1.5` (384-dim, ~67 MB) for the three dense vectors. Trained with retrieval-specific contrastive objectives, which is what this tutorial does.\n- `Qdrant/bm25` for the sparse vector. The IDF modifier on the collection means Qdrant computes inverse-document-frequency weights at query time across the corpus.\n\nChunking uses a fixed two-sentence window for clarity. Chunking strategy has a real effect on retrieval quality and is its own design space (hierarchical, late, semantic chunking are all worth comparing). For now: one point per chunk, with the title and summary embeddings copied onto every chunk of the same paper.\n\nThe loop is deliberately straightforward (one paper at a time) so the per-vector logic stays easy to follow. The first run downloads the FastEmbed models; subsequent runs reuse the local cache. On a laptop CPU expect a few minutes for 2000 papers.\n" + }, + { + "cell_type": "code", + "execution_count": null, + "id": "725afca6", + "metadata": {}, + "outputs": [], + "source": [ + "from fastembed import TextEmbedding, SparseTextEmbedding\n", + "\n", + "# Dense embeddings for title, summary, and chunk content; sparse BM25 for keyword matching.\n", + "dense_model = TextEmbedding(\"BAAI/bge-small-en-v1.5\")\n", + "sparse_model = SparseTextEmbedding(\"Qdrant/bm25\")\n", + "\n", + "def chunk_sentences(text, target_len=2):\n", + " \"\"\"Split text into ~2-sentence chunks; fall back to the full text if it doesn't split cleanly.\"\"\"\n", + " sentences = [s.strip() for s in text.split(\". \") if s.strip()]\n", + " return [\". \".join(sentences[i:i + target_len])\n", + " for i in range(0, len(sentences), target_len)] or [text]\n", + "\n", + "def to_sparse(sparse_emb):\n", + " \"\"\"Convert FastEmbed's SparseEmbedding into a Qdrant SparseVector.\"\"\"\n", + " return models.SparseVector(\n", + " indices=sparse_emb.indices.tolist(),\n", + " values=sparse_emb.values.tolist(),\n", + " )\n", + "\n", + "\n", + "points = []\n", + "for paper in papers:\n", + " chunks = chunk_sentences(paper[\"abstract\"])\n", + "\n", + " # Paper-level embeddings: computed once per paper, reused across every chunk below.\n", + " # next(iter(...)) extracts the single vector from FastEmbed's generator output.\n", + " title_vec = next(iter(dense_model.embed([paper[\"title\"]]))).tolist()\n", + " summary_vec = next(iter(dense_model.embed([paper[\"abstract\"]]))).tolist()\n", + " sparse_vec = to_sparse(next(iter(sparse_model.embed(\n", + " [paper[\"title\"] + \" \" + \" \".join(paper[\"categories\"])]\n", + " ))))\n", + "\n", + " # Chunk-level dense embedding: one vector per chunk.\n", + " chunk_vecs = [v.tolist() for v in dense_model.embed(chunks)]\n", + "\n", + " # One Qdrant point per chunk. dense_title, dense_summary, and sparse_keywords\n", + " # are the same for every chunk of this paper; only dense_chunk varies.\n", + " for i, (chunk, chunk_vec) in enumerate(zip(chunks, chunk_vecs)):\n", + " points.append(models.PointStruct(\n", + " id=len(points),\n", + " vector={\n", + " \"dense_chunk\": chunk_vec,\n", + " \"dense_title\": title_vec,\n", + " \"dense_summary\": summary_vec,\n", + " \"sparse_keywords\": sparse_vec,\n", + " },\n", + " payload={\n", + " \"document_id\": paper[\"arxiv_id\"],\n", + " \"title\": paper[\"title\"],\n", + " \"tags\": paper[\"categories\"],\n", + " \"chunk_index\": i,\n", + " \"chunk_text\": chunk,\n", + " },\n", + " ))\n", + "\n", + "client.upload_points(collection_name=\"arxiv_multi_repr\", points=points, batch_size=64)\n", + "print(f\"Uploaded {len(points)} chunks across {len(papers)} papers\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "61b1aa7b", + "metadata": {}, + "source": [ + "## Query Helpers\n", + "\n", + "Three pieces used by every step below:\n", + "\n", + "- `embed_query(query)` produces the `(dense, sparse)` pair we feed into Qdrant. Both `dense_model` and `sparse_model` expose a `query_embed` method calibrated for queries: for BM25 it applies IDF weighting; for some dense models it applies a query-side prompt.\n", + "- `SAMPLE_QUERY` is the single query we run through every step so we can watch the same query produce different results as capabilities are added.\n", + "- `show_results(retrieve_fn)` runs the retrieve function and prints the top 5 results: title, category tags, and an excerpt from the matching chunk. Accepts both chunk-level results (Steps 1-3) and grouped results (Steps 4-5, where each result is a paper with several chunks).\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "f70b01f8", + "metadata": {}, + "outputs": [], + "source": [ + "import textwrap\n", + "\n", + "def embed_query(query):\n", + " \"\"\"Produce a (dense, sparse) embedding pair for a query string.\"\"\"\n", + " dense = next(iter(dense_model.query_embed([query]))).tolist()\n", + " sparse = to_sparse(next(iter(sparse_model.query_embed([query]))))\n", + " return dense, sparse\n", + "\n", + "SAMPLE_QUERY = \"diffusion models for image synthesis\"\n", + "\n", + "def show_results(retrieve_fn, query=SAMPLE_QUERY, k=5):\n", + " \"\"\"Print top-k results as: title, category tags, and a matching-chunk excerpt.\"\"\"\n", + " print(f\"Query: {query!r}\\n\")\n", + " for i, item in enumerate(retrieve_fn(query, limit=k), 1):\n", + " # item is a Point (Steps 1-3) or a Group (Steps 4-5).\n", + " # For groups, hits[0] is the top chunk for that paper.\n", + " point = item.hits[0] if hasattr(item, \"hits\") else item\n", + " payload = point.payload\n", + " title = payload[\"title\"]\n", + " tags = payload.get(\"tags\", [])\n", + " # Collapse whitespace (including embedded newlines) so the excerpt prints cleanly.\n", + " chunk = \" \".join(payload[\"chunk_text\"].split())\n", + " excerpt = chunk[:250].rstrip() + (\"...\" if len(chunk) > 250 else \"\")\n", + " print(textwrap.fill(f\"{i}. {title}\", width=140, initial_indent=\" \", subsequent_indent=\" \"))\n", + " if tags:\n", + " print(f\" [{', '.join(str(t) for t in tags[:3])}]\")\n", + " print(textwrap.fill(excerpt, width=140, initial_indent=\" \", subsequent_indent=\" \"))\n", + " print()\n" + ] + }, + { + "cell_type": "markdown", + "id": "4b9065fe", + "metadata": {}, + "source": [ + "## Step 1: Dense Over Chunks (Baseline)\n", + "\n", + "The naive baseline: encode the query with the dense model, search against `dense_chunk` only, return the chunk-level results' parent papers. No fusion, no title or sparse signal.\n", + "\n", + "This is what most \"vector search\" tutorials stop at. It's a reasonable default for short, homogeneous corpora where the chunk text already carries the full signal. It systematically underperforms when the signal lives outside the chunk: in the title (topical naming), in tags (controlled vocabulary), or in keyword overlap that the embedding model has averaged out into a generic neighborhood.\n", + "\n", + "Each subsequent step closes one of those gaps.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "566dbbbd", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Query: 'diffusion models for image synthesis'\n", + "\n", + " 1. Optimal Shape Design for Stokes Flow Via Minimax Differentiability\n", + " [math.OC]\n", + " We apply an gradient type algorithm to our problem. Numerical examples show that our theory is useful for practical purpose and the\n", + " proposed algorithm is feasible.\n", + "\n", + " 2. The small deviations of many-dimensional diffusion processes and rarefaction by boundaries\n", + " [math.PR math.AP]\n", + " We lead the algorithm of expansion of sojourn probability of many-dimensional diffusion processes in small domain. The principal member\n", + " of this expansion defines normalizing coefficient for special limit theorems.\n", + "\n", + " 3. Kinetic equation for finite systems of fermions with pairing\n", + " [nucl-th]\n", + " As a consequence, the density fluctuation and the longitudinal response function given by this approximation contain spurious\n", + " contributions. A simple prescription for restoring both local and global particle-number conservation is proposed\n", + "\n", + " 4. Exponential growth rates in a typed branching diffusion\n", + " [math.PR]\n", + " We also briefly discuss applications to traveling wave solutions of an associated reaction--diffusion equation.\n", + "\n", + " 5. Order of Epitaxial Self-Assembled Quantum Dots: Linear Analysis\n", + " [cond-mat.mtrl-sci]\n", + " It is likely that these two types of order are strongly linked; thus, a study of spatial order will also have strong implications for\n", + " size order. Here a study of spatial order is undertaken using a linear analysis of a commonly used model of SAQD for...\n", + "\n" + ] + } + ], + "source": [ + "def retrieve_baseline(query, limit=10):\n", + " dense, _ = embed_query(query)\n", + " return client.query_points(\n", + " collection_name=\"arxiv_multi_repr\",\n", + " query=dense,\n", + " using=\"dense_chunk\",\n", + " limit=limit,\n", + " ).points\n", + "\n", + "show_results(retrieve_baseline)\n" + ] + }, + { + "cell_type": "markdown", + "id": "f710ce2f", + "metadata": {}, + "source": [ + "## Step 2: Add Sparse Keywords With RRF\n", + "\n", + "Add a second prefetch: BM25 over title and tags. Then fuse the two ranked lists with **Reciprocal Rank Fusion (RRF)**.\n", + "\n", + "Why RRF instead of weighted averages of raw scores? RRF works on rank, not score. Dense scores live in [0, 1], sparse BM25 scores don't, and RRF doesn't have to reconcile the two. Linear weights are fragile: a weight that helps one query class hurts another, and the right weight depends on query length, model, and corpus.\n", + "\n", + "What does sparse add? Queries with rare entity names, jargon, or category tags often produce dense embeddings near generic neighborhoods. The sparse path catches those exact-token matches. RRF promotes documents both paths agree on.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "44b0f157", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Query: 'diffusion models for image synthesis'\n", + "\n", + " 1. Optimal Shape Design for Stokes Flow Via Minimax Differentiability\n", + " [math.OC]\n", + " We apply an gradient type algorithm to our problem. Numerical examples show that our theory is useful for practical purpose and the\n", + " proposed algorithm is feasible.\n", + "\n", + " 2. Testing turbulence model at metric scales with mid-infrared VISIR images at the VLT\n", + " [astro-ph]\n", + " The image quality improves in the infrared faster than the standard lambda^{-1/5} scaling and may be diffraction-limited at 30-m\n", + " apertures even without adaptive optics at wavelengths longer than 8 micron.\n", + "\n", + " 3. The small deviations of many-dimensional diffusion processes and rarefaction by boundaries\n", + " [math.PR math.AP]\n", + " We lead the algorithm of expansion of sojourn probability of many-dimensional diffusion processes in small domain. The principal member\n", + " of this expansion defines normalizing coefficient for special limit theorems.\n", + "\n", + " 4. Exponential growth rates in a typed branching diffusion\n", + " [math.PR]\n", + " We also briefly discuss applications to traveling wave solutions of an associated reaction--diffusion equation.\n", + "\n", + " 5. Testing turbulence model at metric scales with mid-infrared VISIR images at the VLT\n", + " [astro-ph]\n", + " We probe turbulence structure from centimetric to metric scales by simultaneous imagery at mid-infrared and visible wavelengths at the\n", + " VLT telescope and show that it departs significantly from the commonly used Kolmogorov model. The data can be fitte...\n", + "\n" + ] + } + ], + "source": [ + "def retrieve_hybrid(query, limit=10):\n", + " dense, sparse = embed_query(query)\n", + " return client.query_points(\n", + " collection_name=\"arxiv_multi_repr\",\n", + " prefetch=[\n", + " models.Prefetch(query=dense, using=\"dense_chunk\", limit=50),\n", + " models.Prefetch(query=sparse, using=\"sparse_keywords\", limit=50),\n", + " ],\n", + " query=models.FusionQuery(fusion=models.Fusion.RRF),\n", + " limit=limit,\n", + " ).points\n", + "\n", + "show_results(retrieve_hybrid)\n" + ] + }, + { + "cell_type": "markdown", + "id": "4bdf38f7", + "metadata": {}, + "source": [ + "## Step 3: Add Title Prefetch\n", + "\n", + "Add a third prefetch: the same dense query vector, but searched against `dense_title` instead of `dense_chunk`. We're now fusing across three representations: chunk content, keyword hits, and topical naming.\n", + "\n", + "The title prefetch saves queries where the topic is named explicitly but not echoed in any single chunk. For example: \"diffusion models for high-resolution image synthesis\" surfaces a paper titled \"High-Resolution Image Synthesis with Latent Diffusion Models\" via the title path even when its chunks phrase the contribution differently. The chunk prefetch alone misses it; the title path catches it; RRF promotes it because both paths agree.\n", + "\n", + "A representation only earns its own prefetch if it carries signal independent of the others. We're not adding `dense_summary` as a fourth prefetch here because abstracts often paraphrase the chunks they came from. If your corpus has summaries that surface different content (human-written summaries of long technical reports, for example), adding a fourth prefetch is worth it.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "b62d81a9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Query: 'diffusion models for image synthesis'\n", + "\n", + " 1. Probability distributions generated by fractional diffusion equations\n", + " [cond-mat.stat-mech]\n", + " This property is a noteworthy generalization of what happens for the standard diffusion equation and can be relevant in treating\n", + " financial and economical problems where the stable probability distributions play a key role.\n", + "\n", + " 2. Optimal Shape Design for Stokes Flow Via Minimax Differentiability\n", + " [math.OC]\n", + " We apply an gradient type algorithm to our problem. Numerical examples show that our theory is useful for practical purpose and the\n", + " proposed algorithm is feasible.\n", + "\n", + " 3. Testing turbulence model at metric scales with mid-infrared VISIR images at the VLT\n", + " [astro-ph]\n", + " The image quality improves in the infrared faster than the standard lambda^{-1/5} scaling and may be diffraction-limited at 30-m\n", + " apertures even without adaptive optics at wavelengths longer than 8 micron.\n", + "\n", + " 4. Exponential growth rates in a typed branching diffusion\n", + " [math.PR]\n", + " We also briefly discuss applications to traveling wave solutions of an associated reaction--diffusion equation.\n", + "\n", + " 5. The small deviations of many-dimensional diffusion processes and rarefaction by boundaries\n", + " [math.PR math.AP]\n", + " We lead the algorithm of expansion of sojourn probability of many-dimensional diffusion processes in small domain. The principal member\n", + " of this expansion defines normalizing coefficient for special limit theorems.\n", + "\n" + ] + } + ], + "source": [ + "def retrieve_three_repr(query, limit=10):\n", + " dense, sparse = embed_query(query)\n", + " return client.query_points(\n", + " collection_name=\"arxiv_multi_repr\",\n", + " prefetch=[\n", + " models.Prefetch(query=dense, using=\"dense_chunk\", limit=50),\n", + " models.Prefetch(query=dense, using=\"dense_title\", limit=50),\n", + " models.Prefetch(query=sparse, using=\"sparse_keywords\", limit=50),\n", + " ],\n", + " query=models.FusionQuery(fusion=models.Fusion.RRF),\n", + " limit=limit,\n", + " ).points\n", + "\n", + "show_results(retrieve_three_repr)\n" + ] + }, + { + "cell_type": "markdown", + "id": "1fed2f91", + "metadata": {}, + "source": [ + "## Step 4: Group by Document\n", + "\n", + "So far results are chunks, and the same paper can appear multiple times in the top 10. Most consumers want one entry per document with the top chunks attached: a results UI, a citation list, an LLM that needs document-level attribution.\n", + "\n", + "`query_points_groups` collapses chunks back to documents using `group_by=\"document_id\"`. Each group's `hits` field carries the top-`group_size` chunks for that paper.\n", + "\n", + "A few things worth knowing:\n", + "\n", + "- Grouping is a *presentation* choice, not a relevance technique. The candidates and their fused scores don't change; only the result shape does.\n", + "- Increase the prefetch `limit` when grouping. If a paper has three good chunks but the prefetch only returned two, grouping doesn't have the third to consider.\n", + "- Use the `with_lookup` parameter when document-level metadata (full title, authors, dates) lives in a separate collection. It fetches one record per group instead of repeating it per chunk.\n", + "\n", + "When *not* to group: when an LLM benefits from seeing several independently ranked chunks across multiple documents in its context window. Collapsing those into per-document groups throws away ordering information the LLM could have used.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "1694ce42", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Query: 'diffusion models for image synthesis'\n", + "\n", + " 1. Specific heat and bimodality in canonical and grand canonical versions of the thermodynamic model\n", + " [nucl-th]\n", + " We address two issues in the thermodynamic model for nuclear disassembly. Surprisingly large differences in results for specific heat\n", + " were seen in predictions from the canonical and grand canonical ensembles when the nuclear system passes from liquid...\n", + "\n", + " 2. Modeling the three-point correlation function\n", + " [astro-ph]\n", + " We present new predictions for the galaxy three-point correlation function (3PCF) using high-resolution dissipationless cosmological\n", + " simulations of a flat LCDM Universe which resolve galaxy-size halos and subhalos. We create realistic mock galaxy cat...\n", + "\n", + " 3. Interpolating and sampling sequences in finite Riemann surfaces\n", + " [math.CV]\n", + " We provide a description of the interpolating and sampling sequences on a space of holomorphic functions with a uniform growth\n", + " restriction defined on finite Riemann surfaces.\n", + "\n", + " 4. Probability distributions generated by fractional diffusion equations\n", + " [cond-mat.stat-mech]\n", + " Fractional calculus allows one to generalize the linear, one-dimensional, diffusion equation by replacing either the first time\n", + " derivative or the second space derivative by a derivative of fractional order. The fundamental solutions of these equation...\n", + "\n", + " 5. Optimal Shape Design for Stokes Flow Via Minimax Differentiability\n", + " [math.OC]\n", + " We apply an gradient type algorithm to our problem. Numerical examples show that our theory is useful for practical purpose and the\n", + " proposed algorithm is feasible.\n", + "\n" + ] + } + ], + "source": [ + "def retrieve_grouped(query, limit=10, group_size=3):\n", + " dense, sparse = embed_query(query)\n", + " return client.query_points_groups(\n", + " collection_name=\"arxiv_multi_repr\",\n", + " prefetch=[\n", + " models.Prefetch(query=dense, using=\"dense_chunk\", limit=100),\n", + " models.Prefetch(query=dense, using=\"dense_title\", limit=100),\n", + " models.Prefetch(query=sparse, using=\"sparse_keywords\", limit=100),\n", + " ],\n", + " query=models.FusionQuery(fusion=models.Fusion.RRF),\n", + " group_by=\"document_id\",\n", + " group_size=group_size,\n", + " limit=limit,\n", + " ).groups\n", + "\n", + "show_results(retrieve_grouped)\n" + ] + }, + { + "cell_type": "markdown", + "id": "83c7905e", + "metadata": {}, + "source": [ + "## Step 5: Score Boosting With a Formula\n", + "\n", + "When you have ranking preferences that aren't captured by similarity alone (recency, source authority, geographic proximity, structured boosts), swap RRF for a `FormulaQuery`. Formulas operate on the prefetch scores and payload fields:\n", + "\n", + "- `$score[i]` references the score from prefetch `i`. Prefetch order is load-bearing.\n", + "- The `defaults` map covers candidates that appeared in one prefetch but not another. Without it, a missing variable would error.\n", + "\n", + "The formula below sums the chunk score with a half-weighted title score and a smaller sparse contribution. Unlike RRF, this is a linear combination of raw scores and is fragile across query types unless you've held the weights up against representative queries. Treat the specific weights here as illustrative; the mechanism is the point.\n", + "\n", + "Formula vs reranker:\n", + "\n", + "- **Formula API**: structured preferences known up front (recency decay, source authority, geo proximity, content-type boosts). Cheap and deterministic.\n", + "- **Reranker** (a late-interaction or cross-encoder model): preferences that are \"this is more relevant than that\" but you can't easily express why in a closed form. Expensive but learns what you can't articulate.\n", + "\n", + "For time decay on a `published_at` payload field, swap the title term for an `exp_decay` expression from Qdrant's [decay functions reference](https://qdrant.tech/documentation/search/search-relevance/#decay-functions).\n" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "d25beee5", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Query: 'diffusion models for image synthesis'\n", + "\n", + " 1. Testing turbulence model at metric scales with mid-infrared VISIR images at the VLT\n", + " [astro-ph]\n", + " We probe turbulence structure from centimetric to metric scales by simultaneous imagery at mid-infrared and visible wavelengths at the\n", + " VLT telescope and show that it departs significantly from the commonly used Kolmogorov model. The data can be fitte...\n", + "\n", + " 2. Exponential growth rates in a typed branching diffusion\n", + " [math.PR]\n", + " We also briefly discuss applications to traveling wave solutions of an associated reaction--diffusion equation.\n", + "\n", + " 3. The small deviations of many-dimensional diffusion processes and rarefaction by boundaries\n", + " [math.PR math.AP]\n", + " We lead the algorithm of expansion of sojourn probability of many-dimensional diffusion processes in small domain. The principal member\n", + " of this expansion defines normalizing coefficient for special limit theorems.\n", + "\n", + " 4. Probability distributions generated by fractional diffusion equations\n", + " [cond-mat.stat-mech]\n", + " Fractional calculus allows one to generalize the linear, one-dimensional, diffusion equation by replacing either the first time\n", + " derivative or the second space derivative by a derivative of fractional order. The fundamental solutions of these equation...\n", + "\n", + " 5. Turbulent Diffusion of Lines and Circulations\n", + " [physics.flu-dyn physics.plasm-ph]\n", + " We study material lines and passive vectors in a model of turbulent flow at infinite-Reynolds number, the Kraichnan-Kazantsev ensemble\n", + " of velocities that are white-noise in time and rough (Hoelder continuous) in space. It is argued that the phenomeno...\n", + "\n" + ] + } + ], + "source": [ + "def retrieve_boosted(query, limit=10, group_size=3):\n", + " dense, sparse = embed_query(query)\n", + " return client.query_points_groups(\n", + " collection_name=\"arxiv_multi_repr\",\n", + " prefetch=[\n", + " # $score[0] = chunk, $score[1] = title, $score[2] = sparse\n", + " models.Prefetch(query=dense, using=\"dense_chunk\", limit=100),\n", + " models.Prefetch(query=dense, using=\"dense_title\", limit=100),\n", + " models.Prefetch(query=sparse, using=\"sparse_keywords\", limit=100),\n", + " ],\n", + " query=models.FormulaQuery(\n", + " formula=models.SumExpression(sum=[\n", + " \"$score[0]\",\n", + " models.MultExpression(mult=[0.5, \"$score[1]\"]),\n", + " models.MultExpression(mult=[0.3, \"$score[2]\"]),\n", + " ]),\n", + " defaults={\"$score[1]\": 0.0, \"$score[2]\": 0.0},\n", + " ),\n", + " group_by=\"document_id\",\n", + " group_size=group_size,\n", + " limit=limit,\n", + " ).groups\n", + "\n", + "show_results(retrieve_boosted)\n" + ] + }, + { + "cell_type": "markdown", + "id": "ca1e7741", + "metadata": {}, + "source": [ + "## Wrap-up\n", + "\n", + "That's the recommended multi-representation pipeline end to end. The same schema works for any corpus with title-like, summary-like, and body-like representations. Swap the dataset, retune which representations earn their prefetch slots for your data, and wire in formula-based ranking preferences as needed.\n", + "\n", + "For the design rationale and references, see the [tutorial](https://qdrant.tech/documentation/tutorials-search-engineering/multi-representation-search/).\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file From 12993cf251e713e54cc93259b9dede9649c8959f Mon Sep 17 00:00:00 2001 From: Dylan Couzon Date: Wed, 6 May 2026 16:51:57 -0400 Subject: [PATCH 2/8] Improve category filter and scale corpus --- .../multi-representation-search.ipynb | 256 +++++------------- multi-representation-search/probe_queries.py | 150 ++++++++++ 2 files changed, 223 insertions(+), 183 deletions(-) create mode 100644 multi-representation-search/probe_queries.py diff --git a/multi-representation-search/multi-representation-search.ipynb b/multi-representation-search/multi-representation-search.ipynb index 7c9cc10..0a12771 100644 --- a/multi-representation-search/multi-representation-search.ipynb +++ b/multi-representation-search/multi-representation-search.ipynb @@ -40,7 +40,11 @@ "cell_type": "markdown", "id": "c1e8c733", "metadata": {}, - "source": "## Dataset\n\nTwo thousand arXiv papers from the [`gfissore/arxiv-abstracts-2021`](https://huggingface.co/datasets/gfissore/arxiv-abstracts-2021) Hugging Face dataset, filtered to machine-learning and computer-science categories (`cs.LG`, `cs.CV`, `cs.CL`, `cs.AI`, `stat.ML`) so the queries below have natural matches. Each paper carries the four representations we want to fuse over: a `title` (a few topical tokens), an `abstract` (which we use both as a summary and as the source for chunked body content), and `categories` (controlled-vocabulary tags). Swap in any other arXiv source as long as it exposes title, abstract, and categories.\n" + "source": [ + "## Dataset\n", + "\n", + "20 000 arXiv papers from the [`gfissore/arxiv-abstracts-2021`](https://huggingface.co/datasets/gfissore/arxiv-abstracts-2021) Hugging Face dataset, filtered to ML/CS and to papers from 2018 onward. Each paper exposes a `title`, `abstract`, and `categories` (which this dataset returns as space-joined strings, so we split them before filtering). Swap in any other arXiv source as long as it exposes those three fields." + ] }, { "cell_type": "code", @@ -48,7 +52,40 @@ "id": "ed2823ba", "metadata": {}, "outputs": [], - "source": "from datasets import load_dataset\n\nML_CATEGORIES = {\"cs.LG\", \"cs.CV\", \"cs.CL\", \"cs.AI\", \"stat.ML\"}\n\ndataset = load_dataset(\n \"gfissore/arxiv-abstracts-2021\", split=\"train\", streaming=True\n)\npapers = []\nfor row in dataset:\n if len(papers) >= 2000:\n break\n if not row[\"abstract\"] or not row[\"title\"]:\n continue\n cats = list(row[\"categories\"])\n if not any(c in ML_CATEGORIES for c in cats):\n continue # ML/CS papers only\n papers.append({\n \"arxiv_id\": row[\"id\"],\n \"title\": row[\"title\"].strip(),\n \"abstract\": row[\"abstract\"].strip(),\n \"categories\": cats,\n })\nprint(f\"Loaded {len(papers)} papers\")\n" + "source": [ + "from datasets import load_dataset\n", + "\n", + "ML_CATEGORIES = {\"cs.LG\", \"cs.CV\", \"cs.CL\", \"cs.AI\", \"stat.ML\"}\n", + "\n", + "# Non-streaming so HF caches the parquet locally; first run downloads ~2.5 GB, re-runs are instant.\n", + "dataset = load_dataset(\"gfissore/arxiv-abstracts-2021\", split=\"train\")\n", + "\n", + "papers = []\n", + "# IDs are roughly chronological; iterate from the end to land on 2021/2020/2019 papers first.\n", + "for i in range(len(dataset) - 1, -1, -1):\n", + " if len(papers) >= 20000:\n", + " break\n", + " row = dataset[i]\n", + " if not row[\"abstract\"] or not row[\"title\"]:\n", + " continue\n", + " # categories arrive as space-joined strings (e.g. [\"cs.LG cs.CV\"]); split each entry.\n", + " cats = [tok for entry in row[\"categories\"] for tok in entry.split()]\n", + " if not any(c in ML_CATEGORIES for c in cats):\n", + " continue\n", + " # Year lives in the YYMM prefix of new-format arXiv IDs (\"2104.01234\" -> 2021).\n", + " arxiv_id = row[\"id\"]\n", + " if \"/\" in arxiv_id or \".\" not in arxiv_id:\n", + " continue # skip pre-2007 IDs like \"math/0506001\"\n", + " if 2000 + int(arxiv_id[:2]) < 2018:\n", + " continue\n", + " papers.append({\n", + " \"arxiv_id\": arxiv_id,\n", + " \"title\": row[\"title\"].strip(),\n", + " \"abstract\": row[\"abstract\"].strip(),\n", + " \"categories\": cats,\n", + " })\n", + "print(f\"Loaded {len(papers)} papers\")\n" + ] }, { "cell_type": "markdown", @@ -97,7 +134,18 @@ "cell_type": "markdown", "id": "295e1a01", "metadata": {}, - "source": "## Ingestion\n\nEmbeddings are generated locally with [FastEmbed](https://qdrant.tech/documentation/fastembed/):\n\n- `BAAI/bge-small-en-v1.5` (384-dim, ~67 MB) for the three dense vectors. Trained with retrieval-specific contrastive objectives, which is what this tutorial does.\n- `Qdrant/bm25` for the sparse vector. The IDF modifier on the collection means Qdrant computes inverse-document-frequency weights at query time across the corpus.\n\nChunking uses a fixed two-sentence window for clarity. Chunking strategy has a real effect on retrieval quality and is its own design space (hierarchical, late, semantic chunking are all worth comparing). For now: one point per chunk, with the title and summary embeddings copied onto every chunk of the same paper.\n\nThe loop is deliberately straightforward (one paper at a time) so the per-vector logic stays easy to follow. The first run downloads the FastEmbed models; subsequent runs reuse the local cache. On a laptop CPU expect a few minutes for 2000 papers.\n" + "source": [ + "## Ingestion\n", + "\n", + "Embeddings are generated locally with [FastEmbed](https://qdrant.tech/documentation/fastembed/):\n", + "\n", + "- `BAAI/bge-small-en-v1.5` (384-dim, ~67 MB) for the three dense vectors. Trained with retrieval-specific contrastive objectives, which is what this tutorial does.\n", + "- `Qdrant/bm25` for the sparse vector. The IDF modifier on the collection means Qdrant computes inverse-document-frequency weights at query time across the corpus.\n", + "\n", + "Chunking uses a fixed two-sentence window for clarity. Chunking strategy has a real effect on retrieval quality and is its own design space (hierarchical, late, semantic chunking are all worth comparing). For now: one point per chunk, with the title and summary embeddings copied onto every chunk of the same paper.\n", + "\n", + "The loop is deliberately straightforward (one paper at a time) so the per-vector logic stays easy to follow. The first run downloads the FastEmbed models; subsequent runs reuse the local cache. On a laptop CPU expect roughly 15–20 minutes for 20000 papers." + ] }, { "cell_type": "code", @@ -181,7 +229,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": null, "id": "f70b01f8", "metadata": {}, "outputs": [], @@ -232,43 +280,10 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": null, "id": "566dbbbd", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Query: 'diffusion models for image synthesis'\n", - "\n", - " 1. Optimal Shape Design for Stokes Flow Via Minimax Differentiability\n", - " [math.OC]\n", - " We apply an gradient type algorithm to our problem. Numerical examples show that our theory is useful for practical purpose and the\n", - " proposed algorithm is feasible.\n", - "\n", - " 2. The small deviations of many-dimensional diffusion processes and rarefaction by boundaries\n", - " [math.PR math.AP]\n", - " We lead the algorithm of expansion of sojourn probability of many-dimensional diffusion processes in small domain. The principal member\n", - " of this expansion defines normalizing coefficient for special limit theorems.\n", - "\n", - " 3. Kinetic equation for finite systems of fermions with pairing\n", - " [nucl-th]\n", - " As a consequence, the density fluctuation and the longitudinal response function given by this approximation contain spurious\n", - " contributions. A simple prescription for restoring both local and global particle-number conservation is proposed\n", - "\n", - " 4. Exponential growth rates in a typed branching diffusion\n", - " [math.PR]\n", - " We also briefly discuss applications to traveling wave solutions of an associated reaction--diffusion equation.\n", - "\n", - " 5. Order of Epitaxial Self-Assembled Quantum Dots: Linear Analysis\n", - " [cond-mat.mtrl-sci]\n", - " It is likely that these two types of order are strongly linked; thus, a study of spatial order will also have strong implications for\n", - " size order. Here a study of spatial order is undertaken using a linear analysis of a commonly used model of SAQD for...\n", - "\n" - ] - } - ], + "outputs": [], "source": [ "def retrieve_baseline(query, limit=10):\n", " dense, _ = embed_query(query)\n", @@ -298,43 +313,10 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": null, "id": "44b0f157", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Query: 'diffusion models for image synthesis'\n", - "\n", - " 1. Optimal Shape Design for Stokes Flow Via Minimax Differentiability\n", - " [math.OC]\n", - " We apply an gradient type algorithm to our problem. Numerical examples show that our theory is useful for practical purpose and the\n", - " proposed algorithm is feasible.\n", - "\n", - " 2. Testing turbulence model at metric scales with mid-infrared VISIR images at the VLT\n", - " [astro-ph]\n", - " The image quality improves in the infrared faster than the standard lambda^{-1/5} scaling and may be diffraction-limited at 30-m\n", - " apertures even without adaptive optics at wavelengths longer than 8 micron.\n", - "\n", - " 3. The small deviations of many-dimensional diffusion processes and rarefaction by boundaries\n", - " [math.PR math.AP]\n", - " We lead the algorithm of expansion of sojourn probability of many-dimensional diffusion processes in small domain. The principal member\n", - " of this expansion defines normalizing coefficient for special limit theorems.\n", - "\n", - " 4. Exponential growth rates in a typed branching diffusion\n", - " [math.PR]\n", - " We also briefly discuss applications to traveling wave solutions of an associated reaction--diffusion equation.\n", - "\n", - " 5. Testing turbulence model at metric scales with mid-infrared VISIR images at the VLT\n", - " [astro-ph]\n", - " We probe turbulence structure from centimetric to metric scales by simultaneous imagery at mid-infrared and visible wavelengths at the\n", - " VLT telescope and show that it departs significantly from the commonly used Kolmogorov model. The data can be fitte...\n", - "\n" - ] - } - ], + "outputs": [], "source": [ "def retrieve_hybrid(query, limit=10):\n", " dense, sparse = embed_query(query)\n", @@ -367,43 +349,10 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": null, "id": "b62d81a9", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Query: 'diffusion models for image synthesis'\n", - "\n", - " 1. Probability distributions generated by fractional diffusion equations\n", - " [cond-mat.stat-mech]\n", - " This property is a noteworthy generalization of what happens for the standard diffusion equation and can be relevant in treating\n", - " financial and economical problems where the stable probability distributions play a key role.\n", - "\n", - " 2. Optimal Shape Design for Stokes Flow Via Minimax Differentiability\n", - " [math.OC]\n", - " We apply an gradient type algorithm to our problem. Numerical examples show that our theory is useful for practical purpose and the\n", - " proposed algorithm is feasible.\n", - "\n", - " 3. Testing turbulence model at metric scales with mid-infrared VISIR images at the VLT\n", - " [astro-ph]\n", - " The image quality improves in the infrared faster than the standard lambda^{-1/5} scaling and may be diffraction-limited at 30-m\n", - " apertures even without adaptive optics at wavelengths longer than 8 micron.\n", - "\n", - " 4. Exponential growth rates in a typed branching diffusion\n", - " [math.PR]\n", - " We also briefly discuss applications to traveling wave solutions of an associated reaction--diffusion equation.\n", - "\n", - " 5. The small deviations of many-dimensional diffusion processes and rarefaction by boundaries\n", - " [math.PR math.AP]\n", - " We lead the algorithm of expansion of sojourn probability of many-dimensional diffusion processes in small domain. The principal member\n", - " of this expansion defines normalizing coefficient for special limit theorems.\n", - "\n" - ] - } - ], + "outputs": [], "source": [ "def retrieve_three_repr(query, limit=10):\n", " dense, sparse = embed_query(query)\n", @@ -443,44 +392,10 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": null, "id": "1694ce42", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Query: 'diffusion models for image synthesis'\n", - "\n", - " 1. Specific heat and bimodality in canonical and grand canonical versions of the thermodynamic model\n", - " [nucl-th]\n", - " We address two issues in the thermodynamic model for nuclear disassembly. Surprisingly large differences in results for specific heat\n", - " were seen in predictions from the canonical and grand canonical ensembles when the nuclear system passes from liquid...\n", - "\n", - " 2. Modeling the three-point correlation function\n", - " [astro-ph]\n", - " We present new predictions for the galaxy three-point correlation function (3PCF) using high-resolution dissipationless cosmological\n", - " simulations of a flat LCDM Universe which resolve galaxy-size halos and subhalos. We create realistic mock galaxy cat...\n", - "\n", - " 3. Interpolating and sampling sequences in finite Riemann surfaces\n", - " [math.CV]\n", - " We provide a description of the interpolating and sampling sequences on a space of holomorphic functions with a uniform growth\n", - " restriction defined on finite Riemann surfaces.\n", - "\n", - " 4. Probability distributions generated by fractional diffusion equations\n", - " [cond-mat.stat-mech]\n", - " Fractional calculus allows one to generalize the linear, one-dimensional, diffusion equation by replacing either the first time\n", - " derivative or the second space derivative by a derivative of fractional order. The fundamental solutions of these equation...\n", - "\n", - " 5. Optimal Shape Design for Stokes Flow Via Minimax Differentiability\n", - " [math.OC]\n", - " We apply an gradient type algorithm to our problem. Numerical examples show that our theory is useful for practical purpose and the\n", - " proposed algorithm is feasible.\n", - "\n" - ] - } - ], + "outputs": [], "source": [ "def retrieve_grouped(query, limit=10, group_size=3):\n", " dense, sparse = embed_query(query)\n", @@ -524,43 +439,10 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": null, "id": "d25beee5", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Query: 'diffusion models for image synthesis'\n", - "\n", - " 1. Testing turbulence model at metric scales with mid-infrared VISIR images at the VLT\n", - " [astro-ph]\n", - " We probe turbulence structure from centimetric to metric scales by simultaneous imagery at mid-infrared and visible wavelengths at the\n", - " VLT telescope and show that it departs significantly from the commonly used Kolmogorov model. The data can be fitte...\n", - "\n", - " 2. Exponential growth rates in a typed branching diffusion\n", - " [math.PR]\n", - " We also briefly discuss applications to traveling wave solutions of an associated reaction--diffusion equation.\n", - "\n", - " 3. The small deviations of many-dimensional diffusion processes and rarefaction by boundaries\n", - " [math.PR math.AP]\n", - " We lead the algorithm of expansion of sojourn probability of many-dimensional diffusion processes in small domain. The principal member\n", - " of this expansion defines normalizing coefficient for special limit theorems.\n", - "\n", - " 4. Probability distributions generated by fractional diffusion equations\n", - " [cond-mat.stat-mech]\n", - " Fractional calculus allows one to generalize the linear, one-dimensional, diffusion equation by replacing either the first time\n", - " derivative or the second space derivative by a derivative of fractional order. The fundamental solutions of these equation...\n", - "\n", - " 5. Turbulent Diffusion of Lines and Circulations\n", - " [physics.flu-dyn physics.plasm-ph]\n", - " We study material lines and passive vectors in a model of turbulent flow at infinite-Reynolds number, the Kraichnan-Kazantsev ensemble\n", - " of velocities that are white-noise in time and rough (Hoelder continuous) in space. It is argued that the phenomeno...\n", - "\n" - ] - } - ], + "outputs": [], "source": [ "def retrieve_boosted(query, limit=10, group_size=3):\n", " dense, sparse = embed_query(query)\n", @@ -603,15 +485,23 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", "name": "python", - "version": "3.11" + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.13" } }, "nbformat": 4, "nbformat_minor": 5 -} \ No newline at end of file +} diff --git a/multi-representation-search/probe_queries.py b/multi-representation-search/probe_queries.py new file mode 100644 index 0000000..034c434 --- /dev/null +++ b/multi-representation-search/probe_queries.py @@ -0,0 +1,150 @@ +"""Try candidate queries against the multi-representation collection and print +all 5 step outputs side-by-side so we can pick the one with the best narrative arc.""" + +from qdrant_client import QdrantClient, models +from fastembed import TextEmbedding, SparseTextEmbedding + +COLLECTION = "arxiv_multi_repr" +client = QdrantClient("http://localhost:6333") +dense_model = TextEmbedding("BAAI/bge-small-en-v1.5") +sparse_model = SparseTextEmbedding("Qdrant/bm25") + + +def to_sparse(s): + return models.SparseVector(indices=s.indices.tolist(), values=s.values.tolist()) + + +def embed(query): + d = next(iter(dense_model.query_embed([query]))).tolist() + s = to_sparse(next(iter(sparse_model.query_embed([query])))) + return d, s + + +def step1(q, k=5): + d, _ = embed(q) + return client.query_points(COLLECTION, query=d, using="dense_chunk", limit=k).points + + +def step2(q, k=5): + d, s = embed(q) + return client.query_points( + COLLECTION, + prefetch=[ + models.Prefetch(query=d, using="dense_chunk", limit=50), + models.Prefetch(query=s, using="sparse_keywords", limit=50), + ], + query=models.FusionQuery(fusion=models.Fusion.RRF), + limit=k, + ).points + + +def step3(q, k=5): + d, s = embed(q) + return client.query_points( + COLLECTION, + prefetch=[ + models.Prefetch(query=d, using="dense_chunk", limit=50), + models.Prefetch(query=d, using="dense_title", limit=50), + models.Prefetch(query=s, using="sparse_keywords", limit=50), + ], + query=models.FusionQuery(fusion=models.Fusion.RRF), + limit=k, + ).points + + +def step4(q, k=5): + d, s = embed(q) + return client.query_points_groups( + COLLECTION, + prefetch=[ + models.Prefetch(query=d, using="dense_chunk", limit=100), + models.Prefetch(query=d, using="dense_title", limit=100), + models.Prefetch(query=s, using="sparse_keywords", limit=100), + ], + query=models.FusionQuery(fusion=models.Fusion.RRF), + group_by="document_id", + group_size=3, + limit=k, + ).groups + + +def step5(q, k=5): + d, s = embed(q) + return client.query_points_groups( + COLLECTION, + prefetch=[ + models.Prefetch(query=d, using="dense_chunk", limit=100), + models.Prefetch(query=d, using="dense_title", limit=100), + models.Prefetch(query=s, using="sparse_keywords", limit=100), + ], + query=models.FormulaQuery( + formula=models.SumExpression(sum=[ + "$score[0]", + models.MultExpression(mult=[0.5, "$score[1]"]), + models.MultExpression(mult=[0.3, "$score[2]"]), + ]), + defaults={"$score[1]": 0.0, "$score[2]": 0.0}, + ), + group_by="document_id", + group_size=3, + limit=k, + ).groups + + +def title_of(item): + p = item.hits[0] if hasattr(item, "hits") else item + return p.payload["title"].replace("\n", " ").strip() + + +def doc_id_of(item): + p = item.hits[0] if hasattr(item, "hits") else item + return p.payload["document_id"] + + +def run_query(q): + print(f"\n{'=' * 100}\nQUERY: {q!r}\n{'=' * 100}") + for name, fn in [("step1 dense-only", step1), + ("step2 +sparse RRF", step2), + ("step3 +title RRF", step3), + ("step4 grouped ", step4), + ("step5 formula ", step5)]: + results = fn(q, k=5) + seen = set() + dup_count = 0 + lines = [] + for i, item in enumerate(results, 1): + doc = doc_id_of(item) + t = title_of(item) + dup = " [DUP]" if doc in seen else "" + if doc in seen: + dup_count += 1 + seen.add(doc) + lines.append(f" {i}. {t[:90]}{dup}") + print(f"\n [{name}] unique_docs={len(seen)} dup_chunks={dup_count}") + for ln in lines: + print(ln) + + +CANDIDATES = [ + "diffusion models for image synthesis", # current + "transformer architecture for language modeling", + "adversarial examples in deep learning", + "contrastive self-supervised representation learning", + "graph neural networks for node classification", + "reinforcement learning from human feedback", + "neural machine translation with attention", + "knowledge distillation for model compression", + "object detection in autonomous driving", + "few-shot learning with meta-learning", + "BERT pretraining for sentence classification", + "convolutional neural network for image classification", + "generative adversarial networks for image generation", + "variational autoencoder for representation learning", + "speech recognition with recurrent neural networks", +] + +if __name__ == "__main__": + import sys + queries = sys.argv[1:] if len(sys.argv) > 1 else CANDIDATES + for q in queries: + run_query(q) From daac9080219864f3f61b110ed591c8fe4cd062de Mon Sep 17 00:00:00 2001 From: Dylan Couzon Date: Mon, 11 May 2026 20:36:32 -0400 Subject: [PATCH 3/8] update notebook to match tutorial: Cloud Inference, core BM25, categories as filter Brings the notebook to 1:1 parity with the refactored tutorial: - Switches to Qdrant Cloud Inference for dense + core BM25 for sparse (drops FastEmbed from the notebook dependencies) - Renames dense_summary to dense_abstract and sparse_keywords to sparse_title - Moves categories from BM25 input to a filterable payload field with a keyword index; sparse_title now indexes only the title (avg_len=10) - Adds dense_abstract as a fourth prefetch with its own Step 4, so the build-up now reads: chunk -> +sparse -> +title -> +abstract -> group -> formula - Adds an optional tags filter to retrieve_grouped - Updates probe_queries.py to match the new schema and step structure --- .../multi-representation-search.ipynb | 355 ++---------------- multi-representation-search/probe_queries.py | 56 ++- 2 files changed, 75 insertions(+), 336 deletions(-) diff --git a/multi-representation-search/multi-representation-search.ipynb b/multi-representation-search/multi-representation-search.ipynb index 0a12771..83ae3fd 100644 --- a/multi-representation-search/multi-representation-search.ipynb +++ b/multi-representation-search/multi-representation-search.ipynb @@ -4,27 +4,13 @@ "cell_type": "markdown", "id": "2153bba9", "metadata": {}, - "source": [ - "# Multi-Representation Search: Step-by-Step Build-Up\n", - "\n", - "A document is rarely well-represented by a single embedding. A research paper has a title, an abstract, body chunks, and category tags. Each carries a different signal, and squashing all four into one dense vector loses most of that structure: the title gets averaged out, keyword matches on tags disappear, and chunk-level grounding for downstream reasoning is gone.\n", - "\n", - "This notebook builds a Qdrant retrieval pipeline that uses each representation deliberately. Over five steps you'll go from a naive dense-only baseline to a fully fused pipeline with three named-vector prefetches, Reciprocal Rank Fusion, document-level grouping, and optional formula-based score boosting. After each step you'll run the same query and see the top retrieved papers change.\n", - "\n", - "The design rationale (why each component is there, when to use it, when not to) lives in the accompanying [tutorial](https://qdrant.tech/documentation/tutorials-search-engineering/multi-representation-search/). This notebook focuses on running the code and watching the result list shift.\n", - "\n", - "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/qdrant/examples/blob/master/multi-representation-search/multi-representation-search.ipynb)\n" - ] + "source": "# Multi-Representation Search: Step-by-Step Build-Up\n\nA document is rarely well-represented by a single embedding. A research paper has a title, an abstract, body chunks, and category tags, each carrying a different signal. Treat all four as one dense vector and the title gets averaged out; chunk-level grounding for downstream reasoning disappears.\n\nThis notebook builds a Qdrant retrieval pipeline that uses each representation deliberately. Over six steps you'll go from a naive dense-only baseline to a fully fused pipeline with four named-vector prefetches, Reciprocal Rank Fusion, document-level grouping, and optional formula-based score boosting. After each step you'll run the same query and see the top retrieved papers change.\n\nThe design rationale (why each component is there, when to use it, when not to) lives in the accompanying [tutorial](https://qdrant.tech/documentation/tutorials-search-engineering/multi-representation-search/). This notebook focuses on running the code and watching the result list shift.\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/qdrant/examples/blob/master/multi-representation-search/multi-representation-search.ipynb)\n" }, { "cell_type": "markdown", "id": "4b597568", "metadata": {}, - "source": [ - "## Requirements\n", - "\n", - "Use Python <3.13. Not all dependencies support the newest Python versions yet.\n" - ] + "source": "## Requirements\n\nThis notebook uses [Qdrant Cloud Inference](https://qdrant.tech/documentation/inference/#qdrant-cloud-inference) to generate embeddings server-side, so no client-side embedding library is required. The free tier covers this notebook's footprint. Core BM25 runs on any Qdrant instance, but dense Cloud Inference is Cloud-only. To self-host, generate dense vectors on the client with a library like [FastEmbed](https://qdrant.tech/documentation/fastembed/) and pass them as raw vectors instead of `models.Document`.\n" }, { "cell_type": "code", @@ -32,19 +18,13 @@ "id": "59028f90", "metadata": {}, "outputs": [], - "source": [ - "!pip install qdrant-client fastembed datasets" - ] + "source": "!pip install qdrant-client datasets\n" }, { "cell_type": "markdown", "id": "c1e8c733", "metadata": {}, - "source": [ - "## Dataset\n", - "\n", - "20 000 arXiv papers from the [`gfissore/arxiv-abstracts-2021`](https://huggingface.co/datasets/gfissore/arxiv-abstracts-2021) Hugging Face dataset, filtered to ML/CS and to papers from 2018 onward. Each paper exposes a `title`, `abstract`, and `categories` (which this dataset returns as space-joined strings, so we split them before filtering). Swap in any other arXiv source as long as it exposes those three fields." - ] + "source": "## Dataset\n\n20 000 ML/CS arXiv papers (2018 and later) from the [`gfissore/arxiv-abstracts-2021`](https://huggingface.co/datasets/gfissore/arxiv-abstracts-2021) dataset. Each paper has a `title`, `abstract`, and `categories` (which this dataset returns as space-joined strings, so we split them before filtering).\n" }, { "cell_type": "code", @@ -91,20 +71,7 @@ "cell_type": "markdown", "id": "26339a5a", "metadata": {}, - "source": [ - "## Schema\n", - "\n", - "One Qdrant collection. Each point is a chunk. Each chunk holds four named vectors that we'll fuse at query time:\n", - "\n", - "- `dense_chunk`: the chunk's own embedding (body content).\n", - "- `dense_title`: the paper title embedding (topical naming).\n", - "- `dense_summary`: the paper abstract embedding (contribution focus).\n", - "- `sparse_keywords`: BM25 over the title and tags concatenated (lexical matches on short structured fields).\n", - "\n", - "`dense_title` and `dense_summary` are duplicated across every chunk of the same paper. That trades a bit of storage for one-shot query fusion (one collection, one Query API call, no `lookup_from`). For the typical case (a few dozen chunks per paper, embeddings under a kilobyte each) it's the simpler choice.\n", - "\n", - "We use *named vectors*, not a multivector field. Multivectors are designed for late-interaction models like ColBERT, where the MaxSim comparator combines per-token subvectors into one score per point. Title, summary, and chunk vectors are different kinds of content, so MaxSim would collapse the per-representation signal we want to fuse. The [tutorial](https://qdrant.tech/documentation/tutorials-search-engineering/multi-representation-search/) covers the contrast.\n" - ] + "source": "## Schema\n\nOne Qdrant collection. Each point is a chunk. Each chunk holds four named vectors that we'll fuse at query time:\n\n- `dense_chunk`: the chunk's own embedding (body content).\n- `dense_title`: the paper title embedding (topical naming).\n- `dense_abstract`: the paper abstract embedding (paper-level view).\n- `sparse_title`: BM25 over the title (lexical matches on rare entity names, jargon, specific model or paper names).\n\nCategories live in the `tags` payload with a keyword index, so queries can pre-filter by category.\n\n`dense_title`, `dense_abstract`, and `sparse_title` are duplicated across every chunk of the same paper. That trades a bit of storage for one-shot query fusion (one collection, one Query API call, every representation reachable from any point). For the typical case (a few dozen chunks per paper, embeddings under a kilobyte each) it's the simpler choice.\n" }, { "cell_type": "code", @@ -112,40 +79,13 @@ "id": "788e1d18", "metadata": {}, "outputs": [], - "source": [ - "from qdrant_client import QdrantClient, models\n", - "\n", - "client = QdrantClient(\"http://localhost:6333\") # or QdrantClient(url=\"https://.cloud.qdrant.io\", api_key=\"...\") for Qdrant Cloud\n", - "\n", - "client.create_collection(\n", - " collection_name=\"arxiv_multi_repr\",\n", - " vectors_config={\n", - " \"dense_chunk\": models.VectorParams(size=384, distance=models.Distance.COSINE),\n", - " \"dense_title\": models.VectorParams(size=384, distance=models.Distance.COSINE),\n", - " \"dense_summary\": models.VectorParams(size=384, distance=models.Distance.COSINE),\n", - " },\n", - " sparse_vectors_config={\n", - " \"sparse_keywords\": models.SparseVectorParams(modifier=models.Modifier.IDF),\n", - " },\n", - ")" - ] + "source": "from qdrant_client import QdrantClient, models\n\nclient = QdrantClient(\n url=\"https://xyz-example.qdrant.io:6333\",\n api_key=\"\",\n cloud_inference=True,\n)\n\n# 384 is the output dimension of sentence-transformers/all-minilm-l6-v2, used below for every dense vector.\nclient.create_collection(\n collection_name=\"arxiv_multi_repr\",\n vectors_config={\n \"dense_chunk\": models.VectorParams(size=384, distance=models.Distance.COSINE),\n \"dense_title\": models.VectorParams(size=384, distance=models.Distance.COSINE),\n \"dense_abstract\": models.VectorParams(size=384, distance=models.Distance.COSINE),\n },\n sparse_vectors_config={\n \"sparse_title\": models.SparseVectorParams(modifier=models.Modifier.IDF),\n },\n)\n\n# Index the 'tags' payload as keyword so we can filter on category at query time.\nclient.create_payload_index(\n collection_name=\"arxiv_multi_repr\",\n field_name=\"tags\",\n field_schema=models.PayloadSchemaType.KEYWORD,\n)\n" }, { "cell_type": "markdown", "id": "295e1a01", "metadata": {}, - "source": [ - "## Ingestion\n", - "\n", - "Embeddings are generated locally with [FastEmbed](https://qdrant.tech/documentation/fastembed/):\n", - "\n", - "- `BAAI/bge-small-en-v1.5` (384-dim, ~67 MB) for the three dense vectors. Trained with retrieval-specific contrastive objectives, which is what this tutorial does.\n", - "- `Qdrant/bm25` for the sparse vector. The IDF modifier on the collection means Qdrant computes inverse-document-frequency weights at query time across the corpus.\n", - "\n", - "Chunking uses a fixed two-sentence window for clarity. Chunking strategy has a real effect on retrieval quality and is its own design space (hierarchical, late, semantic chunking are all worth comparing). For now: one point per chunk, with the title and summary embeddings copied onto every chunk of the same paper.\n", - "\n", - "The loop is deliberately straightforward (one paper at a time) so the per-vector logic stays easy to follow. The first run downloads the FastEmbed models; subsequent runs reuse the local cache. On a laptop CPU expect roughly 15–20 minutes for 20000 papers." - ] + "source": "## Ingestion\n\nEmbeddings are generated server-side via Qdrant Cloud Inference:\n\n- `sentence-transformers/all-minilm-l6-v2` (384-dim) for the three dense vectors.\n- `qdrant/bm25` (core BM25 since Qdrant 1.15) for the sparse vector, with `avg_len=10.0` calibrated for the title-only field (default is 256, calibrated for document-length text).\n\nChunking uses a fixed two-sentence window for simplicity; the right chunking strategy depends on your document structure. One point per chunk, with the title and abstract Documents reused across every chunk of the same paper.\n" }, { "cell_type": "code", @@ -153,79 +93,13 @@ "id": "725afca6", "metadata": {}, "outputs": [], - "source": [ - "from fastembed import TextEmbedding, SparseTextEmbedding\n", - "\n", - "# Dense embeddings for title, summary, and chunk content; sparse BM25 for keyword matching.\n", - "dense_model = TextEmbedding(\"BAAI/bge-small-en-v1.5\")\n", - "sparse_model = SparseTextEmbedding(\"Qdrant/bm25\")\n", - "\n", - "def chunk_sentences(text, target_len=2):\n", - " \"\"\"Split text into ~2-sentence chunks; fall back to the full text if it doesn't split cleanly.\"\"\"\n", - " sentences = [s.strip() for s in text.split(\". \") if s.strip()]\n", - " return [\". \".join(sentences[i:i + target_len])\n", - " for i in range(0, len(sentences), target_len)] or [text]\n", - "\n", - "def to_sparse(sparse_emb):\n", - " \"\"\"Convert FastEmbed's SparseEmbedding into a Qdrant SparseVector.\"\"\"\n", - " return models.SparseVector(\n", - " indices=sparse_emb.indices.tolist(),\n", - " values=sparse_emb.values.tolist(),\n", - " )\n", - "\n", - "\n", - "points = []\n", - "for paper in papers:\n", - " chunks = chunk_sentences(paper[\"abstract\"])\n", - "\n", - " # Paper-level embeddings: computed once per paper, reused across every chunk below.\n", - " # next(iter(...)) extracts the single vector from FastEmbed's generator output.\n", - " title_vec = next(iter(dense_model.embed([paper[\"title\"]]))).tolist()\n", - " summary_vec = next(iter(dense_model.embed([paper[\"abstract\"]]))).tolist()\n", - " sparse_vec = to_sparse(next(iter(sparse_model.embed(\n", - " [paper[\"title\"] + \" \" + \" \".join(paper[\"categories\"])]\n", - " ))))\n", - "\n", - " # Chunk-level dense embedding: one vector per chunk.\n", - " chunk_vecs = [v.tolist() for v in dense_model.embed(chunks)]\n", - "\n", - " # One Qdrant point per chunk. dense_title, dense_summary, and sparse_keywords\n", - " # are the same for every chunk of this paper; only dense_chunk varies.\n", - " for i, (chunk, chunk_vec) in enumerate(zip(chunks, chunk_vecs)):\n", - " points.append(models.PointStruct(\n", - " id=len(points),\n", - " vector={\n", - " \"dense_chunk\": chunk_vec,\n", - " \"dense_title\": title_vec,\n", - " \"dense_summary\": summary_vec,\n", - " \"sparse_keywords\": sparse_vec,\n", - " },\n", - " payload={\n", - " \"document_id\": paper[\"arxiv_id\"],\n", - " \"title\": paper[\"title\"],\n", - " \"tags\": paper[\"categories\"],\n", - " \"chunk_index\": i,\n", - " \"chunk_text\": chunk,\n", - " },\n", - " ))\n", - "\n", - "client.upload_points(collection_name=\"arxiv_multi_repr\", points=points, batch_size=64)\n", - "print(f\"Uploaded {len(points)} chunks across {len(papers)} papers\")\n" - ] + "source": "DENSE_MODEL = \"sentence-transformers/all-minilm-l6-v2\"\nBM25_MODEL = \"qdrant/bm25\"\n\ndef chunk_sentences(text, target_len=2):\n \"\"\"Split text into ~2-sentence chunks; fall back to the full text if it doesn't split cleanly.\"\"\"\n sentences = [s.strip() for s in text.split(\". \") if s.strip()]\n return [\". \".join(sentences[i:i + target_len])\n for i in range(0, len(sentences), target_len)] or [text]\n\n\npoints = []\nfor paper in papers:\n chunks = chunk_sentences(paper[\"abstract\"])\n\n # Title, abstract, and sparse docs are reused across every chunk of this paper; only the chunk text varies.\n # Cloud Inference embeds each Document on the server, so you don't need a client-side embedding library.\n title_doc = models.Document(text=paper[\"title\"], model=DENSE_MODEL)\n abstract_doc = models.Document(text=paper[\"abstract\"], model=DENSE_MODEL)\n # avg_len is the average word count of the indexed text.\n # Default is 256 (document-length); setting it to the actual field length (~10 here) improves BM25 scoring accuracy.\n sparse_doc = models.Document(\n text=paper[\"title\"],\n model=BM25_MODEL,\n options={\"avg_len\": 10.0},\n )\n\n for i, chunk in enumerate(chunks):\n points.append(models.PointStruct(\n id=len(points),\n vector={\n \"dense_chunk\": models.Document(text=chunk, model=DENSE_MODEL),\n \"dense_title\": title_doc,\n \"dense_abstract\": abstract_doc,\n \"sparse_title\": sparse_doc,\n },\n payload={\n \"document_id\": paper[\"arxiv_id\"],\n \"title\": paper[\"title\"],\n \"tags\": paper[\"categories\"],\n \"chunk_index\": i,\n \"chunk_text\": chunk,\n },\n ))\n\nclient.upload_points(collection_name=\"arxiv_multi_repr\", points=points, batch_size=64)\nprint(f\"Uploaded {len(points)} chunks across {len(papers)} papers\")\n" }, { "cell_type": "markdown", "id": "61b1aa7b", "metadata": {}, - "source": [ - "## Query Helpers\n", - "\n", - "Three pieces used by every step below:\n", - "\n", - "- `embed_query(query)` produces the `(dense, sparse)` pair we feed into Qdrant. Both `dense_model` and `sparse_model` expose a `query_embed` method calibrated for queries: for BM25 it applies IDF weighting; for some dense models it applies a query-side prompt.\n", - "- `SAMPLE_QUERY` is the single query we run through every step so we can watch the same query produce different results as capabilities are added.\n", - "- `show_results(retrieve_fn)` runs the retrieve function and prints the top 5 results: title, category tags, and an excerpt from the matching chunk. Accepts both chunk-level results (Steps 1-3) and grouped results (Steps 4-5, where each result is a paper with several chunks).\n" - ] + "source": "## Query Helpers\n\nTwo pieces used by every step below:\n\n- `SAMPLE_QUERY` is the single query we run through every step so we can watch the same query produce different results as capabilities are added.\n- `show_results(retrieve_fn)` runs the retrieve function and prints the top 5 results: title, category tags, and an excerpt from the matching chunk. Accepts both chunk-level results (Steps 1-4) and grouped results (Steps 5-6, where each result is a paper with several chunks).\n" }, { "cell_type": "code", @@ -233,50 +107,13 @@ "id": "f70b01f8", "metadata": {}, "outputs": [], - "source": [ - "import textwrap\n", - "\n", - "def embed_query(query):\n", - " \"\"\"Produce a (dense, sparse) embedding pair for a query string.\"\"\"\n", - " dense = next(iter(dense_model.query_embed([query]))).tolist()\n", - " sparse = to_sparse(next(iter(sparse_model.query_embed([query]))))\n", - " return dense, sparse\n", - "\n", - "SAMPLE_QUERY = \"diffusion models for image synthesis\"\n", - "\n", - "def show_results(retrieve_fn, query=SAMPLE_QUERY, k=5):\n", - " \"\"\"Print top-k results as: title, category tags, and a matching-chunk excerpt.\"\"\"\n", - " print(f\"Query: {query!r}\\n\")\n", - " for i, item in enumerate(retrieve_fn(query, limit=k), 1):\n", - " # item is a Point (Steps 1-3) or a Group (Steps 4-5).\n", - " # For groups, hits[0] is the top chunk for that paper.\n", - " point = item.hits[0] if hasattr(item, \"hits\") else item\n", - " payload = point.payload\n", - " title = payload[\"title\"]\n", - " tags = payload.get(\"tags\", [])\n", - " # Collapse whitespace (including embedded newlines) so the excerpt prints cleanly.\n", - " chunk = \" \".join(payload[\"chunk_text\"].split())\n", - " excerpt = chunk[:250].rstrip() + (\"...\" if len(chunk) > 250 else \"\")\n", - " print(textwrap.fill(f\"{i}. {title}\", width=140, initial_indent=\" \", subsequent_indent=\" \"))\n", - " if tags:\n", - " print(f\" [{', '.join(str(t) for t in tags[:3])}]\")\n", - " print(textwrap.fill(excerpt, width=140, initial_indent=\" \", subsequent_indent=\" \"))\n", - " print()\n" - ] + "source": "import textwrap\n\nSAMPLE_QUERY = \"diffusion models for image synthesis\"\n\ndef show_results(retrieve_fn, query=SAMPLE_QUERY, k=5):\n \"\"\"Print top-k results as: title, category tags, and a matching-chunk excerpt.\"\"\"\n print(f\"Query: {query!r}\\n\")\n for i, item in enumerate(retrieve_fn(query, limit=k), 1):\n # item is a Point (Steps 1-4) or a Group (Steps 5-6).\n # For groups, hits[0] is the top chunk for that paper.\n point = item.hits[0] if hasattr(item, \"hits\") else item\n payload = point.payload\n title = payload[\"title\"]\n tags = payload.get(\"tags\", [])\n # Collapse whitespace (including embedded newlines) so the excerpt prints cleanly.\n chunk = \" \".join(payload[\"chunk_text\"].split())\n excerpt = chunk[:250].rstrip() + (\"...\" if len(chunk) > 250 else \"\")\n print(textwrap.fill(f\"{i}. {title}\", width=140, initial_indent=\" \", subsequent_indent=\" \"))\n if tags:\n print(f\" [{', '.join(str(t) for t in tags[:3])}]\")\n print(textwrap.fill(excerpt, width=140, initial_indent=\" \", subsequent_indent=\" \"))\n print()\n" }, { "cell_type": "markdown", "id": "4b9065fe", "metadata": {}, - "source": [ - "## Step 1: Dense Over Chunks (Baseline)\n", - "\n", - "The naive baseline: encode the query with the dense model, search against `dense_chunk` only, return the chunk-level results' parent papers. No fusion, no title or sparse signal.\n", - "\n", - "This is what most \"vector search\" tutorials stop at. It's a reasonable default for short, homogeneous corpora where the chunk text already carries the full signal. It systematically underperforms when the signal lives outside the chunk: in the title (topical naming), in tags (controlled vocabulary), or in keyword overlap that the embedding model has averaged out into a generic neighborhood.\n", - "\n", - "Each subsequent step closes one of those gaps.\n" - ] + "source": "## Step 1: Dense Over Chunks (Baseline)\n\nThe naive baseline: encode the query with the dense model, search against `dense_chunk` only, return the chunk-level results' parent papers. No fusion, no title or sparse signal.\n\nThis is what most \"vector search\" tutorials stop at. It's a reasonable default for short, homogeneous corpora where the chunk text already carries the full signal. It systematically underperforms when the signal lives outside the chunk: in the title (topical naming), or in keyword overlap that the embedding model has averaged out into a generic neighborhood.\n\nEach subsequent step closes one of those gaps.\n" }, { "cell_type": "code", @@ -284,32 +121,13 @@ "id": "566dbbbd", "metadata": {}, "outputs": [], - "source": [ - "def retrieve_baseline(query, limit=10):\n", - " dense, _ = embed_query(query)\n", - " return client.query_points(\n", - " collection_name=\"arxiv_multi_repr\",\n", - " query=dense,\n", - " using=\"dense_chunk\",\n", - " limit=limit,\n", - " ).points\n", - "\n", - "show_results(retrieve_baseline)\n" - ] + "source": "def retrieve_baseline(query, limit=10):\n return client.query_points(\n collection_name=\"arxiv_multi_repr\",\n query=models.Document(text=query, model=DENSE_MODEL),\n using=\"dense_chunk\",\n limit=limit,\n ).points\n\nshow_results(retrieve_baseline)\n" }, { "cell_type": "markdown", "id": "f710ce2f", "metadata": {}, - "source": [ - "## Step 2: Add Sparse Keywords With RRF\n", - "\n", - "Add a second prefetch: BM25 over title and tags. Then fuse the two ranked lists with **Reciprocal Rank Fusion (RRF)**.\n", - "\n", - "Why RRF instead of weighted averages of raw scores? RRF works on rank, not score. Dense scores live in [0, 1], sparse BM25 scores don't, and RRF doesn't have to reconcile the two. Linear weights are fragile: a weight that helps one query class hurts another, and the right weight depends on query length, model, and corpus.\n", - "\n", - "What does sparse add? Queries with rare entity names, jargon, or category tags often produce dense embeddings near generic neighborhoods. The sparse path catches those exact-token matches. RRF promotes documents both paths agree on.\n" - ] + "source": "## Step 2: Add Sparse Title With RRF\n\nAdd a second prefetch: BM25 over the title. Then fuse the two ranked lists with **Reciprocal Rank Fusion (RRF)**.\n\nWhy RRF instead of weighted averages of raw scores? RRF works on rank, not score. Dense scores live in [0, 1], sparse BM25 scores don't, and RRF doesn't have to reconcile the two. Linear weights are fragile: a weight that helps one query class hurts another, and the right weight depends on query length, model, and corpus.\n\nWhat does sparse add? Queries with rare entity names, jargon, or specific model/paper names often produce dense embeddings near generic neighborhoods. The sparse path catches those exact-token matches on the title. RRF promotes documents both paths agree on.\n" }, { "cell_type": "code", @@ -317,35 +135,13 @@ "id": "44b0f157", "metadata": {}, "outputs": [], - "source": [ - "def retrieve_hybrid(query, limit=10):\n", - " dense, sparse = embed_query(query)\n", - " return client.query_points(\n", - " collection_name=\"arxiv_multi_repr\",\n", - " prefetch=[\n", - " models.Prefetch(query=dense, using=\"dense_chunk\", limit=50),\n", - " models.Prefetch(query=sparse, using=\"sparse_keywords\", limit=50),\n", - " ],\n", - " query=models.FusionQuery(fusion=models.Fusion.RRF),\n", - " limit=limit,\n", - " ).points\n", - "\n", - "show_results(retrieve_hybrid)\n" - ] + "source": "def retrieve_hybrid(query, limit=10):\n dense_query = models.Document(text=query, model=DENSE_MODEL)\n sparse_query = models.Document(text=query, model=BM25_MODEL)\n return client.query_points(\n collection_name=\"arxiv_multi_repr\",\n prefetch=[\n models.Prefetch(query=dense_query, using=\"dense_chunk\", limit=50),\n models.Prefetch(query=sparse_query, using=\"sparse_title\", limit=50),\n ],\n query=models.FusionQuery(fusion=models.Fusion.RRF),\n limit=limit,\n ).points\n\nshow_results(retrieve_hybrid)\n" }, { "cell_type": "markdown", "id": "4bdf38f7", "metadata": {}, - "source": [ - "## Step 3: Add Title Prefetch\n", - "\n", - "Add a third prefetch: the same dense query vector, but searched against `dense_title` instead of `dense_chunk`. We're now fusing across three representations: chunk content, keyword hits, and topical naming.\n", - "\n", - "The title prefetch saves queries where the topic is named explicitly but not echoed in any single chunk. For example: \"diffusion models for high-resolution image synthesis\" surfaces a paper titled \"High-Resolution Image Synthesis with Latent Diffusion Models\" via the title path even when its chunks phrase the contribution differently. The chunk prefetch alone misses it; the title path catches it; RRF promotes it because both paths agree.\n", - "\n", - "A representation only earns its own prefetch if it carries signal independent of the others. We're not adding `dense_summary` as a fourth prefetch here because abstracts often paraphrase the chunks they came from. If your corpus has summaries that surface different content (human-written summaries of long technical reports, for example), adding a fourth prefetch is worth it.\n" - ] + "source": "## Step 3: Add Title Prefetch\n\nAdd a third prefetch: the same dense query vector, but searched against `dense_title` instead of `dense_chunk`. We're now fusing across three representations: chunk content, title (lexical), and title (semantic).\n\nThe title prefetch saves queries where the topic is named explicitly but not echoed in any single chunk. For example: \"diffusion models for high-resolution image synthesis\" surfaces a paper titled \"High-Resolution Image Synthesis with Latent Diffusion Models\" via the title path even when its chunks phrase the contribution differently. The chunk prefetch alone misses it; the title path catches it; RRF promotes it because both paths agree.\n" }, { "cell_type": "code", @@ -353,42 +149,27 @@ "id": "b62d81a9", "metadata": {}, "outputs": [], - "source": [ - "def retrieve_three_repr(query, limit=10):\n", - " dense, sparse = embed_query(query)\n", - " return client.query_points(\n", - " collection_name=\"arxiv_multi_repr\",\n", - " prefetch=[\n", - " models.Prefetch(query=dense, using=\"dense_chunk\", limit=50),\n", - " models.Prefetch(query=dense, using=\"dense_title\", limit=50),\n", - " models.Prefetch(query=sparse, using=\"sparse_keywords\", limit=50),\n", - " ],\n", - " query=models.FusionQuery(fusion=models.Fusion.RRF),\n", - " limit=limit,\n", - " ).points\n", - "\n", - "show_results(retrieve_three_repr)\n" - ] + "source": "def retrieve_three_repr(query, limit=10):\n dense_query = models.Document(text=query, model=DENSE_MODEL)\n sparse_query = models.Document(text=query, model=BM25_MODEL)\n return client.query_points(\n collection_name=\"arxiv_multi_repr\",\n prefetch=[\n models.Prefetch(query=dense_query, using=\"dense_chunk\", limit=50),\n models.Prefetch(query=dense_query, using=\"dense_title\", limit=50),\n models.Prefetch(query=sparse_query, using=\"sparse_title\", limit=50),\n ],\n query=models.FusionQuery(fusion=models.Fusion.RRF),\n limit=limit,\n ).points\n\nshow_results(retrieve_three_repr)\n" + }, + { + "cell_type": "markdown", + "id": "e59ce67e", + "source": "## Step 4: Add Abstract Prefetch\n\nAdd a fourth prefetch on `dense_abstract`. The abstract gives a paper-level view that sits between the title (very short) and individual chunks (very local). It catches queries that match the paper's overall framing rather than a single passage or the title's topical naming.\n\nIn a production setup where chunks are full paper bodies, the abstract is a meaningfully different representation. In this notebook's arXiv dataset (where chunks are 2-sentence slices of the abstract itself), the lift over Step 3 will be smaller because the abstract and the chunks share text. The prefetch is still worth wiring up; the pipeline shape is what generalizes to longer corpora.\n", + "metadata": {} + }, + { + "cell_type": "code", + "id": "e9c0dd1d", + "source": "def retrieve_four_repr(query, limit=10):\n dense_query = models.Document(text=query, model=DENSE_MODEL)\n sparse_query = models.Document(text=query, model=BM25_MODEL)\n return client.query_points(\n collection_name=\"arxiv_multi_repr\",\n prefetch=[\n models.Prefetch(query=dense_query, using=\"dense_chunk\", limit=50),\n models.Prefetch(query=dense_query, using=\"dense_title\", limit=50),\n models.Prefetch(query=dense_query, using=\"dense_abstract\", limit=50),\n models.Prefetch(query=sparse_query, using=\"sparse_title\", limit=50),\n ],\n query=models.FusionQuery(fusion=models.Fusion.RRF),\n limit=limit,\n ).points\n\nshow_results(retrieve_four_repr)\n", + "metadata": {}, + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", "id": "1fed2f91", "metadata": {}, - "source": [ - "## Step 4: Group by Document\n", - "\n", - "So far results are chunks, and the same paper can appear multiple times in the top 10. Most consumers want one entry per document with the top chunks attached: a results UI, a citation list, an LLM that needs document-level attribution.\n", - "\n", - "`query_points_groups` collapses chunks back to documents using `group_by=\"document_id\"`. Each group's `hits` field carries the top-`group_size` chunks for that paper.\n", - "\n", - "A few things worth knowing:\n", - "\n", - "- Grouping is a *presentation* choice, not a relevance technique. The candidates and their fused scores don't change; only the result shape does.\n", - "- Increase the prefetch `limit` when grouping. If a paper has three good chunks but the prefetch only returned two, grouping doesn't have the third to consider.\n", - "- Use the `with_lookup` parameter when document-level metadata (full title, authors, dates) lives in a separate collection. It fetches one record per group instead of repeating it per chunk.\n", - "\n", - "When *not* to group: when an LLM benefits from seeing several independently ranked chunks across multiple documents in its context window. Collapsing those into per-document groups throws away ordering information the LLM could have used.\n" - ] + "source": "## Step 5: Group by Document\n\nSo far results are chunks, and the same paper can appear multiple times in the top 10. Most consumers want one entry per document with the top chunks attached: a results UI, a citation list, an LLM that needs document-level attribution.\n\n`query_points_groups` collapses chunks back to documents using `group_by=\"document_id\"`. Each group's `hits` field carries the top-`group_size` chunks for that paper.\n\nThis step also wires in an optional `tags` parameter that filters candidates to specific arXiv categories before retrieval runs. Qdrant pre-filters on the payload index we added in the schema, so filtering happens before the fusion math, not after.\n\nA few things worth knowing:\n\n- Grouping is a *presentation* choice, not a relevance technique. The candidates and their fused scores don't change; only the result shape does.\n- You may need to adjust the per-prefetch `limit` based on the number of chunks per document; grouping only sees what the prefetch returns.\n" }, { "cell_type": "code", @@ -396,46 +177,13 @@ "id": "1694ce42", "metadata": {}, "outputs": [], - "source": [ - "def retrieve_grouped(query, limit=10, group_size=3):\n", - " dense, sparse = embed_query(query)\n", - " return client.query_points_groups(\n", - " collection_name=\"arxiv_multi_repr\",\n", - " prefetch=[\n", - " models.Prefetch(query=dense, using=\"dense_chunk\", limit=100),\n", - " models.Prefetch(query=dense, using=\"dense_title\", limit=100),\n", - " models.Prefetch(query=sparse, using=\"sparse_keywords\", limit=100),\n", - " ],\n", - " query=models.FusionQuery(fusion=models.Fusion.RRF),\n", - " group_by=\"document_id\",\n", - " group_size=group_size,\n", - " limit=limit,\n", - " ).groups\n", - "\n", - "show_results(retrieve_grouped)\n" - ] + "source": "def retrieve_grouped(query, limit=10, group_size=3, tags=None):\n dense_query = models.Document(text=query, model=DENSE_MODEL)\n sparse_query = models.Document(text=query, model=BM25_MODEL)\n # Optional category filter. When tags is provided, Qdrant pre-filters candidates\n # to points whose 'tags' payload includes any of the given values.\n query_filter = (\n models.Filter(must=[models.FieldCondition(key=\"tags\", match=models.MatchAny(any=tags))])\n if tags else None\n )\n # query_points_groups runs the prefetches, fuses with RRF, applies the filter, and groups results by document_id.\n return client.query_points_groups(\n collection_name=\"arxiv_multi_repr\",\n prefetch=[\n models.Prefetch(query=dense_query, using=\"dense_chunk\", limit=100),\n models.Prefetch(query=dense_query, using=\"dense_title\", limit=100),\n models.Prefetch(query=dense_query, using=\"dense_abstract\", limit=100),\n models.Prefetch(query=sparse_query, using=\"sparse_title\", limit=100),\n ],\n query=models.FusionQuery(fusion=models.Fusion.RRF),\n query_filter=query_filter,\n group_by=\"document_id\",\n group_size=group_size,\n limit=limit,\n ).groups\n\nshow_results(retrieve_grouped)\n" }, { "cell_type": "markdown", "id": "83c7905e", "metadata": {}, - "source": [ - "## Step 5: Score Boosting With a Formula\n", - "\n", - "When you have ranking preferences that aren't captured by similarity alone (recency, source authority, geographic proximity, structured boosts), swap RRF for a `FormulaQuery`. Formulas operate on the prefetch scores and payload fields:\n", - "\n", - "- `$score[i]` references the score from prefetch `i`. Prefetch order is load-bearing.\n", - "- The `defaults` map covers candidates that appeared in one prefetch but not another. Without it, a missing variable would error.\n", - "\n", - "The formula below sums the chunk score with a half-weighted title score and a smaller sparse contribution. Unlike RRF, this is a linear combination of raw scores and is fragile across query types unless you've held the weights up against representative queries. Treat the specific weights here as illustrative; the mechanism is the point.\n", - "\n", - "Formula vs reranker:\n", - "\n", - "- **Formula API**: structured preferences known up front (recency decay, source authority, geo proximity, content-type boosts). Cheap and deterministic.\n", - "- **Reranker** (a late-interaction or cross-encoder model): preferences that are \"this is more relevant than that\" but you can't easily express why in a closed form. Expensive but learns what you can't articulate.\n", - "\n", - "For time decay on a `published_at` payload field, swap the title term for an `exp_decay` expression from Qdrant's [decay functions reference](https://qdrant.tech/documentation/search/search-relevance/#decay-functions).\n" - ] + "source": "## Step 6: Score Boosting With a Formula\n\nWhen you have ranking preferences that aren't captured by similarity alone (recency, source authority, geographic proximity, structured boosts), swap RRF for a `FormulaQuery`. Formulas operate on the prefetch scores and payload fields:\n\n- `$score[i]` references the score from prefetch `i`. Prefetch order is load-bearing.\n- The `defaults` map provides fallback values for candidates that didn't appear in every prefetch, so the formula still evaluates.\n\nThe formula below sums the chunk score with weighted contributions from the title, abstract, and sparse prefetches. This is a linear combination of raw scores, which breaks down when prefetches use different scoring scales. RRF avoids this by discarding scores; DBSF normalizes per prefetch; a custom formula has to align distributions itself, typically with [decay functions](https://qdrant.tech/documentation/search/search-relevance/#decay-functions). The full FormulaQuery syntax lives in the [Score Boosting](https://qdrant.tech/documentation/search/search-relevance/#score-boosting) reference.\n\nFor time-based decay on a `published_at` payload field, swap a term for an `exp_decay` expression.\n\nFor RRF vs. DBSF guidance, see the [hybrid-search FAQ](https://qdrant.tech/documentation/faq/qdrant-fundamentals/#when-should-i-use-reciprocal-rank-fusion-rrf-vs-distribution-based-score-fusion-dbsf-for-hybrid-search).\n" }, { "cell_type": "code", @@ -443,44 +191,13 @@ "id": "d25beee5", "metadata": {}, "outputs": [], - "source": [ - "def retrieve_boosted(query, limit=10, group_size=3):\n", - " dense, sparse = embed_query(query)\n", - " return client.query_points_groups(\n", - " collection_name=\"arxiv_multi_repr\",\n", - " prefetch=[\n", - " # $score[0] = chunk, $score[1] = title, $score[2] = sparse\n", - " models.Prefetch(query=dense, using=\"dense_chunk\", limit=100),\n", - " models.Prefetch(query=dense, using=\"dense_title\", limit=100),\n", - " models.Prefetch(query=sparse, using=\"sparse_keywords\", limit=100),\n", - " ],\n", - " query=models.FormulaQuery(\n", - " formula=models.SumExpression(sum=[\n", - " \"$score[0]\",\n", - " models.MultExpression(mult=[0.5, \"$score[1]\"]),\n", - " models.MultExpression(mult=[0.3, \"$score[2]\"]),\n", - " ]),\n", - " defaults={\"$score[1]\": 0.0, \"$score[2]\": 0.0},\n", - " ),\n", - " group_by=\"document_id\",\n", - " group_size=group_size,\n", - " limit=limit,\n", - " ).groups\n", - "\n", - "show_results(retrieve_boosted)\n" - ] + "source": "def retrieve_boosted(query, limit=10, group_size=3):\n dense_query = models.Document(text=query, model=DENSE_MODEL)\n sparse_query = models.Document(text=query, model=BM25_MODEL)\n return client.query_points_groups(\n collection_name=\"arxiv_multi_repr\",\n prefetch=[\n # $score[0] = chunk, $score[1] = title, $score[2] = abstract, $score[3] = sparse\n models.Prefetch(query=dense_query, using=\"dense_chunk\", limit=100),\n models.Prefetch(query=dense_query, using=\"dense_title\", limit=100),\n models.Prefetch(query=dense_query, using=\"dense_abstract\", limit=100),\n models.Prefetch(query=sparse_query, using=\"sparse_title\", limit=100),\n ],\n query=models.FormulaQuery(\n formula=models.SumExpression(sum=[\n \"$score[0]\",\n models.MultExpression(mult=[0.5, \"$score[1]\"]),\n models.MultExpression(mult=[0.4, \"$score[2]\"]),\n models.MultExpression(mult=[0.3, \"$score[3]\"]),\n ]),\n defaults={\"$score[1]\": 0.0, \"$score[2]\": 0.0, \"$score[3]\": 0.0},\n ),\n group_by=\"document_id\",\n group_size=group_size,\n limit=limit,\n ).groups\n\nshow_results(retrieve_boosted)\n" }, { "cell_type": "markdown", "id": "ca1e7741", "metadata": {}, - "source": [ - "## Wrap-up\n", - "\n", - "That's the recommended multi-representation pipeline end to end. The same schema works for any corpus with title-like, summary-like, and body-like representations. Swap the dataset, retune which representations earn their prefetch slots for your data, and wire in formula-based ranking preferences as needed.\n", - "\n", - "For the design rationale and references, see the [tutorial](https://qdrant.tech/documentation/tutorials-search-engineering/multi-representation-search/).\n" - ] + "source": "## Wrap-up\n\nThat's the recommended multi-representation pipeline end to end. The same schema works for any corpus with title-like, abstract-like, and body-like representations. Swap the dataset, retune which representations earn their prefetch slots for your data, and wire in formula-based ranking preferences as needed.\n\nFor the design rationale and references, see the [tutorial](https://qdrant.tech/documentation/tutorials-search-engineering/multi-representation-search/).\n" } ], "metadata": { @@ -504,4 +221,4 @@ }, "nbformat": 4, "nbformat_minor": 5 -} +} \ No newline at end of file diff --git a/multi-representation-search/probe_queries.py b/multi-representation-search/probe_queries.py index 034c434..8a94fc5 100644 --- a/multi-representation-search/probe_queries.py +++ b/multi-representation-search/probe_queries.py @@ -1,5 +1,8 @@ """Try candidate queries against the multi-representation collection and print -all 5 step outputs side-by-side so we can pick the one with the best narrative arc.""" +all 6 step outputs side-by-side so we can pick the one with the best narrative arc. + +Local dev script: uses FastEmbed for both dense and sparse, against a local Qdrant. +The notebook uses Qdrant Cloud Inference instead, but the schema names match.""" from qdrant_client import QdrantClient, models from fastembed import TextEmbedding, SparseTextEmbedding @@ -31,7 +34,7 @@ def step2(q, k=5): COLLECTION, prefetch=[ models.Prefetch(query=d, using="dense_chunk", limit=50), - models.Prefetch(query=s, using="sparse_keywords", limit=50), + models.Prefetch(query=s, using="sparse_title", limit=50), ], query=models.FusionQuery(fusion=models.Fusion.RRF), limit=k, @@ -45,7 +48,7 @@ def step3(q, k=5): prefetch=[ models.Prefetch(query=d, using="dense_chunk", limit=50), models.Prefetch(query=d, using="dense_title", limit=50), - models.Prefetch(query=s, using="sparse_keywords", limit=50), + models.Prefetch(query=s, using="sparse_title", limit=50), ], query=models.FusionQuery(fusion=models.Fusion.RRF), limit=k, @@ -53,13 +56,29 @@ def step3(q, k=5): def step4(q, k=5): + d, s = embed(q) + return client.query_points( + COLLECTION, + prefetch=[ + models.Prefetch(query=d, using="dense_chunk", limit=50), + models.Prefetch(query=d, using="dense_title", limit=50), + models.Prefetch(query=d, using="dense_abstract", limit=50), + models.Prefetch(query=s, using="sparse_title", limit=50), + ], + query=models.FusionQuery(fusion=models.Fusion.RRF), + limit=k, + ).points + + +def step5(q, k=5): d, s = embed(q) return client.query_points_groups( COLLECTION, prefetch=[ - models.Prefetch(query=d, using="dense_chunk", limit=100), - models.Prefetch(query=d, using="dense_title", limit=100), - models.Prefetch(query=s, using="sparse_keywords", limit=100), + models.Prefetch(query=d, using="dense_chunk", limit=100), + models.Prefetch(query=d, using="dense_title", limit=100), + models.Prefetch(query=d, using="dense_abstract", limit=100), + models.Prefetch(query=s, using="sparse_title", limit=100), ], query=models.FusionQuery(fusion=models.Fusion.RRF), group_by="document_id", @@ -68,22 +87,24 @@ def step4(q, k=5): ).groups -def step5(q, k=5): +def step6(q, k=5): d, s = embed(q) return client.query_points_groups( COLLECTION, prefetch=[ - models.Prefetch(query=d, using="dense_chunk", limit=100), - models.Prefetch(query=d, using="dense_title", limit=100), - models.Prefetch(query=s, using="sparse_keywords", limit=100), + models.Prefetch(query=d, using="dense_chunk", limit=100), + models.Prefetch(query=d, using="dense_title", limit=100), + models.Prefetch(query=d, using="dense_abstract", limit=100), + models.Prefetch(query=s, using="sparse_title", limit=100), ], query=models.FormulaQuery( formula=models.SumExpression(sum=[ "$score[0]", models.MultExpression(mult=[0.5, "$score[1]"]), - models.MultExpression(mult=[0.3, "$score[2]"]), + models.MultExpression(mult=[0.4, "$score[2]"]), + models.MultExpression(mult=[0.3, "$score[3]"]), ]), - defaults={"$score[1]": 0.0, "$score[2]": 0.0}, + defaults={"$score[1]": 0.0, "$score[2]": 0.0, "$score[3]": 0.0}, ), group_by="document_id", group_size=3, @@ -103,11 +124,12 @@ def doc_id_of(item): def run_query(q): print(f"\n{'=' * 100}\nQUERY: {q!r}\n{'=' * 100}") - for name, fn in [("step1 dense-only", step1), - ("step2 +sparse RRF", step2), - ("step3 +title RRF", step3), - ("step4 grouped ", step4), - ("step5 formula ", step5)]: + for name, fn in [("step1 dense-only ", step1), + ("step2 +sparse RRF ", step2), + ("step3 +title RRF ", step3), + ("step4 +abstract RRF", step4), + ("step5 grouped ", step5), + ("step6 formula ", step6)]: results = fn(q, k=5) seen = set() dup_count = 0 From 87726367febc4a17bc603caf1a0bde0252e658ca Mon Sep 17 00:00:00 2001 From: Dylan Couzon Date: Mon, 11 May 2026 20:44:04 -0400 Subject: [PATCH 4/8] delete testing file --- multi-representation-search/probe_queries.py | 172 ------------------- 1 file changed, 172 deletions(-) delete mode 100644 multi-representation-search/probe_queries.py diff --git a/multi-representation-search/probe_queries.py b/multi-representation-search/probe_queries.py deleted file mode 100644 index 8a94fc5..0000000 --- a/multi-representation-search/probe_queries.py +++ /dev/null @@ -1,172 +0,0 @@ -"""Try candidate queries against the multi-representation collection and print -all 6 step outputs side-by-side so we can pick the one with the best narrative arc. - -Local dev script: uses FastEmbed for both dense and sparse, against a local Qdrant. -The notebook uses Qdrant Cloud Inference instead, but the schema names match.""" - -from qdrant_client import QdrantClient, models -from fastembed import TextEmbedding, SparseTextEmbedding - -COLLECTION = "arxiv_multi_repr" -client = QdrantClient("http://localhost:6333") -dense_model = TextEmbedding("BAAI/bge-small-en-v1.5") -sparse_model = SparseTextEmbedding("Qdrant/bm25") - - -def to_sparse(s): - return models.SparseVector(indices=s.indices.tolist(), values=s.values.tolist()) - - -def embed(query): - d = next(iter(dense_model.query_embed([query]))).tolist() - s = to_sparse(next(iter(sparse_model.query_embed([query])))) - return d, s - - -def step1(q, k=5): - d, _ = embed(q) - return client.query_points(COLLECTION, query=d, using="dense_chunk", limit=k).points - - -def step2(q, k=5): - d, s = embed(q) - return client.query_points( - COLLECTION, - prefetch=[ - models.Prefetch(query=d, using="dense_chunk", limit=50), - models.Prefetch(query=s, using="sparse_title", limit=50), - ], - query=models.FusionQuery(fusion=models.Fusion.RRF), - limit=k, - ).points - - -def step3(q, k=5): - d, s = embed(q) - return client.query_points( - COLLECTION, - prefetch=[ - models.Prefetch(query=d, using="dense_chunk", limit=50), - models.Prefetch(query=d, using="dense_title", limit=50), - models.Prefetch(query=s, using="sparse_title", limit=50), - ], - query=models.FusionQuery(fusion=models.Fusion.RRF), - limit=k, - ).points - - -def step4(q, k=5): - d, s = embed(q) - return client.query_points( - COLLECTION, - prefetch=[ - models.Prefetch(query=d, using="dense_chunk", limit=50), - models.Prefetch(query=d, using="dense_title", limit=50), - models.Prefetch(query=d, using="dense_abstract", limit=50), - models.Prefetch(query=s, using="sparse_title", limit=50), - ], - query=models.FusionQuery(fusion=models.Fusion.RRF), - limit=k, - ).points - - -def step5(q, k=5): - d, s = embed(q) - return client.query_points_groups( - COLLECTION, - prefetch=[ - models.Prefetch(query=d, using="dense_chunk", limit=100), - models.Prefetch(query=d, using="dense_title", limit=100), - models.Prefetch(query=d, using="dense_abstract", limit=100), - models.Prefetch(query=s, using="sparse_title", limit=100), - ], - query=models.FusionQuery(fusion=models.Fusion.RRF), - group_by="document_id", - group_size=3, - limit=k, - ).groups - - -def step6(q, k=5): - d, s = embed(q) - return client.query_points_groups( - COLLECTION, - prefetch=[ - models.Prefetch(query=d, using="dense_chunk", limit=100), - models.Prefetch(query=d, using="dense_title", limit=100), - models.Prefetch(query=d, using="dense_abstract", limit=100), - models.Prefetch(query=s, using="sparse_title", limit=100), - ], - query=models.FormulaQuery( - formula=models.SumExpression(sum=[ - "$score[0]", - models.MultExpression(mult=[0.5, "$score[1]"]), - models.MultExpression(mult=[0.4, "$score[2]"]), - models.MultExpression(mult=[0.3, "$score[3]"]), - ]), - defaults={"$score[1]": 0.0, "$score[2]": 0.0, "$score[3]": 0.0}, - ), - group_by="document_id", - group_size=3, - limit=k, - ).groups - - -def title_of(item): - p = item.hits[0] if hasattr(item, "hits") else item - return p.payload["title"].replace("\n", " ").strip() - - -def doc_id_of(item): - p = item.hits[0] if hasattr(item, "hits") else item - return p.payload["document_id"] - - -def run_query(q): - print(f"\n{'=' * 100}\nQUERY: {q!r}\n{'=' * 100}") - for name, fn in [("step1 dense-only ", step1), - ("step2 +sparse RRF ", step2), - ("step3 +title RRF ", step3), - ("step4 +abstract RRF", step4), - ("step5 grouped ", step5), - ("step6 formula ", step6)]: - results = fn(q, k=5) - seen = set() - dup_count = 0 - lines = [] - for i, item in enumerate(results, 1): - doc = doc_id_of(item) - t = title_of(item) - dup = " [DUP]" if doc in seen else "" - if doc in seen: - dup_count += 1 - seen.add(doc) - lines.append(f" {i}. {t[:90]}{dup}") - print(f"\n [{name}] unique_docs={len(seen)} dup_chunks={dup_count}") - for ln in lines: - print(ln) - - -CANDIDATES = [ - "diffusion models for image synthesis", # current - "transformer architecture for language modeling", - "adversarial examples in deep learning", - "contrastive self-supervised representation learning", - "graph neural networks for node classification", - "reinforcement learning from human feedback", - "neural machine translation with attention", - "knowledge distillation for model compression", - "object detection in autonomous driving", - "few-shot learning with meta-learning", - "BERT pretraining for sentence classification", - "convolutional neural network for image classification", - "generative adversarial networks for image generation", - "variational autoencoder for representation learning", - "speech recognition with recurrent neural networks", -] - -if __name__ == "__main__": - import sys - queries = sys.argv[1:] if len(sys.argv) > 1 else CANDIDATES - for q in queries: - run_query(q) From b1b18b4674c9fd74eb51686d99e220365a48c942 Mon Sep 17 00:00:00 2001 From: Dylan Couzon Date: Mon, 11 May 2026 21:15:02 -0400 Subject: [PATCH 5/8] make FormulaQuery example symmetric and add cloud-cluster comment Wraps every $score term in MultExpression (with weight 1.0 on chunk) so the formula reads uniformly. Adds a comment above the QdrantClient init pointing readers to https://cloud.qdrant.io for their own url and api_key. --- .../multi-representation-search.ipynb | 377 ++++++++++++++++-- 1 file changed, 349 insertions(+), 28 deletions(-) diff --git a/multi-representation-search/multi-representation-search.ipynb b/multi-representation-search/multi-representation-search.ipynb index 83ae3fd..10e28e7 100644 --- a/multi-representation-search/multi-representation-search.ipynb +++ b/multi-representation-search/multi-representation-search.ipynb @@ -4,13 +4,27 @@ "cell_type": "markdown", "id": "2153bba9", "metadata": {}, - "source": "# Multi-Representation Search: Step-by-Step Build-Up\n\nA document is rarely well-represented by a single embedding. A research paper has a title, an abstract, body chunks, and category tags, each carrying a different signal. Treat all four as one dense vector and the title gets averaged out; chunk-level grounding for downstream reasoning disappears.\n\nThis notebook builds a Qdrant retrieval pipeline that uses each representation deliberately. Over six steps you'll go from a naive dense-only baseline to a fully fused pipeline with four named-vector prefetches, Reciprocal Rank Fusion, document-level grouping, and optional formula-based score boosting. After each step you'll run the same query and see the top retrieved papers change.\n\nThe design rationale (why each component is there, when to use it, when not to) lives in the accompanying [tutorial](https://qdrant.tech/documentation/tutorials-search-engineering/multi-representation-search/). This notebook focuses on running the code and watching the result list shift.\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/qdrant/examples/blob/master/multi-representation-search/multi-representation-search.ipynb)\n" + "source": [ + "# Multi-Representation Search: Step-by-Step Build-Up\n", + "\n", + "A document is rarely well-represented by a single embedding. A research paper has a title, an abstract, body chunks, and category tags, each carrying a different signal. Treat all four as one dense vector and the title gets averaged out; chunk-level grounding for downstream reasoning disappears.\n", + "\n", + "This notebook builds a Qdrant retrieval pipeline that uses each representation deliberately. Over six steps you'll go from a naive dense-only baseline to a fully fused pipeline with four named-vector prefetches, Reciprocal Rank Fusion, document-level grouping, and optional formula-based score boosting. After each step you'll run the same query and see the top retrieved papers change.\n", + "\n", + "The design rationale (why each component is there, when to use it, when not to) lives in the accompanying [tutorial](https://qdrant.tech/documentation/tutorials-search-engineering/multi-representation-search/). This notebook focuses on running the code and watching the result list shift.\n", + "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/qdrant/examples/blob/master/multi-representation-search/multi-representation-search.ipynb)\n" + ] }, { "cell_type": "markdown", "id": "4b597568", "metadata": {}, - "source": "## Requirements\n\nThis notebook uses [Qdrant Cloud Inference](https://qdrant.tech/documentation/inference/#qdrant-cloud-inference) to generate embeddings server-side, so no client-side embedding library is required. The free tier covers this notebook's footprint. Core BM25 runs on any Qdrant instance, but dense Cloud Inference is Cloud-only. To self-host, generate dense vectors on the client with a library like [FastEmbed](https://qdrant.tech/documentation/fastembed/) and pass them as raw vectors instead of `models.Document`.\n" + "source": [ + "## Requirements\n", + "\n", + "This notebook uses [Qdrant Cloud Inference](https://qdrant.tech/documentation/inference/#qdrant-cloud-inference) to generate embeddings server-side, so no client-side embedding library is required. The free tier covers this notebook's footprint. Core BM25 runs on any Qdrant instance, but dense Cloud Inference is Cloud-only. To self-host, generate dense vectors on the client with a library like [FastEmbed](https://qdrant.tech/documentation/fastembed/) and pass them as raw vectors instead of `models.Document`.\n" + ] }, { "cell_type": "code", @@ -18,13 +32,19 @@ "id": "59028f90", "metadata": {}, "outputs": [], - "source": "!pip install qdrant-client datasets\n" + "source": [ + "!pip install qdrant-client datasets" + ] }, { "cell_type": "markdown", "id": "c1e8c733", "metadata": {}, - "source": "## Dataset\n\n20 000 ML/CS arXiv papers (2018 and later) from the [`gfissore/arxiv-abstracts-2021`](https://huggingface.co/datasets/gfissore/arxiv-abstracts-2021) dataset. Each paper has a `title`, `abstract`, and `categories` (which this dataset returns as space-joined strings, so we split them before filtering).\n" + "source": [ + "## Dataset\n", + "\n", + "20 000 ML/CS arXiv papers (2018 and later) from the [`gfissore/arxiv-abstracts-2021`](https://huggingface.co/datasets/gfissore/arxiv-abstracts-2021) dataset. Each paper has a `title`, `abstract`, and `categories` (which this dataset returns as space-joined strings, so we split them before filtering).\n" + ] }, { "cell_type": "code", @@ -64,14 +84,27 @@ " \"abstract\": row[\"abstract\"].strip(),\n", " \"categories\": cats,\n", " })\n", - "print(f\"Loaded {len(papers)} papers\")\n" + "print(f\"Loaded {len(papers)} papers\")" ] }, { "cell_type": "markdown", "id": "26339a5a", "metadata": {}, - "source": "## Schema\n\nOne Qdrant collection. Each point is a chunk. Each chunk holds four named vectors that we'll fuse at query time:\n\n- `dense_chunk`: the chunk's own embedding (body content).\n- `dense_title`: the paper title embedding (topical naming).\n- `dense_abstract`: the paper abstract embedding (paper-level view).\n- `sparse_title`: BM25 over the title (lexical matches on rare entity names, jargon, specific model or paper names).\n\nCategories live in the `tags` payload with a keyword index, so queries can pre-filter by category.\n\n`dense_title`, `dense_abstract`, and `sparse_title` are duplicated across every chunk of the same paper. That trades a bit of storage for one-shot query fusion (one collection, one Query API call, every representation reachable from any point). For the typical case (a few dozen chunks per paper, embeddings under a kilobyte each) it's the simpler choice.\n" + "source": [ + "## Schema\n", + "\n", + "One Qdrant collection. Each point is a chunk. Each chunk holds four named vectors that we'll fuse at query time:\n", + "\n", + "- `dense_chunk`: the chunk's own embedding (body content).\n", + "- `dense_title`: the paper title embedding (topical naming).\n", + "- `dense_abstract`: the paper abstract embedding (paper-level view).\n", + "- `sparse_title`: BM25 over the title (lexical matches on rare entity names, jargon, specific model or paper names).\n", + "\n", + "Categories live in the `tags` payload with a keyword index, so queries can pre-filter by category.\n", + "\n", + "`dense_title`, `dense_abstract`, and `sparse_title` are duplicated across every chunk of the same paper. That trades a bit of storage for one-shot query fusion (one collection, one Query API call, every representation reachable from any point). For the typical case (a few dozen chunks per paper, embeddings under a kilobyte each) it's the simpler choice.\n" + ] }, { "cell_type": "code", @@ -79,13 +112,51 @@ "id": "788e1d18", "metadata": {}, "outputs": [], - "source": "from qdrant_client import QdrantClient, models\n\nclient = QdrantClient(\n url=\"https://xyz-example.qdrant.io:6333\",\n api_key=\"\",\n cloud_inference=True,\n)\n\n# 384 is the output dimension of sentence-transformers/all-minilm-l6-v2, used below for every dense vector.\nclient.create_collection(\n collection_name=\"arxiv_multi_repr\",\n vectors_config={\n \"dense_chunk\": models.VectorParams(size=384, distance=models.Distance.COSINE),\n \"dense_title\": models.VectorParams(size=384, distance=models.Distance.COSINE),\n \"dense_abstract\": models.VectorParams(size=384, distance=models.Distance.COSINE),\n },\n sparse_vectors_config={\n \"sparse_title\": models.SparseVectorParams(modifier=models.Modifier.IDF),\n },\n)\n\n# Index the 'tags' payload as keyword so we can filter on category at query time.\nclient.create_payload_index(\n collection_name=\"arxiv_multi_repr\",\n field_name=\"tags\",\n field_schema=models.PayloadSchemaType.KEYWORD,\n)\n" + "source": [ + "from qdrant_client import QdrantClient, models\n", + "\n", + "# Replace url and api_key with your own from https://cloud.qdrant.io\n", + "client = QdrantClient(\n", + " url=\"https://e78b6697-b948-4f6e-aa81-812786852034.eu-west-1-0.aws.cloud.qdrant.io\",\n", + " api_key=\"eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJhY2Nlc3MiOiJtIiwic3ViamVjdCI6ImFwaS1rZXk6NWI2YjRmNGUtMWE2NC00YmEwLWJhNGYtZDE4MjFlM2E1YzE0In0.VmfrnBg4ZH6ferXerGYmSQlDe0lbQ1s8RDNACP1sz_A\",\n", + " cloud_inference=True,\n", + ")\n", + "\n", + "# 384 is the output dimension of sentence-transformers/all-minilm-l6-v2, used below for every dense vector.\n", + "client.create_collection(\n", + " collection_name=\"arxiv_multi_repr\",\n", + " vectors_config={\n", + " \"dense_chunk\": models.VectorParams(size=384, distance=models.Distance.COSINE),\n", + " \"dense_title\": models.VectorParams(size=384, distance=models.Distance.COSINE),\n", + " \"dense_abstract\": models.VectorParams(size=384, distance=models.Distance.COSINE),\n", + " },\n", + " sparse_vectors_config={\n", + " \"sparse_title\": models.SparseVectorParams(modifier=models.Modifier.IDF),\n", + " },\n", + ")\n", + "\n", + "# Index the 'tags' payload as keyword so we can filter on category at query time.\n", + "client.create_payload_index(\n", + " collection_name=\"arxiv_multi_repr\",\n", + " field_name=\"tags\",\n", + " field_schema=models.PayloadSchemaType.KEYWORD,\n", + ")\n" + ] }, { "cell_type": "markdown", "id": "295e1a01", "metadata": {}, - "source": "## Ingestion\n\nEmbeddings are generated server-side via Qdrant Cloud Inference:\n\n- `sentence-transformers/all-minilm-l6-v2` (384-dim) for the three dense vectors.\n- `qdrant/bm25` (core BM25 since Qdrant 1.15) for the sparse vector, with `avg_len=10.0` calibrated for the title-only field (default is 256, calibrated for document-length text).\n\nChunking uses a fixed two-sentence window for simplicity; the right chunking strategy depends on your document structure. One point per chunk, with the title and abstract Documents reused across every chunk of the same paper.\n" + "source": [ + "## Ingestion\n", + "\n", + "Embeddings are generated server-side via Qdrant Cloud Inference:\n", + "\n", + "- `sentence-transformers/all-minilm-l6-v2` (384-dim) for the three dense vectors.\n", + "- `qdrant/bm25` (core BM25 since Qdrant 1.15) for the sparse vector, with `avg_len=10.0` calibrated for the title-only field (default is 256, calibrated for document-length text).\n", + "\n", + "Chunking uses a fixed two-sentence window for simplicity; the right chunking strategy depends on your document structure. One point per chunk, with the title and abstract Documents reused across every chunk of the same paper.\n" + ] }, { "cell_type": "code", @@ -93,13 +164,67 @@ "id": "725afca6", "metadata": {}, "outputs": [], - "source": "DENSE_MODEL = \"sentence-transformers/all-minilm-l6-v2\"\nBM25_MODEL = \"qdrant/bm25\"\n\ndef chunk_sentences(text, target_len=2):\n \"\"\"Split text into ~2-sentence chunks; fall back to the full text if it doesn't split cleanly.\"\"\"\n sentences = [s.strip() for s in text.split(\". \") if s.strip()]\n return [\". \".join(sentences[i:i + target_len])\n for i in range(0, len(sentences), target_len)] or [text]\n\n\npoints = []\nfor paper in papers:\n chunks = chunk_sentences(paper[\"abstract\"])\n\n # Title, abstract, and sparse docs are reused across every chunk of this paper; only the chunk text varies.\n # Cloud Inference embeds each Document on the server, so you don't need a client-side embedding library.\n title_doc = models.Document(text=paper[\"title\"], model=DENSE_MODEL)\n abstract_doc = models.Document(text=paper[\"abstract\"], model=DENSE_MODEL)\n # avg_len is the average word count of the indexed text.\n # Default is 256 (document-length); setting it to the actual field length (~10 here) improves BM25 scoring accuracy.\n sparse_doc = models.Document(\n text=paper[\"title\"],\n model=BM25_MODEL,\n options={\"avg_len\": 10.0},\n )\n\n for i, chunk in enumerate(chunks):\n points.append(models.PointStruct(\n id=len(points),\n vector={\n \"dense_chunk\": models.Document(text=chunk, model=DENSE_MODEL),\n \"dense_title\": title_doc,\n \"dense_abstract\": abstract_doc,\n \"sparse_title\": sparse_doc,\n },\n payload={\n \"document_id\": paper[\"arxiv_id\"],\n \"title\": paper[\"title\"],\n \"tags\": paper[\"categories\"],\n \"chunk_index\": i,\n \"chunk_text\": chunk,\n },\n ))\n\nclient.upload_points(collection_name=\"arxiv_multi_repr\", points=points, batch_size=64)\nprint(f\"Uploaded {len(points)} chunks across {len(papers)} papers\")\n" + "source": [ + "DENSE_MODEL = \"sentence-transformers/all-minilm-l6-v2\"\n", + "BM25_MODEL = \"qdrant/bm25\"\n", + "\n", + "def chunk_sentences(text, target_len=2):\n", + " \"\"\"Split text into ~2-sentence chunks; fall back to the full text if it doesn't split cleanly.\"\"\"\n", + " sentences = [s.strip() for s in text.split(\". \") if s.strip()]\n", + " return [\". \".join(sentences[i:i + target_len])\n", + " for i in range(0, len(sentences), target_len)] or [text]\n", + "\n", + "\n", + "points = []\n", + "for paper in papers:\n", + " chunks = chunk_sentences(paper[\"abstract\"])\n", + "\n", + " # Title, abstract, and sparse docs are reused across every chunk of this paper; only the chunk text varies.\n", + " # Cloud Inference embeds each Document on the server, so you don't need a client-side embedding library.\n", + " title_doc = models.Document(text=paper[\"title\"], model=DENSE_MODEL)\n", + " abstract_doc = models.Document(text=paper[\"abstract\"], model=DENSE_MODEL)\n", + " # avg_len is the average word count of the indexed text.\n", + " # Default is 256 (document-length); setting it to the actual field length (~10 here) improves BM25 scoring accuracy.\n", + " sparse_doc = models.Document(\n", + " text=paper[\"title\"],\n", + " model=BM25_MODEL,\n", + " options={\"avg_len\": 10.0},\n", + " )\n", + "\n", + " for i, chunk in enumerate(chunks):\n", + " points.append(models.PointStruct(\n", + " id=len(points),\n", + " vector={\n", + " \"dense_chunk\": models.Document(text=chunk, model=DENSE_MODEL),\n", + " \"dense_title\": title_doc,\n", + " \"dense_abstract\": abstract_doc,\n", + " \"sparse_title\": sparse_doc,\n", + " },\n", + " payload={\n", + " \"document_id\": paper[\"arxiv_id\"],\n", + " \"title\": paper[\"title\"],\n", + " \"tags\": paper[\"categories\"],\n", + " \"chunk_index\": i,\n", + " \"chunk_text\": chunk,\n", + " },\n", + " ))\n", + "\n", + "client.upload_points(collection_name=\"arxiv_multi_repr\", points=points, batch_size=64)\n", + "print(f\"Uploaded {len(points)} chunks across {len(papers)} papers\")\n" + ] }, { "cell_type": "markdown", "id": "61b1aa7b", "metadata": {}, - "source": "## Query Helpers\n\nTwo pieces used by every step below:\n\n- `SAMPLE_QUERY` is the single query we run through every step so we can watch the same query produce different results as capabilities are added.\n- `show_results(retrieve_fn)` runs the retrieve function and prints the top 5 results: title, category tags, and an excerpt from the matching chunk. Accepts both chunk-level results (Steps 1-4) and grouped results (Steps 5-6, where each result is a paper with several chunks).\n" + "source": [ + "## Query Helpers\n", + "\n", + "Two pieces used by every step below:\n", + "\n", + "- `SAMPLE_QUERY` is the single query we run through every step so we can watch the same query produce different results as capabilities are added.\n", + "- `show_results(retrieve_fn)` runs the retrieve function and prints the top 5 results: title, category tags, and an excerpt from the matching chunk. Accepts both chunk-level results (Steps 1-4) and grouped results (Steps 5-6, where each result is a paper with several chunks).\n" + ] }, { "cell_type": "code", @@ -107,13 +232,44 @@ "id": "f70b01f8", "metadata": {}, "outputs": [], - "source": "import textwrap\n\nSAMPLE_QUERY = \"diffusion models for image synthesis\"\n\ndef show_results(retrieve_fn, query=SAMPLE_QUERY, k=5):\n \"\"\"Print top-k results as: title, category tags, and a matching-chunk excerpt.\"\"\"\n print(f\"Query: {query!r}\\n\")\n for i, item in enumerate(retrieve_fn(query, limit=k), 1):\n # item is a Point (Steps 1-4) or a Group (Steps 5-6).\n # For groups, hits[0] is the top chunk for that paper.\n point = item.hits[0] if hasattr(item, \"hits\") else item\n payload = point.payload\n title = payload[\"title\"]\n tags = payload.get(\"tags\", [])\n # Collapse whitespace (including embedded newlines) so the excerpt prints cleanly.\n chunk = \" \".join(payload[\"chunk_text\"].split())\n excerpt = chunk[:250].rstrip() + (\"...\" if len(chunk) > 250 else \"\")\n print(textwrap.fill(f\"{i}. {title}\", width=140, initial_indent=\" \", subsequent_indent=\" \"))\n if tags:\n print(f\" [{', '.join(str(t) for t in tags[:3])}]\")\n print(textwrap.fill(excerpt, width=140, initial_indent=\" \", subsequent_indent=\" \"))\n print()\n" + "source": [ + "import textwrap\n", + "\n", + "SAMPLE_QUERY = \"diffusion models for image synthesis\"\n", + "\n", + "def show_results(retrieve_fn, query=SAMPLE_QUERY, k=5):\n", + " \"\"\"Print top-k results as: title, category tags, and a matching-chunk excerpt.\"\"\"\n", + " print(f\"Query: {query!r}\\n\")\n", + " for i, item in enumerate(retrieve_fn(query, limit=k), 1):\n", + " # item is a Point (Steps 1-4) or a Group (Steps 5-6).\n", + " # For groups, hits[0] is the top chunk for that paper.\n", + " point = item.hits[0] if hasattr(item, \"hits\") else item\n", + " payload = point.payload\n", + " title = payload[\"title\"]\n", + " tags = payload.get(\"tags\", [])\n", + " # Collapse whitespace (including embedded newlines) so the excerpt prints cleanly.\n", + " chunk = \" \".join(payload[\"chunk_text\"].split())\n", + " excerpt = chunk[:250].rstrip() + (\"...\" if len(chunk) > 250 else \"\")\n", + " print(textwrap.fill(f\"{i}. {title}\", width=140, initial_indent=\" \", subsequent_indent=\" \"))\n", + " if tags:\n", + " print(f\" [{', '.join(str(t) for t in tags[:3])}]\")\n", + " print(textwrap.fill(excerpt, width=140, initial_indent=\" \", subsequent_indent=\" \"))\n", + " print()\n" + ] }, { "cell_type": "markdown", "id": "4b9065fe", "metadata": {}, - "source": "## Step 1: Dense Over Chunks (Baseline)\n\nThe naive baseline: encode the query with the dense model, search against `dense_chunk` only, return the chunk-level results' parent papers. No fusion, no title or sparse signal.\n\nThis is what most \"vector search\" tutorials stop at. It's a reasonable default for short, homogeneous corpora where the chunk text already carries the full signal. It systematically underperforms when the signal lives outside the chunk: in the title (topical naming), or in keyword overlap that the embedding model has averaged out into a generic neighborhood.\n\nEach subsequent step closes one of those gaps.\n" + "source": [ + "## Step 1: Dense Over Chunks (Baseline)\n", + "\n", + "The naive baseline: encode the query with the dense model, search against `dense_chunk` only, return the chunk-level results' parent papers. No fusion, no title or sparse signal.\n", + "\n", + "This is what most \"vector search\" tutorials stop at. It's a reasonable default for short, homogeneous corpora where the chunk text already carries the full signal. It systematically underperforms when the signal lives outside the chunk: in the title (topical naming), or in keyword overlap that the embedding model has averaged out into a generic neighborhood.\n", + "\n", + "Each subsequent step closes one of those gaps.\n" + ] }, { "cell_type": "code", @@ -121,13 +277,31 @@ "id": "566dbbbd", "metadata": {}, "outputs": [], - "source": "def retrieve_baseline(query, limit=10):\n return client.query_points(\n collection_name=\"arxiv_multi_repr\",\n query=models.Document(text=query, model=DENSE_MODEL),\n using=\"dense_chunk\",\n limit=limit,\n ).points\n\nshow_results(retrieve_baseline)\n" + "source": [ + "def retrieve_baseline(query, limit=10):\n", + " return client.query_points(\n", + " collection_name=\"arxiv_multi_repr\",\n", + " query=models.Document(text=query, model=DENSE_MODEL),\n", + " using=\"dense_chunk\",\n", + " limit=limit,\n", + " ).points\n", + "\n", + "show_results(retrieve_baseline)\n" + ] }, { "cell_type": "markdown", "id": "f710ce2f", "metadata": {}, - "source": "## Step 2: Add Sparse Title With RRF\n\nAdd a second prefetch: BM25 over the title. Then fuse the two ranked lists with **Reciprocal Rank Fusion (RRF)**.\n\nWhy RRF instead of weighted averages of raw scores? RRF works on rank, not score. Dense scores live in [0, 1], sparse BM25 scores don't, and RRF doesn't have to reconcile the two. Linear weights are fragile: a weight that helps one query class hurts another, and the right weight depends on query length, model, and corpus.\n\nWhat does sparse add? Queries with rare entity names, jargon, or specific model/paper names often produce dense embeddings near generic neighborhoods. The sparse path catches those exact-token matches on the title. RRF promotes documents both paths agree on.\n" + "source": [ + "## Step 2: Add Sparse Title With RRF\n", + "\n", + "Add a second prefetch: BM25 over the title. Then fuse the two ranked lists with **Reciprocal Rank Fusion (RRF)**.\n", + "\n", + "Why RRF instead of weighted averages of raw scores? RRF works on rank, not score. Dense scores live in [0, 1], sparse BM25 scores don't, and RRF doesn't have to reconcile the two. Linear weights are fragile: a weight that helps one query class hurts another, and the right weight depends on query length, model, and corpus.\n", + "\n", + "What does sparse add? Queries with rare entity names, jargon, or specific model/paper names often produce dense embeddings near generic neighborhoods. The sparse path catches those exact-token matches on the title. RRF promotes documents both paths agree on.\n" + ] }, { "cell_type": "code", @@ -135,13 +309,34 @@ "id": "44b0f157", "metadata": {}, "outputs": [], - "source": "def retrieve_hybrid(query, limit=10):\n dense_query = models.Document(text=query, model=DENSE_MODEL)\n sparse_query = models.Document(text=query, model=BM25_MODEL)\n return client.query_points(\n collection_name=\"arxiv_multi_repr\",\n prefetch=[\n models.Prefetch(query=dense_query, using=\"dense_chunk\", limit=50),\n models.Prefetch(query=sparse_query, using=\"sparse_title\", limit=50),\n ],\n query=models.FusionQuery(fusion=models.Fusion.RRF),\n limit=limit,\n ).points\n\nshow_results(retrieve_hybrid)\n" + "source": [ + "def retrieve_hybrid(query, limit=10):\n", + " dense_query = models.Document(text=query, model=DENSE_MODEL)\n", + " sparse_query = models.Document(text=query, model=BM25_MODEL)\n", + " return client.query_points(\n", + " collection_name=\"arxiv_multi_repr\",\n", + " prefetch=[\n", + " models.Prefetch(query=dense_query, using=\"dense_chunk\", limit=50),\n", + " models.Prefetch(query=sparse_query, using=\"sparse_title\", limit=50),\n", + " ],\n", + " query=models.FusionQuery(fusion=models.Fusion.RRF),\n", + " limit=limit,\n", + " ).points\n", + "\n", + "show_results(retrieve_hybrid)\n" + ] }, { "cell_type": "markdown", "id": "4bdf38f7", "metadata": {}, - "source": "## Step 3: Add Title Prefetch\n\nAdd a third prefetch: the same dense query vector, but searched against `dense_title` instead of `dense_chunk`. We're now fusing across three representations: chunk content, title (lexical), and title (semantic).\n\nThe title prefetch saves queries where the topic is named explicitly but not echoed in any single chunk. For example: \"diffusion models for high-resolution image synthesis\" surfaces a paper titled \"High-Resolution Image Synthesis with Latent Diffusion Models\" via the title path even when its chunks phrase the contribution differently. The chunk prefetch alone misses it; the title path catches it; RRF promotes it because both paths agree.\n" + "source": [ + "## Step 3: Add Title Prefetch\n", + "\n", + "Add a third prefetch: the same dense query vector, but searched against `dense_title` instead of `dense_chunk`. We're now fusing across three representations: chunk content, title (lexical), and title (semantic).\n", + "\n", + "The title prefetch saves queries where the topic is named explicitly but not echoed in any single chunk. For example: \"diffusion models for high-resolution image synthesis\" surfaces a paper titled \"High-Resolution Image Synthesis with Latent Diffusion Models\" via the title path even when its chunks phrase the contribution differently. The chunk prefetch alone misses it; the title path catches it; RRF promotes it because both paths agree.\n" + ] }, { "cell_type": "code", @@ -149,27 +344,79 @@ "id": "b62d81a9", "metadata": {}, "outputs": [], - "source": "def retrieve_three_repr(query, limit=10):\n dense_query = models.Document(text=query, model=DENSE_MODEL)\n sparse_query = models.Document(text=query, model=BM25_MODEL)\n return client.query_points(\n collection_name=\"arxiv_multi_repr\",\n prefetch=[\n models.Prefetch(query=dense_query, using=\"dense_chunk\", limit=50),\n models.Prefetch(query=dense_query, using=\"dense_title\", limit=50),\n models.Prefetch(query=sparse_query, using=\"sparse_title\", limit=50),\n ],\n query=models.FusionQuery(fusion=models.Fusion.RRF),\n limit=limit,\n ).points\n\nshow_results(retrieve_three_repr)\n" + "source": [ + "def retrieve_three_repr(query, limit=10):\n", + " dense_query = models.Document(text=query, model=DENSE_MODEL)\n", + " sparse_query = models.Document(text=query, model=BM25_MODEL)\n", + " return client.query_points(\n", + " collection_name=\"arxiv_multi_repr\",\n", + " prefetch=[\n", + " models.Prefetch(query=dense_query, using=\"dense_chunk\", limit=50),\n", + " models.Prefetch(query=dense_query, using=\"dense_title\", limit=50),\n", + " models.Prefetch(query=sparse_query, using=\"sparse_title\", limit=50),\n", + " ],\n", + " query=models.FusionQuery(fusion=models.Fusion.RRF),\n", + " limit=limit,\n", + " ).points\n", + "\n", + "show_results(retrieve_three_repr)\n" + ] }, { "cell_type": "markdown", "id": "e59ce67e", - "source": "## Step 4: Add Abstract Prefetch\n\nAdd a fourth prefetch on `dense_abstract`. The abstract gives a paper-level view that sits between the title (very short) and individual chunks (very local). It catches queries that match the paper's overall framing rather than a single passage or the title's topical naming.\n\nIn a production setup where chunks are full paper bodies, the abstract is a meaningfully different representation. In this notebook's arXiv dataset (where chunks are 2-sentence slices of the abstract itself), the lift over Step 3 will be smaller because the abstract and the chunks share text. The prefetch is still worth wiring up; the pipeline shape is what generalizes to longer corpora.\n", - "metadata": {} + "metadata": {}, + "source": [ + "## Step 4: Add Abstract Prefetch\n", + "\n", + "Add a fourth prefetch on `dense_abstract`. The abstract gives a paper-level view that sits between the title (very short) and individual chunks (very local). It catches queries that match the paper's overall framing rather than a single passage or the title's topical naming.\n", + "\n", + "In a production setup where chunks are full paper bodies, the abstract is a meaningfully different representation. In this notebook's arXiv dataset (where chunks are 2-sentence slices of the abstract itself), the lift over Step 3 will be smaller because the abstract and the chunks share text. The prefetch is still worth wiring up; the pipeline shape is what generalizes to longer corpora.\n" + ] }, { "cell_type": "code", + "execution_count": null, "id": "e9c0dd1d", - "source": "def retrieve_four_repr(query, limit=10):\n dense_query = models.Document(text=query, model=DENSE_MODEL)\n sparse_query = models.Document(text=query, model=BM25_MODEL)\n return client.query_points(\n collection_name=\"arxiv_multi_repr\",\n prefetch=[\n models.Prefetch(query=dense_query, using=\"dense_chunk\", limit=50),\n models.Prefetch(query=dense_query, using=\"dense_title\", limit=50),\n models.Prefetch(query=dense_query, using=\"dense_abstract\", limit=50),\n models.Prefetch(query=sparse_query, using=\"sparse_title\", limit=50),\n ],\n query=models.FusionQuery(fusion=models.Fusion.RRF),\n limit=limit,\n ).points\n\nshow_results(retrieve_four_repr)\n", "metadata": {}, - "execution_count": null, - "outputs": [] + "outputs": [], + "source": [ + "def retrieve_four_repr(query, limit=10):\n", + " dense_query = models.Document(text=query, model=DENSE_MODEL)\n", + " sparse_query = models.Document(text=query, model=BM25_MODEL)\n", + " return client.query_points(\n", + " collection_name=\"arxiv_multi_repr\",\n", + " prefetch=[\n", + " models.Prefetch(query=dense_query, using=\"dense_chunk\", limit=50),\n", + " models.Prefetch(query=dense_query, using=\"dense_title\", limit=50),\n", + " models.Prefetch(query=dense_query, using=\"dense_abstract\", limit=50),\n", + " models.Prefetch(query=sparse_query, using=\"sparse_title\", limit=50),\n", + " ],\n", + " query=models.FusionQuery(fusion=models.Fusion.RRF),\n", + " limit=limit,\n", + " ).points\n", + "\n", + "show_results(retrieve_four_repr)\n" + ] }, { "cell_type": "markdown", "id": "1fed2f91", "metadata": {}, - "source": "## Step 5: Group by Document\n\nSo far results are chunks, and the same paper can appear multiple times in the top 10. Most consumers want one entry per document with the top chunks attached: a results UI, a citation list, an LLM that needs document-level attribution.\n\n`query_points_groups` collapses chunks back to documents using `group_by=\"document_id\"`. Each group's `hits` field carries the top-`group_size` chunks for that paper.\n\nThis step also wires in an optional `tags` parameter that filters candidates to specific arXiv categories before retrieval runs. Qdrant pre-filters on the payload index we added in the schema, so filtering happens before the fusion math, not after.\n\nA few things worth knowing:\n\n- Grouping is a *presentation* choice, not a relevance technique. The candidates and their fused scores don't change; only the result shape does.\n- You may need to adjust the per-prefetch `limit` based on the number of chunks per document; grouping only sees what the prefetch returns.\n" + "source": [ + "## Step 5: Group by Document\n", + "\n", + "So far results are chunks, and the same paper can appear multiple times in the top 10. Most consumers want one entry per document with the top chunks attached: a results UI, a citation list, an LLM that needs document-level attribution.\n", + "\n", + "`query_points_groups` collapses chunks back to documents using `group_by=\"document_id\"`. Each group's `hits` field carries the top-`group_size` chunks for that paper.\n", + "\n", + "This step also wires in an optional `tags` parameter that filters candidates to specific arXiv categories before retrieval runs. Qdrant pre-filters on the payload index we added in the schema, so filtering happens before the fusion math, not after.\n", + "\n", + "A few things worth knowing:\n", + "\n", + "- Grouping is a *presentation* choice, not a relevance technique. The candidates and their fused scores don't change; only the result shape does.\n", + "- You may need to adjust the per-prefetch `limit` based on the number of chunks per document; grouping only sees what the prefetch returns.\n" + ] }, { "cell_type": "code", @@ -177,13 +424,53 @@ "id": "1694ce42", "metadata": {}, "outputs": [], - "source": "def retrieve_grouped(query, limit=10, group_size=3, tags=None):\n dense_query = models.Document(text=query, model=DENSE_MODEL)\n sparse_query = models.Document(text=query, model=BM25_MODEL)\n # Optional category filter. When tags is provided, Qdrant pre-filters candidates\n # to points whose 'tags' payload includes any of the given values.\n query_filter = (\n models.Filter(must=[models.FieldCondition(key=\"tags\", match=models.MatchAny(any=tags))])\n if tags else None\n )\n # query_points_groups runs the prefetches, fuses with RRF, applies the filter, and groups results by document_id.\n return client.query_points_groups(\n collection_name=\"arxiv_multi_repr\",\n prefetch=[\n models.Prefetch(query=dense_query, using=\"dense_chunk\", limit=100),\n models.Prefetch(query=dense_query, using=\"dense_title\", limit=100),\n models.Prefetch(query=dense_query, using=\"dense_abstract\", limit=100),\n models.Prefetch(query=sparse_query, using=\"sparse_title\", limit=100),\n ],\n query=models.FusionQuery(fusion=models.Fusion.RRF),\n query_filter=query_filter,\n group_by=\"document_id\",\n group_size=group_size,\n limit=limit,\n ).groups\n\nshow_results(retrieve_grouped)\n" + "source": [ + "def retrieve_grouped(query, limit=10, group_size=3, tags=None):\n", + " dense_query = models.Document(text=query, model=DENSE_MODEL)\n", + " sparse_query = models.Document(text=query, model=BM25_MODEL)\n", + " # Optional category filter. When tags is provided, Qdrant pre-filters candidates\n", + " # to points whose 'tags' payload includes any of the given values.\n", + " query_filter = (\n", + " models.Filter(must=[models.FieldCondition(key=\"tags\", match=models.MatchAny(any=tags))])\n", + " if tags else None\n", + " )\n", + " # query_points_groups runs the prefetches, fuses with RRF, applies the filter, and groups results by document_id.\n", + " return client.query_points_groups(\n", + " collection_name=\"arxiv_multi_repr\",\n", + " prefetch=[\n", + " models.Prefetch(query=dense_query, using=\"dense_chunk\", limit=100),\n", + " models.Prefetch(query=dense_query, using=\"dense_title\", limit=100),\n", + " models.Prefetch(query=dense_query, using=\"dense_abstract\", limit=100),\n", + " models.Prefetch(query=sparse_query, using=\"sparse_title\", limit=100),\n", + " ],\n", + " query=models.FusionQuery(fusion=models.Fusion.RRF),\n", + " query_filter=query_filter,\n", + " group_by=\"document_id\",\n", + " group_size=group_size,\n", + " limit=limit,\n", + " ).groups\n", + "\n", + "show_results(retrieve_grouped)\n" + ] }, { "cell_type": "markdown", "id": "83c7905e", "metadata": {}, - "source": "## Step 6: Score Boosting With a Formula\n\nWhen you have ranking preferences that aren't captured by similarity alone (recency, source authority, geographic proximity, structured boosts), swap RRF for a `FormulaQuery`. Formulas operate on the prefetch scores and payload fields:\n\n- `$score[i]` references the score from prefetch `i`. Prefetch order is load-bearing.\n- The `defaults` map provides fallback values for candidates that didn't appear in every prefetch, so the formula still evaluates.\n\nThe formula below sums the chunk score with weighted contributions from the title, abstract, and sparse prefetches. This is a linear combination of raw scores, which breaks down when prefetches use different scoring scales. RRF avoids this by discarding scores; DBSF normalizes per prefetch; a custom formula has to align distributions itself, typically with [decay functions](https://qdrant.tech/documentation/search/search-relevance/#decay-functions). The full FormulaQuery syntax lives in the [Score Boosting](https://qdrant.tech/documentation/search/search-relevance/#score-boosting) reference.\n\nFor time-based decay on a `published_at` payload field, swap a term for an `exp_decay` expression.\n\nFor RRF vs. DBSF guidance, see the [hybrid-search FAQ](https://qdrant.tech/documentation/faq/qdrant-fundamentals/#when-should-i-use-reciprocal-rank-fusion-rrf-vs-distribution-based-score-fusion-dbsf-for-hybrid-search).\n" + "source": [ + "## Step 6: Score Boosting With a Formula\n", + "\n", + "When you have ranking preferences that aren't captured by similarity alone (recency, source authority, geographic proximity, structured boosts), swap RRF for a `FormulaQuery`. Formulas operate on the prefetch scores and payload fields:\n", + "\n", + "- `$score[i]` references the score from prefetch `i`. Prefetch order is load-bearing.\n", + "- The `defaults` map provides fallback values for candidates that didn't appear in every prefetch, so the formula still evaluates.\n", + "\n", + "The formula below sums the chunk score with weighted contributions from the title, abstract, and sparse prefetches. This is a linear combination of raw scores, which breaks down when prefetches use different scoring scales. RRF avoids this by discarding scores; DBSF normalizes per prefetch; a custom formula has to align distributions itself, typically with [decay functions](https://qdrant.tech/documentation/search/search-relevance/#decay-functions). The full FormulaQuery syntax lives in the [Score Boosting](https://qdrant.tech/documentation/search/search-relevance/#score-boosting) reference.\n", + "\n", + "For time-based decay on a `published_at` payload field, swap a term for an `exp_decay` expression.\n", + "\n", + "For RRF vs. DBSF guidance, see the [hybrid-search FAQ](https://qdrant.tech/documentation/faq/qdrant-fundamentals/#when-should-i-use-reciprocal-rank-fusion-rrf-vs-distribution-based-score-fusion-dbsf-for-hybrid-search).\n" + ] }, { "cell_type": "code", @@ -191,13 +478,47 @@ "id": "d25beee5", "metadata": {}, "outputs": [], - "source": "def retrieve_boosted(query, limit=10, group_size=3):\n dense_query = models.Document(text=query, model=DENSE_MODEL)\n sparse_query = models.Document(text=query, model=BM25_MODEL)\n return client.query_points_groups(\n collection_name=\"arxiv_multi_repr\",\n prefetch=[\n # $score[0] = chunk, $score[1] = title, $score[2] = abstract, $score[3] = sparse\n models.Prefetch(query=dense_query, using=\"dense_chunk\", limit=100),\n models.Prefetch(query=dense_query, using=\"dense_title\", limit=100),\n models.Prefetch(query=dense_query, using=\"dense_abstract\", limit=100),\n models.Prefetch(query=sparse_query, using=\"sparse_title\", limit=100),\n ],\n query=models.FormulaQuery(\n formula=models.SumExpression(sum=[\n \"$score[0]\",\n models.MultExpression(mult=[0.5, \"$score[1]\"]),\n models.MultExpression(mult=[0.4, \"$score[2]\"]),\n models.MultExpression(mult=[0.3, \"$score[3]\"]),\n ]),\n defaults={\"$score[1]\": 0.0, \"$score[2]\": 0.0, \"$score[3]\": 0.0},\n ),\n group_by=\"document_id\",\n group_size=group_size,\n limit=limit,\n ).groups\n\nshow_results(retrieve_boosted)\n" + "source": [ + "def retrieve_boosted(query, limit=10, group_size=3):\n", + " dense_query = models.Document(text=query, model=DENSE_MODEL)\n", + " sparse_query = models.Document(text=query, model=BM25_MODEL)\n", + " return client.query_points_groups(\n", + " collection_name=\"arxiv_multi_repr\",\n", + " prefetch=[\n", + " # $score[0] = chunk, $score[1] = title, $score[2] = abstract, $score[3] = sparse\n", + " models.Prefetch(query=dense_query, using=\"dense_chunk\", limit=100),\n", + " models.Prefetch(query=dense_query, using=\"dense_title\", limit=100),\n", + " models.Prefetch(query=dense_query, using=\"dense_abstract\", limit=100),\n", + " models.Prefetch(query=sparse_query, using=\"sparse_title\", limit=100),\n", + " ],\n", + " query=models.FormulaQuery(\n", + " formula=models.SumExpression(sum=[\n", + " models.MultExpression(mult=[1.0, \"$score[0]\"]),\n", + " models.MultExpression(mult=[0.5, \"$score[1]\"]),\n", + " models.MultExpression(mult=[0.4, \"$score[2]\"]),\n", + " models.MultExpression(mult=[0.3, \"$score[3]\"]),\n", + " ]),\n", + " defaults={\"$score[1]\": 0.0, \"$score[2]\": 0.0, \"$score[3]\": 0.0},\n", + " ),\n", + " group_by=\"document_id\",\n", + " group_size=group_size,\n", + " limit=limit,\n", + " ).groups\n", + "\n", + "show_results(retrieve_boosted)\n" + ] }, { "cell_type": "markdown", "id": "ca1e7741", "metadata": {}, - "source": "## Wrap-up\n\nThat's the recommended multi-representation pipeline end to end. The same schema works for any corpus with title-like, abstract-like, and body-like representations. Swap the dataset, retune which representations earn their prefetch slots for your data, and wire in formula-based ranking preferences as needed.\n\nFor the design rationale and references, see the [tutorial](https://qdrant.tech/documentation/tutorials-search-engineering/multi-representation-search/).\n" + "source": [ + "## Wrap-up\n", + "\n", + "That's the recommended multi-representation pipeline end to end. The same schema works for any corpus with title-like, abstract-like, and body-like representations. Swap the dataset, retune which representations earn their prefetch slots for your data, and wire in formula-based ranking preferences as needed.\n", + "\n", + "For the design rationale and references, see the [tutorial](https://qdrant.tech/documentation/tutorials-search-engineering/multi-representation-search/).\n" + ] } ], "metadata": { @@ -221,4 +542,4 @@ }, "nbformat": 4, "nbformat_minor": 5 -} \ No newline at end of file +} From 287eab561f978b66347b94d3068420c7cffe7fef Mon Sep 17 00:00:00 2001 From: Dylan Couzon Date: Mon, 11 May 2026 22:05:20 -0400 Subject: [PATCH 6/8] add document_id payload index and expected-results summary Adds the missing keyword index on document_id so the grouping step (query_points_groups with group_by="document_id") works under strict mode. Tweaks the upload_points call to batch_size=256, parallel=2 for faster ingestion against Cloud Inference. Adds an expected-results summary to the wrap-up so readers running the same query against the same dataset can compare their output to the reference. --- .../multi-representation-search.ipynb | 26 +++++++++++++++---- 1 file changed, 21 insertions(+), 5 deletions(-) diff --git a/multi-representation-search/multi-representation-search.ipynb b/multi-representation-search/multi-representation-search.ipynb index 10e28e7..ba33ebf 100644 --- a/multi-representation-search/multi-representation-search.ipynb +++ b/multi-representation-search/multi-representation-search.ipynb @@ -117,8 +117,8 @@ "\n", "# Replace url and api_key with your own from https://cloud.qdrant.io\n", "client = QdrantClient(\n", - " url=\"https://e78b6697-b948-4f6e-aa81-812786852034.eu-west-1-0.aws.cloud.qdrant.io\",\n", - " api_key=\"eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJhY2Nlc3MiOiJtIiwic3ViamVjdCI6ImFwaS1rZXk6NWI2YjRmNGUtMWE2NC00YmEwLWJhNGYtZDE4MjFlM2E1YzE0In0.VmfrnBg4ZH6ferXerGYmSQlDe0lbQ1s8RDNACP1sz_A\",\n", + " url=\"https://xyz-example.qdrant.io:6333\",\n", + " api_key=\"\",\n", " cloud_inference=True,\n", ")\n", "\n", @@ -135,7 +135,12 @@ " },\n", ")\n", "\n", - "# Index the 'tags' payload as keyword so we can filter on category at query time.\n", + "# Index 'document_id' so the Query API can group by it; index 'tags' so we can filter on category.\n", + "client.create_payload_index(\n", + " collection_name=\"arxiv_multi_repr\",\n", + " field_name=\"document_id\",\n", + " field_schema=models.PayloadSchemaType.KEYWORD,\n", + ")\n", "client.create_payload_index(\n", " collection_name=\"arxiv_multi_repr\",\n", " field_name=\"tags\",\n", @@ -209,7 +214,7 @@ " },\n", " ))\n", "\n", - "client.upload_points(collection_name=\"arxiv_multi_repr\", points=points, batch_size=64)\n", + "client.upload_points(collection_name=\"arxiv_multi_repr\", points=points, batch_size=256, parallel=2)\n", "print(f\"Uploaded {len(points)} chunks across {len(papers)} papers\")\n" ] }, @@ -515,7 +520,18 @@ "source": [ "## Wrap-up\n", "\n", - "That's the recommended multi-representation pipeline end to end. The same schema works for any corpus with title-like, abstract-like, and body-like representations. Swap the dataset, retune which representations earn their prefetch slots for your data, and wire in formula-based ranking preferences as needed.\n", + "That's the recommended multi-representation pipeline end to end. The same schema works for any corpus with title-like, abstract-like, and body-like representations.\n", + "\n", + "If you ran this notebook with the same `SAMPLE_QUERY` (\"diffusion models for image synthesis\") and the same 20,000-paper arXiv slice, here's roughly what each step's top 5 should produce:\n", + "\n", + "- **Step 1 (`dense_chunk` only):** chunk-level results with the same paper appearing in multiple slots. SegDiff, LDM, GLIDE in the top 5.\n", + "- **Step 2 (+ `sparse_title`):** title-exact matches surface. Vector Quantized Diffusion Model jumps in.\n", + "- **Step 3 (+ `dense_title`):** LDM dominates with three of its own chunks. Semantic title match takes over.\n", + "- **Step 4 (+ `dense_abstract`):** modest shift. GLIDE returns thanks to abstract-level signal. Adding a prefetch isn't always dramatic.\n", + "- **Step 5 (grouping):** one entry per paper. The collapsed LDM chunks free up slots for Palette, Global Context, and Implicit Image Segmentation.\n", + "- **Step 6 (formula):** custom weighting reorders results. Vector Quantized Diffusion climbs back; ImageBART and Manifold-aware Synthesis enter as the formula amplifies raw scores differently from RRF's rank-based fusion.\n", + "\n", + "Swap the dataset, retune which representations earn their prefetch slots for your data, and wire in formula-based ranking preferences as needed.\n", "\n", "For the design rationale and references, see the [tutorial](https://qdrant.tech/documentation/tutorials-search-engineering/multi-representation-search/).\n" ] From c01fe7c2ccb314c1212bc398267e7deafd96dff5 Mon Sep 17 00:00:00 2001 From: Dylan Couzon Date: Wed, 13 May 2026 16:03:42 -0400 Subject: [PATCH 7/8] update tags label --- .../multi-representation-search.ipynb | 86 +------------------ 1 file changed, 3 insertions(+), 83 deletions(-) diff --git a/multi-representation-search/multi-representation-search.ipynb b/multi-representation-search/multi-representation-search.ipynb index ba33ebf..ac34cc3 100644 --- a/multi-representation-search/multi-representation-search.ipynb +++ b/multi-representation-search/multi-representation-search.ipynb @@ -52,40 +52,7 @@ "id": "ed2823ba", "metadata": {}, "outputs": [], - "source": [ - "from datasets import load_dataset\n", - "\n", - "ML_CATEGORIES = {\"cs.LG\", \"cs.CV\", \"cs.CL\", \"cs.AI\", \"stat.ML\"}\n", - "\n", - "# Non-streaming so HF caches the parquet locally; first run downloads ~2.5 GB, re-runs are instant.\n", - "dataset = load_dataset(\"gfissore/arxiv-abstracts-2021\", split=\"train\")\n", - "\n", - "papers = []\n", - "# IDs are roughly chronological; iterate from the end to land on 2021/2020/2019 papers first.\n", - "for i in range(len(dataset) - 1, -1, -1):\n", - " if len(papers) >= 20000:\n", - " break\n", - " row = dataset[i]\n", - " if not row[\"abstract\"] or not row[\"title\"]:\n", - " continue\n", - " # categories arrive as space-joined strings (e.g. [\"cs.LG cs.CV\"]); split each entry.\n", - " cats = [tok for entry in row[\"categories\"] for tok in entry.split()]\n", - " if not any(c in ML_CATEGORIES for c in cats):\n", - " continue\n", - " # Year lives in the YYMM prefix of new-format arXiv IDs (\"2104.01234\" -> 2021).\n", - " arxiv_id = row[\"id\"]\n", - " if \"/\" in arxiv_id or \".\" not in arxiv_id:\n", - " continue # skip pre-2007 IDs like \"math/0506001\"\n", - " if 2000 + int(arxiv_id[:2]) < 2018:\n", - " continue\n", - " papers.append({\n", - " \"arxiv_id\": arxiv_id,\n", - " \"title\": row[\"title\"].strip(),\n", - " \"abstract\": row[\"abstract\"].strip(),\n", - " \"categories\": cats,\n", - " })\n", - "print(f\"Loaded {len(papers)} papers\")" - ] + "source": "from datasets import load_dataset\n\nML_CATEGORIES = {\"cs.LG\", \"cs.CV\", \"cs.CL\", \"cs.AI\", \"stat.ML\"}\n\n# Non-streaming so HF caches the parquet locally; first run downloads ~2.5 GB, re-runs are instant.\ndataset = load_dataset(\"gfissore/arxiv-abstracts-2021\", split=\"train\")\n\npapers = []\n# IDs are roughly chronological; iterate from the end to land on 2021/2020/2019 papers first.\nfor i in range(len(dataset) - 1, -1, -1):\n if len(papers) >= 20000:\n break\n row = dataset[i]\n if not row[\"abstract\"] or not row[\"title\"]:\n continue\n # categories arrive as space-joined strings (e.g. [\"cs.LG cs.CV\"]); split each entry.\n cats = [tok for entry in row[\"categories\"] for tok in entry.split()]\n if not any(c in ML_CATEGORIES for c in cats):\n continue\n # Year lives in the YYMM prefix of new-format arXiv IDs (\"2104.01234\" -> 2021).\n arxiv_id = row[\"id\"]\n if \"/\" in arxiv_id or \".\" not in arxiv_id:\n continue # skip pre-2007 IDs like \"math/0506001\"\n if 2000 + int(arxiv_id[:2]) < 2018:\n continue\n papers.append({\n \"arxiv_id\": arxiv_id,\n \"title\": row[\"title\"].strip(),\n \"abstract\": row[\"abstract\"].strip(),\n \"tags\": cats,\n })\nprint(f\"Loaded {len(papers)} papers\")" }, { "cell_type": "markdown", @@ -169,54 +136,7 @@ "id": "725afca6", "metadata": {}, "outputs": [], - "source": [ - "DENSE_MODEL = \"sentence-transformers/all-minilm-l6-v2\"\n", - "BM25_MODEL = \"qdrant/bm25\"\n", - "\n", - "def chunk_sentences(text, target_len=2):\n", - " \"\"\"Split text into ~2-sentence chunks; fall back to the full text if it doesn't split cleanly.\"\"\"\n", - " sentences = [s.strip() for s in text.split(\". \") if s.strip()]\n", - " return [\". \".join(sentences[i:i + target_len])\n", - " for i in range(0, len(sentences), target_len)] or [text]\n", - "\n", - "\n", - "points = []\n", - "for paper in papers:\n", - " chunks = chunk_sentences(paper[\"abstract\"])\n", - "\n", - " # Title, abstract, and sparse docs are reused across every chunk of this paper; only the chunk text varies.\n", - " # Cloud Inference embeds each Document on the server, so you don't need a client-side embedding library.\n", - " title_doc = models.Document(text=paper[\"title\"], model=DENSE_MODEL)\n", - " abstract_doc = models.Document(text=paper[\"abstract\"], model=DENSE_MODEL)\n", - " # avg_len is the average word count of the indexed text.\n", - " # Default is 256 (document-length); setting it to the actual field length (~10 here) improves BM25 scoring accuracy.\n", - " sparse_doc = models.Document(\n", - " text=paper[\"title\"],\n", - " model=BM25_MODEL,\n", - " options={\"avg_len\": 10.0},\n", - " )\n", - "\n", - " for i, chunk in enumerate(chunks):\n", - " points.append(models.PointStruct(\n", - " id=len(points),\n", - " vector={\n", - " \"dense_chunk\": models.Document(text=chunk, model=DENSE_MODEL),\n", - " \"dense_title\": title_doc,\n", - " \"dense_abstract\": abstract_doc,\n", - " \"sparse_title\": sparse_doc,\n", - " },\n", - " payload={\n", - " \"document_id\": paper[\"arxiv_id\"],\n", - " \"title\": paper[\"title\"],\n", - " \"tags\": paper[\"categories\"],\n", - " \"chunk_index\": i,\n", - " \"chunk_text\": chunk,\n", - " },\n", - " ))\n", - "\n", - "client.upload_points(collection_name=\"arxiv_multi_repr\", points=points, batch_size=256, parallel=2)\n", - "print(f\"Uploaded {len(points)} chunks across {len(papers)} papers\")\n" - ] + "source": "DENSE_MODEL = \"sentence-transformers/all-minilm-l6-v2\"\nBM25_MODEL = \"qdrant/bm25\"\n\ndef chunk_sentences(text, target_len=2):\n \"\"\"Split text into ~2-sentence chunks; fall back to the full text if it doesn't split cleanly.\"\"\"\n sentences = [s.strip() for s in text.split(\". \") if s.strip()]\n return [\". \".join(sentences[i:i + target_len])\n for i in range(0, len(sentences), target_len)] or [text]\n\n\npoints = []\nfor paper in papers:\n chunks = chunk_sentences(paper[\"abstract\"])\n\n # Title, abstract, and sparse docs are reused across every chunk of this paper; only the chunk text varies.\n # Cloud Inference embeds each Document on the server, so you don't need a client-side embedding library.\n title_doc = models.Document(text=paper[\"title\"], model=DENSE_MODEL)\n abstract_doc = models.Document(text=paper[\"abstract\"], model=DENSE_MODEL)\n # avg_len is the average word count of the indexed text.\n # Default is 256 (document-length); setting it to the actual field length (~10 here) improves BM25 scoring accuracy.\n sparse_doc = models.Document(\n text=paper[\"title\"],\n model=BM25_MODEL,\n options={\"avg_len\": 10.0},\n )\n\n for i, chunk in enumerate(chunks):\n points.append(models.PointStruct(\n id=len(points),\n vector={\n \"dense_chunk\": models.Document(text=chunk, model=DENSE_MODEL),\n \"dense_title\": title_doc,\n \"dense_abstract\": abstract_doc,\n \"sparse_title\": sparse_doc,\n },\n payload={\n \"document_id\": paper[\"arxiv_id\"],\n \"title\": paper[\"title\"],\n \"tags\": paper[\"tags\"],\n \"chunk_index\": i,\n \"chunk_text\": chunk,\n },\n ))\n\nclient.upload_points(collection_name=\"arxiv_multi_repr\", points=points, batch_size=256, parallel=2)\nprint(f\"Uploaded {len(points)} chunks across {len(papers)} papers\")" }, { "cell_type": "markdown", @@ -558,4 +478,4 @@ }, "nbformat": 4, "nbformat_minor": 5 -} +} \ No newline at end of file From c17ad4f1e80943baa3c3238c4d2c8af2d5a7f539 Mon Sep 17 00:00:00 2001 From: Dylan Couzon Date: Thu, 14 May 2026 10:12:38 -0400 Subject: [PATCH 8/8] remove parallelization for Google colab --- .../multi-representation-search.ipynb | 86 ++++++++++++++++++- 1 file changed, 83 insertions(+), 3 deletions(-) diff --git a/multi-representation-search/multi-representation-search.ipynb b/multi-representation-search/multi-representation-search.ipynb index ac34cc3..6421baa 100644 --- a/multi-representation-search/multi-representation-search.ipynb +++ b/multi-representation-search/multi-representation-search.ipynb @@ -52,7 +52,40 @@ "id": "ed2823ba", "metadata": {}, "outputs": [], - "source": "from datasets import load_dataset\n\nML_CATEGORIES = {\"cs.LG\", \"cs.CV\", \"cs.CL\", \"cs.AI\", \"stat.ML\"}\n\n# Non-streaming so HF caches the parquet locally; first run downloads ~2.5 GB, re-runs are instant.\ndataset = load_dataset(\"gfissore/arxiv-abstracts-2021\", split=\"train\")\n\npapers = []\n# IDs are roughly chronological; iterate from the end to land on 2021/2020/2019 papers first.\nfor i in range(len(dataset) - 1, -1, -1):\n if len(papers) >= 20000:\n break\n row = dataset[i]\n if not row[\"abstract\"] or not row[\"title\"]:\n continue\n # categories arrive as space-joined strings (e.g. [\"cs.LG cs.CV\"]); split each entry.\n cats = [tok for entry in row[\"categories\"] for tok in entry.split()]\n if not any(c in ML_CATEGORIES for c in cats):\n continue\n # Year lives in the YYMM prefix of new-format arXiv IDs (\"2104.01234\" -> 2021).\n arxiv_id = row[\"id\"]\n if \"/\" in arxiv_id or \".\" not in arxiv_id:\n continue # skip pre-2007 IDs like \"math/0506001\"\n if 2000 + int(arxiv_id[:2]) < 2018:\n continue\n papers.append({\n \"arxiv_id\": arxiv_id,\n \"title\": row[\"title\"].strip(),\n \"abstract\": row[\"abstract\"].strip(),\n \"tags\": cats,\n })\nprint(f\"Loaded {len(papers)} papers\")" + "source": [ + "from datasets import load_dataset\n", + "\n", + "ML_CATEGORIES = {\"cs.LG\", \"cs.CV\", \"cs.CL\", \"cs.AI\", \"stat.ML\"}\n", + "\n", + "# Non-streaming so HF caches the parquet locally; first run downloads ~2.5 GB, re-runs are instant.\n", + "dataset = load_dataset(\"gfissore/arxiv-abstracts-2021\", split=\"train\")\n", + "\n", + "papers = []\n", + "# IDs are roughly chronological; iterate from the end to land on 2021/2020/2019 papers first.\n", + "for i in range(len(dataset) - 1, -1, -1):\n", + " if len(papers) >= 20000:\n", + " break\n", + " row = dataset[i]\n", + " if not row[\"abstract\"] or not row[\"title\"]:\n", + " continue\n", + " # categories arrive as space-joined strings (e.g. [\"cs.LG cs.CV\"]); split each entry.\n", + " cats = [tok for entry in row[\"categories\"] for tok in entry.split()]\n", + " if not any(c in ML_CATEGORIES for c in cats):\n", + " continue\n", + " # Year lives in the YYMM prefix of new-format arXiv IDs (\"2104.01234\" -> 2021).\n", + " arxiv_id = row[\"id\"]\n", + " if \"/\" in arxiv_id or \".\" not in arxiv_id:\n", + " continue # skip pre-2007 IDs like \"math/0506001\"\n", + " if 2000 + int(arxiv_id[:2]) < 2018:\n", + " continue\n", + " papers.append({\n", + " \"arxiv_id\": arxiv_id,\n", + " \"title\": row[\"title\"].strip(),\n", + " \"abstract\": row[\"abstract\"].strip(),\n", + " \"tags\": cats,\n", + " })\n", + "print(f\"Loaded {len(papers)} papers\")" + ] }, { "cell_type": "markdown", @@ -136,7 +169,54 @@ "id": "725afca6", "metadata": {}, "outputs": [], - "source": "DENSE_MODEL = \"sentence-transformers/all-minilm-l6-v2\"\nBM25_MODEL = \"qdrant/bm25\"\n\ndef chunk_sentences(text, target_len=2):\n \"\"\"Split text into ~2-sentence chunks; fall back to the full text if it doesn't split cleanly.\"\"\"\n sentences = [s.strip() for s in text.split(\". \") if s.strip()]\n return [\". \".join(sentences[i:i + target_len])\n for i in range(0, len(sentences), target_len)] or [text]\n\n\npoints = []\nfor paper in papers:\n chunks = chunk_sentences(paper[\"abstract\"])\n\n # Title, abstract, and sparse docs are reused across every chunk of this paper; only the chunk text varies.\n # Cloud Inference embeds each Document on the server, so you don't need a client-side embedding library.\n title_doc = models.Document(text=paper[\"title\"], model=DENSE_MODEL)\n abstract_doc = models.Document(text=paper[\"abstract\"], model=DENSE_MODEL)\n # avg_len is the average word count of the indexed text.\n # Default is 256 (document-length); setting it to the actual field length (~10 here) improves BM25 scoring accuracy.\n sparse_doc = models.Document(\n text=paper[\"title\"],\n model=BM25_MODEL,\n options={\"avg_len\": 10.0},\n )\n\n for i, chunk in enumerate(chunks):\n points.append(models.PointStruct(\n id=len(points),\n vector={\n \"dense_chunk\": models.Document(text=chunk, model=DENSE_MODEL),\n \"dense_title\": title_doc,\n \"dense_abstract\": abstract_doc,\n \"sparse_title\": sparse_doc,\n },\n payload={\n \"document_id\": paper[\"arxiv_id\"],\n \"title\": paper[\"title\"],\n \"tags\": paper[\"tags\"],\n \"chunk_index\": i,\n \"chunk_text\": chunk,\n },\n ))\n\nclient.upload_points(collection_name=\"arxiv_multi_repr\", points=points, batch_size=256, parallel=2)\nprint(f\"Uploaded {len(points)} chunks across {len(papers)} papers\")" + "source": [ + "DENSE_MODEL = \"sentence-transformers/all-minilm-l6-v2\"\n", + "BM25_MODEL = \"qdrant/bm25\"\n", + "\n", + "def chunk_sentences(text, target_len=2):\n", + " \"\"\"Split text into ~2-sentence chunks; fall back to the full text if it doesn't split cleanly.\"\"\"\n", + " sentences = [s.strip() for s in text.split(\". \") if s.strip()]\n", + " return [\". \".join(sentences[i:i + target_len])\n", + " for i in range(0, len(sentences), target_len)] or [text]\n", + "\n", + "\n", + "points = []\n", + "for paper in papers:\n", + " chunks = chunk_sentences(paper[\"abstract\"])\n", + "\n", + " # Title, abstract, and sparse docs are reused across every chunk of this paper; only the chunk text varies.\n", + " # Cloud Inference embeds each Document on the server, so you don't need a client-side embedding library.\n", + " title_doc = models.Document(text=paper[\"title\"], model=DENSE_MODEL)\n", + " abstract_doc = models.Document(text=paper[\"abstract\"], model=DENSE_MODEL)\n", + " # avg_len is the average word count of the indexed text.\n", + " # Default is 256 (document-length); setting it to the actual field length (~10 here) improves BM25 scoring accuracy.\n", + " sparse_doc = models.Document(\n", + " text=paper[\"title\"],\n", + " model=BM25_MODEL,\n", + " options={\"avg_len\": 10.0},\n", + " )\n", + "\n", + " for i, chunk in enumerate(chunks):\n", + " points.append(models.PointStruct(\n", + " id=len(points),\n", + " vector={\n", + " \"dense_chunk\": models.Document(text=chunk, model=DENSE_MODEL),\n", + " \"dense_title\": title_doc,\n", + " \"dense_abstract\": abstract_doc,\n", + " \"sparse_title\": sparse_doc,\n", + " },\n", + " payload={\n", + " \"document_id\": paper[\"arxiv_id\"],\n", + " \"title\": paper[\"title\"],\n", + " \"tags\": paper[\"tags\"],\n", + " \"chunk_index\": i,\n", + " \"chunk_text\": chunk,\n", + " },\n", + " ))\n", + "\n", + "client.upload_points(collection_name=\"arxiv_multi_repr\", points=points, batch_size=256)\n", + "print(f\"Uploaded {len(points)} chunks across {len(papers)} papers\")" + ] }, { "cell_type": "markdown", @@ -478,4 +558,4 @@ }, "nbformat": 4, "nbformat_minor": 5 -} \ No newline at end of file +}