Skip to content

Add FlashRank reranker to HybridRetriever to improve retrieval quality#116

Open
GovindhKishore wants to merge 1 commit intoreactome:mainfrom
GovindhKishore:feature/flashrank-reranking
Open

Add FlashRank reranker to HybridRetriever to improve retrieval quality#116
GovindhKishore wants to merge 1 commit intoreactome:mainfrom
GovindhKishore:feature/flashrank-reranking

Conversation

@GovindhKishore
Copy link
Copy Markdown

Summary

Adds a reranking layer to HybridRetriever in csv_chroma.py to address the issue of responses becoming increasingly long and noisy as more data sources are integrated into the retrieval pipeline.

Problem

The current pipeline retrieves documents from multiple subdirectories using BM25 + SelfQuery + MultiQuery expansion, resulting in ~90 documents being passed directly to create_stuff_documents_chain.

There is no cross-subdirectory relevance filtering - all retrieved documents are stuffed into the LLM prompt regardless of how relevant they are to the original user query. This causes:

  • Responses becoming longer and noisier as more data is added
  • Low-relevance documents from one subdirectory treated equally to high-relevance documents from another
  • LLM receiving too much context which reduces answer precision

Solution

A new module src/retrievers/reranker.py is introduced using FlashRank (ms-marco-MiniLM-L-12-v2). After weighted_reciprocal_rank merges results across all subdirectories, the reranker scores every retrieved document against the original user query using a cross-encoder model and returns only the top N most relevant documents.

Two functions are provided:

  • rerank() - sync, called by retrieve_documents()
  • arerank() - async, called by aretrieve_documents()

arerank() uses asyncio.to_thread to run the blocking FlashRank inference in a background thread without freezing the async event loop.

Changes

  • src/retrievers/reranker.py - new module containing reranking logic
  • src/retrievers/csv_chroma.py - import reranker, update return statements in both retrieve_documents() and aretrieve_documents()
  • config_default.yml - add reranker configuration block
  • pyproject.toml / poetry.lock - add flashrank dependency

Why FlashRank

  • Runs locally - no API key required
  • CPU only - no GPU needed
  • Lightweight (~4MB model)
  • No changes to downstream pipeline - same list[Document] type
    returned throughout

Impact

Since csv_chroma.py is shared by both Reactome and UniProt retrievers, reranking applies automatically to all current and future database integrations without any additional changes.

Test

# Input: 7 documents (mix of relevant and irrelevant)
# Query: "What does TP53 do in apoptosis?"

# Output after reranking (top 3):
# 1. score=0.9996 | TP53 activates apoptosis through BAX
# 2. score=0.9860 | TP53 and PUMA in intrinsic apoptosis  
# 3. score=0.8930 | p53 regulates cell death signalling

# Correctly dropped:
# RNA polymerase II transcription      (irrelevant)
# Reactome database overview           (irrelevant)
# General cancer pathway summary       (irrelevant)

Note

This contribution was developed with AI assistance (Claude) for understanding the codebase and implementation guidance. All code has been reviewed and understood.

Closes #115

Happy to make any changes based on maintainer feedback.

@GovindhKishore
Copy link
Copy Markdown
Author

Hi @adamjohnwright @GFJHogue ,

Just flagging this PR for your attention when you get a chance. This directly addresses the retrieval noise issue mentioned across several issues, and since it touches csv_chroma.py which is shared by both Reactome and UniProt retrievers, I wanted to make sure the right people are aware of it.

Happy to:

  • Add unit tests if needed
  • Adjust the top_n default value in config_default.yml
  • Discuss alternative reranking models if FlashRank is not preferred

Looking forward to any feedback!

@adamjohnwright
Copy link
Copy Markdown
Contributor

@heliamoh are you able to take a look to see if this resolves the issue(s)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Improve retrieval quality by adding reranking layer to HybridRetriever

2 participants