BERT-Based Multi-Hop Question Answering

Introduction

In a previous article, we addressed a critical limitation of today’s Retrieval-Augmented Generation (RAG) systems: missing contextual information due to independent chunking. However, this is just one of RAG’s shortcomings. Another significant challenge is multi-hop question answering (QA), which involves integrating evidence spread across multiple chunks to derive a complete and accurate response. This means the system must gather and reason over information distributed throughout the document — often spanning non-contiguous sections — to arrive at the correct answer.

The Challenge of Multi-Hop QA

Multi-hop QA tasks demand that multiple pieces of evidence from different sections of a document (or even multiple documents) be combined to answer a question. Some examples are straightforward. For instance:
“Compare iPhone 14 and iPhone 16 characteristics.” This can be handled using two independent retrievals: one for “iPhone 14” and one for “iPhone 16.” But others are more complex, such as causal chains: “Give me a detailed comparison of the latest OSX versions.” In this case, a model must first determine what the latest OSX versions are (e.g., Sonama, Catalina), and then perform separate retrievals or queries for each version before aggregating and comparing the information. This multi-step querying process is computationally expensive and slow.

A Simpler Alternative: Single-Model Chunk Scoring

What if we could avoid query generation altogether and directly score all document chunks against the original question using a single machine learning model? This approach solves three core RAG problems:

Eliminates the need for query generation
Mitigates the downsides of independent chunking
Enables multi-hop reasoning in a single pass

The Nuance: Chunk Embeddings Instead of Token Embeddings

A common objection is: wouldn’t scoring all chunks at once require large context windows, the very problem RAG was meant to solve? Here’s the twist: we replace individual token embeddings with chunk embeddings — each chunk being up to 200 tokens long. With a sequence length of 500, this setup effectively allows us to encode 100,000 tokens’ worth of content.

Training Setup

Training was conducted on a desktop PC with an 8GB GPU for a few hours. The setup was as follows:

Embedding model: jinaai/jina-embeddings-v3
Base model for fine-tuning: google/bert-base-uncased
Input format: [question, chunk_i_1, chunk_i_2, …, chunk_j_1, chunk_j_2, …] where chunk_i_1 is the first chunk embedding of document_i.
Labels: [0, label_i_1, label_i_2, …, label_j_2, …] where label_i_x = 1 if chunk_i_x contributes to the answer, else 0
Chunks are shuffled across documents for each mini-batch, but order within a document is preserved
Training parameters:
Batch size: 16
Precision: FP32
Learning rate: 1e-5
Epochs: 100
Average sequence length: 511
Loss function: Binary Cross-Entropy

Dataset

A hybrid dataset was selected for training, composed of several sources:
Testset: 100 samples from HotpotQA

Results

Recall@10 was used as the performance metric — how many relevant chunks were retrieved in the top 10 candidates.
Baseline comparisons:

Raw cosine similarity using jinaai embeddings
Cohere Rerank API — which evaluates relevance using a learned scoring model rather than relying solely on cosine similarity. Cosine similarity, while efficient, often fails to capture semantic nuances between a question and a passage. To address this, Cohere and similar companies have developed reranking models that assign a more accurate relevance score based on contextual understanding of both the query and the retrieved chunk (one by one).
Sample output:
- Relevant chunks: [1, 312, 313]
- Top 10 candidates: [1, 313, 312, 2, 188, 46, 183, 325, 91, 149]
- Recall@10: 3/3

Open-Sourcing the Code

Want to try this yourself? We’ve open-sourced the code with training and testing scripts, including integrations with Cohere Rerank and cosine similarity baselines.

Future Work

This experiment used bert-base-uncased, which has a limited input size of 512 tokens. Next steps:

Experiment with larger models: bert-large, DeBERTa, Longformer, etc.
Jointly fine-tune the embedding and scoring models to enable end-to-end optimization. This allows the embedding space to evolve in a way that is directly aligned with the downstream scoring task, potentially improving both chunk relevance and overall retrieval accuracy.
Move beyond chunk scoring: fine-tune a large language model such as LLaMA 3 to reason directly over chunk embeddings and generate answers end-to-end. This enables the model to not only retrieve relevant content but also synthesize and articulate coherent multi-hop responses in a single step.

Conclusion

Transformer-based language models continue to impress. Over the recent experiments, we observed they can efficiently adapt to new modalities like character probabilities or chunk embeddings very quickly, and their adaptability to complex tasks like multi-hop reasoning is fascinating.

FAQs

Q: What is multi-hop question answering?
A: Multi-hop question answering involves integrating evidence spread across multiple chunks to derive a complete and accurate response.
Q: What is the challenge of multi-hop QA?
A: The challenge of multi-hop QA is that it requires the system to gather and reason over information distributed throughout the document — often spanning non-contiguous sections — to arrive at the correct answer.
Q: What is the proposed solution?
A: The proposed solution is to use a single machine learning model to directly score all document chunks against the original question.
Q: What is chunk embedding?
A: Chunk embedding is a technique that replaces individual token embeddings with chunk embeddings — each chunk being up to 200 tokens long.
Q: What is the future work?
A: The future work includes experimenting with larger models, jointly fine-tuning the embedding and scoring models, and moving beyond chunk scoring to generate answers end-to-end.