Transfer Learning with BERT for Multi-Hop Question Answering

Introduction to Multi-Hop Question Answering

In previous articles, we addressed a critical limitation of today’s Retrieval-Augmented Generation (RAG) systems: missing contextual information due to independent chunking. However, this is just one of RAG’s shortcomings. Another significant challenge is multi-hop question answering (QA), which involves integrating evidence spread across multiple chunks to derive a complete and accurate response. This means the system must gather and reason over information distributed throughout the document — often spanning non-contiguous sections — to arrive at the correct answer.

The Challenge of Multi-Hop QA

Multi-hop QA tasks demand that multiple pieces of evidence from different sections of a document (or even multiple documents) be combined to answer a question. Some examples are straightforward. For instance:
“Compare iPhone 14 and iPhone 16 characteristics.” This can be handled using two independent retrievals: one for “iPhone 14” and one for “iPhone 16.” But others are more complex, such as causal chains: “Give me a detailed comparison of the latest OSX versions.” In this case, a model must first determine what the latest OSX versions are (e.g., Sonama, Catalina), and then perform separate retrievals or queries for each version before aggregating and comparing the information. This multi-step querying process is computationally expensive and slow.

A Simpler Alternative: Single-Model Chunk Scoring

What if we could avoid query generation altogether and directly score all document chunks against the original question using a single machine learning model? This approach solves three core RAG problems:

Eliminates the need for query generation
Mitigates the downsides of independent chunking
Enables multi-hop reasoning in a single pass

The Nuance: Chunk Embeddings Instead of Token Embeddings

A common objection is: wouldn’t scoring all chunks at once require large context windows, the very problem RAG was meant to solve? Here’s the twist: we replace individual token embeddings with chunk embeddings — each chunk being up to 200 tokens long. With a sequence length of 500, this setup effectively allows us to encode 100,000 tokens’ worth of content.

Training Setup

Training was conducted on a desktop PC with an 8GB GPU for a few hours. The setup was as follows:

Embedding model: jinaai/jina-embeddings-v3
Base model for fine-tuning: google/bert-base-uncased
Input format: [question, chunk_i_1, chunk_i_2, …, chunk_j_1, chunk_j_2, …] where chunk_i_1 is the first chunk embedding of document_i.
Labels: [0, label_i_1, label_i_2, …, label_j_2, …] where label_i_x = 1 if chunk_i_x contributes to the answer, else 0
Chunks are shuffled across documents for each mini-batch, but order within a document is preserved

Training Parameters

Batch size: 16
Precision: FP32
Learning rate: 1e-5
Epochs: 100
Average sequence length: 511
Loss function: Binary Cross-Entropy

Dataset

A hybrid dataset was selected for training, composed of several sources:
Testset: 100 samples from HotpotQA

Results

Recall@10 was used as the performance metric — how many relevant chunks were retrieved in the top 10 candidates.
Baseline comparisons:

Raw cosine similarity using jinaai embeddings
Cohere Rerank API — which evaluates relevance using a learned scoring model rather than relying solely on cosine similarity. Cosine similarity, while efficient, often fails to capture semantic nuances between a question and a passage. To address this, Cohere and similar companies have developed reranking models that assign a more accurate relevance score based on contextual understanding of both the query and the retrieved chunk (one by one).
Sample output:
- Relevant chunks: [1, 312, 313]
- Top 10 candidates: [1, 313, 312, 2, 188, 46, 183, 325, 91, 149]
- Recall@10: 3/3

Open-Sourcing the Code

Want to try this yourself? We’ve open-sourced the code with training and testing scripts, including integrations with Cohere Rerank and cosine similarity baselines.

Future Work

This experiment used bert-base-uncased, which has a limited input size of 512 tokens. Next steps:

Experiment with larger models: bert-large, DeBERTa, Longformer, etc.
Jointly fine-tune the embedding and scoring models to enable end-to-end optimization. This allows the embedding space to evolve in a way that is directly aligned with the downstream scoring task, potentially improving both chunk relevance and overall retrieval accuracy.
Move beyond chunk scoring: fine-tune a large language model such as LLaMA 3 to reason directly over chunk embeddings and generate answers end-to-end. This enables the model to not only retrieve relevant content but also synthesize and articulate coherent multi-hop responses in a single step.

Final Thoughts

Transformer-based language models continue to impress. Over the recent experiments, we observed they can efficiently adapt to new modalities like character probabilities or chunk embeddings very quickly, and their adaptability to complex tasks like multi-hop reasoning is fascinating.

Code Snippets

Chunk Embedding Generation (JinaAI)

# Example code for generating question and chunk embeddings
emb_cache = {}
embedding_model = AutoModel.from_pretrained("jinaai/jina-embeddings-v3", trust_remote_code=True)
embedding_model.eval()
embedding_model.cuda()
for step, (question, chunks_list, labels_list) in enumerate(dataset):
    print(f"rEmb: {step}/10k", end="", flush=True)
    h = hashf(question)
    if h not in emb_cache: emb_cache[h] = embedding_model.encode([question], task="retrieval.query")[0]
    for chunks in chunks_list: 
        for chunk in chunks:
            h = hashf(chunk)
            if h not in emb_cache: emb_cache[h] = embedding_model.encode([chunk], task="retrieval.passage")[0]

Cohere Rerank API

# Example usage of Cohere Rerank API
import cohere
co = cohere.Client('your-api-key')
documents_reranked = co.rerank(model="rerank-v3.5", query=question, documents=documents, top_n=10)
top_indices = [r.index for r in documents_reranked.results]

Data Collator (Batch Creator)


# Example code for custom data collator
class DataCollator:
    def shuffle_lists(self, a, b):
        combined = list(zip(a, b)) # Pair corresponding elements
        random.shuffle(combined) # Shuffle the pairs
        a_shuffled, b_shuffled = zip(*combined) # Unzip after shuffling
        return list(a_shuffled), list(b_shuffled)

    def __call__(self, features) -> Dict[str, torch.Tensor]:
        batch = {"input_values": [], "labels":[], "position_ids":{}}
        for x in features:
            question, labels_list, chunks_list = x["question"], x["labels_list"], x["chunks_list"]
            question_emb, chunks_emb = emb_cache[hashf(question)], []
            for chunks in chunks_list:
                chunks_emb.append( [emb_cache[hashf(chunk)] for chunk in chunks] )
            chunks_emb, labels_list = self.shuffle_lists(chunks_emb, labels_list)

            input_values, labels, position_ids = [question_emb], [0], [0]
            for i, embs in enumerate(chunks_emb): 
                for idx, emb in enumerate(embs):
                    input_values.append(emb)
                    position_ids.append(idx+1)
                    labels += labels_list[i]

            input_values, labels, position_ids = torch.tensor(input_values), torch.tensor(labels), torch.tensor(position_ids, dtype=torch.long)
            batch["input_values"].append(input_values)
            batch["labels"].append(labels)
            batch["position_ids"].append(position_ids)

        batch["input_values"] = pad_sequence(batch["input_values"], batch_first=True, padding_value=0) #B,S,C
        batch["labels"] = pad_sequence(batch["labels"], batch_first=True, padding_value=0)
        batch["position_ids"] = pad_sequence(batch["position_ids"], batch_first=True, padding_value=0)
        return batch

## Conclusion
In this article, we discussed the challenges of multi-hop question answering and proposed a simpler alternative using single-model chunk scoring. We also provided code snippets for chunk embedding generation, Cohere Rerank API, and data collator. Our experiment showed promising results, and we plan to explore further improvements in the future.

## FAQs
Q: What is multi-hop question answering?
A: Multi-hop question answering is a type of question answering task that requires integrating evidence from multiple sources to derive a complete and accurate response.
Q: What is the challenge of multi-hop QA?
A: The challenge of multi-hop QA is that it requires the model to gather and reason over information distributed throughout the document, often spanning non-contiguous sections, to arrive at the correct answer.
Q: What is the proposed solution?
A: The proposed solution is to use single-model chunk scoring, which eliminates the need for query generation, mitigates the downsides of independent chunking, and enables multi-hop reasoning in a single pass.
Q: What is the role of chunk embeddings?
A: Chunk embeddings replace individual token embeddings and allow the model to encode larger context windows, enabling it to capture more semantic nuances between the question and the passage.
Q: What is the future work?
A: The future