Document Retrieval

Built an efficient IR system across 7 languages with computational limits

When you need to search through 200,000+ documents across multiple languages-fast-the usual tricks might not work. Pre-trained language models deliver impressive results, but they’re computationally expensive. Our task: build a multilingual information retrieval system that’s both effective and resource-efficient.

The setup was straightforward: retrieve the single most relevant document for each query from a corpus spanning 7 languages (Arabic, German, English, Spanish, French, Italian, and Korean). But here’s the catch – inference had to complete within 10 minutes, and we were limited to a Kaggle notebook’s resources.

Our Approach

We tested three categories of methods, ranging from classical to modern:

TF-IDF

We started with TF-IDF, a semantic method that weighs terms by their frequency and uniqueness. The baseline scored 0.52 on our test metric (Recall@10), but we improved it by boosting high-IDF terms—since rare terms often indicate query-specific documents. Using a scalar multiplier on the top 2 IDF terms increased performance to 0.5871, a 10% boost.

BM25: The Winner

BM25 refined the TF-IDF approach by accounting for document length and term saturation effects. It’s a simple formula, but it works remarkably well:

IR System Pipeline.

The results were interesting: BM25 achieved 0.7714 on the dev set and 0.7562 on the final test submission. No complex models needed.

Text Embeddings

We tested multilingual-e5-small, a compact embedding model with under 250M parameters. We tried two chunking strategies (20-word and 200-word sequences) to handle the attention mechanism’s length limitations. The results were disappointing: e5 with 200-word chunks scored only 0.5414, and shorter chunks performed even worse at 0.3028. Longer chunks provided better context but created language imbalances.

Reranking: The Diminishing Returns

We fine-tuned DistilBERT as a reranker to refine retrieved results. Unfortunately, adding reranking degraded performance for most models—likely due to the small model size. It slightly helped text embedding methods but hurt the strong keyword-based methods. For our final submission, we stuck with pure BM25.

The Results

Below are the results of fine-tuned models:

Fine-tuned models' performances.

Here is the comparison of all methods for each language.

ar de en es fr it ko
TF-IDF (Baseline) 0.460 0.450 0.495 0.760 0.715 0.595 0.465
TF-IDF (Optimal) 0.585 0.560 0.450 0.775 0.790 0.610 0.605
e5_sl200 0.420 0.425 0.520 0.685 0.670 0.555 0.515
BM25 0.735 0.665 0.745 0.910 0.910 0.770 0.625
TF-IDF + Rerank (Baseline) 0.460 0.470 0.490 0.755 0.725 0.610 0.460
TF-IDF + Rerank (Optimal) 0.580 0.555 0.450 0.775 0.790 0.605 0.600
e5_sl200 + Rerank 0.440 0.435 0.525 0.675 0.685 0.565 0.520
BM25 + Rerank 0.725 0.660 0.735 0.895 0.895 0.765 0.620
Recall@10 per language for different retrieval models and configurations.

BM25 dominated across all 7 languages, with particularly strong performance in Spanish (0.91) and French (0.91). Even on less-resourced languages like Korean, it achieved 0.625.

Takeaways

In constrained environments, classical information retrieval methods like BM25 outperform small embedding models because they’re efficient, interpretable, and surprisingly effective at keyword matching. Text embeddings excel at semantic understanding, but they’re overkill when you don’t have the computational resources to properly leverage them.