Anserini: BM25 Baselines for MS MARCO Passage Ranking

This page contains instructions for running BM25 baselines on the MS MARCO passage ranking task. Note that there is a separate MS MARCO document ranking task. We also have a separate page describing document expansion experiments (Doc2query) for this task.

Data Prep

We're going to use the repository's root directory as the working directory. First, we need to download and extract the MS MARCO passage dataset:

mkdir collections/msmarco-passage

wget https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz -P collections/msmarco-passage

# Alternative mirror:
# wget https://www.dropbox.com/s/9f54jg2f71ray3b/collectionandqueries.tar.gz -P collections/msmarco-passage

tar xvfz collections/msmarco-passage/collectionandqueries.tar.gz -C collections/msmarco-passage

To confirm, collectionandqueries.tar.gz should have MD5 checksum of 31644046b18952c1386cd4564ba2ae69.

Next, we need to convert the MS MARCO tsv collection into Anserini's jsonl files (which have one json object per line):

python tools/scripts/msmarco/convert_collection_to_jsonl.py \
 --collection-path collections/msmarco-passage/collection.tsv \
 --output-folder collections/msmarco-passage/collection_jsonl

The above script should generate 9 jsonl files in collections/msmarco-passage/collection_jsonl, each with 1M lines (except for the last one, which should have 841,823 lines).

We can now index these docs as a JsonCollection using Anserini:

sh target/appassembler/bin/IndexCollection -threads 9 -collection JsonCollection \
 -generator DefaultLuceneDocumentGenerator -input collections/msmarco-passage/collection_jsonl \
 -index indexes/msmarco-passage/lucene-index-msmarco -storePositions -storeDocvectors -storeRaw

Upon completion, we should have an index with 8,841,823 documents. The indexing speed may vary; on a modern desktop with an SSD, indexing takes a couple of minutes.

Performing Retrieval on the Dev Queries

Since queries of the set are too many (+100k), it would take a long time to retrieve all of them. To speed this up, we use only the queries that are in the qrels file:

python tools/scripts/msmarco/filter_queries.py \
 --qrels collections/msmarco-passage/qrels.dev.small.tsv \
 --queries collections/msmarco-passage/queries.dev.tsv \
 --output collections/msmarco-passage/queries.dev.small.tsv

The output queries file should contain 6980 lines. We can now perform a retrieval run using this smaller set of queries:

sh target/appassembler/bin/SearchMsmarco -hits 1000 -threads 1 \
 -index indexes/msmarco-passage/lucene-index-msmarco \
 -queries collections/msmarco-passage/queries.dev.small.tsv \
 -output runs/run.msmarco-passage.dev.small.tsv

Note that by default, the above script uses BM25 with tuned parameters k1=0.82, b=0.68. The option -hits specifies the number of documents per query to be retrieved. Thus, the output file should have approximately 6980 × 1000 = 6.9M lines.

Retrieval speed will vary by machine: On a modern desktop with an SSD, we can get ~0.07 s/query, so the run should finish in under ten minutes. We can perform multi-threaded retrieval by changing the -threads argument.

Finally, we can evaluate the retrieved documents using this the official MS MARCO evaluation script:

python tools/scripts/msmarco/msmarco_passage_eval.py \
 collections/msmarco-passage/qrels.dev.small.tsv runs/run.msmarco-passage.dev.small.tsv

And the output should be like this:

#####################
MRR @10: 0.18741227770955546
QueriesRanked: 6980
#####################

You can find this entry on the MS MARCO Passage Ranking Leaderboard as entry "BM25 (Lucene8, tuned)", so you've just replicated (part of) a leaderboard submission!

We can also use the official TREC evaluation tool, trec_eval, to compute other metrics than MRR@10. For that we first need to convert runs and qrels files to the TREC format:

python tools/scripts/msmarco/convert_msmarco_to_trec_run.py \
 --input runs/run.msmarco-passage.dev.small.tsv \
 --output runs/run.msmarco-passage.dev.small.trec

python tools/scripts/msmarco/convert_msmarco_to_trec_qrels.py \
 --input collections/msmarco-passage/qrels.dev.small.tsv \
 --output collections/msmarco-passage/qrels.dev.small.trec

And run the trec_eval tool:

tools/eval/trec_eval.9.0.4/trec_eval -c -mrecall.1000 -mmap \
 collections/msmarco-passage/qrels.dev.small.trec runs/run.msmarco-passage.dev.small.trec

The output should be:

map                   	all	0.1957
recall_1000           	all	0.8573

Average precision and recall@1000 are the two metrics we care about the most.

BM25 Tuning

Note that this figure differs slightly from the value reported in Document Expansion by Query Prediction, which uses the Anserini (system-wide) default of k1=0.9, b=0.4.

Tuning was accomplished with tools/scripts/msmarco/tune_bm25.py, using the queries found here; the basic approach is grid search of parameter values in tenth increments. There are five different sets of 10k samples (using the shuf command). We tuned on each individual set and then averaged parameter values across all five sets (this has the effect of regularization). In separate trials, we optimized for:

recall@1000, since Anserini output serves as input to downstream rerankers (e.g., based on BERT), and we want to maximize the number of relevant documents the rerankers have to work with;
MRR@10, for the case where Anserini output is directly presented to users (i.e., no downstream reranking).

It turns out that optimizing for MRR@10 and MAP yields the same settings.

Here's the comparison between the Anserini default and optimized parameters:

Setting	MRR@10	MAP	Recall@1000
Default (`k1=0.9`, `b=0.4`)	0.1840	0.1926	0.8526
Optimized for recall@1000 (`k1=0.82`, `b=0.68`)	0.1874	0.1957	0.8573
Optimized for MRR@10/MAP (`k1=0.60`, `b=0.62`)	0.1892	0.1972	0.8555

To replicate these results, the SearchMsmarco class above takes k1 and b parameters as command-line arguments, e.g., -k1 0.60 -b 0.62 (note that the default setting is k1=0.82 and b=0.68).

Replication Log

Results replicated by @ronakice on 2019-08-12 (commit 5b29d16)
Results replicated by @MathBunny on 2019-08-12 (commit 5b29d16)
Results replicated by @JMMackenzie on 2020-01-08 (commit f63cd22)
Results replicated by @edwinzhng on 2020-01-08 (commit 5cc923d)
Results replicated by @LuKuuu on 2020-01-15 (commit f21137b)
Results replicated by @kevinxyc1 on 2020-01-18 (commit 798cb3a)
Results replicated by @nikhilro on 2020-01-21 (commit 631589e)
Results replicated by @yuki617 on 2020-03-29 (commit 074723c)
Results replicated by @weipang142857 on 2020-04-20 (commit 074723c)
Results replicated by @HangCui0510 on 2020-04-23 (commit 0ae567d)
Results replicated by @x65han on 2020-04-25 (commit f5496b9)
Results replicated by @y276lin on 2020-04-26 (commit 8f48f8e)
Results replicated by @stephaniewhoo on 2020-04-26 (commit 8f48f8e)
Results replicated by @eiston on 2020-05-04 (commit dd84a5a)
Results replicated by @rohilg on 2020-05-09 (commit 20ee950)
Results replicated by @wongalvis14 on 2020-05-09 (commit ebac5d6)
Results replicated by @YimingDou on 2020-05-14 (commit 3b0a642)
Results replicated by @richard3983 on 2020-05-14 (commit a65646f)
Results replicated by @MXueguang on 2020-05-20 (commit 3b2751e)
Results replicated by @shaneding on 2020-05-23 (commit b6e0367)
Results replicated by @adamyy on 2020-05-28 (commit 94893f1)
Results replicated by @kelvin-jiang on 2020-05-28 (commit d55531a)
Results replicated by @TianchengY on 2020-05-28 (commit 2947a16)
Results replicated by @stariqmi on 2020-05-28 (commit 4914305)
Results replicated by @justinborromeo on 2020-06-10 (commit 7954eab)
Results replicated by @yxzhu16 on 2020-07-03 (commit 68ace26)
Results replicated by @LizzyZhang-tutu on 2020-07-13 (commit 8c98d5b)
Results replicated by @estella98 on 2020-07-29 (commit 99092a8)
Results replicated by @tangsaidi on 2020-08-19 (commit aba846)
Results replicated by @qguo96 on 2020-09-07 (commit e16b3c1)
Results replicated by @yuxuan-ji on 2020-09-08 (commit 0f9a8ec)
Results replicated by @wiltan-uw on 2020-09-09 (commit 93d913f)
Results replicated by @JeffreyCA on 2020-09-13 (commit bc2628b)
Results replicated by @jhuang265 on 2020-10-15 (commit 66711b9)
Results replicated by @rayyang29 on 2020-10-27 (commit ad8cc5a)
Results replicated by @Dahlia-Chehata on 2020-11-11 (commit 22c0ad3)
Results replicated by @rakeeb123 on 2020-12-07 (commit f50dcce)
Results replicated by @jrzhang12 on 2021-01-02 (commit be4e44d)
Results replicated by @HEC2018 on 2021-01-04 (commit 4de21ec)
Results replicated by @KaiSun314 on 2021-01-08 (commit 113f1c7)
Results replicated by @yemiliey on 2021-01-18 (commit 179c242)
Results replicated by @larryli1999 on 2021-01-22 (commit 3f9af5)
Results replicated by @ArthurChen189 on 2021-04-08 (commit 45a5a21)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experiments-msmarco-passage.md

experiments-msmarco-passage.md

Anserini: BM25 Baselines for MS MARCO Passage Ranking

Data Prep

Performing Retrieval on the Dev Queries

BM25 Tuning

Replication Log

Files

experiments-msmarco-passage.md

Latest commit

History

experiments-msmarco-passage.md

File metadata and controls

Anserini: BM25 Baselines for MS MARCO Passage Ranking

Data Prep

Performing Retrieval on the Dev Queries

BM25 Tuning

Replication Log