Skip to content

Latest commit

 

History

History
195 lines (140 loc) · 7.19 KB

experiments-fever.md

File metadata and controls

195 lines (140 loc) · 7.19 KB

Anserini: BM25 Baselines for FEVER Fact Verification

This page contains instructions for running BM25 baselines on the FEVER fact verification task.

Data Prep

We are going to use the repository's root directory as the working directory.

First, we need to download and extract the FEVER dataset:

mkdir collections/fever
mkdir indexes/fever

wget https://s3-eu-west-1.amazonaws.com/fever.public/wiki-pages.zip -P collections/fever
unzip collections/fever/wiki-pages.zip -d collections/fever

wget https://s3-eu-west-1.amazonaws.com/fever.public/train.jsonl -P collections/fever
wget https://s3-eu-west-1.amazonaws.com/fever.public/paper_dev.jsonl -P collections/fever

To confirm, wiki-pages.zip should have MD5 checksum of ed8bfd894a2c47045dca61f0c8dc4c07.

Building Lucene Indexes

Next, we want to index the Wikipedia dump (wiki-pages.zip) using Anserini. Note that this Wikipedia dump consists of Wikipedia articles' introductions only, which we will refer to as "paragraphs" from this point onward.

We will consider two variants: (1) Paragraph Indexing and (2) Sentence Indexing.

Paragraph Indexing

We can index paragraphs with FeverParagraphCollection, as follows:

sh target/appassembler/bin/IndexCollection \
 -collection FeverParagraphCollection -generator DefaultLuceneDocumentGenerator \
 -threads 9 -input collections/fever/wiki-pages \
 -index indexes/fever/lucene-index-fever-paragraph -storePositions -storeDocvectors -storeRaw 

Upon completion, we should have an index with 5,396,106 documents (paragraphs).

Sentence Indexing

We can index sentences with FeverSentenceCollection, as follows:

sh target/appassembler/bin/IndexCollection \
 -collection FeverSentenceCollection -generator DefaultLuceneDocumentGenerator \
 -threads 9 -input collections/fever/wiki-pages \
 -index indexes/fever/lucene-index-fever-sentence -storePositions -storeDocvectors -storeRaw 

Upon completion, we should have an index with 25,247,887 documents (sentences).

Performing Retrieval on the Dev Queries

Note that while we use paragraph indexing for this section, these steps can easily be modified for sentence indexing.

Before we can retrieve with our index, we need to generate the queries and qrels files for the dev split of the FEVER dataset:

python src/main/python/fever/generate_queries_and_qrels.py \
 --dataset_file collections/fever/paper_dev.jsonl \
 --output_queries_file collections/fever/queries.paragraph.dev.tsv \
 --output_qrels_file collections/fever/qrels.paragraph.dev.txt \
 --granularity paragraph

We can now perform a retrieval run:

sh target/appassembler/bin/SearchCollection \
 -index indexes/fever/lucene-index-fever-paragraph \
 -topicReader TsvInt -topics collections/fever/queries.paragraph.dev.tsv \
 -output runs/run.fever-paragraph.dev.txt -bm25

Note that by default, the above uses the BM25 algorithm with parameters k1=0.9, b=0.4.

Evaluating with trec_eval

Finally, we can evaluate the retrieved documents using the official TREC evaluation tool, trec_eval.

target/appassembler/bin/trec_eval -c -m recall \
 collections/fever/qrels.paragraph.dev.txt runs/run.fever-paragraph.dev.txt

The output should be:

recall_5              	all	0.6622
recall_10             	all	0.7446
recall_15             	all	0.7834
recall_20             	all	0.8059
recall_30             	all	0.8353
recall_100            	all	0.8974
recall_200            	all	0.9176
recall_500            	all	0.9393
recall_1000           	all	0.9477

Comparing with FEVER Baseline

We can also evaluate our retrieval compared to the TF-IDF baseline described in the FEVER paper. Specifically, we want to compare the metrics described in Table 2 of the paper.

We evaluate the run file produced earlier:

python src/main/python/fever/evaluate_doc_retrieval.py \
 --truth_file collections/fever/paper_dev.jsonl \
 --run_file runs/run.fever-paragraph.dev.txt

This run produces the following results:

k Fully Supported Oracle Accuracy
1 0.3887 0.5925
5 0.6517 0.7678
10 0.7349 0.8233
25 0.8117 0.8745
50 0.8570 0.9047
100 0.8900 0.9267

Note that this outperforms the TF-IDF baseline in the FEVER paper at every value of k.

BM25 Tuning

The above retrieval uses BM25 default parameters of k1=0.9, b=0.4. We can tune these parameters to obtain even better retrieval results.

We tune on a subset of the training split of the dataset. We generate that subset:

python src/main/python/fever/generate_subset.py \
 --dataset_file collections/fever/train.jsonl \
 --subset_file collections/fever/train-subset.jsonl \
 --length 2000

If necessary, to speed up grid search later on, decrease the length of the subset here.

We then generate the queries and qrels files for this subset.

python src/main/python/fever/generate_queries_and_qrels.py \
 --dataset_file collections/fever/train-subset.jsonl \
 --output_queries_file collections/fever/queries.paragraph.train-subset.tsv \
 --output_qrels_file collections/fever/qrels.paragraph.train-subset.txt \
 --granularity paragraph

We tune the BM25 parameters with a grid search of parameter values in 0.1 increments. We save the run files generated by this process to a new folder runs/fever-bm25 (do not use runs here).

python src/main/python/fever/tune_bm25.py \
 --runs_folder runs/fever-bm25 \
 --index_folder indexes/fever/lucene-index-fever-paragraph \
 --queries_file collections/fever/queries.paragraph.train-subset.tsv \
 --qrels_file collections/fever/qrels.paragraph.train-subset.txt

From the grid search, we observe that the parameters k1=0.9, b=0.1 perform fairly well. If we retrieve on the dev set with these parameters:

sh target/appassembler/bin/SearchCollection \
 -index indexes/fever/lucene-index-fever-paragraph \
 -topicReader TsvInt -topics collections/fever/queries.paragraph.dev.tsv \
 -output runs/run.fever-paragraph-0.9-0.1.dev.txt -bm25 -bm25.k1 0.9 -bm25.b 0.1

and we evaluate this run file:

python src/main/python/fever/evaluate_doc_retrieval.py \
 --truth_file collections/fever/paper_dev.jsonl \
 --run_file runs/run.fever-paragraph-0.9-0.1.dev.txt

then we can achieve the following results:

k Fully Supported Oracle Accuracy
1 0.4121 0.6081
5 0.6899 0.7933
10 0.7636 0.8424
25 0.8263 0.8842
50 0.8651 0.9101
100 0.8932 0.9288

Reproduction Log*