The repository is organized by THUNLP and Microsoft AI. It contains an ongoing work of an IR and QA pipeline system towards the novel coronavirus COVID-19 (SARS-CoV-2). This system is trained with MS-MARCO, a large scale reading comprehension dataset, and directly transferred to the medical area. We hope this repository will help us work together against the COVID-19.
The CORD-19 resource is constructed by Semantic Scholar of Allen Institute and will continue to be updated as new research is published in archival services and peer-reviewed publications. The shared task on Kaggle aims to help specialists in virusology, pharmacy and microbiology to find answers to the problem.
The following models are implemented for an effective document retrieval system.
- BM25
- Approximate Nearest Neighbor (ANN)
- BERT (Base version of BERT with 12 layers)
- Distilled BERT (BERT with 3 layers)
- BERT (Base version)
Downloading and unzipping checkpoints, data and index files into models
and retrieval
folders, respectively. You can find all resource on Tsinghua Cloud and Google Drive. Then install required packages.
Build BM25 Index using anserini. Download link of collections are available in data
.
./indexer/bm25_indexer/bin/IndexCollection -collection JsonCollection -es -es.index cord19 -input collection -generator LuceneDocumentGenerator -threads 1 -storePositions -storeDocvectors -storeRawDocs
pip install -r requirements.txt
Setting the CUDA device.
export CUDA_VISIBLE_DEVICES=DEVICE_ID
Running this pipeline system with the basic instruction. BM25 document retrieval, BERT paragraph retrieval and BERT QA model.
python run_pipeline.py
Using ANN in document retrieval.
python run_pipeline.py --use_ann
Using Distilled BERT in paragraph retrieval.
python run_pipeline.py --ranking_model_path ./models/bert_ranking_model_distilled
Keyphrase Extraction: the detailed giudes for generating keyphrases in the kpe folder.
Search result is a list of top-k document information and each document contains following fileds
- "title": Document title
- "keyphrases": Extracted keyphrases
- "text": Document text
QA results is a list of top-k answers and each answer contains following fileds
- "text": Answer text
- "title": The document tile where the answer is from
The following people share the same contribution for this repository:
Aowei Lu, Jiahua Liu, Kaitao Zhang, Shi Yu, Si Sun, Zhenghao Liu