GitHub - Panjete/iidsearch: an efficient ranked retrieval system for English corpora, optimised with VBE and BPE.

Inverted Index Construction, and BPE Tokenisation

For implementational and algorithmic details, please refer the report "algorithmic_details.pdf"

Note that the reader files have been constructed for the TREC-like datasets and queryfiles, and may need to be adopted for different file formats. Signatures of functions have been commented to facilitate this.

To create the index and the dictionary file:

Call bash invidx.sh [directoryname] [indexfile] [compressionFlag] [tokenizerFlag]
Calls invidx_cons.py, which reads the input files, learns BPE (if asked) and constructs the dictionay and the postings list.
Compression Flag (0/1) denotes Variable Byte Encoding On/Off and BPE Tokeniser flag (0/1) denotes BPE Tokenisation On/Off
[directoryname] is the directory path to the dataset whose index is being constructed, and [indexfile] is the name that will be generated for the output files.

To search the built index:

Call bash tf_idf_search.sh [queryfile] [resultfile] [indexfile] [dictionary]
Calls top.py, which uses a reader to figure out the compression and encoding strategies used in the [indexfile] and [dictionary].
Based on this, uses the relevant reader and query processing file to process and retrieve the queries.

To compute the F1 scores:

Call python retrieval_efficiency.py (edit the filenames in retrieval_efficiency.py)
The outputs are already TREC_EVAL compatible, and further metrics can be computed by configuring trec eval if the need be.

The build file just checks lxml availability, and installs it if not present.

Files in the files folder are samples of the formats of files the present code is compatible with.

Computation

The Vanilla No Compression, No Encoding framework is able to construct the index of around a 2GB Corpus in just 982.55 seconds, and boasts an Average Query Retrieval Time of 1.276 seconds, with an Average Precision of 0.611!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
files		files
README.md		README.md
algorithmic_details.pdf		algorithmic_details.pdf
bpetoken.py		bpetoken.py
build.sh		build.sh
invidx.sh		invidx.sh
invidx_cons.py		invidx_cons.py
process_query00.py		process_query00.py
process_query01.py		process_query01.py
process_query10.py		process_query10.py
process_query11.py		process_query11.py
q_00.py		q_00.py
q_01.py		q_01.py
q_10.py		q_10.py
q_11.py		q_11.py
readers_00.py		readers_00.py
readers_01.py		readers_01.py
readers_10.py		readers_10.py
readers_11.py		readers_11.py
retrieval_efficiency.py		retrieval_efficiency.py
tf_idf_search.sh		tf_idf_search.sh
top.py		top.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Inverted Index Construction, and BPE Tokenisation

Computation

About

Releases

Packages

Languages

Panjete/iidsearch

Folders and files

Latest commit

History

Repository files navigation

Inverted Index Construction, and BPE Tokenisation

Computation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages