Repository for the Information Retrieval exam final project.
Repository organization:
-
med_data
contains the dataset used, a collection of 1032 articles from a medical journal.MED.ALL
contains the documents collection,MED.QRY
contains a list of 30 queries,MED.REL
contains for each query inMED.QRY
the list of known relevant documents, in the formatquery_id 0 doc_id 1
-
load_data.py
: script with the classLoadDataset()
used to load the corpus of documents, and the queries and relevance documents files, if any. -
vsm.py
: script with the classVectorSpaceModel()
used to perform the retrieval. An object of such class is initialized by giving as parameter a list of tokens (for example the dataset loaded with theLoadDataset()
functions), which will then be preprocessed through stop words removal and stemming and stored in thedocs
attribute. Then, the inverted index for the list of terms in the corpus is build, as well as the vocabulary containing all the terms in the collection, and stored as attributes of the object (index
andvocab
attributes), together with the number of documents (n_docs
) and the number of terms (n_terms
). Once the TF-IDF is computed, it is stored as well as attribute (tfidf
).It contains functions to compute the TF-IDF for each term in each document, to vectorize documents and queries and to perform relevance and pseudo-relevance feedback. It contains also a function to perform standard preprocessing of terms and a function to evaluate the retrieval, given a set of queries and known associated relevant documents.
-
ranked_retrieval.ipynb
: notebook with all the implemented functions shown at work, on the Medline dataset. It shows also an evaluation in the performance of the program, through the computation of precision, recall and mean average precision. -
run_vsm.py
: script to run the program by command line, giving as argument the corpus of documents:python run_vsm.py med_data/MED.ALL
. The user can modify the script by inserting:- the required query, as the
QUERY
variable, - a value of
K
, corresponding to how many documents will be returned by the program, - a set of known relevant documents as the
RELEVANT_DOCS
variable, to allow the program to perform also relevance feedback, PSEUDO=True
if the user wishes to perform also pseudo-relevance feedback.
- the required query, as the