Skip to content

Repository for the Information Retrieval exam final project

Notifications You must be signed in to change notification settings

elena-buscaroli/IR_VectorSpaceModel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Information Retrieval - Vector Space Model

Repository for the Information Retrieval exam final project.

Repository organization:

  • med_data contains the dataset used, a collection of 1032 articles from a medical journal.

    • MED.ALL contains the documents collection,
    • MED.QRY contains a list of 30 queries,
    • MED.REL contains for each query in MED.QRY the list of known relevant documents, in the format query_id 0 doc_id 1
  • load_data.py: script with the class LoadDataset() used to load the corpus of documents, and the queries and relevance documents files, if any.

  • vsm.py: script with the class VectorSpaceModel() used to perform the retrieval. An object of such class is initialized by giving as parameter a list of tokens (for example the dataset loaded with the LoadDataset() functions), which will then be preprocessed through stop words removal and stemming and stored in the docs attribute. Then, the inverted index for the list of terms in the corpus is build, as well as the vocabulary containing all the terms in the collection, and stored as attributes of the object (index and vocab attributes), together with the number of documents (n_docs) and the number of terms (n_terms). Once the TF-IDF is computed, it is stored as well as attribute (tfidf).

    It contains functions to compute the TF-IDF for each term in each document, to vectorize documents and queries and to perform relevance and pseudo-relevance feedback. It contains also a function to perform standard preprocessing of terms and a function to evaluate the retrieval, given a set of queries and known associated relevant documents.

  • ranked_retrieval.ipynb: notebook with all the implemented functions shown at work, on the Medline dataset. It shows also an evaluation in the performance of the program, through the computation of precision, recall and mean average precision.

  • run_vsm.py: script to run the program by command line, giving as argument the corpus of documents: python run_vsm.py med_data/MED.ALL. The user can modify the script by inserting:

    • the required query, as the QUERY variable,
    • a value of K, corresponding to how many documents will be returned by the program,
    • a set of known relevant documents as the RELEVANT_DOCS variable, to allow the program to perform also relevance feedback,
    • PSEUDO=True if the user wishes to perform also pseudo-relevance feedback.

About

Repository for the Information Retrieval exam final project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published