Information Retrieval - Vector Space Model

Repository for the Information Retrieval exam final project.

Repository organization:

med_data contains the dataset used, a collection of 1032 articles from a medical journal.
- MED.ALL contains the documents collection,
- MED.QRY contains a list of 30 queries,
- MED.REL contains for each query in MED.QRY the list of known relevant documents, in the format query_id 0 doc_id 1
load_data.py: script with the class LoadDataset() used to load the corpus of documents, and the queries and relevance documents files, if any.
vsm.py: script with the class VectorSpaceModel() used to perform the retrieval. An object of such class is initialized by giving as parameter a list of tokens (for example the dataset loaded with the LoadDataset() functions), which will then be preprocessed through stop words removal and stemming and stored in the docs attribute. Then, the inverted index for the list of terms in the corpus is build, as well as the vocabulary containing all the terms in the collection, and stored as attributes of the object (index and vocab attributes), together with the number of documents (n_docs) and the number of terms (n_terms). Once the TF-IDF is computed, it is stored as well as attribute (tfidf).

It contains functions to compute the TF-IDF for each term in each document, to vectorize documents and queries and to perform relevance and pseudo-relevance feedback. It contains also a function to perform standard preprocessing of terms and a function to evaluate the retrieval, given a set of queries and known associated relevant documents.
ranked_retrieval.ipynb: notebook with all the implemented functions shown at work, on the Medline dataset. It shows also an evaluation in the performance of the program, through the computation of precision, recall and mean average precision.
run_vsm.py: script to run the program by command line, giving as argument the corpus of documents: python run_vsm.py med_data/MED.ALL. The user can modify the script by inserting:
- the required query, as the QUERY variable,
- a value of K, corresponding to how many documents will be returned by the program,
- a set of known relevant documents as the RELEVANT_DOCS variable, to allow the program to perform also relevance feedback,
- PSEUDO=True if the user wishes to perform also pseudo-relevance feedback.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
med_data		med_data
.gitignore		.gitignore
README.md		README.md
load_data.py		load_data.py
ranked_retrieval.ipynb		ranked_retrieval.ipynb
run_vsm.py		run_vsm.py
slides.pdf		slides.pdf
vsm.py		vsm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Information Retrieval - Vector Space Model

About

Releases

Packages

Languages

elena-buscaroli/IR_VectorSpaceModel

Folders and files

Latest commit

History

Repository files navigation

Information Retrieval - Vector Space Model

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages