#GRAHAM SEARCH ENGINE An implementation of a text-based search engine based on the Vector Space Model using Python. We have used the publically available Reuters-21578 "ApteMod" corpus for text categorization.
code available online at: https://github.com/aadijain/graham-engine
##Features
- Phrase Queries and Proximity Queries
- Retrieved documents ranked according to relevance
- File preview of all the retrieved documents
- Option to open returned documents in a new window
- Category classification of the retrieved documents.
- Reduces the vocabulary size and the retrieval time by eliminating common words (stop words)
- Searches for similar forms of words
- Spelling correction in queries
##Prerequisites
-
software required:
- python2.7
- gedit
-
python2.7 libraries required:
- nltk
- textblob
- math
- re
- json
- time
- sortedcontainers
- os
##Assumptions
- This program must be run on a linux system
- The user enters a valid option when prompted
- The program was tested on a ubuntu 16.04LTS system with 4GB of ram
- The corpus provided contains a "cats.txt" file which maps file names to specified categories
##Getting Started
- Ensure all required software and libraries are installed
- Run 'python driver.py' in a terminal to start the program
- The program will prompt for input as and when required
- Choose the option to rebuild corpus in case any changes are made to the corpus or during the first run of the program (Note: this may take upto 30mins depending on the system specs and corpus size)
- Entering the query: -queries not enclosed in quotes are treated as list of tokens and are searched using vector space model -if the query is enclosed in quotes it will be treated as a phrase query and the entire phrase is searched as a unit -the "*" wildcard can be used to in phrase queries to match any(one) token to include proximity
- Any Result files can be opened in Write mode
##Modifying Corpus
- Any of the following files can be specified
- A list of common words which don't add any value are present in corpora/reuters/stopwords
- Category information for each file is present in corpora/reuters/cats.txt
- The database files are present in corpora/reuters/test and corpora/reuters/training
- Rebuild the index if any changed are made NOTE: modifying the corpus is not recommended
##Authors
- Aadi Jain (2015A7PS104P)
- Aayushmaan Jain (2015A7PS043P)
- Aradhya Khandelwal (2015A7PS036P)
- Samip Jasani (2015A7PS127P)
- Tanvi Aggarwal (2015A7PS140P)