Skip to content

TitoTamburini/AtlasObscura-WebScraping

Repository files navigation

Adavanced Data Mining 2022/23 - Homework 3

Build Dataset containing most relevant places from the website AtlasObscura.com using craping function from BeatifulSoup pyhton module.
With the complete dataset build a search engine on it.
Execute the query.
Define a new score for your search engine.
Visualize the most relevant places in the dataset using the new score defined.
Answer a theoretical question on Sorting Algorithm.

Repository content:

  1. main.ipynb: Main notebook it starts from part 1,2,3,4,7
  2. CommandLine.sh: file .sh containing the command line solution
  3. RankingList.txt: output of the sorting query
  4. TSV_FILES.zip: all the .tsv files for each place in atlas obscura ordered by page, output of part 1.2
  5. inverted_index.pkl, vocabulary.pkl: files need for the search engine
  6. map.png: screenshot of the scatter_mapbox obtained in part 4
  7. merged.tsv: .tsv files containing the entire dataset
  8. places_url.txt: txt with all the urls, output of part 1.1
  9. web_scraping_functions.py: .py file with web_scraping functions used for part 1.3
  10. sorting.py: .py file with function for part 7