uu-spring-fever-scraper

This repository contains tools for scraping and analyzing natural language content related to the Spring Fever week (Lentekriebels). The tools developed are intended for academic research purposes. This repository was created for a research project initiated by Utrecht University.

Getting started

Prerequisites

Python 3.10 or newer

Requirements

All package requirements for running the application are specified in requirements.txt. Packages can be installed from PyPi, e.g. through pip install -r requirements.txt. Note: for lemmatization (during preprocessing), run python -m spacy download nl_core_news_lg before running preprocess.py.

Scraper

The script scrape.py can be used for scraping web pages. The scrapers retrieve posts by individual authors/accounts on the web page. The anonymized author of the post, as well as the text of the post, are retrieved and exported to a csv file.

Currently, scrapers have been developed for websites which contain content relevant to the Lentekriebels week. These are:

Facebook (comments for any given post)
YouTube (comments for any given video)
www.ouders.nl (responses to any given forum post)

The web pages of interest for the present research project are defined in config/scrape.yaml, but the scrapers may be used for any page on these websites. In addition, the repository is set up such that it can be readily extended to be used for scraping other websites.

If the Facebook or YouTube scraper are used, make sure that you provide the appropriate credentials (cookies or API key respectively). See scraper/facebook.py and scraper/youtube.py for specifics.

As is common, the developed scrapers rely on the existing page layout of the websites at the time of development. If the structure of these pages should change, the scrapers may no longer work.

Anonymization occurs by replacing usernames by an integer (e.g. @JohnDoe becomes @0, @JaneDoe becomes @1). This is done by first collecting all author usernames of all posts and assigning them an integer, and then replacing all occurrences of a username in the texts of the posts by its assigned integer. This approach has the limitation that usernames which are mentioned in posts, but who are never the author of a post, are not recognized as usernames and therefore not replaced. Therefore, if privacy is of concern (which it most likely is), make sure to perform a manual check of the retrieved data before further processing.

Preprocessing

The script preprocess.py contains functionality for preprocessing of natural language. The available options were developed with classical topic modelling approaches in mind. For topic modelling, it is desirable that documents are "normalized" as much as possible, in order to increase the density of the distribution of the model to be fit. This means that all preprocessing options help to reduce the text to a form that maintains the most important features that capture the meaning of the text, while removing as much redundancy as possible. Moreover, the documents are assumed to be in Dutch, so Dutch lexicons are used wherever appropriate (although it should be straightforward to replace these if required).

The preprocessing options that can be executed are listed below.

Account masking: following the anonymization described above, documents can be normalized by replacing all of @0, @1, @2, ... etc. with @USER.
HTML tag removal: remove lingering HTML tags from text.
Stopword removal: removes all stopwords that are included in the NLTK Dutch stopword corpus.
Punctuation removal: removes all punctuation characters that are categorized as such by unicodedata.category, as well as those contained in string.punctuation.
Lemmatization: reduces words to a root form, using the nl_core_news_lg corpus from Spacy.
Stemming: reduces words to their stem, e.g. by removing prefixes and suffixes, using the Dutch snowball stemmer from Spacy. Lemmatization seems to perform better than stemming for the present corpus.
Lowercasing: casting all words in the text to lowercase.

The preprocessing steps to be executed are defined in config/preprocess.yaml. The steps are applied in the order that they are listed. In addition, the files that preprocessing should be applied to are listed in the same .yaml file.

Contact

A publication about the results of the analysis performed on the basis of the data scraped using this repository is forthcoming (and will be linked here). Otherwise, this repository will no longer be actively maintained. Should you nonetheless wish to get in touch, you may do so by opening an Issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

uu-spring-fever-scraper

Getting started

Prerequisites

Requirements

Scraper

Preprocessing

Contact

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
config		config
preprocess		preprocess
scraper		scraper
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
preprocess.py		preprocess.py
requirements.txt		requirements.txt
scrape.py		scrape.py

License

UtrechtUniversity/spring-fever-scraper

Folders and files

Latest commit

History

Repository files navigation

uu-spring-fever-scraper

Getting started

Prerequisites

Requirements

Scraper

Preprocessing

Contact

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages