DAPT-MLM-BERT

Based on Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks

Domain-adaptive pre-training is the process of tailoring a pre-trained model to the domain of a target classification task. This is done by submitting the model to an additional round of pre-training on a large unlabeled domain-specific corpus.

I submitted a pre-trained Spanish BERT model to an additional round of domain-adaptive masked language modeling pre-training for the task of misogynistic tweet detection in Spanish.

Pre-training Corpus: profanitiesdatasetprocessed.csv

5.5 M tweets in Spanish, each tweet contains at least one common Spanish profanity.
Scraped using snscrape and Twitter API.
Idea:
1. BERT pre-trained on a general domain doesn't understand Twitter language.
2. Misogynistic tweets usually have slurs.

Fine-tuning Corpus: IberEval 2018 Automatic Misogyny Identification (AMI) dataset.

This is just the code that I used for pretraining, not the model itself. I ran the model on a Docker container and adjusted the following hyperparameters: weight decay (0,0.1, 0.01, 0.001), epochs (1-4), batch size (16,32,64,128), optimizer (AdamW, SGD, Adadelta, Adagrad).

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.gitignore		.gitignore
README.md		README.md
pre_training.ipynb		pre_training.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DAPT-MLM-BERT

About

Releases

Packages

Languages

angelelliott/DAPT-MLM-BERT

Folders and files

Latest commit

History

Repository files navigation

DAPT-MLM-BERT

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages