Skip to content

angelelliott/DAPT-MLM-BERT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 

Repository files navigation

DAPT-MLM-BERT

Based on Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks

Domain-adaptive pre-training is the process of tailoring a pre-trained model to the domain of a target classification task. This is done by submitting the model to an additional round of pre-training on a large unlabeled domain-specific corpus.

I submitted a pre-trained Spanish BERT model to an additional round of domain-adaptive masked language modeling pre-training for the task of misogynistic tweet detection in Spanish.

Pre-training Corpus: profanitiesdatasetprocessed.csv

  • 5.5 M tweets in Spanish, each tweet contains at least one common Spanish profanity.
  • Scraped using snscrape and Twitter API.
  • Idea:
    1. BERT pre-trained on a general domain doesn't understand Twitter language.
    2. Misogynistic tweets usually have slurs.

Fine-tuning Corpus: IberEval 2018 Automatic Misogyny Identification (AMI) dataset.

This is just the code that I used for pretraining, not the model itself. I ran the model on a Docker container and adjusted the following hyperparameters: weight decay (0,0.1, 0.01, 0.001), epochs (1-4), batch size (16,32,64,128), optimizer (AdamW, SGD, Adadelta, Adagrad).

About

code for MLM domain-adaptive pre-training for BERT

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published