Analyzing Byte-Pair Encoding on Monophonic and Polyphonic Symbolic Music: A Focus on Musical Phrase Segmentation

Accepted to 3rd Workshop on NLP for Music and Audio (NLP4MusA)

Dinh-Viet-Toan Le, Louis Bigo, Mikaela Keller

Abstract - Byte-Pair Encoding (BPE) is an algorithm commonly used in Natural Language Processing to build a vocabulary of subwords, which has been recently applied to symbolic music. Given that symbolic music can differ significantly from text, particularly with polyphony, we investigate how BPE behaves with different types of musical content. This study provides a qualitative analysis of BPE's behavior across various instrumentations and evaluates its impact on a musical phrase segmentation task for both monophonic and polyphonic music. Our findings show that the BPE training process is highly dependent on the instrumentation and that BPE ``supertokens'' succeed in capturing abstract musical content. In a musical phrase segmentation task, BPE notably improves performance in a polyphonic setting, but enhances performance in monophonic tunes only within a specific range of BPE merges.

Setup

Create environment:

conda create -n envtokenization python=3.9.2
conda activate envtokenization

Install requirements:

pip install --no-deps -r requirements.txt

Reproduce figures from pre-computed data

Download pre-computed data and models here
Data content:

exp231_save and exp232_save: Pretrained models for phrase segmentation (Figure 3). To put in your preferred directory (modify model_path_231 and model_path_232 for phrase segmentation performances in figures_paper.ipynb).
corpus: raw MIDI files, phrase segmentation annotations. To put in ./corpus.
bpe_tokenizers: pre-trained BPE tokenizers. To put in ./bpe_tokenizers.
results: pre-computed data for text-BPE vs. music-BPE (Figure 1) and supertokens with pitches (Figure 4). To put in ./results.

Run notebook figures_paper.ipynb

Training

BPE tokenizer

Use:

python train_bpe.py --corpus=<poly|mono> --n_merges=<int>

Outputs:

train.bpe: Pre-trained BPE tokenizer
train.bpe.frq: Supertoken frequency (for Figure 1)

Options:

--tokenizer_init=<InitialTokenizer>: start from an already trained BPE tokenizer. If none, start from the initial vocabulary.
--output_file=<OutputFilename> : output filename (default: train.bpe)
--bypass_tokenize : bypass the tokenization step before BPE (i.e. the path 'data_tokenized/{TokenizerName}/{Corpus}/train' already exists), because it can be very long for several trainings... Warning if used with --tokenizer_init, make sure that it has been tokenized with THIS initial tokenizer.

Musical phrase detection

Monophonic

Use:

python exp231_clfdata_tf.py --config=<config_file>

Required:

Pre-trained BPE tokenizer (according to the bpe_savepath field in the config file.)

Outputs:

Trained models saved at PATH_CKPT (defined in exp231_clfdata_tf.py)

Options:

--precompute_data: builds pre-computed data mtc_clfdata_<TokenizerName>_bpe<NumBPE>.feather
--seed_split=<int>

No BPE

python exp231_clfdata_tf.py --config=config/clfdata_transformers_withbpe.yaml

With BPE

python exp231_clfdata_tf.py --config=config/clfdata_transformers_withbpe.yaml

Polyphonic

Use:

python exp232_clfdata_tf.py --config=<config_file>

Required:

Pre-trained BPE tokenizer (according to the bpe_savepath field in the config file.)

Outputs:

Trained models saved at PATH_CKPT (defined in exp232_clfdata_tf.py)

Options:

--precompute_data: builds pre-computed data mtc_piano_clfdata_<TokenizerName>_bpe<NumBPE>_chunkafter.feather
--seed_split=<int>

No BPE

python exp232_clfdata_tf.py --config=config/clfdata_piano_transformers_nobpe.yaml

With BPE

python exp232_clfdata_tf.py --config=config/clfdata_piano_transformers_withbpe.yaml

Evaluation

Required:

Trained models for phrase segmentation. In particular, only the best_loss.pt model (trained by exp231_clfdata_tf.py or exp232_clfdata_tf.py) is evaluated.
Pre-computed data with the correct number of BPE merges (ex: mtc_clfdata_REMIVelocityMute_bpe4096.feather) in the current folder.

Outputs:

perfo.json created in the checkpoint folder.

The same scripts can be used to evaluate both BPE and non-BPE models.

Monophonic

python exp231_clfdata_evaluate.py <CheckpointFolder> (cuda:<device_number> optional)

Polyphonic

python exp232_clfdata_evaluate.py <CheckpointFolder> (cuda:<device_number> optional)

Citation BibTex

If you find this work helpful and use our code in your research, please cite our paper:

@inproceedings{le2024analyzing,
  title={Analyzing Byte-Pair Encoding on Monophonic and Polyphonic Symbolic Music: A Focus on Musical Phrase Segmentation},
  author={Le, Dinh-Viet-Toan and Bigo, Louis and Keller, Mikaela},
  booktitle={Proceedings of the 3rd Workshop on NLP for Music and Audio (NLP4MusA)},
  year={2024},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analyzing Byte-Pair Encoding on Monophonic and Polyphonic Symbolic Music: A Focus on Musical Phrase Segmentation

Setup

Reproduce figures from pre-computed data

Training

BPE tokenizer

Musical phrase detection

Monophonic

Polyphonic

Evaluation

Monophonic

Polyphonic

Citation BibTex

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
bpe_tokenizers		bpe_tokenizers
config		config
corpus		corpus
figures		figures
plot_data		plot_data
results		results
src		src
.gitignore		.gitignore
README.md		README.md
exp231_clfdata_evaluate.py		exp231_clfdata_evaluate.py
exp231_clfdata_tf.py		exp231_clfdata_tf.py
exp232_clfdata_evaluate.py		exp232_clfdata_evaluate.py
exp232_clfdata_tf.py		exp232_clfdata_tf.py
figures_paper.ipynb		figures_paper.ipynb
make_pitch_supertokens.py		make_pitch_supertokens.py
requirements.txt		requirements.txt
train_bpe.py		train_bpe.py

dinhviettoanle/musicbpe-mono-poly

Folders and files

Latest commit

History

Repository files navigation

Analyzing Byte-Pair Encoding on Monophonic and Polyphonic Symbolic Music: A Focus on Musical Phrase Segmentation

Setup

Reproduce figures from pre-computed data

Training

BPE tokenizer

Musical phrase detection

Monophonic

Polyphonic

Evaluation

Monophonic

Polyphonic

Citation BibTex

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages