forked from markovka17/dla
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
f99f08f
commit e1e513b
Showing
2 changed files
with
201 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,201 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Homework №5\n", | ||
"\n", | ||
" This homework will be dedicated to the Text-to-Speech(TTS), specifically the neural vocoder." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Data\n", | ||
"\n", | ||
" In this homework we will use only LJSpeech https://keithito.com/LJ-Speech-Dataset/.\n", | ||
"\n", | ||
" Use the following `featurizer` (his configuration is +- standard for this task):" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"code_folding": [ | ||
13, | ||
27 | ||
] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"from IPython import display\n", | ||
"from dataclasses import dataclass\n", | ||
"\n", | ||
"import torch\n", | ||
"from torch import nn\n", | ||
"\n", | ||
"import torchaudio\n", | ||
"\n", | ||
"import librosa\n", | ||
"from matplotlib import pyplot as plt\n", | ||
"\n", | ||
"\n", | ||
"@dataclass\n", | ||
"class MelSpectrogramConfig:\n", | ||
" sr: int = 22050\n", | ||
" win_length: int = 1024\n", | ||
" hop_length: int = 256\n", | ||
" n_fft: int = 1024\n", | ||
" f_min: int = 0\n", | ||
" f_max: int = 8000\n", | ||
" n_mels: int = 80\n", | ||
" power: float = 1.0\n", | ||
" \n", | ||
" # value of melspectrograms if we fed a silence into `MelSpectrogram`\n", | ||
" pad_value: float = -11.5129251\n", | ||
"\n", | ||
"\n", | ||
"class MelSpectrogram(nn.Module):\n", | ||
"\n", | ||
" def __init__(self, config: MelSpectrogramConfig):\n", | ||
" super(MelSpectrogram, self).__init__()\n", | ||
" \n", | ||
" self.config = config\n", | ||
"\n", | ||
" self.mel_spectrogram = torchaudio.transforms.MelSpectrogram(\n", | ||
" sample_rate=config.sr,\n", | ||
" win_length=config.win_length,\n", | ||
" hop_length=config.hop_length,\n", | ||
" n_fft=config.n_fft,\n", | ||
" f_min=config.f_min,\n", | ||
" f_max=config.f_max,\n", | ||
" n_mels=config.n_mels\n", | ||
" )\n", | ||
"\n", | ||
" # The is no way to set power in constructor in 0.5.0 version.\n", | ||
" self.mel_spectrogram.spectrogram.power = config.power\n", | ||
"\n", | ||
" # Default `torchaudio` mel basis uses HTK formula. In order to be compatible with WaveGlow\n", | ||
" # we decided to use Slaney one instead (as well as `librosa` does by default).\n", | ||
" mel_basis = librosa.filters.mel(\n", | ||
" sr=config.sr,\n", | ||
" n_fft=config.n_fft,\n", | ||
" n_mels=config.n_mels,\n", | ||
" fmin=config.f_min,\n", | ||
" fmax=config.f_max\n", | ||
" ).T\n", | ||
" self.mel_spectrogram.mel_scale.fb.copy_(torch.tensor(mel_basis))\n", | ||
" \n", | ||
"\n", | ||
" def forward(self, audio: torch.Tensor) -> torch.Tensor:\n", | ||
" \"\"\"\n", | ||
" :param audio: Expected shape is [B, T]\n", | ||
" :return: Shape is [B, n_mels, T']\n", | ||
" \"\"\"\n", | ||
" \n", | ||
" mel = self.mel_spectrogram(audio) \\\n", | ||
" .clamp_(min=1e-5) \\\n", | ||
" .log_()\n", | ||
"\n", | ||
" return mel" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"featurizer = MelSpectrogram(MelSpectrogramConfig())\n", | ||
"wav, sr = torchaudio.load('../week01/audio.wav')\n", | ||
"mels = featurizer(wav)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"_, axes = plt.subplots(2, 1, figsize=(15, 7))\n", | ||
"axes[0].plot(wav.squeeze())\n", | ||
"axes[1].imshow(mels.squeeze())\n", | ||
"\n", | ||
"plt.show()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Model\n", | ||
"\n", | ||
" 1) In this homework you need to implement classical version of WaveNet.\n", | ||
" Pay attention on:\n", | ||
" 1.1) Causal convs. We recommend to implement it via padding.\n", | ||
" 1.2) \"Condition Network\" which align mel with wav\n", | ||
"\n", | ||
" 2) (Bonus) If you have already implemented WaveNet, you can try to implement [Parallel WaveGAN](https://www.dropbox.com/s/bj25vnmkblr9y8v/PWG.pdf?dl=0).\n", | ||
" This model is based on WaveNet and GAN.\n", | ||
"\n", | ||
" 3) (Bonus) Fast generation of WaveNet. https://arxiv.org/abs/1611.09482.\n", | ||
" Don't forget to compare perfomance." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Code\n", | ||
"\n", | ||
" 1) In this homework you are allowed to use pytorch-lighting.\n", | ||
"\n", | ||
" 2) Try to write code more structurally and cleanly!\n", | ||
"\n", | ||
" 3) Good logging of experiments save your nerves and time, so we ask you to use W&B.\n", | ||
" Log loss, generated and real wavs (in pair, i.e. real wav and wav from correspond mel). \n", | ||
" Do not remove the logs until we have checked your work and given you a grade!\n", | ||
"\n", | ||
" 4) We also ask you to organize your code in github repo with (Bonus) Docker and setup.py. You can use my template https://github.com/markovka17/dl-start-pack.\n", | ||
"\n", | ||
" 5) Your work must be reproducable, so fix seed, save the weights of model, and etc.\n", | ||
"\n", | ||
" 6) In the end of your work write inference utils. Anyone should be able to take your weight, load it into the model and run it on some melspec." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Report\n", | ||
"\n", | ||
" Finally, you need to write a report in W&B https://www.wandb.com/reports. Add examples of generated mel and audio, compare with GT.\n", | ||
" Don't forget to add link to your report." | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.7.7" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 4 | ||
} |
Empty file.