Skip to content

Commit

Permalink
Update notebook table and transformers intro notebook (huggingface#9136)
Browse files Browse the repository at this point in the history
  • Loading branch information
sgugger committed Dec 16, 2020
1 parent fb650df commit 4d48973
Show file tree
Hide file tree
Showing 3 changed files with 171 additions and 193 deletions.
4 changes: 2 additions & 2 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,11 +55,11 @@ Coming soon!
|---|---|:---:|:---:|:---:|:---:|
| [**`language-modeling`**](https://github.com/huggingface/transformers/tree/master/examples/language-modeling) | Raw text | ✅ | - | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)
| [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/master/examples/multiple-choice) | SWAG, RACE, ARC | ✅ | ✅ | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb)
| [**`question-answering`**](https://github.com/huggingface/transformers/tree/master/examples/question-answering) | SQuAD | ✅ | ✅ | ✅ | -
| [**`question-answering`**](https://github.com/huggingface/transformers/tree/master/examples/question-answering) | SQuAD | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/huggingface/notebooks/blob/master/examples/question_answering.ipynb)
| [**`summarization`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq) | CNN/Daily Mail | ✅ | - | - | -
| [**`text-classification`**](https://github.com/huggingface/transformers/tree/master/examples/text-classification) | GLUE, XNLI | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb)
| [**`text-generation`**](https://github.com/huggingface/transformers/tree/master/examples/text-generation) | - | n/a | n/a | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)
| [**`token-classification`**](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | CoNLL NER | ✅ | ✅ | ✅ | -
| [**`token-classification`**](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | CoNLL NER | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/huggingface/notebooks/blob/master/examples/token_classification.ipynb)
| [**`translation`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq) | WMT | ✅ | - | - | -


Expand Down
210 changes: 92 additions & 118 deletions notebooks/02-transformers.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -73,12 +73,14 @@
"\n",
"The transformers library allows you to benefits from large, pretrained language models without requiring a huge and costly computational\n",
"infrastructure. Most of the State-of-the-Art models are provided directly by their author and made available in the library \n",
"in PyTorch and TensorFlow in a transparent and interchangeable way. "
"in PyTorch and TensorFlow in a transparent and interchangeable way. \n",
"\n",
"If you're executing this notebook in Colab, you will need to install the transformers library. You can do so with this command:"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 2,
"metadata": {
"id": "KnT3Jn6fSXai",
"pycharm": {
Expand All @@ -89,13 +91,12 @@
},
"outputs": [],
"source": [
"!pip install transformers\n",
"!pip install --upgrade tensorflow"
"# !pip install transformers"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 3,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
Expand All @@ -111,13 +112,11 @@
{
"data": {
"text/plain": [
"<torch.autograd.grad_mode.set_grad_enabled at 0x7f9c03e5b3c8>"
"<torch.autograd.grad_mode.set_grad_enabled at 0x7ff0cc2a2c50>"
]
},
"execution_count": 2,
"metadata": {
"tags": []
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
Expand All @@ -130,7 +129,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 4,
"metadata": {
"id": "1xMDTHQXSXai",
"pycharm": {
Expand Down Expand Up @@ -159,103 +158,56 @@
"source": [
"With only the above two lines of code, you're ready to use a BERT pre-trained model. \n",
"The tokenizers will allow us to map a raw textual input to a sequence of integers representing our textual input\n",
"in a way the model can manipulate."
"in a way the model can manipulate. Since we will be using a PyTorch model, we ask the tokenizer to return to us PyTorch tensors."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "XgkFg52fSXai",
"outputId": "94b569d4-5415-4327-f39e-c9541b0a53e0",
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tokens: ['This', 'is', 'an', 'input', 'example']\n",
"Tokens id: [1188, 1110, 1126, 7758, 1859]\n",
"Tokens PyTorch: tensor([[ 101, 1188, 1110, 1126, 7758, 1859, 102]])\n",
"Token wise output: torch.Size([1, 7, 768]), Pooled output: torch.Size([1, 768])\n"
"input_ids:\n",
"\ttensor([[ 101, 1188, 1110, 1126, 7758, 1859, 102]])\n",
"token_type_ids:\n",
"\ttensor([[0, 0, 0, 0, 0, 0, 0]])\n",
"attention_mask:\n",
"\ttensor([[1, 1, 1, 1, 1, 1, 1]])\n"
]
}
],
"source": [
"# Tokens comes from a process that splits the input into sub-entities with interesting linguistic properties. \n",
"tokens = tokenizer.tokenize(\"This is an input example\")\n",
"print(\"Tokens: {}\".format(tokens))\n",
"\n",
"# This is not sufficient for the model, as it requires integers as input, \n",
"# not a problem, let's convert tokens to ids.\n",
"tokens_ids = tokenizer.convert_tokens_to_ids(tokens)\n",
"print(\"Tokens id: {}\".format(tokens_ids))\n",
"\n",
"# Add the required special tokens\n",
"tokens_ids = tokenizer.build_inputs_with_special_tokens(tokens_ids)\n",
"\n",
"# We need to convert to a Deep Learning framework specific format, let's use PyTorch for now.\n",
"tokens_pt = torch.tensor([tokens_ids])\n",
"print(\"Tokens PyTorch: {}\".format(tokens_pt))\n",
"\n",
"# Now we're ready to go through BERT with out input\n",
"outputs = model(tokens_pt)\n",
"last_hidden_state = outputs.last_hidden_state\n",
"pooler_output = outputs.pooler_output\n",
"\n",
"print(\"Token wise output: {}, Pooled output: {}\".format(last_hidden_state.shape, pooler_output.shape))"
"tokens_pt = tokenizer(\"This is an input example\", return_tensors=\"pt\")\n",
"for key, value in tokens_pt.items():\n",
" print(\"{}:\\n\\t{}\".format(key, value))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "lBbvwNKXSXaj",
"pycharm": {
"name": "#%% md\n"
}
},
"metadata": {},
"source": [
"As you can see, BERT outputs two tensors:\n",
" - One with the generated representation for every token in the input `(1, NB_TOKENS, REPRESENTATION_SIZE)`\n",
" - One with an aggregated representation for the whole input `(1, REPRESENTATION_SIZE)`\n",
" \n",
"The first, token-based, representation can be leveraged if your task requires to keep the sequence representation and you\n",
"want to operate at a token-level. This is particularly useful for Named Entity Recognition and Question-Answering.\n",
"The tokenizer automatically converted our input to all the inputs expected by the model. It generated some additional tensors on top of the IDs: \n",
"\n",
"The second, aggregated, representation is especially useful if you need to extract the overall context of the sequence and don't\n",
"require a fine-grained token-level. This is the case for Sentiment-Analysis of the sequence or Information Retrieval."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "DCxuDWH2SXaj",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"The code you saw in the previous section introduced all the steps required to do simple model invocation.\n",
"For more day-to-day usage, transformers provides you higher-level methods which will makes your NLP journey easier.\n",
"Let's improve our previous example"
"- token_type_ids: This tensor will map every tokens to their corresponding segment (see below).\n",
"- attention_mask: This tensor is used to \"mask\" padded values in a batch of sequence with different lengths (see below).\n",
"\n",
"You can check our [glossary](https://huggingface.co/transformers/glossary.html) for more information about each of those keys. \n",
"\n",
"We can just feed this directly into our model:"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 7,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "sgcNCdXUSXaj",
"outputId": "af2fb928-7c17-475b-cf81-89cfc4b1d9e5",
"id": "XgkFg52fSXai",
"outputId": "94b569d4-5415-4327-f39e-c9541b0a53e0",
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
Expand All @@ -266,52 +218,41 @@
"name": "stdout",
"output_type": "stream",
"text": [
"input_ids:\n",
"\ttensor([[ 101, 1188, 1110, 1126, 7758, 1859, 102]])\n",
"token_type_ids:\n",
"\ttensor([[0, 0, 0, 0, 0, 0, 0]])\n",
"attention_mask:\n",
"\ttensor([[1, 1, 1, 1, 1, 1, 1]])\n",
"Difference with previous code: (0.0, 0.0)\n"
"Token wise output: torch.Size([1, 7, 768]), Pooled output: torch.Size([1, 768])\n"
]
}
],
"source": [
"# tokens = tokenizer.tokenize(\"This is an input example\")\n",
"# tokens_ids = tokenizer.convert_tokens_to_ids(tokens)\n",
"# tokens_pt = torch.tensor([tokens_ids])\n",
"\n",
"# This code can be factored into one-line as follow\n",
"tokens_pt2 = tokenizer(\"This is an input example\", return_tensors=\"pt\")\n",
"\n",
"for key, value in tokens_pt2.items():\n",
" print(\"{}:\\n\\t{}\".format(key, value))\n",
"\n",
"outputs2 = model(**tokens_pt2)\n",
"last_hidden_state2 = outputs2.last_hidden_state\n",
"pooler_output2 = outputs2.pooler_output\n",
"outputs = model(**tokens_pt)\n",
"last_hidden_state = outputs.last_hidden_state\n",
"pooler_output = outputs.pooler_output\n",
"\n",
"print(\"Difference with previous code: ({}, {})\".format((last_hidden_state2 - last_hidden_state).sum(), (pooler_output2 - pooler_output).sum()))"
"print(\"Token wise output: {}, Pooled output: {}\".format(last_hidden_state.shape, pooler_output.shape))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "gC-7xGYPSXal"
"id": "lBbvwNKXSXaj",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"As you can see above, calling the tokenizer provides a convenient way to generate all the required parameters\n",
"that will go through the model. \n",
"\n",
"Moreover, you might have noticed it generated some additional tensors: \n",
"As you can see, BERT outputs two tensors:\n",
" - One with the generated representation for every token in the input `(1, NB_TOKENS, REPRESENTATION_SIZE)`\n",
" - One with an aggregated representation for the whole input `(1, REPRESENTATION_SIZE)`\n",
" \n",
"The first, token-based, representation can be leveraged if your task requires to keep the sequence representation and you\n",
"want to operate at a token-level. This is particularly useful for Named Entity Recognition and Question-Answering.\n",
"\n",
"- token_type_ids: This tensor will map every tokens to their corresponding segment (see below).\n",
"- attention_mask: This tensor is used to \"mask\" padded values in a batch of sequence with different lengths (see below)."
"The second, aggregated, representation is especially useful if you need to extract the overall context of the sequence and don't\n",
"require a fine-grained token-level. This is the case for Sentiment-Analysis of the sequence or Information Retrieval."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 8,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
Expand Down Expand Up @@ -357,7 +298,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 9,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
Expand Down Expand Up @@ -414,14 +355,47 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 10,
"metadata": {
"id": "Kubwm-wJSXan",
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "3b971be3639d4fedb02778fb5c6898a0",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, description='Downloading', max=526681800.0, style=ProgressStyle(descri…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']\n",
"- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
"- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
"All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.\n",
"If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.\n"
]
}
],
"source": [
"from transformers import TFBertModel, BertModel\n",
"\n",
Expand All @@ -432,7 +406,7 @@
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": 11,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
Expand All @@ -448,8 +422,8 @@
"name": "stdout",
"output_type": "stream",
"text": [
"last_hidden_state differences: 1.0094e-05\n",
"pooler_output differences: 7.2969e-07\n"
"last_hidden_state differences: 1.2933e-05\n",
"pooler_output differences: 2.9691e-06\n"
]
}
],
Expand Down Expand Up @@ -482,7 +456,7 @@
"\n",
"For example, Google released a few months ago **T5** an Encoder/Decoder architecture based on Transformer and available in `transformers` with no more than 11 billions parameters. Microsoft also recently entered the game with **Turing-NLG** using 17 billions parameters. This kind of model requires tens of gigabytes to store the weights and a tremendous compute infrastructure to run such models which makes it impracticable for the common man !\n",
"\n",
"![transformers-parameters](https://lh5.googleusercontent.com/NRdXzEcgZV3ooykjIaTm9uvbr9QnSjDQHHAHb2kk_Lm9lIF0AhS-PJdXGzpcBDztax922XAp386hyNmWZYsZC1lUN2r4Ip5p9v-PHO19-jevRGg4iQFxgv5Olq4DWaqSA_8ptep7)\n",
"![transformers-parameters](https://github.com/huggingface/notebooks/blob/master/examples/images/model_parameters.png?raw=true)\n",
"\n",
"With the goal of making Transformer-based NLP accessible to everyone we @huggingface developed models that take advantage of a training process called **Distillation** which allows us to drastically reduce the resources needed to run such models with almost zero drop in performances.\n",
"\n",
Expand Down Expand Up @@ -673,7 +647,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
"version": "3.7.9"
},
"pycharm": {
"stem_cell": {
Expand Down
Loading

0 comments on commit 4d48973

Please sign in to comment.