Update notebook table and transformers intro notebook (huggingface#9136)

markurtz · Dec 16, 2020 · 4d48973 · 4d48973
1 parent fb650df
commit 4d48973
Show file tree

Hide file tree

Showing 3 changed files with 171 additions and 193 deletions.
diff --git a/examples/README.md b/examples/README.md
@@ -55,11 +55,11 @@ Coming soon!
 |---|---|:---:|:---:|:---:|:---:|
 | [**`language-modeling`**](https://github.com/huggingface/transformers/tree/master/examples/language-modeling)       | Raw text        | ✅ | -  | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)
 | [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/master/examples/multiple-choice)           | SWAG, RACE, ARC | ✅ | ✅ | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb)
-| [**`question-answering`**](https://github.com/huggingface/transformers/tree/master/examples/question-answering)     | SQuAD           | ✅ | ✅ | ✅ | -
+| [**`question-answering`**](https://github.com/huggingface/transformers/tree/master/examples/question-answering)     | SQuAD           | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/huggingface/notebooks/blob/master/examples/question_answering.ipynb)
 | [**`summarization`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq)                     | CNN/Daily Mail  | ✅  | - | - | -
 | [**`text-classification`**](https://github.com/huggingface/transformers/tree/master/examples/text-classification)   | GLUE, XNLI      | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb)
 | [**`text-generation`**](https://github.com/huggingface/transformers/tree/master/examples/text-generation)           | -               | n/a | n/a | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)
-| [**`token-classification`**](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | CoNLL NER       | ✅ | ✅ | ✅ | -
+| [**`token-classification`**](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | CoNLL NER       | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/huggingface/notebooks/blob/master/examples/token_classification.ipynb)
 | [**`translation`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq)                       | WMT             | ✅  | - | - | -
 
 

diff --git a/notebooks/02-transformers.ipynb b/notebooks/02-transformers.ipynb
@@ -73,12 +73,14 @@
     "\n",
     "The transformers library allows you to benefits from large, pretrained language models without requiring a huge and costly computational\n",
     "infrastructure. Most of the State-of-the-Art models are provided directly by their author and made available in the library \n",
-    "in PyTorch and TensorFlow in a transparent and interchangeable way. "
+    "in PyTorch and TensorFlow in a transparent and interchangeable way. \n",
+    "\n",
+    "If you're executing this notebook in Colab, you will need to install the transformers library. You can do so with this command:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 2,
    "metadata": {
     "id": "KnT3Jn6fSXai",
     "pycharm": {
@@ -89,13 +91,12 @@
    },
    "outputs": [],
    "source": [
-    "!pip install transformers\n",
-    "!pip install --upgrade tensorflow"
+    "# !pip install transformers"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 3,
    "metadata": {
     "colab": {
      "base_uri": "https://localhost:8080/"
@@ -111,13 +112,11 @@
     {
      "data": {
       "text/plain": [
-       "<torch.autograd.grad_mode.set_grad_enabled at 0x7f9c03e5b3c8>"
+       "<torch.autograd.grad_mode.set_grad_enabled at 0x7ff0cc2a2c50>"
       ]
      },
-     "execution_count": 2,
-     "metadata": {
-      "tags": []
-     },
+     "execution_count": 3,
+     "metadata": {},
      "output_type": "execute_result"
     }
    ],
@@ -130,7 +129,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 4,
    "metadata": {
     "id": "1xMDTHQXSXai",
     "pycharm": {
@@ -159,103 +158,56 @@
    "source": [
     "With only the above two lines of code, you're ready to use a BERT pre-trained model. \n",
     "The tokenizers will allow us to map a raw textual input to a sequence of integers representing our textual input\n",
-    "in a way the model can manipulate."
+    "in a way the model can manipulate. Since we will be using a PyTorch model, we ask the tokenizer to return to us PyTorch tensors."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "colab": {
-     "base_uri": "https://localhost:8080/"
-    },
-    "id": "XgkFg52fSXai",
-    "outputId": "94b569d4-5415-4327-f39e-c9541b0a53e0",
-    "pycharm": {
-     "is_executing": false,
-     "name": "#%% code\n"
-    }
-   },
+   "execution_count": 6,
+   "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Tokens: ['This', 'is', 'an', 'input', 'example']\n",
-      "Tokens id: [1188, 1110, 1126, 7758, 1859]\n",
-      "Tokens PyTorch: tensor([[ 101, 1188, 1110, 1126, 7758, 1859,  102]])\n",
-      "Token wise output: torch.Size([1, 7, 768]), Pooled output: torch.Size([1, 768])\n"
+      "input_ids:\n",
+      "\ttensor([[ 101, 1188, 1110, 1126, 7758, 1859,  102]])\n",
+      "token_type_ids:\n",
+      "\ttensor([[0, 0, 0, 0, 0, 0, 0]])\n",
+      "attention_mask:\n",
+      "\ttensor([[1, 1, 1, 1, 1, 1, 1]])\n"
      ]
     }
    ],
    "source": [
-    "# Tokens comes from a process that splits the input into sub-entities with interesting linguistic properties. \n",
-    "tokens = tokenizer.tokenize(\"This is an input example\")\n",
-    "print(\"Tokens: {}\".format(tokens))\n",
-    "\n",
-    "# This is not sufficient for the model, as it requires integers as input, \n",
-    "# not a problem, let's convert tokens to ids.\n",
-    "tokens_ids = tokenizer.convert_tokens_to_ids(tokens)\n",
-    "print(\"Tokens id: {}\".format(tokens_ids))\n",
-    "\n",
-    "# Add the required special tokens\n",
-    "tokens_ids = tokenizer.build_inputs_with_special_tokens(tokens_ids)\n",
-    "\n",
-    "# We need to convert to a Deep Learning framework specific format, let's use PyTorch for now.\n",
-    "tokens_pt = torch.tensor([tokens_ids])\n",
-    "print(\"Tokens PyTorch: {}\".format(tokens_pt))\n",
-    "\n",
-    "# Now we're ready to go through BERT with out input\n",
-    "outputs = model(tokens_pt)\n",
-    "last_hidden_state = outputs.last_hidden_state\n",
-    "pooler_output = outputs.pooler_output\n",
-    "\n",
-    "print(\"Token wise output: {}, Pooled output: {}\".format(last_hidden_state.shape, pooler_output.shape))"
+    "tokens_pt = tokenizer(\"This is an input example\", return_tensors=\"pt\")\n",
+    "for key, value in tokens_pt.items():\n",
+    "    print(\"{}:\\n\\t{}\".format(key, value))"
    ]
   },
   {
    "cell_type": "markdown",
-   "metadata": {
-    "id": "lBbvwNKXSXaj",
-    "pycharm": {
-     "name": "#%% md\n"
-    }
-   },
+   "metadata": {},
    "source": [
-    "As you can see, BERT outputs two tensors:\n",
-    " - One with the generated representation for every token in the input `(1, NB_TOKENS, REPRESENTATION_SIZE)`\n",
-    " - One with an aggregated representation for the whole input `(1, REPRESENTATION_SIZE)`\n",
-    " \n",
-    "The first, token-based, representation can be leveraged if your task requires to keep the sequence representation and you\n",
-    "want to operate at a token-level. This is particularly useful for Named Entity Recognition and Question-Answering.\n",
+    "The tokenizer automatically converted our input to all the inputs expected by the model. It generated some additional tensors on top of the IDs: \n",
     "\n",
-    "The second, aggregated, representation is especially useful if you need to extract the overall context of the sequence and don't\n",
-    "require a fine-grained token-level. This is the case for Sentiment-Analysis of the sequence or Information Retrieval."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "id": "DCxuDWH2SXaj",
-    "pycharm": {
-     "name": "#%% md\n"
-    }
-   },
-   "source": [
-    "The code you saw in the previous section introduced all the steps required to do simple model invocation.\n",
-    "For more day-to-day usage, transformers provides you higher-level methods which will makes your NLP journey easier.\n",
-    "Let's improve our previous example"
+    "- token_type_ids: This tensor will map every tokens to their corresponding segment (see below).\n",
+    "- attention_mask: This tensor is used to \"mask\" padded values in a batch of sequence with different lengths (see below).\n",
+    "\n",
+    "You can check our [glossary](https://huggingface.co/transformers/glossary.html) for more information about each of those keys. \n",
+    "\n",
+    "We can just feed this directly into our model:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 7,
    "metadata": {
     "colab": {
      "base_uri": "https://localhost:8080/"
     },
-    "id": "sgcNCdXUSXaj",
-    "outputId": "af2fb928-7c17-475b-cf81-89cfc4b1d9e5",
+    "id": "XgkFg52fSXai",
+    "outputId": "94b569d4-5415-4327-f39e-c9541b0a53e0",
     "pycharm": {
      "is_executing": false,
      "name": "#%% code\n"
@@ -266,52 +218,41 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "input_ids:\n",
-      "\ttensor([[ 101, 1188, 1110, 1126, 7758, 1859,  102]])\n",
-      "token_type_ids:\n",
-      "\ttensor([[0, 0, 0, 0, 0, 0, 0]])\n",
-      "attention_mask:\n",
-      "\ttensor([[1, 1, 1, 1, 1, 1, 1]])\n",
-      "Difference with previous code: (0.0, 0.0)\n"
+      "Token wise output: torch.Size([1, 7, 768]), Pooled output: torch.Size([1, 768])\n"
      ]
     }
    ],
    "source": [
-    "# tokens = tokenizer.tokenize(\"This is an input example\")\n",
-    "# tokens_ids = tokenizer.convert_tokens_to_ids(tokens)\n",
-    "# tokens_pt = torch.tensor([tokens_ids])\n",
-    "\n",
-    "# This code can be factored into one-line as follow\n",
-    "tokens_pt2 = tokenizer(\"This is an input example\", return_tensors=\"pt\")\n",
-    "\n",
-    "for key, value in tokens_pt2.items():\n",
-    "    print(\"{}:\\n\\t{}\".format(key, value))\n",
-    "\n",
-    "outputs2 = model(**tokens_pt2)\n",
-    "last_hidden_state2 = outputs2.last_hidden_state\n",
-    "pooler_output2 = outputs2.pooler_output\n",
+    "outputs = model(**tokens_pt)\n",
+    "last_hidden_state = outputs.last_hidden_state\n",
+    "pooler_output = outputs.pooler_output\n",
     "\n",
-    "print(\"Difference with previous code: ({}, {})\".format((last_hidden_state2 - last_hidden_state).sum(), (pooler_output2 - pooler_output).sum()))"
+    "print(\"Token wise output: {}, Pooled output: {}\".format(last_hidden_state.shape, pooler_output.shape))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
-    "id": "gC-7xGYPSXal"
+    "id": "lBbvwNKXSXaj",
+    "pycharm": {
+     "name": "#%% md\n"
+    }
    },
    "source": [
-    "As you can see above, calling the tokenizer provides a convenient way to generate all the required parameters\n",
-    "that will go through the model. \n",
-    "\n",
-    "Moreover, you might have noticed it generated some additional tensors: \n",
+    "As you can see, BERT outputs two tensors:\n",
+    " - One with the generated representation for every token in the input `(1, NB_TOKENS, REPRESENTATION_SIZE)`\n",
+    " - One with an aggregated representation for the whole input `(1, REPRESENTATION_SIZE)`\n",
+    " \n",
+    "The first, token-based, representation can be leveraged if your task requires to keep the sequence representation and you\n",
+    "want to operate at a token-level. This is particularly useful for Named Entity Recognition and Question-Answering.\n",
     "\n",
-    "- token_type_ids: This tensor will map every tokens to their corresponding segment (see below).\n",
-    "- attention_mask: This tensor is used to \"mask\" padded values in a batch of sequence with different lengths (see below)."
+    "The second, aggregated, representation is especially useful if you need to extract the overall context of the sequence and don't\n",
+    "require a fine-grained token-level. This is the case for Sentiment-Analysis of the sequence or Information Retrieval."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 8,
    "metadata": {
     "colab": {
      "base_uri": "https://localhost:8080/"
@@ -357,7 +298,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 9,
    "metadata": {
     "colab": {
      "base_uri": "https://localhost:8080/"
@@ -414,14 +355,47 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 10,
    "metadata": {
     "id": "Kubwm-wJSXan",
     "pycharm": {
      "is_executing": false
     }
    },
-   "outputs": [],
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "3b971be3639d4fedb02778fb5c6898a0",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=526681800.0, style=ProgressStyle(descri…"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']\n",
+      "- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
+      "- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
+      "All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.\n",
+      "If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.\n"
+     ]
+    }
+   ],
    "source": [
     "from transformers import TFBertModel, BertModel\n",
     "\n",
@@ -432,7 +406,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": 11,
    "metadata": {
     "colab": {
      "base_uri": "https://localhost:8080/"
@@ -448,8 +422,8 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "last_hidden_state differences: 1.0094e-05\n",
-      "pooler_output differences: 7.2969e-07\n"
+      "last_hidden_state differences: 1.2933e-05\n",
+      "pooler_output differences: 2.9691e-06\n"
      ]
     }
    ],
@@ -482,7 +456,7 @@
     "\n",
     "For example, Google released a few months ago **T5** an Encoder/Decoder architecture based on Transformer and available in `transformers` with no more than 11 billions parameters. Microsoft also recently entered the game with **Turing-NLG** using 17 billions parameters. This kind of model requires tens of gigabytes to store the weights and a tremendous compute infrastructure to run such models which makes it impracticable for the common man !\n",
     "\n",
-    "![transformers-parameters](https://lh5.googleusercontent.com/NRdXzEcgZV3ooykjIaTm9uvbr9QnSjDQHHAHb2kk_Lm9lIF0AhS-PJdXGzpcBDztax922XAp386hyNmWZYsZC1lUN2r4Ip5p9v-PHO19-jevRGg4iQFxgv5Olq4DWaqSA_8ptep7)\n",
+    "![transformers-parameters](https://github.com/huggingface/notebooks/blob/master/examples/images/model_parameters.png?raw=true)\n",
     "\n",
     "With the goal of making Transformer-based NLP accessible to everyone we @huggingface developed models that take advantage of a training process called **Distillation** which allows us to drastically reduce the resources needed to run such models with almost zero drop in performances.\n",
     "\n",
@@ -673,7 +647,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.7.3"
+   "version": "3.7.9"
   },
   "pycharm": {
    "stem_cell": {