revise textcnn code and utils

caloskao · Aug 18, 2018 · 2ec11df · 2ec11df
1 parent 82e87b3
commit 2ec11df
Show file tree

Hide file tree

Showing 5 changed files with 181 additions and 153 deletions.
diff --git a/STYLE_GUIDE.md b/STYLE_GUIDE.md
@@ -19,7 +19,7 @@
     * 第一人称 → 我们
     * 第二人称 → 你、大家
 * 工具或部件
-    * Gluon, MXNet, NumPy, NDArray, Symbol, Block, HybridBlock, ResNet-18, Fashion-MNIST
+    * Gluon, MXNet, NumPy, spaCy, NDArray, Symbol, Block, HybridBlock, ResNet-18, Fashion-MNIST
         * 这些都作为词，不要带重音符
     * Dense类/实例, Sequential类/实例, HybridSequential类/实例
         * 不要带重音符

diff --git a/chapter_appendix/gluonbook.md b/chapter_appendix/gluonbook.md
@@ -7,18 +7,24 @@
 |:--|:-:|
 | `accuracy`|[Softmax回归的从零开始实现](../chapter_deep-learning-basics/softmax-regression-scratch.md)|
 | `bbox_to_rect`|[物体检测和边界框](../chapter_computer-vision/bounding-box.md)|
+| `count_tokens`|[文本情感分类：使用循环神经网络](../chapter_natural-language-processing/sentiment-analysis.md)|
 | `data_iter`|[线性回归的从零开始实现](../chapter_deep-learning-basics/linear-regression-scratch.md)|
 | `data_iter_consecutive`|[循环神经网络](../chapter_recurrent-neural-networks/rnn.md)|
 | `data_iter_random`|[循环神经网络](../chapter_recurrent-neural-networks/rnn.md)|
+| `download_imdb`|[文本情感分类：使用循环神经网络](../chapter_natural-language-processing/sentiment-analysis.md)|
 | `download_voc_pascal`|[语义分割和数据集](../chapter_computer-vision/semantic-segmentation-and-dataset.md)|
 | `evaluate_accuracy`|[图片增广](../chapter_computer-vision/image-augmentation.md)|
+| `get_tokenized_imdb`|[文本情感分类：使用循环神经网络](../chapter_natural-language-processing/sentiment-analysis.md)|
 | `grad_clipping`|[循环神经网络](../chapter_recurrent-neural-networks/rnn.md)|
 | `linreg`|[线性回归的从零开始实现](../chapter_deep-learning-basics/linear-regression-scratch.md)|
 | `load_data_fashion_mnist`|[深度卷积神经网络（AlexNet）](../chapter_convolutional-neural-networks/alexnet.md)|
 | `load_data_pikachu`|[物体检测数据集](../chapter_computer-vision/object-detection-dataset.md)|
 | `optimize`|[梯度下降和随机梯度下降的Gluon实现](../chapter_optimization/gd-sgd-gluon.md)|
 | `plt`|[线性回归的从零开始实现](../chapter_deep-learning-basics/linear-regression-scratch.md)|
 | `predict_rnn`|[循环神经网络](../chapter_recurrent-neural-networks/rnn.md)|
+| `predict_sentiment`|[文本情感分类：使用循环神经网络](../chapter_natural-language-processing/sentiment-analysis.md)|
+| `preprocess_imdb`|[文本情感分类：使用循环神经网络](../chapter_natural-language-processing/sentiment-analysis.md)|
+| `read_imdb`|[文本情感分类：使用循环神经网络](../chapter_natural-language-processing/sentiment-analysis.md)|
 | `read_voc_images`|[语义分割和数据集](../chapter_computer-vision/semantic-segmentation-and-dataset.md)|
 | `Residual`|[残差网络（ResNet）](../chapter_convolutional-neural-networks/resnet.md)|
 | `resnet18`|[多GPU计算的Gluon实现](../chapter_computational-performance/multiple-gpus-gluon.md)|

diff --git a/chapter_natural-language-processing/sentiment-analysis-cnn.md b/chapter_natural-language-processing/sentiment-analysis-cnn.md
@@ -50,40 +50,14 @@ import tarfile
 我们首先下载这个数据集到`../data`下。压缩包大小是 81MB，下载解压需要一定时间。解压之后这个数据集将会放置在`../data/aclImdb`下。
 
 ```{.python .input  n=2}
-def download_imdb(data_dir='../data'):
-    """Download the IMDb Dataset."""
-    imdb_dir = os.path.join(data_dir, 'aclImdb')
-    url = ('http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz')
-    sha1 = '01ada507287d82875905620988597833ad4e0903'
-    fname = gutils.download(url, data_dir, sha1_hash=sha1)
-    with tarfile.open(fname, 'r') as f:
-        f.extractall(data_dir)
-    return imdb_dir
-
-imdb_dir = download_imdb()
+gb.download_imdb()
 ```
 
 下面，读取训练和测试数据集。
 
 ```{.python .input  n=3}
-def readIMDB(dir_url, seg='train'):
-    pos_or_neg = ['pos', 'neg']
-    data = []
-    for label in pos_or_neg:
-        files = os.listdir(os.path.join('../data/',dir_url, seg, label))
-        for file in files:
-            with open(os.path.join('../data/',dir_url, seg, label, file), 'r',
-                      encoding='utf8') as rf:
-                review = rf.read().replace('\n', '')
-                if label == 'pos':
-                    data.append([review, 1])
-                elif label == 'neg':
-                    data.append([review, 0])
-    return data
-
-train_data = readIMDB('aclImdb', 'train')
-test_data = readIMDB('aclImdb', 'test')
-
+train_data = gb.read_imdb('aclImdb', 'train')
+test_data = gb.read_imdb('aclImdb', 'test')
 random.shuffle(train_data)
 random.shuffle(test_data)
 ```
@@ -93,32 +67,15 @@ random.shuffle(test_data)
 接下来我们对每条评论做分词，从而得到分好词的评论。这里使用最简单的方法：基于空格进行分词。我们将在本节练习中探究其他的分词方法。
 
 ```{.python .input  n=4}
-def tokenizer(text):
-    return [tok.lower() for tok in text.split(' ')]
-
-train_tokenized = []
-for review, score in train_data:
-    train_tokenized.append(tokenizer(review))
-test_tokenized = []
-for review, score in test_data:
-    test_tokenized.append(tokenizer(review))
+train_tokenized, test_tokenized = gb.get_tokenized_imdb(train_data, test_data)
 ```
 
 ## 创建词典
 
 现在，我们可以根据分好词的训练数据集来创建词典了。这里我们设置了特殊符号“&lt;unk&gt;”（unknown）。它将表示一切不存在于训练数据集词典中的词。
 
 ```{.python .input  n=5}
-token_counter = collections.Counter()
-def count_token(train_tokenized):
-    for sample in train_tokenized:
-        for token in sample:
-            if token not in token_counter:
-                token_counter[token] = 1
-            else:
-                token_counter[token] += 1
-
-count_token(train_tokenized)
+token_counter = gb.count_tokens(train_tokenized)
 vocab = text.vocab.Vocabulary(token_counter, unknown_token='<unk>',
                               reserved_tokens=None)
 ```
@@ -128,37 +85,8 @@ vocab = text.vocab.Vocabulary(token_counter, unknown_token='<unk>',
 下面，我们继续对数据进行预处理。每个不定长的评论将被特殊符号`PAD`补成长度为`maxlen`的序列，并用NDArray表示。在这里由于模型使用了最大池化层，只取卷积后最大的一个值，所以补0不会对结果产生影响。
 
 ```{.python .input  n=6}
-def encode_samples(tokenized_samples, vocab):
-    features = []
-    for sample in tokenized_samples:
-        feature = []
-        for token in sample:
-            if token in vocab.token_to_idx:
-                feature.append(vocab.token_to_idx[token])
-            else:
-                feature.append(0)
-        features.append(feature)         
-    return features
-
-def pad_samples(features, maxlen=500, PAD=0):
-    padded_features = []
-    for feature in features:
-        if len(feature) > maxlen:
-            padded_feature = feature[:maxlen]
-        else:
-            padded_feature = feature
-            # 添加 PAD 符号使每个序列等长（长度为 maxlen ）。
-            while len(padded_feature) < maxlen:
-                padded_feature.append(PAD)
-        padded_features.append(padded_feature)
-    return padded_features
-
-train_features = encode_samples(train_tokenized, vocab)
-test_features = encode_samples(test_tokenized, vocab)
-train_features = nd.array(pad_samples(train_features, 1000, 0))
-test_features = nd.array(pad_samples(test_features, 1000, 0))
-train_labels = nd.array([score for _, score in train_data])
-test_labels = nd.array([score for _, score in test_data])
+train_features, test_features, train_labels, test_labels = gb.preprocess_imdb(
+    train_tokenized, test_tokenized, train_data, test_data, vocab)
 ```
 
 ## 加载预训练的词向量
@@ -247,6 +175,7 @@ class TextCNN(nn.Block):
             setattr(self, 'pool_{i}', pool)  #将self.pool_{i}置为第i个pool
         self.dropout = nn.Dropout(0.5)
         self.decoder = nn.Dense(num_outputs)
+
     def forward(self, inputs):
         #inputs 输入的维度为(batch_size, 句子长度) ，转换为(句子长度, batch_size)
         inputs = inputs.T
@@ -327,17 +256,11 @@ gb.train(train_loader, test_loader, net, loss, trainer, ctx, num_epochs)
 下面我们使用训练好的模型对两个简单句子的情感进行分类。
 
 ```{.python .input}
-def get_sentiment(vocab, sentence):
-    sentence = nd.array([vocab.token_to_idx[token] for token in sentence],
-                        ctx=gb.try_gpu())
-    label = nd.argmax(net(nd.reshape(sentence, shape=(1, -1))), axis=1)
-    return 'positive' if label.asscalar() == 1 else 'negative'
-
-get_sentiment(vocab, ['i', 'think', 'this', 'movie', 'is', 'great'])
+gb.predict_sentiment(net, vocab, ['i', 'think', 'this', 'movie', 'is', 'great'])
 ```
 
 ```{.python .input}
-get_sentiment(vocab, ['the', 'show', 'is', 'terribly', 'boring'])
+gb.predict_sentiment(net, vocab, ['the', 'show', 'is', 'terribly', 'boring'])
 ```
 
 ## 小结
@@ -349,13 +272,7 @@ get_sentiment(vocab, ['the', 'show', 'is', 'terribly', 'boring'])
 
 ## 练习
 
-* 使用IMDb完整数据集，把迭代周期改为 5。你的模型能在训练和测试数据集上得到怎样的准确率？通过调节超参数，你能进一步提升分类准确率吗？
-
-* 使用更大的预训练词向量，例如300维的GloVe词向量，能否提升分类准确率？
-
-* 使用spacy分词工具，能否提升分类准确率？。你需要安装spacy：`pip install spacy`，并且安装英文包：`python -m spacy download en`。在代码中，先导入spacy：`import spacy`。然后加载spacy英文包：`spacy_en = spacy.load('en')`。最后定义函数：`def tokenizer(text): return [tok.text for tok in spacy_en.tokenizer(text)]`替换原来的基于空格分词的`tokenizer`函数。需要注意的是，GloVe的词向量对于名词词组的存储方式是用“-”连接各个单词，例如词组“new york”在GloVe中的表示为“new-york”。而使用spacy分词之后“new york”的存储可能是“new york”。
-
-* 通过上面三种方法，你能使模型在测试集上的准确率提高到0.87以上吗？
+* 使用上一节练习中介绍的三种方法：调节超参数、使用更大的预训练词向量和使用spacy分词工具，你能使模型在测试集上的准确率提高到0.87以上吗？
 
 
 

diff --git a/chapter_natural-language-processing/sentiment-analysis.md b/chapter_natural-language-processing/sentiment-analysis.md
@@ -36,39 +36,35 @@ import tarfile
 
 ```{.python .input  n=4}
 def download_imdb(data_dir='../data'):
-    """Download the IMDb Dataset."""
-    imdb_dir = os.path.join(data_dir, 'aclImdb')
     url = ('http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz')
     sha1 = '01ada507287d82875905620988597833ad4e0903'
     fname = gutils.download(url, data_dir, sha1_hash=sha1)
     with tarfile.open(fname, 'r') as f:
         f.extractall(data_dir)
-    return imdb_dir
 
-imdb_dir = download_imdb()
+download_imdb()
 ```
 
 下面，读取训练和测试数据集。
 
 ```{.python .input  n=5}
-def readIMDB(dir_url, seg='train'):
+def read_imdb(dir_url, seg='train'):
     pos_or_neg = ['pos', 'neg']
     data = []
     for label in pos_or_neg:
-        files = os.listdir(os.path.join('../data/',dir_url, seg, label))
+        files = os.listdir(os.path.join('../data/', dir_url, seg, label))
         for file in files:
-            with open(os.path.join('../data/',dir_url, seg, label, file), 'r',
-                      encoding='utf8') as rf:
+            with open(os.path.join('../data/', dir_url, seg, label, file),
+                      'r', encoding='utf8') as rf:
                 review = rf.read().replace('\n', '')
                 if label == 'pos':
                     data.append([review, 1])
                 elif label == 'neg':
                     data.append([review, 0])
     return data
 
-train_data = readIMDB('aclImdb', 'train')
-test_data = readIMDB('aclImdb', 'test')
-
+train_data = read_imdb('aclImdb', 'train')
+test_data = read_imdb('aclImdb', 'test')
 random.shuffle(train_data)
 random.shuffle(test_data)
 ```
@@ -78,32 +74,37 @@ random.shuffle(test_data)
 接下来我们对每条评论做分词，从而得到分好词的评论。这里使用最简单的方法：基于空格进行分词。我们将在本节练习中探究其他的分词方法。
 
 ```{.python .input  n=6}
-def tokenizer(text):
-    return [tok.lower() for tok in text.split(' ')]
-
-train_tokenized = []
-for review, score in train_data:
-    train_tokenized.append(tokenizer(review))
-test_tokenized = []
-for review, score in test_data:
-    test_tokenized.append(tokenizer(review))
+def get_tokenized_imdb(train_data, test_data):
+    def tokenizer(text):
+        return [tok.lower() for tok in text.split(' ')]
+
+    train_tokenized = []
+    for review, score in train_data:
+        train_tokenized.append(tokenizer(review))
+    test_tokenized = []
+    for review, score in test_data:
+        test_tokenized.append(tokenizer(review))
+    return train_tokenized, test_tokenized
+
+train_tokenized, test_tokenized = get_tokenized_imdb(train_data, test_data)
 ```
 
 ## 创建词典
 
 现在，我们可以根据分好词的训练数据集来创建词典了。这里我们设置了特殊符号“&lt;unk&gt;”（unknown）。它将表示一切不存在于训练数据集词典中的词。
 
 ```{.python .input  n=7}
-token_counter = collections.Counter()
-def count_token(train_tokenized):
-    for sample in train_tokenized:
+def count_tokens(samples):
+    token_counter = collections.Counter()
+    for sample in samples:
         for token in sample:
             if token not in token_counter:
                 token_counter[token] = 1
             else:
                 token_counter[token] += 1
+    return token_counter
 
-count_token(train_tokenized)
+token_counter = count_tokens(train_tokenized)
 vocab = text.vocab.Vocabulary(token_counter, unknown_token='<unk>',
                               reserved_tokens=None)
 ```
@@ -113,37 +114,43 @@ vocab = text.vocab.Vocabulary(token_counter, unknown_token='<unk>',
 下面，我们继续对数据进行预处理。每个不定长的评论将被特殊符号`PAD`补成长度为`maxlen`的序列，并用NDArray表示。
 
 ```{.python .input  n=8}
-def encode_samples(tokenized_samples, vocab):
-    features = []
-    for sample in tokenized_samples:
-        feature = []
-        for token in sample:
-            if token in vocab.token_to_idx:
-                feature.append(vocab.token_to_idx[token])
+def preprocess_imdb(train_tokenized, test_tokenized, train_data, test_data,
+                    vocab):
+    def encode_samples(tokenized_samples, vocab):
+        features = []
+        for sample in tokenized_samples:
+            feature = []
+            for token in sample:
+                if token in vocab.token_to_idx:
+                    feature.append(vocab.token_to_idx[token])
+                else:
+                    feature.append(0)
+            features.append(feature)         
+        return features
+
+    def pad_samples(features, maxlen=500, PAD=0):
+        padded_features = []
+        for feature in features:
+            if len(feature) > maxlen:
+                padded_feature = feature[:maxlen]
             else:
-                feature.append(0)
-        features.append(feature)         
-    return features
-
-def pad_samples(features, maxlen=500, PAD=0):
-    padded_features = []
-    for feature in features:
-        if len(feature) > maxlen:
-            padded_feature = feature[:maxlen]
-        else:
-            padded_feature = feature
-            # 添加 PAD 符号使每个序列等长（长度为 maxlen）。
-            while len(padded_feature) < maxlen:
-                padded_feature.append(PAD)
-        padded_features.append(padded_feature)
-    return padded_features
-
-train_features = encode_samples(train_tokenized, vocab)
-test_features = encode_samples(test_tokenized, vocab)
-train_features = nd.array(pad_samples(train_features, 500, 0))
-test_features = nd.array(pad_samples(test_features, 500, 0))
-train_labels = nd.array([score for _, score in train_data])
-test_labels = nd.array([score for _, score in test_data])
+                padded_feature = feature
+                # 添加 PAD 符号使每个序列等长（长度为 maxlen）。
+                while len(padded_feature) < maxlen:
+                    padded_feature.append(PAD)
+            padded_features.append(padded_feature)
+        return padded_features
+
+    train_features = encode_samples(train_tokenized, vocab)
+    test_features = encode_samples(test_tokenized, vocab)
+    train_features = nd.array(pad_samples(train_features, 500, 0))
+    test_features = nd.array(pad_samples(test_features, 500, 0))
+    train_labels = nd.array([score for _, score in train_data])
+    test_labels = nd.array([score for _, score in test_data])
+    return train_features, test_features, train_labels, test_labels
+
+train_features, test_features, train_labels, test_labels = preprocess_imdb(
+    train_tokenized, test_tokenized, train_data, test_data, vocab)
 ```
 
 ## 加载预训练的词向量
@@ -220,17 +227,17 @@ gb.train(train_loader, test_loader, net, loss, trainer, ctx, num_epochs)
 下面我们使用训练好的模型对两个简单句子的情感进行分类。
 
 ```{.python .input  n=18}
-def get_sentiment(vocab, sentence):
+def predict_sentiment(net, vocab, sentence):
     sentence = nd.array([vocab.token_to_idx[token] for token in sentence],
                         ctx=gb.try_gpu())
     label = nd.argmax(net(nd.reshape(sentence, shape=(1, -1))), axis=1)
     return 'positive' if label.asscalar() == 1 else 'negative'
 
-get_sentiment(vocab, ['i', 'think', 'this', 'movie', 'is', 'great'])
+predict_sentiment(net, vocab, ['i', 'think', 'this', 'movie', 'is', 'great'])
 ```
 
 ```{.python .input}
-get_sentiment(vocab, ['the', 'show', 'is', 'terribly', 'boring'])
+predict_sentiment(net, vocab, ['the', 'show', 'is', 'terribly', 'boring'])
 ```
 
 ## 小结
@@ -244,7 +251,7 @@ get_sentiment(vocab, ['the', 'show', 'is', 'terribly', 'boring'])
 
 * 使用更大的预训练词向量，例如300维的GloVe词向量，能否提升分类准确率？
 
-* 使用spacy分词工具，能否提升分类准确率？。你需要安装spacy：`pip install spacy`，并且安装英文包：`python -m spacy download en`。在代码中，先导入spacy：`import spacy`。然后加载spacy英文包：`spacy_en = spacy.load('en')`。最后定义函数：`def tokenizer(text): return [tok.text for tok in spacy_en.tokenizer(text)]`替换原来的基于空格分词的`tokenizer`函数。需要注意的是，GloVe的词向量对于名词词组的存储方式是用“-”连接各个单词，例如词组“new york”在GloVe中的表示为“new-york”。而使用spacy分词之后“new york”的存储可能是“new york”。
+* 使用spaCy分词工具，能否提升分类准确率？。你需要安装spaCy：`pip install spacy`，并且安装英文包：`python -m spacy download en`。在代码中，先导入spacy：`import spacy`。然后加载spacy英文包：`spacy_en = spacy.load('en')`。最后定义函数：`def tokenizer(text): return [tok.text for tok in spacy_en.tokenizer(text)]`替换原来的基于空格分词的`tokenizer`函数。需要注意的是，GloVe的词向量对于名词词组的存储方式是用“-”连接各个单词，例如词组“new york”在GloVe中的表示为“new-york”。而使用spacy分词之后“new york”的存储可能是“new york”。
 
 * 通过上面三种方法，你能使模型在测试集上的准确率提高到0.85以上吗？