Skip to content

Commit

Permalink
revise textcnn code and utils
Browse files Browse the repository at this point in the history
  • Loading branch information
astonzhang committed Aug 18, 2018
1 parent 82e87b3 commit 2ec11df
Show file tree
Hide file tree
Showing 5 changed files with 181 additions and 153 deletions.
2 changes: 1 addition & 1 deletion STYLE_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
* 第一人称 → 我们
* 第二人称 → 你、大家
* 工具或部件
* Gluon, MXNet, NumPy, NDArray, Symbol, Block, HybridBlock, ResNet-18, Fashion-MNIST
* Gluon, MXNet, NumPy, spaCy, NDArray, Symbol, Block, HybridBlock, ResNet-18, Fashion-MNIST
* 这些都作为词,不要带重音符
* Dense类/实例, Sequential类/实例, HybridSequential类/实例
* 不要带重音符
Expand Down
6 changes: 6 additions & 0 deletions chapter_appendix/gluonbook.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,18 +7,24 @@
|:--|:-:|
| `accuracy`|[Softmax回归的从零开始实现](../chapter_deep-learning-basics/softmax-regression-scratch.md)|
| `bbox_to_rect`|[物体检测和边界框](../chapter_computer-vision/bounding-box.md)|
| `count_tokens`|[文本情感分类:使用循环神经网络](../chapter_natural-language-processing/sentiment-analysis.md)|
| `data_iter`|[线性回归的从零开始实现](../chapter_deep-learning-basics/linear-regression-scratch.md)|
| `data_iter_consecutive`|[循环神经网络](../chapter_recurrent-neural-networks/rnn.md)|
| `data_iter_random`|[循环神经网络](../chapter_recurrent-neural-networks/rnn.md)|
| `download_imdb`|[文本情感分类:使用循环神经网络](../chapter_natural-language-processing/sentiment-analysis.md)|
| `download_voc_pascal`|[语义分割和数据集](../chapter_computer-vision/semantic-segmentation-and-dataset.md)|
| `evaluate_accuracy`|[图片增广](../chapter_computer-vision/image-augmentation.md)|
| `get_tokenized_imdb`|[文本情感分类:使用循环神经网络](../chapter_natural-language-processing/sentiment-analysis.md)|
| `grad_clipping`|[循环神经网络](../chapter_recurrent-neural-networks/rnn.md)|
| `linreg`|[线性回归的从零开始实现](../chapter_deep-learning-basics/linear-regression-scratch.md)|
| `load_data_fashion_mnist`|[深度卷积神经网络(AlexNet)](../chapter_convolutional-neural-networks/alexnet.md)|
| `load_data_pikachu`|[物体检测数据集](../chapter_computer-vision/object-detection-dataset.md)|
| `optimize`|[梯度下降和随机梯度下降的Gluon实现](../chapter_optimization/gd-sgd-gluon.md)|
| `plt`|[线性回归的从零开始实现](../chapter_deep-learning-basics/linear-regression-scratch.md)|
| `predict_rnn`|[循环神经网络](../chapter_recurrent-neural-networks/rnn.md)|
| `predict_sentiment`|[文本情感分类:使用循环神经网络](../chapter_natural-language-processing/sentiment-analysis.md)|
| `preprocess_imdb`|[文本情感分类:使用循环神经网络](../chapter_natural-language-processing/sentiment-analysis.md)|
| `read_imdb`|[文本情感分类:使用循环神经网络](../chapter_natural-language-processing/sentiment-analysis.md)|
| `read_voc_images`|[语义分割和数据集](../chapter_computer-vision/semantic-segmentation-and-dataset.md)|
| `Residual`|[残差网络(ResNet)](../chapter_convolutional-neural-networks/resnet.md)|
| `resnet18`|[多GPU计算的Gluon实现](../chapter_computational-performance/multiple-gpus-gluon.md)|
Expand Down
105 changes: 11 additions & 94 deletions chapter_natural-language-processing/sentiment-analysis-cnn.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,40 +50,14 @@ import tarfile
我们首先下载这个数据集到`../data`下。压缩包大小是 81MB,下载解压需要一定时间。解压之后这个数据集将会放置在`../data/aclImdb`下。

```{.python .input n=2}
def download_imdb(data_dir='../data'):
"""Download the IMDb Dataset."""
imdb_dir = os.path.join(data_dir, 'aclImdb')
url = ('http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz')
sha1 = '01ada507287d82875905620988597833ad4e0903'
fname = gutils.download(url, data_dir, sha1_hash=sha1)
with tarfile.open(fname, 'r') as f:
f.extractall(data_dir)
return imdb_dir
imdb_dir = download_imdb()
gb.download_imdb()
```

下面,读取训练和测试数据集。

```{.python .input n=3}
def readIMDB(dir_url, seg='train'):
pos_or_neg = ['pos', 'neg']
data = []
for label in pos_or_neg:
files = os.listdir(os.path.join('../data/',dir_url, seg, label))
for file in files:
with open(os.path.join('../data/',dir_url, seg, label, file), 'r',
encoding='utf8') as rf:
review = rf.read().replace('\n', '')
if label == 'pos':
data.append([review, 1])
elif label == 'neg':
data.append([review, 0])
return data
train_data = readIMDB('aclImdb', 'train')
test_data = readIMDB('aclImdb', 'test')
train_data = gb.read_imdb('aclImdb', 'train')
test_data = gb.read_imdb('aclImdb', 'test')
random.shuffle(train_data)
random.shuffle(test_data)
```
Expand All @@ -93,32 +67,15 @@ random.shuffle(test_data)
接下来我们对每条评论做分词,从而得到分好词的评论。这里使用最简单的方法:基于空格进行分词。我们将在本节练习中探究其他的分词方法。

```{.python .input n=4}
def tokenizer(text):
return [tok.lower() for tok in text.split(' ')]
train_tokenized = []
for review, score in train_data:
train_tokenized.append(tokenizer(review))
test_tokenized = []
for review, score in test_data:
test_tokenized.append(tokenizer(review))
train_tokenized, test_tokenized = gb.get_tokenized_imdb(train_data, test_data)
```

## 创建词典

现在,我们可以根据分好词的训练数据集来创建词典了。这里我们设置了特殊符号“<unk>”(unknown)。它将表示一切不存在于训练数据集词典中的词。

```{.python .input n=5}
token_counter = collections.Counter()
def count_token(train_tokenized):
for sample in train_tokenized:
for token in sample:
if token not in token_counter:
token_counter[token] = 1
else:
token_counter[token] += 1
count_token(train_tokenized)
token_counter = gb.count_tokens(train_tokenized)
vocab = text.vocab.Vocabulary(token_counter, unknown_token='<unk>',
reserved_tokens=None)
```
Expand All @@ -128,37 +85,8 @@ vocab = text.vocab.Vocabulary(token_counter, unknown_token='<unk>',
下面,我们继续对数据进行预处理。每个不定长的评论将被特殊符号`PAD`补成长度为`maxlen`的序列,并用NDArray表示。在这里由于模型使用了最大池化层,只取卷积后最大的一个值,所以补0不会对结果产生影响。

```{.python .input n=6}
def encode_samples(tokenized_samples, vocab):
features = []
for sample in tokenized_samples:
feature = []
for token in sample:
if token in vocab.token_to_idx:
feature.append(vocab.token_to_idx[token])
else:
feature.append(0)
features.append(feature)
return features
def pad_samples(features, maxlen=500, PAD=0):
padded_features = []
for feature in features:
if len(feature) > maxlen:
padded_feature = feature[:maxlen]
else:
padded_feature = feature
# 添加 PAD 符号使每个序列等长(长度为 maxlen )。
while len(padded_feature) < maxlen:
padded_feature.append(PAD)
padded_features.append(padded_feature)
return padded_features
train_features = encode_samples(train_tokenized, vocab)
test_features = encode_samples(test_tokenized, vocab)
train_features = nd.array(pad_samples(train_features, 1000, 0))
test_features = nd.array(pad_samples(test_features, 1000, 0))
train_labels = nd.array([score for _, score in train_data])
test_labels = nd.array([score for _, score in test_data])
train_features, test_features, train_labels, test_labels = gb.preprocess_imdb(
train_tokenized, test_tokenized, train_data, test_data, vocab)
```

## 加载预训练的词向量
Expand Down Expand Up @@ -247,6 +175,7 @@ class TextCNN(nn.Block):
setattr(self, 'pool_{i}', pool) #将self.pool_{i}置为第i个pool
self.dropout = nn.Dropout(0.5)
self.decoder = nn.Dense(num_outputs)
def forward(self, inputs):
#inputs 输入的维度为(batch_size, 句子长度) ,转换为(句子长度, batch_size)
inputs = inputs.T
Expand Down Expand Up @@ -327,17 +256,11 @@ gb.train(train_loader, test_loader, net, loss, trainer, ctx, num_epochs)
下面我们使用训练好的模型对两个简单句子的情感进行分类。

```{.python .input}
def get_sentiment(vocab, sentence):
sentence = nd.array([vocab.token_to_idx[token] for token in sentence],
ctx=gb.try_gpu())
label = nd.argmax(net(nd.reshape(sentence, shape=(1, -1))), axis=1)
return 'positive' if label.asscalar() == 1 else 'negative'
get_sentiment(vocab, ['i', 'think', 'this', 'movie', 'is', 'great'])
gb.predict_sentiment(net, vocab, ['i', 'think', 'this', 'movie', 'is', 'great'])
```

```{.python .input}
get_sentiment(vocab, ['the', 'show', 'is', 'terribly', 'boring'])
gb.predict_sentiment(net, vocab, ['the', 'show', 'is', 'terribly', 'boring'])
```

## 小结
Expand All @@ -349,13 +272,7 @@ get_sentiment(vocab, ['the', 'show', 'is', 'terribly', 'boring'])

## 练习

* 使用IMDb完整数据集,把迭代周期改为 5。你的模型能在训练和测试数据集上得到怎样的准确率?通过调节超参数,你能进一步提升分类准确率吗?

* 使用更大的预训练词向量,例如300维的GloVe词向量,能否提升分类准确率?

* 使用spacy分词工具,能否提升分类准确率?。你需要安装spacy:`pip install spacy`,并且安装英文包:`python -m spacy download en`。在代码中,先导入spacy:`import spacy`。然后加载spacy英文包:`spacy_en = spacy.load('en')`。最后定义函数:`def tokenizer(text): return [tok.text for tok in spacy_en.tokenizer(text)]`替换原来的基于空格分词的`tokenizer`函数。需要注意的是,GloVe的词向量对于名词词组的存储方式是用“-”连接各个单词,例如词组“new york”在GloVe中的表示为“new-york”。而使用spacy分词之后“new york”的存储可能是“new york”。

* 通过上面三种方法,你能使模型在测试集上的准确率提高到0.87以上吗?
* 使用上一节练习中介绍的三种方法:调节超参数、使用更大的预训练词向量和使用spacy分词工具,你能使模型在测试集上的准确率提高到0.87以上吗?



Expand Down
123 changes: 65 additions & 58 deletions chapter_natural-language-processing/sentiment-analysis.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,39 +36,35 @@ import tarfile

```{.python .input n=4}
def download_imdb(data_dir='../data'):
"""Download the IMDb Dataset."""
imdb_dir = os.path.join(data_dir, 'aclImdb')
url = ('http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz')
sha1 = '01ada507287d82875905620988597833ad4e0903'
fname = gutils.download(url, data_dir, sha1_hash=sha1)
with tarfile.open(fname, 'r') as f:
f.extractall(data_dir)
return imdb_dir
imdb_dir = download_imdb()
download_imdb()
```

下面,读取训练和测试数据集。

```{.python .input n=5}
def readIMDB(dir_url, seg='train'):
def read_imdb(dir_url, seg='train'):
pos_or_neg = ['pos', 'neg']
data = []
for label in pos_or_neg:
files = os.listdir(os.path.join('../data/',dir_url, seg, label))
files = os.listdir(os.path.join('../data/', dir_url, seg, label))
for file in files:
with open(os.path.join('../data/',dir_url, seg, label, file), 'r',
encoding='utf8') as rf:
with open(os.path.join('../data/', dir_url, seg, label, file),
'r', encoding='utf8') as rf:
review = rf.read().replace('\n', '')
if label == 'pos':
data.append([review, 1])
elif label == 'neg':
data.append([review, 0])
return data
train_data = readIMDB('aclImdb', 'train')
test_data = readIMDB('aclImdb', 'test')
train_data = read_imdb('aclImdb', 'train')
test_data = read_imdb('aclImdb', 'test')
random.shuffle(train_data)
random.shuffle(test_data)
```
Expand All @@ -78,32 +74,37 @@ random.shuffle(test_data)
接下来我们对每条评论做分词,从而得到分好词的评论。这里使用最简单的方法:基于空格进行分词。我们将在本节练习中探究其他的分词方法。

```{.python .input n=6}
def tokenizer(text):
return [tok.lower() for tok in text.split(' ')]
train_tokenized = []
for review, score in train_data:
train_tokenized.append(tokenizer(review))
test_tokenized = []
for review, score in test_data:
test_tokenized.append(tokenizer(review))
def get_tokenized_imdb(train_data, test_data):
def tokenizer(text):
return [tok.lower() for tok in text.split(' ')]
train_tokenized = []
for review, score in train_data:
train_tokenized.append(tokenizer(review))
test_tokenized = []
for review, score in test_data:
test_tokenized.append(tokenizer(review))
return train_tokenized, test_tokenized
train_tokenized, test_tokenized = get_tokenized_imdb(train_data, test_data)
```

## 创建词典

现在,我们可以根据分好词的训练数据集来创建词典了。这里我们设置了特殊符号“&lt;unk&gt;”(unknown)。它将表示一切不存在于训练数据集词典中的词。

```{.python .input n=7}
token_counter = collections.Counter()
def count_token(train_tokenized):
for sample in train_tokenized:
def count_tokens(samples):
token_counter = collections.Counter()
for sample in samples:
for token in sample:
if token not in token_counter:
token_counter[token] = 1
else:
token_counter[token] += 1
return token_counter
count_token(train_tokenized)
token_counter = count_tokens(train_tokenized)
vocab = text.vocab.Vocabulary(token_counter, unknown_token='<unk>',
reserved_tokens=None)
```
Expand All @@ -113,37 +114,43 @@ vocab = text.vocab.Vocabulary(token_counter, unknown_token='<unk>',
下面,我们继续对数据进行预处理。每个不定长的评论将被特殊符号`PAD`补成长度为`maxlen`的序列,并用NDArray表示。

```{.python .input n=8}
def encode_samples(tokenized_samples, vocab):
features = []
for sample in tokenized_samples:
feature = []
for token in sample:
if token in vocab.token_to_idx:
feature.append(vocab.token_to_idx[token])
def preprocess_imdb(train_tokenized, test_tokenized, train_data, test_data,
vocab):
def encode_samples(tokenized_samples, vocab):
features = []
for sample in tokenized_samples:
feature = []
for token in sample:
if token in vocab.token_to_idx:
feature.append(vocab.token_to_idx[token])
else:
feature.append(0)
features.append(feature)
return features
def pad_samples(features, maxlen=500, PAD=0):
padded_features = []
for feature in features:
if len(feature) > maxlen:
padded_feature = feature[:maxlen]
else:
feature.append(0)
features.append(feature)
return features
def pad_samples(features, maxlen=500, PAD=0):
padded_features = []
for feature in features:
if len(feature) > maxlen:
padded_feature = feature[:maxlen]
else:
padded_feature = feature
# 添加 PAD 符号使每个序列等长(长度为 maxlen)。
while len(padded_feature) < maxlen:
padded_feature.append(PAD)
padded_features.append(padded_feature)
return padded_features
train_features = encode_samples(train_tokenized, vocab)
test_features = encode_samples(test_tokenized, vocab)
train_features = nd.array(pad_samples(train_features, 500, 0))
test_features = nd.array(pad_samples(test_features, 500, 0))
train_labels = nd.array([score for _, score in train_data])
test_labels = nd.array([score for _, score in test_data])
padded_feature = feature
# 添加 PAD 符号使每个序列等长(长度为 maxlen)。
while len(padded_feature) < maxlen:
padded_feature.append(PAD)
padded_features.append(padded_feature)
return padded_features
train_features = encode_samples(train_tokenized, vocab)
test_features = encode_samples(test_tokenized, vocab)
train_features = nd.array(pad_samples(train_features, 500, 0))
test_features = nd.array(pad_samples(test_features, 500, 0))
train_labels = nd.array([score for _, score in train_data])
test_labels = nd.array([score for _, score in test_data])
return train_features, test_features, train_labels, test_labels
train_features, test_features, train_labels, test_labels = preprocess_imdb(
train_tokenized, test_tokenized, train_data, test_data, vocab)
```

## 加载预训练的词向量
Expand Down Expand Up @@ -220,17 +227,17 @@ gb.train(train_loader, test_loader, net, loss, trainer, ctx, num_epochs)
下面我们使用训练好的模型对两个简单句子的情感进行分类。

```{.python .input n=18}
def get_sentiment(vocab, sentence):
def predict_sentiment(net, vocab, sentence):
sentence = nd.array([vocab.token_to_idx[token] for token in sentence],
ctx=gb.try_gpu())
label = nd.argmax(net(nd.reshape(sentence, shape=(1, -1))), axis=1)
return 'positive' if label.asscalar() == 1 else 'negative'
get_sentiment(vocab, ['i', 'think', 'this', 'movie', 'is', 'great'])
predict_sentiment(net, vocab, ['i', 'think', 'this', 'movie', 'is', 'great'])
```

```{.python .input}
get_sentiment(vocab, ['the', 'show', 'is', 'terribly', 'boring'])
predict_sentiment(net, vocab, ['the', 'show', 'is', 'terribly', 'boring'])
```

## 小结
Expand All @@ -244,7 +251,7 @@ get_sentiment(vocab, ['the', 'show', 'is', 'terribly', 'boring'])

* 使用更大的预训练词向量,例如300维的GloVe词向量,能否提升分类准确率?

* 使用spacy分词工具,能否提升分类准确率?。你需要安装spacy`pip install spacy`,并且安装英文包:`python -m spacy download en`。在代码中,先导入spacy:`import spacy`。然后加载spacy英文包:`spacy_en = spacy.load('en')`。最后定义函数:`def tokenizer(text): return [tok.text for tok in spacy_en.tokenizer(text)]`替换原来的基于空格分词的`tokenizer`函数。需要注意的是,GloVe的词向量对于名词词组的存储方式是用“-”连接各个单词,例如词组“new york”在GloVe中的表示为“new-york”。而使用spacy分词之后“new york”的存储可能是“new york”。
* 使用spaCy分词工具,能否提升分类准确率?。你需要安装spaCy`pip install spacy`,并且安装英文包:`python -m spacy download en`。在代码中,先导入spacy:`import spacy`。然后加载spacy英文包:`spacy_en = spacy.load('en')`。最后定义函数:`def tokenizer(text): return [tok.text for tok in spacy_en.tokenizer(text)]`替换原来的基于空格分词的`tokenizer`函数。需要注意的是,GloVe的词向量对于名词词组的存储方式是用“-”连接各个单词,例如词组“new york”在GloVe中的表示为“new-york”。而使用spacy分词之后“new york”的存储可能是“new york”。

* 通过上面三种方法,你能使模型在测试集上的准确率提高到0.85以上吗?

Expand Down
Loading

0 comments on commit 2ec11df

Please sign in to comment.