GitHub - vivanov879/word2vec: Torch implementation of word2vec and sentiment analysis

Torch implementation of word2vec and sentiment analysis as explained in assignment1.pdf. Negative sampling loss function and max-margin learning are used for word2vec.

In Terminal.app, run python extract_datasets_for_torch.py to generate datasets for sentiment_analysis.lua.
Run th filter_sentences.lua to generate vocabulary of words and then write sentences to file using this vocabulary
Run th word2vec.lua to train the word2vec model. You do not need a complex multi layer network to generate embeddings here. Our task is to simply separate word embeddings based on cooccurance. If we had documents here, we would need a complex system, for example a convnet for documents, to generate the embeddings, because there is no obvious way to combine multiple documents in a simple embedding. Here we simply have a vocabulary of words and map an embedding to each word. You can try it yourself: add several layers into the network and use the very last layer(the layer before the difference between the embedings is put into z) or some middle layer, or embdedding layer, as an output embedding for the words. Getting embeddings from a single embedding layer is logical, because the model can adjust each value in embedding matrix to pull words together or apart.
Run th visualize_word_vectors.lua to project several words into 2d space using SVD and then plot this words. You'll get something like Result_example.png.
Run th gen_sentiment_files.lua generates training and dev sets of words and their labels
Run th sentiment_analysis.lua trains a sentiment analysis model.

You can use the program with your own text. filter_sentences.lua works with tokenized input text. You can tokenize you text using https://github.com/moses-smt/mosesdecoder by running

cd ~/mosesdecoder/scripts/tokenizer/ 
./pre-tokenizer.perl ~/word2vec/trainSentences_raw | ./lowercase.perl | ./tokenizer.perl > ~/word2vec/trainSentences

Different models in word2vec.lua and their results.

word_center = nn.Identity()()
word_outer = nn.Identity()()

x_center_ = Embedding(vocab_size, 100)(word_center)
x_center = nn.Linear(100, 50)(x_center_)
x_center = nn.Tanh()(x_center)

x_outer_ = Embedding(vocab_size, 100)(word_outer)
x_outer = nn.Linear(100, 50)(x_outer_)
x_outer = nn.Tanh()(x_outer)

x_center_minus = nn.MulConstant(-1)(x_center)

z = nn.CAddTable()({x_outer, x_center_minus})
z = nn.Power(2)(z)
z = nn.Sum(2)(z)

m = nn.gModule({word_center, word_outer}, {z, x_outer_, x_center_})

th word2vec.lua example outputs after many iterations:

loss = 0.11914144, grad_params:norm() = 2.9512e-03, params:norm() = 9.8262e+01	
loss = 0.11778496, grad_params:norm() = 3.2691e-03, params:norm() = 9.8274e+01	
loss = 0.11646408, grad_params:norm() = 2.9714e-03, params:norm() = 9.8286e+01	
loss = 0.11348016, grad_params:norm() = 2.8652e-03, params:norm() = 9.8298e+01	
loss = 0.11632667, grad_params:norm() = 3.1923e-03, params:norm() = 9.8311e+01	
loss = 0.11513715, grad_params:norm() = 3.1598e-03, params:norm() = 9.8323e+01	
loss = 0.11461787, grad_params:norm() = 2.8214e-03, params:norm() = 9.8335e+01	
loss = 0.11431365, grad_params:norm() = 3.1080e-03, params:norm() = 9.8347e+01	
loss = 0.12215819, grad_params:norm() = 4.3113e-03, params:norm() = 9.8360e+01	
loss = 0.11354396, grad_params:norm() = 3.1466e-03, params:norm() = 9.8372e+01	
loss = 0.11717846, grad_params:norm() = 3.3619e-03, params:norm() = 9.8385e+01	
loss = 0.11574432, grad_params:norm() = 2.9117e-03, params:norm() = 9.8397e+01	
loss = 0.11931591, grad_params:norm() = 3.1105e-03, params:norm() = 9.8410e+01	
loss = 0.11486262, grad_params:norm() = 2.3376e-03, params:norm() = 9.8422e+01	
loss = 0.11434479, grad_params:norm() = 2.6629e-03, params:norm() = 9.8435e+01	
loss = 0.11297270, grad_params:norm() = 3.0902e-03, params:norm() = 9.8447e+01

th sentiment_analysis.lua output after some number of iterations, the goal was to achieve precision of at least 0.3 on the dev set:

train set: loss = 1.55104364, f1_score = 0.18394721, precision = 0.34597694, recall = 0.22747006, grad_params:norm() = 6.7158e-01, params:norm() = 6.3476e+01	
dev set:   loss = 1.56834475, f1_score = 0.16801552, precision = 0.30563835, recall = 0.21562503	
success	
train set: loss = 1.55557282, f1_score =    nan, precision = 0.29000712, recall = 0.23968426, grad_params:norm() = 8.7115e-01, params:norm() = 6.3757e+01	
dev set:   loss = 1.56860018, f1_score =    nan, precision = 0.24876680, recall = 0.21427333	
train set: loss = 1.54965129, f1_score = 0.17010089, precision = 0.35540269, recall = 0.21828832, grad_params:norm() = 2.2832e-01, params:norm() = 6.4020e+01	
dev set:   loss = 1.56864349, f1_score = 0.16472635, precision = 0.28914897, recall = 0.21383932	
train set: loss = 1.55040840, f1_score = 0.18304011, precision = 0.35890092, recall = 0.22776055, grad_params:norm() = 7.1540e-01, params:norm() = 6.4286e+01	
dev set:   loss = 1.56893610, f1_score = 0.17009003, precision = 0.30058033, recall = 0.21598218	
success	
train set: loss = 1.55578923, f1_score =    nan, precision = 0.27823843, recall = 0.23775610, grad_params:norm() = 8.7676e-01, params:norm() = 6.4530e+01	
dev set:   loss = 1.56943533, f1_score =    nan, precision = 0.24482958, recall = 0.21237577	
train set: loss = 1.54900364, f1_score = 0.17349076, precision = 0.36970442, recall = 0.22035081, grad_params:norm() = 2.1700e-01, params:norm() = 6.4779e+01	
dev set:   loss = 1.56876765, f1_score = 0.16603706, precision = 0.28586120, recall = 0.21404013	
train set: loss = 1.55048328, f1_score = 0.18201624, precision = 0.35189609, recall = 0.22688347, grad_params:norm() = 7.2782e-01, params:norm() = 6.5017e+01	
dev set:   loss = 1.56869145, f1_score = 0.16786719, precision = 0.30135164, recall = 0.21560279

th visualize_word_vectors.lua output

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
cs224d		cs224d
.gitignore		.gitignore
Embedding.lua		Embedding.lua
README.md		README.md
Result_example.png		Result_example.png
X_dev		X_dev
X_train		X_train
datasetSentences		datasetSentences
datasetSentences.txt		datasetSentences.txt
datasetSentences1		datasetSentences1
devLabels		devLabels
devSentences		devSentences
devSentences_raw		devSentences_raw
dictionary		dictionary
dictionary.txt		dictionary.txt
dictionary1		dictionary1
dictionary_sorted_by_index		dictionary_sorted_by_index
extract_datasets_for_torch.py		extract_datasets_for_torch.py
extract_datasets_for_torch1.py		extract_datasets_for_torch1.py
filter_sentences.lua		filter_sentences.lua
filter_sentences_check.lua		filter_sentences_check.lua
filter_sentences_output		filter_sentences_output
filtered_datasetSentences_indexes_en1		filtered_datasetSentences_indexes_en1
gen_sentiment_files.lua		gen_sentiment_files.lua
gen_sentiment_files_dictionary.lua		gen_sentiment_files_dictionary.lua
gmodule_check.lua		gmodule_check.lua
gmodule_check1.lua		gmodule_check1.lua
gmodule_check2.lua		gmodule_check2.lua
inv_vocabulary_raw		inv_vocabulary_raw
linear1.lua		linear1.lua
linear_check.lua		linear_check.lua
margincriterion_check.lua		margincriterion_check.lua
margincritetion_check1.lua		margincritetion_check1.lua
model_utils.lua		model_utils.lua
original_rt_snippets		original_rt_snippets
original_rt_snippets.txt		original_rt_snippets.txt
plot_check.lua		plot_check.lua
replicate_check.lua		replicate_check.lua
requirements.txt		requirements.txt
sentiment_analysis.lua		sentiment_analysis.lua
sentiment_features_and_labels		sentiment_features_and_labels
sentiment_labels		sentiment_labels
sentiment_labels.txt		sentiment_labels.txt
sentiment_labels1		sentiment_labels1
sigmoid1.lua		sigmoid1.lua
sigmoid_test.lua		sigmoid_test.lua
softmax1.lua		softmax1.lua
softmax_check.lua		softmax_check.lua
table_utils.lua		table_utils.lua
te1.lua		te1.lua
trainLabels		trainLabels
trainLabels1		trainLabels1
trainLabels_raw		trainLabels_raw
trainSentences		trainSentences
trainSentences1		trainSentences1
trainSentences_raw		trainSentences_raw
visualize_word_vectors.lua		visualize_word_vectors.lua
vocabulary_raw		vocabulary_raw
word2vec.lua		word2vec.lua
word2vec1.lua		word2vec1.lua
word2vec2.lua		word2vec2.lua
wordvec_sentiment.ipynb		wordvec_sentiment.ipynb
y_dev		y_dev
y_train		y_train

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

vivanov879/word2vec

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages