Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

word mover's distance #92

Closed
zachmayer opened this issue Apr 22, 2016 · 21 comments
Closed

word mover's distance #92

zachmayer opened this issue Apr 22, 2016 · 21 comments
Assignees
Milestone

Comments

@zachmayer
Copy link

See the paper and the gensim implementation.

I've found this method to do a better job of measuring "distance" between documents than summing the word vectors and comparing.

@zachmayer zachmayer changed the title word mover distance word mover's distance Apr 22, 2016
@dselivanov
Copy link
Owner

There is a emdist package for calculating Earth mover's distance. Not sure it uses the same algorithm as FastEMD.

@zachmayer
Copy link
Author

zachmayer commented Apr 26, 2016

emddist looks really interesting, but it doesn't support matrices with > 4 dimensions, e.g.:

set.seed(1)
x <- matrix(runif(5), nrow=1)
y <- matrix(runif(5), nrow=1)
emdist::emd(x, y)

vs

set.seed(1)
x <- matrix(runif(10), nrow=1)
y <- matrix(runif(10), nrow=1)
emdist::emd(x, y)

@pommedeterresautee
Copy link

pommedeterresautee commented Apr 26, 2016

@dselivanov Below my thought without checking the source code (so may be wrong).
according to the R package PDF :

R code by Simon Urbanek, EMD code by Yossi Rubner.

http://robotics.stanford.edu/~rubner/emd/default.htm
The code on the website from Rubner is based on a paper dated as of 1998

@dselivanov
Copy link
Owner

@pommedeterresautee I saw comments, but still not sure about algorithm...

Anyway I think it will not too hard to port FastEMD.

@pommedeterresautee
Copy link

@dselivanov yep there are obviously several version of the algo implemented. May be taking FastEMD is the easiest option indeed. You will be sure of the quality.
@zachmayer have you tried earth mover on FastSent vectors? (I am not sure it makes sense as you are supposed to sum/mean word vectors to build the vec representation of a doc in FastSent but may be using word mover like you would do with W2V vectors would give interesting results to compare two documents)

@zachmayer
Copy link
Author

@pommedeterresautee I haven't tried EMD on fastsent vectors, but that seems like a really interesting idea.

I guess there are 2 ways to use EMD:

  1. Compare 2 matrices of word vectors, where each the rows are words and the columns are word vectors.
  2. Compare 2 vectors directly, where each vector is a "document" vector of some kind.

I guess 2 is a special case of 1

@dselivanov
Copy link
Owner

dselivanov commented Jun 20, 2016

Just re-read paper. Looks really interesting. Also I think we can focus on relaxed word word-movers distance - RWMD, because of following:

  1. It has much lower computational complexity - O(N^2) vs O(N^3 * log(N))
  2. Results of RWMD are very close to WMD
  3. It is quite trivial to efficiently implement it in pure R

@zachmayer did you use euclidean distance or cosine?

@zachmayer
Copy link
Author

I tried both euclidean distance and cosine distance. Got much better results from cosine distance and word movers distance than euclidean.

@dselivanov
Copy link
Owner

I was confused by the fact, that authors used euclidean distance.

@zachmayer
Copy link
Author

I agree. They should have compared to cosine distance.

@dselivanov dselivanov added this to the 0.4 milestone Jun 20, 2016
@dselivanov dselivanov mentioned this issue Jun 20, 2016
11 tasks
@dselivanov
Copy link
Owner

dselivanov commented Jun 22, 2016

@zachmayer I have pretty efficient code for RWMD. Can you suggest any test cases / good datasets? I found these two exaples: blog post and gensim wmd tutorial.

@zachmayer
Copy link
Author

Those examples look great to me. The "President greets the press in Chicago" sentence is the one I thought of off the top of my head.

@pommedeterresautee
Copy link

pommedeterresautee commented Jun 22, 2016

http://www.cis.upenn.edu/~ccb/ppdb/ (there are many more if you are interested).

This repository should provide you lots of stuff to work on:
https://github.com/brmson/dataset-sts

http://ixa2.si.ehu.es/stswiki/index.php/Main_Page

@pommedeterresautee
Copy link

For what it worth I also implemented it (in Python) and got nice results.
However, it is not that awesome compared to a simple average word embedding + cosine.
I also tried EMD function from a Python package. Results are of similar quality.
Curious to know you opinion @dselivanov

@zachmayer
Copy link
Author

Maybe the standford question answering dataset? Probably not the perfect use case, but could be interesting to try out: https://twitter.com/stanfordnlp/status/744539496230707200

@pommedeterresautee
Copy link

The new Stanford dataset is about finding the correct answer inside a provided document. As the whole document has the same subject than the question I imagine any simple algo won't work well if it's not trained to perform this specific task.

Paraphrase task seems the task we try to solve and would make sense for a future text2vec Vignette.

@dselivanov
Copy link
Owner

RWMD and bunch of other distances now are in 0.4 branch - see ?dist2, pdist2 (pdist2 is not finished yet).
I implemented both cosine and euclidean distances for comparing word vectors in RWMD. RWMD with cosine distance is faster and more accurate, based on my quick test. But both euclidean and cosine are much faster than linked python implementations (but mb I missed something). If you use good BLAS such as OpenBLAS/ATLAS/vecLib - RWMD with cosine distance will be parallelised out of the box.

I don't have tutorials/ use cases yet. I only had quick tests on movie_review dataset. Performance on sentiment analysis using KNN with word embeddings trained on wikipedia dump is nearly the same as using KNN on LSA (SVD) vectors.
Looking forward for your feedback.

@dselivanov dselivanov self-assigned this Jun 28, 2016
@zachmayer
Copy link
Author

This is awesome, thank you!

@pommedeterresautee
Copy link

@dselivanov regarding your findings, as said before, even using the full WMD I don t see a big difference with simply comparing with averaging word vectors. However I have not compared word embedding averaging and... SVD!

dselivanov added a commit that referenced this issue Oct 3, 2016
dselivanov added a commit that referenced this issue Oct 3, 2016
…elated to #92.

Former-commit-id: 273768c2d7cc60bdb03c321c4d5bfd4e3eb57b33 [formerly 5b37021]
Former-commit-id: 66c943b
@dselivanov
Copy link
Owner

RWMD in master. Closing, see ?RWMD.

@dselivanov
Copy link
Owner

Linear-Complexity Relaxed Word Mover’s Distance - https://arxiv.org/abs/1711.07227

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants