word mover's distance #92

zachmayer · 2016-04-22T13:34:32Z

See the paper and the gensim implementation.

I've found this method to do a better job of measuring "distance" between documents than summing the word vectors and comparing.

dselivanov · 2016-04-26T09:11:44Z

There is a emdist package for calculating Earth mover's distance. Not sure it uses the same algorithm as FastEMD.

zachmayer · 2016-04-26T13:55:39Z

emddist looks really interesting, but it doesn't support matrices with > 4 dimensions, e.g.:

set.seed(1)
x <- matrix(runif(5), nrow=1)
y <- matrix(runif(5), nrow=1)
emdist::emd(x, y)

vs

set.seed(1)
x <- matrix(runif(10), nrow=1)
y <- matrix(runif(10), nrow=1)
emdist::emd(x, y)

pommedeterresautee · 2016-04-26T13:55:41Z

@dselivanov Below my thought without checking the source code (so may be wrong).
according to the R package PDF :

R code by Simon Urbanek, EMD code by Yossi Rubner.

http://robotics.stanford.edu/~rubner/emd/default.htm
The code on the website from Rubner is based on a paper dated as of 1998

dselivanov · 2016-04-26T13:58:01Z

@pommedeterresautee I saw comments, but still not sure about algorithm...

Anyway I think it will not too hard to port FastEMD.

pommedeterresautee · 2016-04-26T14:08:37Z

@dselivanov yep there are obviously several version of the algo implemented. May be taking FastEMD is the easiest option indeed. You will be sure of the quality.
@zachmayer have you tried earth mover on FastSent vectors? (I am not sure it makes sense as you are supposed to sum/mean word vectors to build the vec representation of a doc in FastSent but may be using word mover like you would do with W2V vectors would give interesting results to compare two documents)

zachmayer · 2016-04-26T14:11:04Z

@pommedeterresautee I haven't tried EMD on fastsent vectors, but that seems like a really interesting idea.

I guess there are 2 ways to use EMD:

Compare 2 matrices of word vectors, where each the rows are words and the columns are word vectors.
Compare 2 vectors directly, where each vector is a "document" vector of some kind.

I guess 2 is a special case of 1

dselivanov · 2016-06-20T12:58:38Z

Just re-read paper. Looks really interesting. Also I think we can focus on relaxed word word-movers distance - RWMD, because of following:

It has much lower computational complexity - O(N^2) vs O(N^3 * log(N))
Results of RWMD are very close to WMD
It is quite trivial to efficiently implement it in pure R

@zachmayer did you use euclidean distance or cosine?

zachmayer · 2016-06-20T14:19:49Z

I tried both euclidean distance and cosine distance. Got much better results from cosine distance and word movers distance than euclidean.

dselivanov · 2016-06-20T14:21:28Z

I was confused by the fact, that authors used euclidean distance.

zachmayer · 2016-06-20T14:28:16Z

I agree. They should have compared to cosine distance.

dselivanov · 2016-06-22T09:52:04Z

@zachmayer I have pretty efficient code for RWMD. Can you suggest any test cases / good datasets? I found these two exaples: blog post and gensim wmd tutorial.

zachmayer · 2016-06-22T13:42:30Z

Those examples look great to me. The "President greets the press in Chicago" sentence is the one I thought of off the top of my head.

pommedeterresautee · 2016-06-22T13:51:28Z

http://www.cis.upenn.edu/~ccb/ppdb/ (there are many more if you are interested).

This repository should provide you lots of stuff to work on:
https://github.com/brmson/dataset-sts

http://ixa2.si.ehu.es/stswiki/index.php/Main_Page

pommedeterresautee · 2016-06-22T14:49:30Z

For what it worth I also implemented it (in Python) and got nice results.
However, it is not that awesome compared to a simple average word embedding + cosine.
I also tried EMD function from a Python package. Results are of similar quality.
Curious to know you opinion @dselivanov

zachmayer · 2016-06-22T17:32:49Z

Maybe the standford question answering dataset? Probably not the perfect use case, but could be interesting to try out: https://twitter.com/stanfordnlp/status/744539496230707200

pommedeterresautee · 2016-06-22T19:44:03Z

The new Stanford dataset is about finding the correct answer inside a provided document. As the whole document has the same subject than the question I imagine any simple algo won't work well if it's not trained to perform this specific task.

Paraphrase task seems the task we try to solve and would make sense for a future text2vec Vignette.

…elated to #92.

dselivanov · 2016-06-28T12:06:02Z

RWMD and bunch of other distances now are in 0.4 branch - see ?dist2, pdist2 (pdist2 is not finished yet).
I implemented both cosine and euclidean distances for comparing word vectors in RWMD. RWMD with cosine distance is faster and more accurate, based on my quick test. But both euclidean and cosine are much faster than linked python implementations (but mb I missed something). If you use good BLAS such as OpenBLAS/ATLAS/vecLib - RWMD with cosine distance will be parallelised out of the box.

I don't have tutorials/ use cases yet. I only had quick tests on movie_review dataset. Performance on sentiment analysis using KNN with word embeddings trained on wikipedia dump is nearly the same as using KNN on LSA (SVD) vectors.
Looking forward for your feedback.

zachmayer · 2016-06-28T13:15:36Z

This is awesome, thank you!

pommedeterresautee · 2016-06-28T20:47:19Z

@dselivanov regarding your findings, as said before, even using the full WMD I don t see a big difference with simply comparing with averaging word vectors. However I have not compared word embedding averaging and... SVD!

…elated to #92. Former-commit-id: 5b37021

…elated to #92. Former-commit-id: 273768c2d7cc60bdb03c321c4d5bfd4e3eb57b33 [formerly 5b37021] Former-commit-id: 66c943b

dselivanov · 2016-10-03T18:18:30Z

RWMD in master. Closing, see ?RWMD.

dselivanov · 2018-12-10T10:12:15Z

Linear-Complexity Relaxed Word Mover’s Distance - https://arxiv.org/abs/1711.07227

zachmayer changed the title ~~word mover distance~~ word mover's distance Apr 22, 2016

dselivanov added the feature request label May 20, 2016

dselivanov added this to the 0.4 milestone Jun 20, 2016

dselivanov mentioned this issue Jun 20, 2016

Roadmap for text2vec 0.4 #91

Closed

11 tasks

dselivanov added a commit that referenced this issue Jun 26, 2016

bunch of distance measures, including relaxed word mover distance - r…

5b37021

…elated to #92.

dselivanov self-assigned this Jun 28, 2016

dselivanov added a commit that referenced this issue Oct 3, 2016

bunch of distance measures, including relaxed word mover distance - r…

66c943b

…elated to #92. Former-commit-id: 5b37021

dselivanov added a commit that referenced this issue Oct 3, 2016

bunch of distance measures, including relaxed word mover distance - r…

6962283

…elated to #92. Former-commit-id: 273768c2d7cc60bdb03c321c4d5bfd4e3eb57b33 [formerly 5b37021] Former-commit-id: 66c943b

dselivanov closed this as completed Oct 3, 2016

de-code mentioned this issue Apr 16, 2018

investigate Word Movers Distance dciudadr/eLife_retractions#5

Open

dselivanov reopened this Dec 10, 2018

dselivanov closed this as completed in a5d7ce6 Dec 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

word mover's distance #92

word mover's distance #92

zachmayer commented Apr 22, 2016

dselivanov commented Apr 26, 2016

zachmayer commented Apr 26, 2016 •

edited

Loading

pommedeterresautee commented Apr 26, 2016 •

edited

Loading

dselivanov commented Apr 26, 2016

pommedeterresautee commented Apr 26, 2016

zachmayer commented Apr 26, 2016

dselivanov commented Jun 20, 2016 •

edited

Loading

zachmayer commented Jun 20, 2016

dselivanov commented Jun 20, 2016

zachmayer commented Jun 20, 2016

dselivanov commented Jun 22, 2016 •

edited

Loading

zachmayer commented Jun 22, 2016

pommedeterresautee commented Jun 22, 2016 •

edited

Loading

pommedeterresautee commented Jun 22, 2016

zachmayer commented Jun 22, 2016

pommedeterresautee commented Jun 22, 2016

dselivanov commented Jun 28, 2016

zachmayer commented Jun 28, 2016

pommedeterresautee commented Jun 28, 2016

dselivanov commented Oct 3, 2016

dselivanov commented Dec 10, 2018

word mover's distance #92

word mover's distance #92

Comments

zachmayer commented Apr 22, 2016

dselivanov commented Apr 26, 2016

zachmayer commented Apr 26, 2016 • edited Loading

pommedeterresautee commented Apr 26, 2016 • edited Loading

dselivanov commented Apr 26, 2016

pommedeterresautee commented Apr 26, 2016

zachmayer commented Apr 26, 2016

dselivanov commented Jun 20, 2016 • edited Loading

zachmayer commented Jun 20, 2016

dselivanov commented Jun 20, 2016

zachmayer commented Jun 20, 2016

dselivanov commented Jun 22, 2016 • edited Loading

zachmayer commented Jun 22, 2016

pommedeterresautee commented Jun 22, 2016 • edited Loading

pommedeterresautee commented Jun 22, 2016

zachmayer commented Jun 22, 2016

pommedeterresautee commented Jun 22, 2016

dselivanov commented Jun 28, 2016

zachmayer commented Jun 28, 2016

pommedeterresautee commented Jun 28, 2016

dselivanov commented Oct 3, 2016

dselivanov commented Dec 10, 2018

zachmayer commented Apr 26, 2016 •

edited

Loading

pommedeterresautee commented Apr 26, 2016 •

edited

Loading

dselivanov commented Jun 20, 2016 •

edited

Loading

dselivanov commented Jun 22, 2016 •

edited

Loading

pommedeterresautee commented Jun 22, 2016 •

edited

Loading