-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
word mover's distance #92
Comments
There is a emdist package for calculating Earth mover's distance. Not sure it uses the same algorithm as FastEMD. |
emddist looks really interesting, but it doesn't support matrices with > 4 dimensions, e.g.: set.seed(1)
x <- matrix(runif(5), nrow=1)
y <- matrix(runif(5), nrow=1)
emdist::emd(x, y) vs set.seed(1)
x <- matrix(runif(10), nrow=1)
y <- matrix(runif(10), nrow=1)
emdist::emd(x, y) |
@dselivanov Below my thought without checking the source code (so may be wrong).
http://robotics.stanford.edu/~rubner/emd/default.htm |
@pommedeterresautee I saw comments, but still not sure about algorithm... Anyway I think it will not too hard to port FastEMD. |
@dselivanov yep there are obviously several version of the algo implemented. May be taking FastEMD is the easiest option indeed. You will be sure of the quality. |
@pommedeterresautee I haven't tried EMD on fastsent vectors, but that seems like a really interesting idea. I guess there are 2 ways to use EMD:
I guess 2 is a special case of 1 |
Just re-read paper. Looks really interesting. Also I think we can focus on relaxed word word-movers distance - RWMD, because of following:
@zachmayer did you use euclidean distance or cosine? |
I tried both euclidean distance and cosine distance. Got much better results from cosine distance and word movers distance than euclidean. |
I was confused by the fact, that authors used euclidean distance. |
I agree. They should have compared to cosine distance. |
@zachmayer I have pretty efficient code for RWMD. Can you suggest any test cases / good datasets? I found these two exaples: blog post and gensim wmd tutorial. |
Those examples look great to me. The "President greets the press in Chicago" sentence is the one I thought of off the top of my head. |
http://www.cis.upenn.edu/~ccb/ppdb/ (there are many more if you are interested). This repository should provide you lots of stuff to work on: |
For what it worth I also implemented it (in Python) and got nice results. |
Maybe the standford question answering dataset? Probably not the perfect use case, but could be interesting to try out: https://twitter.com/stanfordnlp/status/744539496230707200 |
The new Stanford dataset is about finding the correct answer inside a provided document. As the whole document has the same subject than the question I imagine any simple algo won't work well if it's not trained to perform this specific task. Paraphrase task seems the task we try to solve and would make sense for a future text2vec Vignette. |
RWMD and bunch of other distances now are in 0.4 branch - see I don't have tutorials/ use cases yet. I only had quick tests on |
This is awesome, thank you! |
@dselivanov regarding your findings, as said before, even using the full WMD I don t see a big difference with simply comparing with averaging word vectors. However I have not compared word embedding averaging and... SVD! |
RWMD in master. Closing, see |
Linear-Complexity Relaxed Word Mover’s Distance - https://arxiv.org/abs/1711.07227 |
See the paper and the gensim implementation.
I've found this method to do a better job of measuring "distance" between documents than summing the word vectors and comparing.
The text was updated successfully, but these errors were encountered: