Skip to content
This repository has been archived by the owner on May 18, 2023. It is now read-only.

Similarities between documents and query may be >1 #3

Open
hrs opened this issue Mar 21, 2016 · 2 comments
Open

Similarities between documents and query may be >1 #3

hrs opened this issue Mar 21, 2016 · 2 comments
Assignees

Comments

@hrs
Copy link
Owner

hrs commented Mar 21, 2016

The README claims that similarities between documents and queries shouldn't be greater than 1. However:

table = tfidf.tfidf()
table.addDocument("foo", ["alpha", "bravo", "charlie", "delta", "echo", "foxtrot", "golf", "hotel"])
table.addDocument("bar", ["alpha", "bravo", "charlie", "india", "juliet", "kilo"])
table.addDocument("baz", ["kilo", "lima", "mike", "november"])
print table.similarities (["alpha", "bravo", "charlie", "india"])

Yields [['foo', 0.5625], ['bar', 1.0416666666666665], ['baz', 0.0]]. Whoops!

This is happening because the query isn't being normalized. The ranking of results should still be correct, but it'd be better if we normalized it so we can make guarantees about the output.

@hrs hrs self-assigned this Mar 21, 2016
@tianye2856
Copy link

I meet the same problem, please solve it, thanks.

@shanalikhan
Copy link

what is the solution you guys did it to solve it

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants