-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get fuzzy duplicates #72
Comments
I could probably create such a function, but it would likely make inefficient calls to fuzzyjoin and be quite slow. Someone else could likely write it more eloquently, and to run faster. I only work with small data so less of a problem for me. |
I think you have the wrong drob. I'm @danlovesproofs on twitter. (He seems cool though.) |
ha I didn't realize that pasting a tweet would tag people - doh. Thanks for the heads up. |
That was quicker than I thought to come up with something functional. Check out: Missing much of the functionality above, but will do the job most of the time. Maybe just start by keeping it simple like this? I will make this a Markdown file to show output:
|
I think multiple columns is too complex. It hurts my head. In which case, this could just take a vector as the input. Does that meet user needs? Edit: as I was falling asleep, I decided that maybe two columns is okay. As long as the |
I have some really crude code related to this (solving the problem I tweeted about). It's not "fuzzy" but approaches from a different angle: looks for actual matches but one variable at a time. In a way that guards against label switching. Then forms putative groups / entities by looking for agreement across the variables. That's why I'm interested in this and the gist. |
Dear @jennybc, Here's how I was thinking about this:
|
The more general way is to learn the weights on appropriate distance measures for each column based on training data using some supervised technique and then match the rest. |
dear @sfirke As you already anticipate and indirectly state, there is no one distance measure that will work everywhere. Depends on the kinds of errors you expect etc. Idea is to generally support lots of straightforward distance measures. Some common include: various edit distance measures, but also edit distance weighted by typing errors (if stuff is typed but diff. it comes from voice recognition etc.), then if we autodetect dates or numbers in some columns, we could use absolute distance for those. I built some of this stuff ground up for merging shape file with electoral returns: http://projects.iq.harvard.edu/eda/data |
If I ever revisit this function, note to self to consider wrapping the OpenRefine project: https://github.com/ChrisMuir/refinr |
Inspired by https://twitter.com/JennyBryan/status/777953052129054720 (as well as @almartin82 and @chrishaid):
Input: a data.frame with column(s) specified, name of the fuzzy matching algorithm to use (like fuzzyjoin), and (optional) allowable distance on that metric to be used a threshold for returning as a duplicate.
Output: Returns just the duplicate records, stored together, similar to
get_dupes
. But it also contains a unique key for each set of duplicates (maybe derived just the first of the fuzzy duplicates) in a new variable, and the distance from the anchor duplicate [if there are more than 2 duplicates, make one of them the anchor - maybe the most common, and as a tiebreaker the one closest to the other two].This would be used:
plyr::mapvalues
or similar.(2) assumes that it matters which near-duplicate value survives, in a way that can't be automated. If it doesn't, then also provide a convenience function that swaps the anchor duplicate value into all dupes in a cluster and removes the unique key and distance columns.
The text was updated successfully, but these errors were encountered: