Possible to enrich the get_dupes() #546

sapphire75710 · 2023-06-01T07:13:31Z

Feature requests

I am using the get_dupes() for detecting duplicates in my datasets. Relevant variables are all numeric.

For sex, age, agelength, LocIDorg, PopulationUniverse, and value variables, the rule is to detect non-identical values.

For year.07 variable, the rule is slightly relaxed: if the difference between two record is smaller than 0.7, then these two records are considered duplicates. Is it possible to have this relaxed rules for duplicates matching in get_dupes()? Please let me know.

# duplicates <- janitor::get_dupes(df, sex, age, agelength, LocIDorg, PopulationUniverse, year.07, value)

sfirke · 2023-06-02T15:47:14Z

Hi! I think this is outside the scope of that function. I'd call this "fuzzy" duplicate identification and in my experience it gets complex quickly, with each case requiring a unique solution. You might try:

Expanding your data to contain values within the full range of that variable, then looking for duplicates including that column
That might be done with an expand() or some kind of join. Also take a look at the fuzzyjoin package, because you're working with a numeric variable this might not be so bad
Binning the variable somehow, then looking to see if there are duplicate bins

It could be a good candidate for a StackOverflow question, if you can share reproducible data that folks can work with. Good luck!

sfirke closed this as completed Jun 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible to enrich the get_dupes() #546

Possible to enrich the get_dupes() #546

sapphire75710 commented Jun 1, 2023

sfirke commented Jun 2, 2023

Possible to enrich the get_dupes() #546

Possible to enrich the get_dupes() #546

Comments

sapphire75710 commented Jun 1, 2023

Feature requests

sfirke commented Jun 2, 2023