Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible to enrich the get_dupes() #546

Closed
sapphire75710 opened this issue Jun 1, 2023 · 1 comment
Closed

Possible to enrich the get_dupes() #546

sapphire75710 opened this issue Jun 1, 2023 · 1 comment

Comments

@sapphire75710
Copy link

Feature requests

I am using the get_dupes() for detecting duplicates in my datasets. Relevant variables are all numeric.

For sex, age, agelength, LocIDorg, PopulationUniverse, and value variables, the rule is to detect non-identical values.

For year.07 variable, the rule is slightly relaxed: if the difference between two record is smaller than 0.7, then these two records are considered duplicates. Is it possible to have this relaxed rules for duplicates matching in get_dupes()? Please let me know.

# duplicates <- janitor::get_dupes(df, sex, age, agelength, LocIDorg, PopulationUniverse, year.07, value)
@sfirke
Copy link
Owner

sfirke commented Jun 2, 2023

Hi! I think this is outside the scope of that function. I'd call this "fuzzy" duplicate identification and in my experience it gets complex quickly, with each case requiring a unique solution. You might try:

  • Expanding your data to contain values within the full range of that variable, then looking for duplicates including that column
  • That might be done with an expand() or some kind of join. Also take a look at the fuzzyjoin package, because you're working with a numeric variable this might not be so bad
  • Binning the variable somehow, then looking to see if there are duplicate bins

It could be a good candidate for a StackOverflow question, if you can share reproducible data that folks can work with. Good luck!

@sfirke sfirke closed this as completed Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants