Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create simpler has_dupes() function if significantly faster than get_dupes() on large data.frames #67

Closed
sfirke opened this issue Oct 5, 2016 · 10 comments

Comments

@sfirke
Copy link
Owner

sfirke commented Oct 5, 2016

for CRAN version 0.2.0 I wasn't interested in optimization but it might be due now.

On the October 4th CRAN logs file, with 926,002 rows x 10 columns:

> microbenchmark(dat %>% get_dupes(package, country, ip_id))
Unit: seconds
                                       expr      min       lq     mean median       uq      max neval
 dat %>% get_dupes(package, country, ip_id) 4.095232 4.381469 4.580609 4.5755 4.695437 5.884169   100

4.5 seconds seems pretty pokey for interactive use. I wonder if there's a way to get that down. Though 533k rows were returned... if performance is better when there are relatively few duplicate rows returned, that would be more acceptable.

@chrishaid
Copy link
Collaborator

Probably move that functionality into C++. I could look into this.

c

Chris Haid

Director of Research and Analysis

KIPP Chicago

312.961.0510

chaid@kippchicago.org

github.com/chrishaid

Work hard. Be nice.

On Wed, Oct 5, 2016 at 1:14 PM, Sam Firke notifications@github.com wrote:

for CRAN version 0.2.0 I wasn't interested in optimization but it might be
due now.

On the October 4th CRAN logs file
http://cran-logs.rstudio.com/2016/2016-10-04.csv.gz, with 926,002 rows
x 10 columns:

microbenchmark(dat %>% get_dupes(package, country, ip_id))
Unit: seconds
expr min lq mean median uq max neval
dat %>% get_dupes(package, country, ip_id) 4.095232 4.381469 4.580609 4.5755 4.695437 5.884169 100

4.5 seconds seems pretty pokey for interactive use. I wonder if there's a
way to get that down. Though 533k rows were returned... if performance is
better when there are relatively few duplicate rows returned, that would be
more acceptable.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#67, or mute the thread
https://github.com/notifications/unsubscribe-auth/AA7c_wqffLazA5DCV4HChp9aEdQcUzcHks5qw-kfgaJpZM4KPHwA
.

@sfirke
Copy link
Owner Author

sfirke commented Oct 5, 2016

I know zero C++ so I would happily defer to you on this if we choose to prioritize this. But let's first see if other feedback comes in on the package. There might be higher-yield places to invest time than this, I was just noting it in the moment as the operation took a noticeable amount of time.

@rgknight
Copy link
Collaborator

rgknight commented Oct 6, 2016

Does get_dups need to be more complicated than an NSE version of

sep_dups <- function(df, ...){
  target <- df %>% select_(.dots=...)
  dup_index <- duplicated(target) | duplicated(target, fromLast = TRUE)

  duplicates = df[dup_index, ])
}

I use this SE duplicated approach in https://github.com/rgknight/knightr, and it appears to be considerably faster.

uniques <- c(1:10)
len <- 10^5
x <- sample(uniques, len, replace = TRUE)
y <- 11:10000000
z <- append(x, y)
z <- as.data.frame(z)
m <- microbenchmark(get_dupes(z), 
                    times = 10L)
m2 <- microbenchmark(knightr::sep_dups(z), 
                    times = 10L)
summary(m)
summary(m2)

Results

Unit: seconds
         expr      min       lq     mean   median       uq      max neval
 get_dupes(z) 10.47748 10.56888 11.02739 10.96821 11.52733 11.71295    10

Unit: microseconds
                 expr     min      lq     mean   median     uq     max neval
 knightr::sep_dups(z) 287.847 327.968 382.3651 348.0285 421.74 598.087    10

@sfirke
Copy link
Owner Author

sfirke commented Oct 13, 2016

get_dupes is a little more complicated than that because of creating the count column and reorder the columns, but generally, you are right. I had a simpler version initially and but ran into problems with dplyr::n() not working in a package the way it does interactively. I solved that with the current approach, which is computationally quite inefficient.

I think with some thought, get_dupes can be refactored along the lines you suggest to be much faster, without writing anything in C++.

@sfirke
Copy link
Owner Author

sfirke commented Jan 29, 2017

I modified get_dupes, on the optimizing_get_dupes branch - check that branch out for the below testing. Now if there are >100k rows in the input data.frame, it doesn't calculate the N (this can be overridden).

By my testing it was basically as fast as sep_dups, mentioned above and defined below for testing, when I took out the last select and arrange of the result. That makes it a more apples-to-apples comparison. With those back in, it lags slightly. But I think they are essential. Without having duplicates in order to look at, the result mostly tells you just that you have duplicates.

# Ryan's data for testing - a one-column data.frame with >10MM rows
uniques <- c(1:10)
len <- 10^5
x <- sample(uniques, len, replace = TRUE)
y <- 11:10000000
z <- append(x, y)
z <- as.data.frame(z)

# source file linked above on this page; 960k records, real world example
dat <- readr::read_csv(gzfile("2016-10-04.csv.gz")) 

sep_dups <- function(dat, ...){
  var_names <- lapply(unlist(as.list(substitute(list(...)))[-1L]), deparse) %>% unlist # sad!
  target <- dat %>% select(one_of(var_names))
  dup_index <- duplicated(target) | duplicated(target, fromLast = TRUE)
  dat[dup_index, , drop = FALSE]
}

library(microbenchmark)

microbenchmark(get_dupes = dat %>% get_dupes(package, country, ip_id),
               get_dupes_Ns = dat %>% get_dupes(package, country, ip_id, show_n = TRUE),
               sep_dups = dat %>% sep_dups(package, country, ip_id),
               times = 5)

microbenchmark(get_dupes = get_dupes(z, z),
               get_dupes_Ns = get_dupes(z, show_n = TRUE),
               sep_dups = sep_dups(z,z),
               times = 5)

Yields:

Unit: seconds
         expr      min       lq     mean   median       uq      max neval cld
    get_dupes 2.071470 2.379026 2.826163 2.439346 3.200552 4.040421     5  a 
 get_dupes_Ns 5.331746 5.449574 5.513692 5.451409 5.509487 5.826245     5   b
     sep_dups 1.627400 1.718317 1.969888 1.931549 2.066020 2.506156     5  a 

Unit: seconds
         expr       min        lq      mean    median        uq       max neval cld
    get_dupes  1.334536  1.379275  1.810781  2.012506  2.089803  2.237786     5  a 
 get_dupes_Ns 11.722700 11.958248 13.507509 12.260594 12.445033 19.150969     5   b
     sep_dups  1.312836  1.419128  1.689508  1.487448  1.589686  2.638443     5  a 

I'm still open to someone making the underlying code faster, in addition to this workaround of a new parameter show_n that defaults to false on large data.frames. I don't know how. But in the meantime I think this is a nice compromise that allows full info for everyday "small data", but won't leave a user hanging for 10+ seconds on larger data.frames.

@sfirke
Copy link
Owner Author

sfirke commented Jan 29, 2017

@rgknight when you get a chance, try cloning that branch and running the above tests? Curious to see what you think about this approach and also if I've missed something in your sep_dups that I rewrote slightly above. If you approve of this, I can add tests and merge into master.

@rgknight
Copy link
Collaborator

I took a quick look at the code but haven't run the tests. The approach looks good. My only question is about the auto-suppress of the count while also having a count option.

It looks like you intended it to have three options:

  • TRUE = always
  • FALSE = never
  • NULL = if the df is small

However, you set show_n to TRUE in the function definition, so the NULL condition will never be triggered, unless you explicitly set it to NULL, which I think is a typo/mistake?

More generally, the function definition implies two options but there are actually three. Do you have examples of other functions that take this approach? I would lean toward having two options, TRUE/FALSE, and let people who are speed sensitive set it to FALSE. Actually I would prefer the default to be FALSE and let non-speed sensitive folks set it to TRUE since it's an optional adornment, but I think we may disagree there.

If you want to keep three options, I'd recommend making it a vector with "no", "whenfast" and always", like the useNA option on table(), with the default of "whenfast" (or some better name).

@sfirke
Copy link
Owner Author

sfirke commented Feb 5, 2017

To your first point, the NULL condition does get triggered - see if((nrow(dat) > 100000 & missing(show_n)) | !show_n) - the missing() checks to see if show_n is was specified to be TRUE, vs. defaulted to TRUE because the user didn't supply a value. dat %>% get_dupes(package, country, ip_id) and dat %>% get_dupes(package, country, ip_id, show_n = FALSE) (with a very large dat) do the same thing but there's no warning in the 2nd instance about defaulting to show_n = FALSE.

So it's working as I intended it. BUT, you raise a good point about how to approach this. I do want it to default to the "whenfast" case for use in interactive checking of dirty data. But the current approach might lead to the appearance or disappearance of the n column depending on the size of the input, which would be potentially bad for users counting on a certain input. I can't think of other R functions that behave like this, with a default input of TRUE not always behaving like a specified TRUE (because as you note, TRUE/FALSE implies two options but there are three).

So yeah, I'll do that w/ whenfast, then write tests and set up a PR. Thanks for reviewing!

@rgknight
Copy link
Collaborator

rgknight commented Feb 8, 2017

Cool I didn't know that missing works that way -- helpful, thanks!

@sfirke
Copy link
Owner Author

sfirke commented Mar 5, 2018

The above draft version, where it defaults to different behavior for a high # of rows, seems ugly to me now. I would prefer, if this can't be optimized much more, creating the has_dupes function that Ryan proposes in the comments here: #18. It just returns TRUE/FALSE. Then maybe if # of rows or columns is very large, print a message that it will likely be slow and perhaps they should call has_dupes first instead.

As to actually optimizing get_dupes, let's continue that conversation at #163.

@sfirke sfirke closed this as completed Mar 5, 2018
@sfirke sfirke changed the title improve speed of get_dupes create simpler has_dupes() function if significantly faster than get_dupes() on large data.frames Mar 5, 2018
@sfirke sfirke reopened this Mar 5, 2018
@sfirke sfirke closed this as completed Apr 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants