create simpler has_dupes() function if significantly faster than get_dupes() on large data.frames #67

sfirke · 2016-10-05T18:14:54Z

for CRAN version 0.2.0 I wasn't interested in optimization but it might be due now.

On the October 4th CRAN logs file, with 926,002 rows x 10 columns:

> microbenchmark(dat %>% get_dupes(package, country, ip_id))
Unit: seconds
                                       expr      min       lq     mean median       uq      max neval
 dat %>% get_dupes(package, country, ip_id) 4.095232 4.381469 4.580609 4.5755 4.695437 5.884169   100

4.5 seconds seems pretty pokey for interactive use. I wonder if there's a way to get that down. Though 533k rows were returned... if performance is better when there are relatively few duplicate rows returned, that would be more acceptable.

The text was updated successfully, but these errors were encountered:

chrishaid · 2016-10-05T18:24:31Z

Probably move that functionality into C++. I could look into this.

c

Chris Haid

Director of Research and Analysis

KIPP Chicago

312.961.0510

chaid@kippchicago.org

github.com/chrishaid

Work hard. Be nice.

On Wed, Oct 5, 2016 at 1:14 PM, Sam Firke notifications@github.com wrote:

for CRAN version 0.2.0 I wasn't interested in optimization but it might be
due now.

On the October 4th CRAN logs file
http://cran-logs.rstudio.com/2016/2016-10-04.csv.gz, with 926,002 rows
x 10 columns:

microbenchmark(dat %>% get_dupes(package, country, ip_id))
Unit: seconds
expr min lq mean median uq max neval
dat %>% get_dupes(package, country, ip_id) 4.095232 4.381469 4.580609 4.5755 4.695437 5.884169 100

4.5 seconds seems pretty pokey for interactive use. I wonder if there's a
way to get that down. Though 533k rows were returned... if performance is
better when there are relatively few duplicate rows returned, that would be
more acceptable.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#67, or mute the thread
https://github.com/notifications/unsubscribe-auth/AA7c_wqffLazA5DCV4HChp9aEdQcUzcHks5qw-kfgaJpZM4KPHwA
.

sfirke · 2016-10-05T18:50:29Z

I know zero C++ so I would happily defer to you on this if we choose to prioritize this. But let's first see if other feedback comes in on the package. There might be higher-yield places to invest time than this, I was just noting it in the moment as the operation took a noticeable amount of time.

rgknight · 2016-10-06T01:03:53Z

Does get_dups need to be more complicated than an NSE version of

sep_dups <- function(df, ...){
  target <- df %>% select_(.dots=...)
  dup_index <- duplicated(target) | duplicated(target, fromLast = TRUE)

  duplicates = df[dup_index, ])
}

I use this SE duplicated approach in https://github.com/rgknight/knightr, and it appears to be considerably faster.

uniques <- c(1:10)
len <- 10^5
x <- sample(uniques, len, replace = TRUE)
y <- 11:10000000
z <- append(x, y)
z <- as.data.frame(z)
m <- microbenchmark(get_dupes(z), 
                    times = 10L)
m2 <- microbenchmark(knightr::sep_dups(z), 
                    times = 10L)
summary(m)
summary(m2)

Results

Unit: seconds
         expr      min       lq     mean   median       uq      max neval
 get_dupes(z) 10.47748 10.56888 11.02739 10.96821 11.52733 11.71295    10

Unit: microseconds
                 expr     min      lq     mean   median     uq     max neval
 knightr::sep_dups(z) 287.847 327.968 382.3651 348.0285 421.74 598.087    10

sfirke · 2016-10-13T18:36:47Z

get_dupes is a little more complicated than that because of creating the count column and reorder the columns, but generally, you are right. I had a simpler version initially and but ran into problems with dplyr::n() not working in a package the way it does interactively. I solved that with the current approach, which is computationally quite inefficient.

I think with some thought, get_dupes can be refactored along the lines you suggest to be much faster, without writing anything in C++.

sfirke · 2017-01-29T04:01:26Z

I modified get_dupes, on the optimizing_get_dupes branch - check that branch out for the below testing. Now if there are >100k rows in the input data.frame, it doesn't calculate the N (this can be overridden).

By my testing it was basically as fast as sep_dups, mentioned above and defined below for testing, when I took out the last select and arrange of the result. That makes it a more apples-to-apples comparison. With those back in, it lags slightly. But I think they are essential. Without having duplicates in order to look at, the result mostly tells you just that you have duplicates.

# Ryan's data for testing - a one-column data.frame with >10MM rows
uniques <- c(1:10)
len <- 10^5
x <- sample(uniques, len, replace = TRUE)
y <- 11:10000000
z <- append(x, y)
z <- as.data.frame(z)

# source file linked above on this page; 960k records, real world example
dat <- readr::read_csv(gzfile("2016-10-04.csv.gz")) 

sep_dups <- function(dat, ...){
  var_names <- lapply(unlist(as.list(substitute(list(...)))[-1L]), deparse) %>% unlist # sad!
  target <- dat %>% select(one_of(var_names))
  dup_index <- duplicated(target) | duplicated(target, fromLast = TRUE)
  dat[dup_index, , drop = FALSE]
}

library(microbenchmark)

microbenchmark(get_dupes = dat %>% get_dupes(package, country, ip_id),
               get_dupes_Ns = dat %>% get_dupes(package, country, ip_id, show_n = TRUE),
               sep_dups = dat %>% sep_dups(package, country, ip_id),
               times = 5)

microbenchmark(get_dupes = get_dupes(z, z),
               get_dupes_Ns = get_dupes(z, show_n = TRUE),
               sep_dups = sep_dups(z,z),
               times = 5)

Yields:

Unit: seconds
         expr      min       lq     mean   median       uq      max neval cld
    get_dupes 2.071470 2.379026 2.826163 2.439346 3.200552 4.040421     5  a 
 get_dupes_Ns 5.331746 5.449574 5.513692 5.451409 5.509487 5.826245     5   b
     sep_dups 1.627400 1.718317 1.969888 1.931549 2.066020 2.506156     5  a 

Unit: seconds
         expr       min        lq      mean    median        uq       max neval cld
    get_dupes  1.334536  1.379275  1.810781  2.012506  2.089803  2.237786     5  a 
 get_dupes_Ns 11.722700 11.958248 13.507509 12.260594 12.445033 19.150969     5   b
     sep_dups  1.312836  1.419128  1.689508  1.487448  1.589686  2.638443     5  a

I'm still open to someone making the underlying code faster, in addition to this workaround of a new parameter show_n that defaults to false on large data.frames. I don't know how. But in the meantime I think this is a nice compromise that allows full info for everyday "small data", but won't leave a user hanging for 10+ seconds on larger data.frames.

sfirke · 2017-01-29T04:02:59Z

@rgknight when you get a chance, try cloning that branch and running the above tests? Curious to see what you think about this approach and also if I've missed something in your sep_dups that I rewrote slightly above. If you approve of this, I can add tests and merge into master.

rgknight · 2017-01-30T02:14:15Z

I took a quick look at the code but haven't run the tests. The approach looks good. My only question is about the auto-suppress of the count while also having a count option.

It looks like you intended it to have three options:

TRUE = always
FALSE = never
NULL = if the df is small

However, you set show_n to TRUE in the function definition, so the NULL condition will never be triggered, unless you explicitly set it to NULL, which I think is a typo/mistake?

More generally, the function definition implies two options but there are actually three. Do you have examples of other functions that take this approach? I would lean toward having two options, TRUE/FALSE, and let people who are speed sensitive set it to FALSE. Actually I would prefer the default to be FALSE and let non-speed sensitive folks set it to TRUE since it's an optional adornment, but I think we may disagree there.

If you want to keep three options, I'd recommend making it a vector with "no", "whenfast" and always", like the useNA option on table(), with the default of "whenfast" (or some better name).

sfirke · 2017-02-05T16:23:41Z

To your first point, the NULL condition does get triggered - see if((nrow(dat) > 100000 & missing(show_n)) | !show_n) - the missing() checks to see if show_n is was specified to be TRUE, vs. defaulted to TRUE because the user didn't supply a value. dat %>% get_dupes(package, country, ip_id) and dat %>% get_dupes(package, country, ip_id, show_n = FALSE) (with a very large dat) do the same thing but there's no warning in the 2nd instance about defaulting to show_n = FALSE.

So it's working as I intended it. BUT, you raise a good point about how to approach this. I do want it to default to the "whenfast" case for use in interactive checking of dirty data. But the current approach might lead to the appearance or disappearance of the n column depending on the size of the input, which would be potentially bad for users counting on a certain input. I can't think of other R functions that behave like this, with a default input of TRUE not always behaving like a specified TRUE (because as you note, TRUE/FALSE implies two options but there are three).

So yeah, I'll do that w/ whenfast, then write tests and set up a PR. Thanks for reviewing!

rgknight · 2017-02-08T14:37:09Z

Cool I didn't know that missing works that way -- helpful, thanks!

sfirke · 2018-03-05T21:41:48Z

The above draft version, where it defaults to different behavior for a high # of rows, seems ugly to me now. I would prefer, if this can't be optimized much more, creating the has_dupes function that Ryan proposes in the comments here: #18. It just returns TRUE/FALSE. Then maybe if # of rows or columns is very large, print a message that it will likely be slow and perhaps they should call has_dupes first instead.

As to actually optimizing get_dupes, let's continue that conversation at #163.

rgknight mentioned this issue Oct 5, 2016

create get_not_dupes() function #18

Open

rgknight mentioned this issue Oct 22, 2016

Proposal for dups_ family of functions #75

Closed

sfirke closed this as completed Mar 5, 2018

sfirke changed the title ~~improve speed of get_dupes~~ create simpler has_dupes() function if significantly faster than get_dupes() on large data.frames Mar 5, 2018

sfirke reopened this Mar 5, 2018

sfirke closed this as completed Apr 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create simpler has_dupes() function if significantly faster than get_dupes() on large data.frames #67

create simpler has_dupes() function if significantly faster than get_dupes() on large data.frames #67

sfirke commented Oct 5, 2016

chrishaid commented Oct 5, 2016

sfirke commented Oct 5, 2016

rgknight commented Oct 6, 2016

sfirke commented Oct 13, 2016 •

edited

Loading

sfirke commented Jan 29, 2017 •

edited

Loading

sfirke commented Jan 29, 2017

rgknight commented Jan 30, 2017

sfirke commented Feb 5, 2017 •

edited

Loading

rgknight commented Feb 8, 2017

sfirke commented Mar 5, 2018

create simpler has_dupes() function if significantly faster than get_dupes() on large data.frames #67

create simpler has_dupes() function if significantly faster than get_dupes() on large data.frames #67

Comments

sfirke commented Oct 5, 2016

chrishaid commented Oct 5, 2016

sfirke commented Oct 5, 2016

rgknight commented Oct 6, 2016

sfirke commented Oct 13, 2016 • edited Loading

sfirke commented Jan 29, 2017 • edited Loading

sfirke commented Jan 29, 2017

rgknight commented Jan 30, 2017

sfirke commented Feb 5, 2017 • edited Loading

rgknight commented Feb 8, 2017

sfirke commented Mar 5, 2018

sfirke commented Oct 13, 2016 •

edited

Loading

sfirke commented Jan 29, 2017 •

edited

Loading

sfirke commented Feb 5, 2017 •

edited

Loading