-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
create simpler has_dupes() function if significantly faster than get_dupes() on large data.frames #67
Comments
Probably move that functionality into C++. I could look into this. c Chris Haid Director of Research and Analysis KIPP Chicago 312.961.0510 github.com/chrishaid Work hard. Be nice. On Wed, Oct 5, 2016 at 1:14 PM, Sam Firke notifications@github.com wrote:
|
I know zero C++ so I would happily defer to you on this if we choose to prioritize this. But let's first see if other feedback comes in on the package. There might be higher-yield places to invest time than this, I was just noting it in the moment as the operation took a noticeable amount of time. |
Does
I use this SE
Results
|
I think with some thought, |
I modified By my testing it was basically as fast as
Yields:
I'm still open to someone making the underlying code faster, in addition to this workaround of a new parameter |
@rgknight when you get a chance, try cloning that branch and running the above tests? Curious to see what you think about this approach and also if I've missed something in your |
I took a quick look at the code but haven't run the tests. The approach looks good. My only question is about the auto-suppress of the count while also having a count option. It looks like you intended it to have three options:
However, you set More generally, the function definition implies two options but there are actually three. Do you have examples of other functions that take this approach? I would lean toward having two options, TRUE/FALSE, and let people who are speed sensitive set it to FALSE. Actually I would prefer the default to be FALSE and let non-speed sensitive folks set it to TRUE since it's an optional adornment, but I think we may disagree there. If you want to keep three options, I'd recommend making it a vector with "no", "whenfast" and always", like the |
To your first point, the NULL condition does get triggered - see So it's working as I intended it. BUT, you raise a good point about how to approach this. I do want it to default to the "whenfast" case for use in interactive checking of dirty data. But the current approach might lead to the appearance or disappearance of the So yeah, I'll do that w/ whenfast, then write tests and set up a PR. Thanks for reviewing! |
Cool I didn't know that missing works that way -- helpful, thanks! |
The above draft version, where it defaults to different behavior for a high # of rows, seems ugly to me now. I would prefer, if this can't be optimized much more, creating the As to actually optimizing |
for CRAN version 0.2.0 I wasn't interested in optimization but it might be due now.
On the October 4th CRAN logs file, with 926,002 rows x 10 columns:
4.5 seconds seems pretty pokey for interactive use. I wonder if there's a way to get that down. Though 533k rows were returned... if performance is better when there are relatively few duplicate rows returned, that would be more acceptable.
The text was updated successfully, but these errors were encountered: