-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
function to compare data.frames that should be the same - variable names and col type consistency #50
Comments
I have this problem too. I got 20+ spreadsheets back that were supposed to use the same template, but some users had added or removed columns. Combine that with type mismatches and it took a lot of work to bind them together. I took the same approach as you, with reading them all as characters. It was not good. The new I've been thinking: a function that takes two data.frames and returns a description of how the 2nd is different than the 1st. Like:
You'd use this interactively to quickly compare data.frames and operate on them such that they can be bound. What do you think? I'm not sure what format the result should be (showing both DFs vs. the leaner "how is df2 different than df1?", a data.frame vs. a list or text output). |
My typical headache is two data.frames that are supposed to have the same columns and types. I call |
Agreed, this describes a lot of common data import problems. One potential end product would be a function that:
|
I think the first discrete step is a function Other functions could:
|
I need to get back to development. Just tried to bind_rows two tables that should allow it and got:
Instead, because I haven't created this tool, I'll go try to fix column types one by one... |
I ran into this problem today when I actually had a little time to write this function. How about this as a starting point?
|
Looks like a good start, a couple of thoughts:
|
I've come back to use this function several times, which is a sign it's useful. Maybe "compare_dfs" is a better name for it. It fundamentally answers the question, "these two data.frames are supposed to have the same columns and types - do they?" |
There's a draft from @bfgray3 in #179, specific code feedback can go there but let's discuss the functionality in this issue. My reactions:
All of these bullet points could be addressed by a function that returns a data.frame, with 3 list-cols and N-1 rows for N data.frames (b/c the 1st is the referent - or maybe it gets a dummy row). Then in the basic case of calling on two identical DFs, it returns a 1- or 2-row DF with all NAs. |
See also |
My two cents for each of @sfirke's bullet points above:
|
I haven't followed the code questions, but Re: what should it return... I still think that you need two functions:
This lets you string the assert_ version in a pipe with bind_rows. I use this check/assert separation a lot in my code and find it extremely useful. |
I just used this function again - so am convinced of its merits. Let's aim first for a function Then if it feels worthwhile, easy to have |
I am currently pulling my hair out about a big I wrote the original version of a function like this for Stata that was fairly popular back in the day called cfout that returned differences in values between datasets. This is what I would expected a function called I still think there are several types of output you would want, and therefore several functions in this family:
|
Check out this new repo: https://github.com/thomasp85/pearls. |
Thinking about this again. I woke up thinking about a data.frame result of See
|
I wonder if this df ^^^ is the way to go with storing the comparison info. It's good for the user as is. But we could also then use it as the building block for a diagnostic report, tooling that would make non-bindable dfs compatible, etc. |
When I wrote a function like this for Stata that had pretty wide adoption
(cfout), I used the transpose of what you generated: one row per column
with the columns being the df name. Are you sure you want columns for
columns? Having them as rows would let you have 3 or 4 columns: df_1, df_2,
is_identical, diference_type. Then you could classify differences in types
versus missing columns, etc.
…On Mon, Sep 10, 2018 at 3:43 PM Sam Firke ***@***.***> wrote:
I wonder if this df ^^^ is the way to go with storing the comparison info.
It's good for the user as is. But we could also then use it as the building
block for a diagnostic report, tooling that would make non-bindable dfs
compatible, etc.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#50 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AH-db1IMzD9Qzk3-_RAVxpn84n_EG2euks5uZsDygaJpZM4Jd5Od>
.
|
Another few spanners in the works for this one:
For 1:What I mean is: In my use cases, I often have factors that may be subsets of one another, and I'd like to know if the final version will all work together. More explicitly, the comparison df above by @sfirke would show these as the same when I'd like to know that they differ: data.frame(A=factor(c("A", "B")))
data.frame(A=factor(c("C", "D"))) For 2:The solution should probably be very clear about what is and is not tested. If a column doesn't fall into the canonical types, we may have issues. ArchitectureFor the architecture, I agree that there are multiple solutions from a similar set of operations. I like @rgknight's list above with one exception that I think fits a different (though still very useful) scope. The only exception in my view is |
Ryan that is a great call to put data.frames in the cols, column names in the rows. For lots of reasons, including easier implementation and there typically being a few DFs with many columns. Bill, you raise good implementation concerns. I think this function would be useful as a minimal version that handles simple data types, then as we build it up we can consider how to approach POSIX, list-columns, etc. And agreed that comparing data values is out of scope and already covered by |
I had a need for this last week, and I came up with the following implementation. It has the following features:
I need this pretty often, and I'd be happy to make a PR for this if it feels like the right fit. compare_df_types_class_detect <- function(x) {
UseMethod("compare_df_types_class_detect")
}
compare_df_types_class_detect.factor <- function(x) {
all_classes <- class(x)
all_levels <- levels(x)
level_text <- sprintf("levels=c(%s)", paste('"', levels(x), '"', sep="", collapse=", "))
if (is.ordered(x)) {
level_text <- paste0(level_text, ", ordered=TRUE")
}
factor_text <- sprintf("factor(%s)", level_text)
mask_factor <- all_classes == "factor"
if (!any(mask_factor)) {
stop("Cannot handle a factor that does not have a class of factor. Please report this as a bug with a reproducible example.")
} else if (sum(mask_factor) != 1) {
stop("More than one of the classes shows up as a factor. Please report this as a bug with a reproducible example.")
}
all_classes[mask_factor] <- factor_text
paste(all_classes, sep=", ")
}
compare_df_types_class_detect.default <- function(x) {
all_classes <- class(x)
paste(all_classes, sep=", ")
}
compare_df_types_df_maker <- function(x, class_colname="class") {
ret <-
data.frame(
column_name=names(x),
X=sapply(X=x, FUN=compare_df_types_class_detect),
stringsAsFactors=FALSE
)
names(ret)[2] <- class_colname
ret
}
compare_df_types <- function(..., return=c("all", "matches", "mismatches"), bind_check=c("rbind", "bind_rows")) {
return <- match.arg(return)
bind_check <- match.arg(bind_check)
direct_names <- names(list(...))
indirect_names <-
setdiff(
as.character(match.call(expand.dots=TRUE)),
as.character(match.call(expand.dots=FALSE))
)
if (is.null(direct_names)) {
final_names <- indirect_names
} else {
final_names <- direct_names
mask_replace <- final_names %in% ""
final_names[mask_replace] <- indirect_names[mask_replace]
}
args <- list(...)
ret <- compare_df_types_df_maker(args[[1]], class_colname=final_names[1])
for (idx in (1+seq_len(length(args) - 1))) {
ret <-
merge(
ret,
compare_df_types_df_maker(args[[idx]], class_colname=final_names[idx]),
by="column_name",
all=TRUE
)
}
if (return == "all" | ncol(ret) == 2) {
if (return != "all") {
warning("Only one data.frame provided, so all its classes are provided.")
}
ret
} else {
# Is this the best way to check for all row values to be equal?
bind_check_fun <-
list(
rbind=function(idx) {
all(unlist(ret[idx,3:ncol(ret)]) %in% ret[idx,2])
},
bind_rows=function(idx) {
all(
unlist(ret[idx,3:ncol(ret)]) %in%
c(NA_character_,
na.omit(unlist(ret[idx,2:ncol(ret)]))[1])
)
}
)
mask_match <-
sapply(
X=seq_len(nrow(ret)),
FUN=bind_check_fun[[bind_check]]
)
if (return == "matches") {
ret[mask_match,]
} else if (return == "mismatches") {
ret[!mask_match,]
}
}
}
compare_df_types_success <- function(..., return="mismatches", bind_check=c("rbind", "bind_rows"), verbose=TRUE) {
return <- match.arg(return)
bind_check <- match.arg(bind_check)
ret <- compare_df_types(..., return=return, bind_check=bind_check)
if (nrow(ret) & verbose) {
print(ret)
}
nrow(ret) == 0
} |
@billdenney this is great! I think it accomplishes the purposes that have been discussed above over the years (!) in this issue. Want to add documentation, tests, and make a PR? Given that you have the code (and the functionality seems great to me), think we could squeeze it in for an April 21 submission to CRAN? |
* Allow data.frame row-binding comparison (Fix #50) * Allow list inputs to `compare_df_types()` * Address code review comments * typo columne -> column * make bind_rows the default value of bind_method swapping it in for rbind, since this is a tidyverse-aligned package and my quick poll shows more peers using dplyr::bind_rows * re-describe new functions
Thanks to everyone who contributed to this thread over the years! Having this merged and on its way to CRAN soon feels great. Vive le open-source! |
here's my use case - I am writing functions to suck up n state export files.
I can't depend on
read.csv
's type hinting because I have situations where the nth data file will contain a data type (say, 'K' for grade, where grade had always been integer on the first 10 files) that doesn't play nicely.To solve that, I'm reading the raw files in as character, and then doing type conversion myself.
But this seems like a janitor kind of job
Is this in scope/ out of scope? Any thoughts about how to move forward? @chrishaid would love your thoughts here as well
The text was updated successfully, but these errors were encountered: