function to compare data.frames that should be the same - variable names and col type consistency #50

almartin82 · 2016-08-05T17:29:05Z

here's my use case - I am writing functions to suck up n state export files.

I can't depend on read.csv's type hinting because I have situations where the nth data file will contain a data type (say, 'K' for grade, where grade had always been integer on the first 10 files) that doesn't play nicely.

To solve that, I'm reading the raw files in as character, and then doing type conversion myself.

But this seems like a janitor kind of job

Is this in scope/ out of scope? Any thoughts about how to move forward? @chrishaid would love your thoughts here as well

The text was updated successfully, but these errors were encountered:

sfirke · 2016-08-06T19:23:18Z

I have this problem too. I got 20+ spreadsheets back that were supposed to use the same template, but some users had added or removed columns. Combine that with type mismatches and it took a lot of work to bind them together. I took the same approach as you, with reading them all as characters. It was not good.

The new readr::read_csv 1.0.0 that Hadley dropped yesterday is some help, since you can quickly save what you determine to be the correct column specs and then pass that to all other read_csv calls. But that doesn't work for rearranged or added/removed columns.

I've been thinking: a function that takes two data.frames and returns a description of how the 2nd is different than the 1st. Like:

27 Identical columns
2 columns of same name, different type: year (date vs. string), ID (int vs. string)
1 absent column: rating (str)
3 additional columns: school1 (int), school1_name (str), district (str)

You'd use this interactively to quickly compare data.frames and operate on them such that they can be bound.

What do you think? I'm not sure what format the result should be (showing both DFs vs. the leaner "how is df2 different than df1?", a data.frame vs. a list or text output).

sfirke · 2016-08-06T19:25:20Z

My typical headache is two data.frames that are supposed to have the same columns and types. I call bind_rows and it fails, telling me the first error. Then I fix one thing at a time, not having a sense of how long this cycle will go. Whipping up comparison function code on the fly is non-trivial.

almartin82 · 2016-08-07T12:13:00Z

Agreed, this describes a lot of common data import problems.

One potential end product would be a function that:

takes a list of data frames
figures out the minimum common type per column
does the missing / additional column analysis
has rule-based options (return all / return common) for rbinding it all together.

sfirke · 2016-12-23T16:45:55Z

I think the first discrete step is a function check_bindable that returns an interactive report of what differences there are in data.frame variables: column names and types (and maybe ordering?). Purely diagnostic, fixing it is up to the user.

Other functions could:

spit out the universal column type specs that would work for a list of data.frames; you'd then go use that in your read_ calls to load the data.frames to that spec (Andrew's bullets 1-3 above);
force-rbind them together, allowing for user-specified options (Andrew's 4th bullet)

sfirke · 2017-01-23T03:02:21Z

I need to get back to development. Just tried to bind_rows two tables that should allow it and got:

Error in bind_rows_(x, .id) :
Can not automatically convert from numeric to character in column "In what field(s) of study is your Bachelor’s degree? (If your Bachelor’s degree is in more than one field, please select all that apply.) - Accounting".

Instead, because I haven't created this tool, I'll go try to fix column types one by one...

sfirke · 2017-03-02T15:25:32Z

I ran into this problem today when I actually had a little time to write this function. How about this as a starting point?

check_bindable <- function(df1, df2){ # should eventually take a list of dfs
  vars_in_df1_only <- names(df1)[!names(df1) %in% names(df2)]
  vars_in_df2_only <- names(df2)[!names(df2) %in% names(df1)]
  
  # subset DFs to only the same for class investigation
  common_vars <- Reduce(intersect, list(names(df1), names(df2)))
  df1_common <- df1[, common_vars]
  df2_common <- df2[, common_vars]

  if(sum(names(df1_common) == names(df2_common)) != ncol(df1_common)){stop("something is wrong, names in df1_common don't match names in df2_common")}

  df1_col_types <- lapply(df1_common, class) %>%
    lapply(`[[`, 1) %>% # get first class, in case of POSIX there are multiple
    unlist
  
  df2_col_types <- lapply(df2_common, class) %>%
    lapply(`[[`, 1) %>% # get first class, in case of POSIX there are multiple
    unlist
  
  col_mismatch_index <- df1_col_types != df2_col_types
  col_mismatches <- names(df1_common)[col_mismatch_index]
  col_mismatch_class1 <- df1_col_types[col_mismatch_index]
  col_mismatch_class2 <- df2_col_types[col_mismatch_index]
  
  # result is ragged, so return as a list
  list(
    vars_in_df1_only = vars_in_df1_only,
    vars_in_df2_only = vars_in_df2_only,
    column_class_mismatches = data_frame(variable = col_mismatches,
               class_in_df1 = col_mismatch_class1,
               class_in_df2 = col_mismatch_class2
               )
  )
}

check_bindable(mtcars %>%
                 mutate(cyl = as.factor(cyl),
                        new_var = "hi"),
               mtcars %>%
                 select(-mpg, -wt) %>%
                 rename(CARB = carb)
               )
#> $vars_in_df1_only
#> [1] "mpg"     "wt"      "carb"    "new_var"
#> 
#> $vars_in_df2_only
#> [1] "CARB"
#> 
#> $column_class_mismatches
#> # A tibble: 1 × 3
#>   variable class_in_df1 class_in_df2
#>      <chr>        <chr>        <chr>
#> 1      cyl       factor      numeric

rgknight · 2017-03-02T17:22:02Z

Looks like a good start, a couple of thoughts:

check_ functions should return TRUE/FALSE only. In your example, it should return FALSE with a message or warning. There should maybe be a warn=TRUE option in case you want to turn the warning off to do some custom handling if it's FALSE.
We will want an assert_bindable function as well that will stop() if it's not bindable.
We only care about differences in type that are not bindable. I believe that vctrs has the canonical list of allowable type differences. We should ignore these differences. Coercion rules r-lib/vctrs#7
There are two very different things happening here. Checking that the columns are the same and checking type differences. I think they may want to be different functions, with a parent function that calls both. Then if you only care about type differences you can call check_ whatever it's called, combinable or something.
I don't know if check_bindable is quite right for the parent function because the column names don't need to be the same for it to be bindable, and I am often OK with column differences. I might use check_appendable to signal that it is more strict than required for bind_rows to pass, since appending in most languages would fail if there are a different number of columns.
ensurer may have some good ideas here, see the ensure_as_template example. I don't think we'll want to do exactly that, but it may be worth a look if you haven't seen it. https://github.com/smbache/ensurer/blob/fea96587bac2a5abe2c44a5e6ee6e0ba90e21557/vignettes/ensurer.Rmd

sfirke · 2018-02-06T19:38:56Z

I've come back to use this function several times, which is a sign it's useful. Maybe "compare_dfs" is a better name for it. It fundamentally answers the question, "these two data.frames are supposed to have the same columns and types - do they?"

sfirke · 2018-02-10T18:56:13Z

There's a draft from @bfgray3 in #179, specific code feedback can go there but let's discuss the functionality in this issue.

My reactions:

What should this return when the data.frame cols are identical? It's a balance of being informative (like a message saying "everything matches") vs. returning TRUE or NULL so that if you're running it on a list of 100 data.frames, you can quickly write code to ID the ones that don't match, or maybe use in an assertr-style check.
Should there be two different functions for calling on two DFs vs. on a list of 3+? Or could the latter cover the former use case as well? Fine if under the hood it calls another function, I mean do we present to the user as two options and if so how does the output differ.
For comparing 3+ DFs, I suggest that instead of comparing them all to each other (the permutations will get crazy) they all be compared to the 1st one. Users can pass as the 1st DF a preferred one or empty DF with the spec they like.
If this result is a nested list (each of DFs 2-N), that will be tough for beginners to do anything with besides read in the console. I wonder if there's a friendlier output format... maybe use list-columns? Then the result could be a data.frame, one row for each data.frame in 2-N, with list-cols for "only in this DF", "in reference DF but not this one", and a data.frame of type differences. Then that could quickly be filtered for !is.na(type_mismatches) by a beginner.

All of these bullet points could be addressed by a function that returns a data.frame, with 3 list-cols and N-1 rows for N data.frames (b/c the 1st is the referent - or maybe it gets a dummy row). Then in the basic case of calling on two identical DFs, it returns a 1- or 2-row DF with all NAs.

maelle · 2018-02-10T21:34:52Z

See also vis_compare in https://github.com/ropensci/visdat

bfgray3 · 2018-02-13T03:39:53Z

My two cents for each of @sfirke's bullet points above:

I think returning TRUE when everything is golden is a good idea. We could also include a boolean verbose argument with a default value to control printing of messages.
This is an interesting question. For example, purrr provides two functions:

> identical(purrr::pmap(list(c(3,4), c(1,2)), `+`),
+           purrr::map2(c(3,4), c(1,2), `+`))
[1] TRUE

Good idea; I agree.
Good idea; I agree.

rgknight · 2018-02-13T04:03:25Z

I haven't followed the code questions, but Re: what should it return...

I still think that you need two functions:

A check_ version that returns true or false with a warning
An assert_ version that will fail if there are differences and otherwise invisibly return the dfs as a list

This lets you string the assert_ version in a pipe with bind_rows.

I use this check/assert separation a lot in my code and find it extremely useful.

sfirke · 2018-03-12T20:34:42Z

I just used this function again - so am convinced of its merits. Let's aim first for a function compare_dfs. It returns diagnostic info. That is most important.

Then if it feels worthwhile, easy to have check_bindable or same_df_structure that returns TRUE/FALSE depending on a clean result from compare_dfs. This function could be then put in a pipeline with assertr::verify(check_bindable(.)) - say after a purrr::map call to read a directory full of spreadsheets.

rgknight · 2018-03-14T20:01:12Z

I am currently pulling my hair out about a big map() procedure that includes a merge and recently introduced a type difference somewhere in the list of dfs... so yeah, worthwhile.

I wrote the original version of a function like this for Stata that was fairly popular back in the day called cfout that returned differences in values between datasets. This is what I would expected a function called compare_dfs to do.

I still think there are several types of output you would want, and therefore several functions in this family:

compare_types or something like that, that returns a long df with the actual differences in name / type. Or maybe a list with two elements: type differences and variable differences. Or maybe it's compare_names and compare_types as separate functions, where compare_types only compares the common vars? I think that may be better -- would be easy to write a wrapper for the two if you wanted.
compare_dfs that would output a df with the differences in values between two dfs
check_bindable as above, returns true if bindable using tidyverse rules, false if not with an informative warning
assert_bindable as above, invisibly returns the df if bindable, raises an informative error if not. The difference to wrapping assertr is that we could control the error message

bfgray3 · 2018-04-04T21:47:28Z

Check out this new repo: https://github.com/thomasp85/pearls.

sfirke · 2018-09-10T19:41:48Z

Thinking about this again. I woke up thinking about a data.frame result of compare_df_types that would have one row for each DF, one column for every variable, and the grid would be populated with the class in that DF.

See result here:

library(tibble)
x <- tibble(a = 1:3, b = "words")
y <- tibble(a = 4:5, c = as.factor("a factor"))
z <- tibble(a = 6:8, c = "more text")

# a function - not yet written - creates a data.frame with:
# first row = input data.frame names
# a column for each variable found in any of the data.frames,
# indicating the presence and class of the variable in each data.frame

# Here would tbe the result of compare_df_types(x, y, z)
result <- tibble(
  df_name = c("x", "y", "z"),
  a = "integer",
  b = c("character", NA, NA),
  c = c(NA, "factor", "character")
)

> result
# A tibble: 3 x 4
  df_name a       b         c        
  <chr>   <chr>   <chr>     <chr>    
1 x       integer character NA       
2 y       integer NA        factor   
3 z       integer NA        character
 
# Most useful for finding mismatched columns, so allow that as an argument
# This function is a helper for compare_dfs allowing users to view only
# columns that are mismatched
hide_full_matches <- function(dat){
  count_unique <- vapply(dat, function(x) length(unique(x)), 0)
  single_value <- count_unique == 1
  dat[, !single_value]
}
 

hide_full_matches(result)
# A tibble: 3 x 3
  df_name b         c        
  <chr>   <chr>     <chr>    
1 x       character NA       
2 y       NA        factor   
3 z       NA        character

sfirke · 2018-09-10T19:42:21Z

Looks better in the console:

sfirke · 2018-09-10T19:43:46Z

I wonder if this df ^^^ is the way to go with storing the comparison info. It's good for the user as is. But we could also then use it as the building block for a diagnostic report, tooling that would make non-bindable dfs compatible, etc.

rgknight · 2018-09-10T19:56:06Z

When I wrote a function like this for Stata that had pretty wide adoption (cfout), I used the transpose of what you generated: one row per column with the columns being the df name. Are you sure you want columns for columns? Having them as rows would let you have 3 or 4 columns: df_1, df_2, is_identical, diference_type. Then you could classify differences in types versus missing columns, etc.

…

On Mon, Sep 10, 2018 at 3:43 PM Sam Firke ***@***.***> wrote: I wonder if this df ^^^ is the way to go with storing the comparison info. It's good for the user as is. But we could also then use it as the building block for a diagnostic report, tooling that would make non-bindable dfs compatible, etc. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#50 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AH-db1IMzD9Qzk3-_RAVxpn84n_EG2euks5uZsDygaJpZM4Jd5Od> .

billdenney · 2018-09-10T20:30:41Z

Another few spanners in the works for this one:

integer is always integer, character is always character, but factor is not always factor.
What about more complex types? POSIXct? POSIXlt? data.frame in a data.frame?

For 1:

What I mean is: In my use cases, I often have factors that may be subsets of one another, and I'd like to know if the final version will all work together.

More explicitly, the comparison df above by @sfirke would show these as the same when I'd like to know that they differ:

data.frame(A=factor(c("A", "B")))
data.frame(A=factor(c("C", "D")))

For 2:

The solution should probably be very clear about what is and is not tested. If a column doesn't fall into the canonical types, we may have issues.

Architecture

For the architecture, I agree that there are multiple solutions from a similar set of operations. I like @rgknight's list above with one exception that I think fits a different (though still very useful) scope.

The only exception in my view is compare_dfs() which I think is a value-to-value comparison that would attempt to be a Unix diff of the two data.frames (e.g. these rows were removed, these rows were inserted, these columns were removed, these columns were inserted, these columns were renamed (a harder problem), these cells were modified (also hard)). I think that this comparison is outside the scope of this issue (and is likely better for the compare package mentioned earlier or the daff package, https://cran.r-project.org/web/packages/daff/).

sfirke · 2018-09-11T13:57:00Z

Ryan that is a great call to put data.frames in the cols, column names in the rows. For lots of reasons, including easier implementation and there typically being a few DFs with many columns.

Bill, you raise good implementation concerns. I think this function would be useful as a minimal version that handles simple data types, then as we build it up we can consider how to approach POSIX, list-columns, etc. And agreed that comparing data values is out of scope and already covered by daff... a function name like compare_df_types will be more descriptive than the original compare_dfs.

billdenney · 2019-02-17T22:08:19Z

I had a need for this last week, and I came up with the following implementation. It has the following features:

It works with arbitrarily many data.frames and gets the names from the inputs automatically.
data.frame names are columns, data.frame columns are rows.
To be more tidyverse-friendly, it adds a column called "column_name" with the column names (note that this will cause issues if you have a data.frame named "column_name" that you pass in; maybe a check should be added).
It combines all the class names of the column, so that it gives full class information (sometimes classes may be combinable because they are instances of the same class, those will show as mismatches here).
For factors, it gives the levels and if they are ordered.
It can report all columns, only the columns that match, or only the columns that don't match (the third being the most useful in my use cases).
Class name creation is assigned by a generic function so that it can be overridden for classes that can be bound together (such as the POSIXt classes or numeric/integer).
Missing columns in one data frame which are present in another are given as NA. That would fail to bind with rbind(), but it would succeed with bind_rows(). You can choose the behavior with the bind_check argument.
A secondary function returns nrow(mismatching) == 0 for a boolean version indicating if success will occur or not.

I need this pretty often, and I'd be happy to make a PR for this if it feels like the right fit.

compare_df_types_class_detect <- function(x) {
  UseMethod("compare_df_types_class_detect")
}
compare_df_types_class_detect.factor <- function(x) {
  all_classes <- class(x)
  all_levels <- levels(x)
  level_text <- sprintf("levels=c(%s)", paste('"', levels(x), '"', sep="", collapse=", "))
  if (is.ordered(x)) {
    level_text <- paste0(level_text, ", ordered=TRUE")
  }
  factor_text <- sprintf("factor(%s)", level_text)
  mask_factor <- all_classes == "factor"
  if (!any(mask_factor)) {
    stop("Cannot handle a factor that does not have a class of factor.  Please report this as a bug with a reproducible example.")
  } else if (sum(mask_factor) != 1) {
    stop("More than one of the classes shows up as a factor.  Please report this as a bug with a reproducible example.")
  }
  all_classes[mask_factor] <- factor_text
  paste(all_classes, sep=", ")
}
compare_df_types_class_detect.default <- function(x) {
  all_classes <- class(x)
  paste(all_classes, sep=", ")
}

compare_df_types_df_maker <- function(x, class_colname="class") {
  ret <-
    data.frame(
      column_name=names(x),
      X=sapply(X=x, FUN=compare_df_types_class_detect),
      stringsAsFactors=FALSE
    )
  names(ret)[2] <- class_colname
  ret
}

compare_df_types <- function(..., return=c("all", "matches", "mismatches"), bind_check=c("rbind", "bind_rows")) {
  return <- match.arg(return)
  bind_check <- match.arg(bind_check)
  direct_names <- names(list(...))
  indirect_names <-
    setdiff(
      as.character(match.call(expand.dots=TRUE)), 
      as.character(match.call(expand.dots=FALSE))
    )
  if (is.null(direct_names)) {
    final_names <- indirect_names
  } else {
    final_names <- direct_names
    mask_replace <- final_names %in% ""
    final_names[mask_replace] <- indirect_names[mask_replace]
  }
  args <- list(...)
  ret <- compare_df_types_df_maker(args[[1]], class_colname=final_names[1])
  for (idx in (1+seq_len(length(args) - 1))) {
    ret <-
      merge(
        ret,
        compare_df_types_df_maker(args[[idx]], class_colname=final_names[idx]),
        by="column_name",
        all=TRUE
      )
  }
  if (return == "all" | ncol(ret) == 2) {
    if (return != "all") {
      warning("Only one data.frame provided, so all its classes are provided.")
    }
    ret
  } else {
    # Is this the best way to check for all row values to be equal?
    bind_check_fun <-
      list(
        rbind=function(idx) {
          all(unlist(ret[idx,3:ncol(ret)]) %in% ret[idx,2])
        },
        bind_rows=function(idx) {
          all(
            unlist(ret[idx,3:ncol(ret)]) %in%
              c(NA_character_,
                na.omit(unlist(ret[idx,2:ncol(ret)]))[1])
          )
        }
      )
    mask_match <-
      sapply(
        X=seq_len(nrow(ret)),
        FUN=bind_check_fun[[bind_check]]
      )
    if (return == "matches") {
      ret[mask_match,]
    } else if (return == "mismatches") {
      ret[!mask_match,]
    }
  }
}

compare_df_types_success <- function(..., return="mismatches", bind_check=c("rbind", "bind_rows"), verbose=TRUE) {
  return <- match.arg(return)
  bind_check <- match.arg(bind_check)
  ret <- compare_df_types(..., return=return, bind_check=bind_check)
  if (nrow(ret) & verbose) {
    print(ret)
  }
  nrow(ret) == 0
}

sfirke · 2019-04-11T18:45:01Z

@billdenney this is great! I think it accomplishes the purposes that have been discussed above over the years (!) in this issue. Want to add documentation, tests, and make a PR? Given that you have the code (and the functionality seems great to me), think we could squeeze it in for an April 21 submission to CRAN?

* Allow data.frame row-binding comparison (Fix #50) * Allow list inputs to `compare_df_types()` * Address code review comments * typo columne -> column * make bind_rows the default value of bind_method swapping it in for rbind, since this is a tidyverse-aligned package and my quick poll shows more peers using dplyr::bind_rows * re-describe new functions

sfirke · 2019-04-20T17:08:28Z

Thanks to everyone who contributed to this thread over the years! Having this merged and on its way to CRAN soon feels great. Vive le open-source!

almartin82 self-assigned this Aug 7, 2016

rgknight mentioned this issue Oct 23, 2016

Proposal for dups_ family of functions #75

Closed

sfirke added hop-right-in good first issue labels Feb 6, 2018

sfirke unassigned almartin82 Feb 6, 2018

bfgray3 mentioned this issue Feb 10, 2018

compare_dfs #179

Closed

sfirke changed the title ~~does janitor have tools for intelligently cleaning up data types?~~ function to compare data.frames that should be the same - variable names and col type consistency Feb 10, 2018

sfirke added seeking comments Users and any interested parties should please weigh in - this is in a discussion phase! and removed good first issue hop-right-in labels Mar 12, 2018

sfirke mentioned this issue Sep 21, 2018

Breaking off tabyl and adorn_ into their own package? #240

Open

sfirke added this to the v1.2 milestone Apr 11, 2019

sfirke mentioned this issue Apr 11, 2019

Submit v1.2 to CRAN #277

Closed

billdenney added a commit to billdenney/janitor that referenced this issue Apr 12, 2019

Allow data.frame row-binding comparison (Fix sfirke#50)

16dba69

billdenney mentioned this issue Apr 12, 2019

Allow data.frame row-binding comparison (Fix #50) #284

Merged

sfirke added pr-awaiting-review and removed seeking comments Users and any interested parties should please weigh in - this is in a discussion phase! labels Apr 17, 2019

sfirke closed this as completed in #284 Apr 20, 2019

sfirke removed the pr-awaiting-review label Apr 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

function to compare data.frames that should be the same - variable names and col type consistency #50

function to compare data.frames that should be the same - variable names and col type consistency #50

almartin82 commented Aug 5, 2016 •

edited

Loading

sfirke commented Aug 6, 2016

sfirke commented Aug 6, 2016

almartin82 commented Aug 7, 2016

sfirke commented Dec 23, 2016

sfirke commented Jan 23, 2017

sfirke commented Mar 2, 2017

rgknight commented Mar 2, 2017

sfirke commented Feb 6, 2018

sfirke commented Feb 10, 2018 •

edited

Loading

maelle commented Feb 10, 2018

bfgray3 commented Feb 13, 2018

rgknight commented Feb 13, 2018

sfirke commented Mar 12, 2018

rgknight commented Mar 14, 2018

bfgray3 commented Apr 4, 2018

sfirke commented Sep 10, 2018

sfirke commented Sep 10, 2018

sfirke commented Sep 10, 2018

rgknight commented Sep 10, 2018 via email

billdenney commented Sep 10, 2018 •

edited

Loading

sfirke commented Sep 11, 2018

billdenney commented Feb 17, 2019 •

edited

Loading

sfirke commented Apr 11, 2019

sfirke commented Apr 20, 2019

function to compare data.frames that should be the same - variable names and col type consistency #50

function to compare data.frames that should be the same - variable names and col type consistency #50

Comments

almartin82 commented Aug 5, 2016 • edited Loading

sfirke commented Aug 6, 2016

sfirke commented Aug 6, 2016

almartin82 commented Aug 7, 2016

sfirke commented Dec 23, 2016

sfirke commented Jan 23, 2017

sfirke commented Mar 2, 2017

rgknight commented Mar 2, 2017

sfirke commented Feb 6, 2018

sfirke commented Feb 10, 2018 • edited Loading

maelle commented Feb 10, 2018

bfgray3 commented Feb 13, 2018

rgknight commented Feb 13, 2018

sfirke commented Mar 12, 2018

rgknight commented Mar 14, 2018

bfgray3 commented Apr 4, 2018

sfirke commented Sep 10, 2018

sfirke commented Sep 10, 2018

sfirke commented Sep 10, 2018

rgknight commented Sep 10, 2018 via email

billdenney commented Sep 10, 2018 • edited Loading

For 1:

For 2:

Architecture

sfirke commented Sep 11, 2018

billdenney commented Feb 17, 2019 • edited Loading

sfirke commented Apr 11, 2019

sfirke commented Apr 20, 2019

almartin82 commented Aug 5, 2016 •

edited

Loading

sfirke commented Feb 10, 2018 •

edited

Loading

billdenney commented Sep 10, 2018 •

edited

Loading

billdenney commented Feb 17, 2019 •

edited

Loading