Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

function to compare data.frames that should be the same - variable names and col type consistency #50

Closed
almartin82 opened this issue Aug 5, 2016 · 24 comments · Fixed by #284
Milestone

Comments

@almartin82
Copy link
Collaborator

almartin82 commented Aug 5, 2016

here's my use case - I am writing functions to suck up n state export files.

I can't depend on read.csv's type hinting because I have situations where the nth data file will contain a data type (say, 'K' for grade, where grade had always been integer on the first 10 files) that doesn't play nicely.

To solve that, I'm reading the raw files in as character, and then doing type conversion myself.

But this seems like a janitor kind of job
image

Is this in scope/ out of scope? Any thoughts about how to move forward? @chrishaid would love your thoughts here as well

@sfirke
Copy link
Owner

sfirke commented Aug 6, 2016

I have this problem too. I got 20+ spreadsheets back that were supposed to use the same template, but some users had added or removed columns. Combine that with type mismatches and it took a lot of work to bind them together. I took the same approach as you, with reading them all as characters. It was not good.

The new readr::read_csv 1.0.0 that Hadley dropped yesterday is some help, since you can quickly save what you determine to be the correct column specs and then pass that to all other read_csv calls. But that doesn't work for rearranged or added/removed columns.

I've been thinking: a function that takes two data.frames and returns a description of how the 2nd is different than the 1st. Like:

27 Identical columns
2 columns of same name, different type: year (date vs. string), ID (int vs. string)
1 absent column: rating (str)
3 additional columns: school1 (int), school1_name (str), district (str)

You'd use this interactively to quickly compare data.frames and operate on them such that they can be bound.

What do you think? I'm not sure what format the result should be (showing both DFs vs. the leaner "how is df2 different than df1?", a data.frame vs. a list or text output).

@sfirke
Copy link
Owner

sfirke commented Aug 6, 2016

My typical headache is two data.frames that are supposed to have the same columns and types. I call bind_rows and it fails, telling me the first error. Then I fix one thing at a time, not having a sense of how long this cycle will go. Whipping up comparison function code on the fly is non-trivial.

@almartin82
Copy link
Collaborator Author

Agreed, this describes a lot of common data import problems.

One potential end product would be a function that:

  • takes a list of data frames
  • figures out the minimum common type per column
  • does the missing / additional column analysis
  • has rule-based options (return all / return common) for rbinding it all together.

@sfirke
Copy link
Owner

sfirke commented Dec 23, 2016

I think the first discrete step is a function check_bindable that returns an interactive report of what differences there are in data.frame variables: column names and types (and maybe ordering?). Purely diagnostic, fixing it is up to the user.

Other functions could:

  • spit out the universal column type specs that would work for a list of data.frames; you'd then go use that in your read_ calls to load the data.frames to that spec (Andrew's bullets 1-3 above);
  • force-rbind them together, allowing for user-specified options (Andrew's 4th bullet)

@sfirke
Copy link
Owner

sfirke commented Jan 23, 2017

I need to get back to development. Just tried to bind_rows two tables that should allow it and got:

Error in bind_rows_(x, .id) :
Can not automatically convert from numeric to character in column "In what field(s) of study is your Bachelor’s degree? (If your Bachelor’s degree is in more than one field, please select all that apply.) - Accounting".

Instead, because I haven't created this tool, I'll go try to fix column types one by one...

@sfirke
Copy link
Owner

sfirke commented Mar 2, 2017

I ran into this problem today when I actually had a little time to write this function. How about this as a starting point?

check_bindable <- function(df1, df2){ # should eventually take a list of dfs
  vars_in_df1_only <- names(df1)[!names(df1) %in% names(df2)]
  vars_in_df2_only <- names(df2)[!names(df2) %in% names(df1)]
  
  # subset DFs to only the same for class investigation
  common_vars <- Reduce(intersect, list(names(df1), names(df2)))
  df1_common <- df1[, common_vars]
  df2_common <- df2[, common_vars]

  if(sum(names(df1_common) == names(df2_common)) != ncol(df1_common)){stop("something is wrong, names in df1_common don't match names in df2_common")}

  df1_col_types <- lapply(df1_common, class) %>%
    lapply(`[[`, 1) %>% # get first class, in case of POSIX there are multiple
    unlist
  
  df2_col_types <- lapply(df2_common, class) %>%
    lapply(`[[`, 1) %>% # get first class, in case of POSIX there are multiple
    unlist
  
  col_mismatch_index <- df1_col_types != df2_col_types
  col_mismatches <- names(df1_common)[col_mismatch_index]
  col_mismatch_class1 <- df1_col_types[col_mismatch_index]
  col_mismatch_class2 <- df2_col_types[col_mismatch_index]
  
  # result is ragged, so return as a list
  list(
    vars_in_df1_only = vars_in_df1_only,
    vars_in_df2_only = vars_in_df2_only,
    column_class_mismatches = data_frame(variable = col_mismatches,
               class_in_df1 = col_mismatch_class1,
               class_in_df2 = col_mismatch_class2
               )
  )
}

check_bindable(mtcars %>%
                 mutate(cyl = as.factor(cyl),
                        new_var = "hi"),
               mtcars %>%
                 select(-mpg, -wt) %>%
                 rename(CARB = carb)
               )
#> $vars_in_df1_only
#> [1] "mpg"     "wt"      "carb"    "new_var"
#> 
#> $vars_in_df2_only
#> [1] "CARB"
#> 
#> $column_class_mismatches
#> # A tibble: 1 × 3
#>   variable class_in_df1 class_in_df2
#>      <chr>        <chr>        <chr>
#> 1      cyl       factor      numeric

@rgknight
Copy link
Collaborator

rgknight commented Mar 2, 2017

Looks like a good start, a couple of thoughts:

  1. check_ functions should return TRUE/FALSE only. In your example, it should return FALSE with a message or warning. There should maybe be a warn=TRUE option in case you want to turn the warning off to do some custom handling if it's FALSE.

  2. We will want an assert_bindable function as well that will stop() if it's not bindable.

  3. We only care about differences in type that are not bindable. I believe that vctrs has the canonical list of allowable type differences. We should ignore these differences. Coercion rules r-lib/vctrs#7

  4. There are two very different things happening here. Checking that the columns are the same and checking type differences. I think they may want to be different functions, with a parent function that calls both. Then if you only care about type differences you can call check_ whatever it's called, combinable or something.

  5. I don't know if check_bindable is quite right for the parent function because the column names don't need to be the same for it to be bindable, and I am often OK with column differences. I might use check_appendable to signal that it is more strict than required for bind_rows to pass, since appending in most languages would fail if there are a different number of columns.

  6. ensurer may have some good ideas here, see the ensure_as_template example. I don't think we'll want to do exactly that, but it may be worth a look if you haven't seen it. https://github.com/smbache/ensurer/blob/fea96587bac2a5abe2c44a5e6ee6e0ba90e21557/vignettes/ensurer.Rmd

@sfirke
Copy link
Owner

sfirke commented Feb 6, 2018

I've come back to use this function several times, which is a sign it's useful. Maybe "compare_dfs" is a better name for it. It fundamentally answers the question, "these two data.frames are supposed to have the same columns and types - do they?"

@sfirke
Copy link
Owner

sfirke commented Feb 10, 2018

There's a draft from @bfgray3 in #179, specific code feedback can go there but let's discuss the functionality in this issue.

My reactions:

  • What should this return when the data.frame cols are identical? It's a balance of being informative (like a message saying "everything matches") vs. returning TRUE or NULL so that if you're running it on a list of 100 data.frames, you can quickly write code to ID the ones that don't match, or maybe use in an assertr-style check.
  • Should there be two different functions for calling on two DFs vs. on a list of 3+? Or could the latter cover the former use case as well? Fine if under the hood it calls another function, I mean do we present to the user as two options and if so how does the output differ.
  • For comparing 3+ DFs, I suggest that instead of comparing them all to each other (the permutations will get crazy) they all be compared to the 1st one. Users can pass as the 1st DF a preferred one or empty DF with the spec they like.
  • If this result is a nested list (each of DFs 2-N), that will be tough for beginners to do anything with besides read in the console. I wonder if there's a friendlier output format... maybe use list-columns? Then the result could be a data.frame, one row for each data.frame in 2-N, with list-cols for "only in this DF", "in reference DF but not this one", and a data.frame of type differences. Then that could quickly be filtered for !is.na(type_mismatches) by a beginner.

All of these bullet points could be addressed by a function that returns a data.frame, with 3 list-cols and N-1 rows for N data.frames (b/c the 1st is the referent - or maybe it gets a dummy row). Then in the basic case of calling on two identical DFs, it returns a 1- or 2-row DF with all NAs.

@sfirke sfirke changed the title does janitor have tools for intelligently cleaning up data types? function to compare data.frames that should be the same - variable names and col type consistency Feb 10, 2018
@maelle
Copy link

maelle commented Feb 10, 2018

See also vis_compare in https://github.com/ropensci/visdat

@bfgray3
Copy link
Contributor

bfgray3 commented Feb 13, 2018

My two cents for each of @sfirke's bullet points above:

  • I think returning TRUE when everything is golden is a good idea. We could also include a boolean verbose argument with a default value to control printing of messages.
  • This is an interesting question. For example, purrr provides two functions:
> identical(purrr::pmap(list(c(3,4), c(1,2)), `+`),
+           purrr::map2(c(3,4), c(1,2), `+`))
[1] TRUE
  • Good idea; I agree.
  • Good idea; I agree.

@rgknight
Copy link
Collaborator

I haven't followed the code questions, but Re: what should it return...

I still think that you need two functions:

  1. A check_ version that returns true or false with a warning

  2. An assert_ version that will fail if there are differences and otherwise invisibly return the dfs as a list

This lets you string the assert_ version in a pipe with bind_rows.

I use this check/assert separation a lot in my code and find it extremely useful.

@sfirke
Copy link
Owner

sfirke commented Mar 12, 2018

I just used this function again - so am convinced of its merits. Let's aim first for a function compare_dfs. It returns diagnostic info. That is most important.

Then if it feels worthwhile, easy to have check_bindable or same_df_structure that returns TRUE/FALSE depending on a clean result from compare_dfs. This function could be then put in a pipeline with assertr::verify(check_bindable(.)) - say after a purrr::map call to read a directory full of spreadsheets.

@sfirke sfirke added seeking comments Users and any interested parties should please weigh in - this is in a discussion phase! and removed good first issue hop-right-in labels Mar 12, 2018
@rgknight
Copy link
Collaborator

I am currently pulling my hair out about a big map() procedure that includes a merge and recently introduced a type difference somewhere in the list of dfs... so yeah, worthwhile.

I wrote the original version of a function like this for Stata that was fairly popular back in the day called cfout that returned differences in values between datasets. This is what I would expected a function called compare_dfs to do.

I still think there are several types of output you would want, and therefore several functions in this family:

  • compare_types or something like that, that returns a long df with the actual differences in name / type. Or maybe a list with two elements: type differences and variable differences. Or maybe it's compare_names and compare_types as separate functions, where compare_types only compares the common vars? I think that may be better -- would be easy to write a wrapper for the two if you wanted.
  • compare_dfs that would output a df with the differences in values between two dfs
  • check_bindable as above, returns true if bindable using tidyverse rules, false if not with an informative warning
  • assert_bindable as above, invisibly returns the df if bindable, raises an informative error if not. The difference to wrapping assertr is that we could control the error message

@bfgray3
Copy link
Contributor

bfgray3 commented Apr 4, 2018

Check out this new repo: https://github.com/thomasp85/pearls.

@sfirke
Copy link
Owner

sfirke commented Sep 10, 2018

Thinking about this again. I woke up thinking about a data.frame result of compare_df_types that would have one row for each DF, one column for every variable, and the grid would be populated with the class in that DF.

See result here:

library(tibble)
x <- tibble(a = 1:3, b = "words")
y <- tibble(a = 4:5, c = as.factor("a factor"))
z <- tibble(a = 6:8, c = "more text")

# a function - not yet written - creates a data.frame with:
# first row = input data.frame names
# a column for each variable found in any of the data.frames,
# indicating the presence and class of the variable in each data.frame

# Here would tbe the result of compare_df_types(x, y, z)
result <- tibble(
  df_name = c("x", "y", "z"),
  a = "integer",
  b = c("character", NA, NA),
  c = c(NA, "factor", "character")
)

> result
# A tibble: 3 x 4
  df_name a       b         c        
  <chr>   <chr>   <chr>     <chr>    
1 x       integer character NA       
2 y       integer NA        factor   
3 z       integer NA        character
 
# Most useful for finding mismatched columns, so allow that as an argument
# This function is a helper for compare_dfs allowing users to view only
# columns that are mismatched
hide_full_matches <- function(dat){
  count_unique <- vapply(dat, function(x) length(unique(x)), 0)
  single_value <- count_unique == 1
  dat[, !single_value]
}
 

hide_full_matches(result)
# A tibble: 3 x 3
  df_name b         c        
  <chr>   <chr>     <chr>    
1 x       character NA       
2 y       NA        factor   
3 z       NA        character

@sfirke
Copy link
Owner

sfirke commented Sep 10, 2018

Looks better in the console:
image

@sfirke
Copy link
Owner

sfirke commented Sep 10, 2018

I wonder if this df ^^^ is the way to go with storing the comparison info. It's good for the user as is. But we could also then use it as the building block for a diagnostic report, tooling that would make non-bindable dfs compatible, etc.

@rgknight
Copy link
Collaborator

rgknight commented Sep 10, 2018 via email

@billdenney
Copy link
Collaborator

billdenney commented Sep 10, 2018

Another few spanners in the works for this one:

  1. integer is always integer, character is always character, but factor is not always factor.
  2. What about more complex types? POSIXct? POSIXlt? data.frame in a data.frame?

For 1:

What I mean is: In my use cases, I often have factors that may be subsets of one another, and I'd like to know if the final version will all work together.

More explicitly, the comparison df above by @sfirke would show these as the same when I'd like to know that they differ:

data.frame(A=factor(c("A", "B")))
data.frame(A=factor(c("C", "D")))

For 2:

The solution should probably be very clear about what is and is not tested. If a column doesn't fall into the canonical types, we may have issues.

Architecture

For the architecture, I agree that there are multiple solutions from a similar set of operations. I like @rgknight's list above with one exception that I think fits a different (though still very useful) scope.

The only exception in my view is compare_dfs() which I think is a value-to-value comparison that would attempt to be a Unix diff of the two data.frames (e.g. these rows were removed, these rows were inserted, these columns were removed, these columns were inserted, these columns were renamed (a harder problem), these cells were modified (also hard)). I think that this comparison is outside the scope of this issue (and is likely better for the compare package mentioned earlier or the daff package, https://cran.r-project.org/web/packages/daff/).

@sfirke
Copy link
Owner

sfirke commented Sep 11, 2018

Ryan that is a great call to put data.frames in the cols, column names in the rows. For lots of reasons, including easier implementation and there typically being a few DFs with many columns.

Bill, you raise good implementation concerns. I think this function would be useful as a minimal version that handles simple data types, then as we build it up we can consider how to approach POSIX, list-columns, etc. And agreed that comparing data values is out of scope and already covered by daff... a function name like compare_df_types will be more descriptive than the original compare_dfs.

@billdenney
Copy link
Collaborator

billdenney commented Feb 17, 2019

I had a need for this last week, and I came up with the following implementation. It has the following features:

  • It works with arbitrarily many data.frames and gets the names from the inputs automatically.
  • data.frame names are columns, data.frame columns are rows.
  • To be more tidyverse-friendly, it adds a column called "column_name" with the column names (note that this will cause issues if you have a data.frame named "column_name" that you pass in; maybe a check should be added).
  • It combines all the class names of the column, so that it gives full class information (sometimes classes may be combinable because they are instances of the same class, those will show as mismatches here).
  • For factors, it gives the levels and if they are ordered.
  • It can report all columns, only the columns that match, or only the columns that don't match (the third being the most useful in my use cases).
  • Class name creation is assigned by a generic function so that it can be overridden for classes that can be bound together (such as the POSIXt classes or numeric/integer).
  • Missing columns in one data frame which are present in another are given as NA. That would fail to bind with rbind(), but it would succeed with bind_rows(). You can choose the behavior with the bind_check argument.
  • A secondary function returns nrow(mismatching) == 0 for a boolean version indicating if success will occur or not.

I need this pretty often, and I'd be happy to make a PR for this if it feels like the right fit.

compare_df_types_class_detect <- function(x) {
  UseMethod("compare_df_types_class_detect")
}
compare_df_types_class_detect.factor <- function(x) {
  all_classes <- class(x)
  all_levels <- levels(x)
  level_text <- sprintf("levels=c(%s)", paste('"', levels(x), '"', sep="", collapse=", "))
  if (is.ordered(x)) {
    level_text <- paste0(level_text, ", ordered=TRUE")
  }
  factor_text <- sprintf("factor(%s)", level_text)
  mask_factor <- all_classes == "factor"
  if (!any(mask_factor)) {
    stop("Cannot handle a factor that does not have a class of factor.  Please report this as a bug with a reproducible example.")
  } else if (sum(mask_factor) != 1) {
    stop("More than one of the classes shows up as a factor.  Please report this as a bug with a reproducible example.")
  }
  all_classes[mask_factor] <- factor_text
  paste(all_classes, sep=", ")
}
compare_df_types_class_detect.default <- function(x) {
  all_classes <- class(x)
  paste(all_classes, sep=", ")
}

compare_df_types_df_maker <- function(x, class_colname="class") {
  ret <-
    data.frame(
      column_name=names(x),
      X=sapply(X=x, FUN=compare_df_types_class_detect),
      stringsAsFactors=FALSE
    )
  names(ret)[2] <- class_colname
  ret
}

compare_df_types <- function(..., return=c("all", "matches", "mismatches"), bind_check=c("rbind", "bind_rows")) {
  return <- match.arg(return)
  bind_check <- match.arg(bind_check)
  direct_names <- names(list(...))
  indirect_names <-
    setdiff(
      as.character(match.call(expand.dots=TRUE)), 
      as.character(match.call(expand.dots=FALSE))
    )
  if (is.null(direct_names)) {
    final_names <- indirect_names
  } else {
    final_names <- direct_names
    mask_replace <- final_names %in% ""
    final_names[mask_replace] <- indirect_names[mask_replace]
  }
  args <- list(...)
  ret <- compare_df_types_df_maker(args[[1]], class_colname=final_names[1])
  for (idx in (1+seq_len(length(args) - 1))) {
    ret <-
      merge(
        ret,
        compare_df_types_df_maker(args[[idx]], class_colname=final_names[idx]),
        by="column_name",
        all=TRUE
      )
  }
  if (return == "all" | ncol(ret) == 2) {
    if (return != "all") {
      warning("Only one data.frame provided, so all its classes are provided.")
    }
    ret
  } else {
    # Is this the best way to check for all row values to be equal?
    bind_check_fun <-
      list(
        rbind=function(idx) {
          all(unlist(ret[idx,3:ncol(ret)]) %in% ret[idx,2])
        },
        bind_rows=function(idx) {
          all(
            unlist(ret[idx,3:ncol(ret)]) %in%
              c(NA_character_,
                na.omit(unlist(ret[idx,2:ncol(ret)]))[1])
          )
        }
      )
    mask_match <-
      sapply(
        X=seq_len(nrow(ret)),
        FUN=bind_check_fun[[bind_check]]
      )
    if (return == "matches") {
      ret[mask_match,]
    } else if (return == "mismatches") {
      ret[!mask_match,]
    }
  }
}

compare_df_types_success <- function(..., return="mismatches", bind_check=c("rbind", "bind_rows"), verbose=TRUE) {
  return <- match.arg(return)
  bind_check <- match.arg(bind_check)
  ret <- compare_df_types(..., return=return, bind_check=bind_check)
  if (nrow(ret) & verbose) {
    print(ret)
  }
  nrow(ret) == 0
}

@sfirke
Copy link
Owner

sfirke commented Apr 11, 2019

@billdenney this is great! I think it accomplishes the purposes that have been discussed above over the years (!) in this issue. Want to add documentation, tests, and make a PR? Given that you have the code (and the functionality seems great to me), think we could squeeze it in for an April 21 submission to CRAN?

@sfirke sfirke added this to the v1.2 milestone Apr 11, 2019
billdenney added a commit to billdenney/janitor that referenced this issue Apr 12, 2019
@sfirke sfirke added pr-awaiting-review and removed seeking comments Users and any interested parties should please weigh in - this is in a discussion phase! labels Apr 17, 2019
sfirke pushed a commit that referenced this issue Apr 20, 2019
* Allow data.frame row-binding comparison (Fix #50)

* Allow list inputs to `compare_df_types()`

* Address code review comments

* typo columne -> column

* make bind_rows the default value of bind_method

swapping it in for rbind, since this is a tidyverse-aligned package and my quick poll shows more peers using dplyr::bind_rows

* re-describe new functions
@sfirke
Copy link
Owner

sfirke commented Apr 20, 2019

Thanks to everyone who contributed to this thread over the years! Having this merged and on its way to CRAN soon feels great. Vive le open-source!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants