Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using tabyl with index-numbers #317

Closed
gekess opened this issue Nov 25, 2019 · 11 comments
Closed

Using tabyl with index-numbers #317

gekess opened this issue Nov 25, 2019 · 11 comments

Comments

@gekess
Copy link

gekess commented Nov 25, 2019

Feature requests

I would like to use tabyl for recurring procedures with a map()-function, looping over several variables in a data frame without naming them, but rather just have the loop access the next variable in line. It is possible with table(), but I don't like table()'s properties and rather do it with tabyl().

Here is an example with table():

map(mtcars[1:6], function(x) table(x,mtcars$gear))

With tabyl I would have to specify each tabulation, correct?

tabyl(mtcars, mpg, gear)
tabyl(mtcars, cyl, gear)
# etc.
tabyl(mtcars, wt, gear)

That seems tedious to me and I would prefer a looping-solution.

@jzadra
Copy link
Contributor

jzadra commented Nov 25, 2019

suppressPackageStartupMessages(require(tidyverse))
suppressPackageStartupMessages(require(janitor))


mtcars %>% 
  pivot_longer(-gear) %>% 
  group_split(name) %>% 
  map(function(x) {
    varname <- x %>% distinct(name) %>% pull(name) 
    
    x %>% 
      rename(!!varname := value) %>% 
      tabyl(!!sym(varname), gear)
  })
#> [[1]]
#>  am  3 4 5
#>   0 15 4 0
#>   1  0 8 5
#> 
#> [[2]]
#>  carb 3 4 5
#>     1 3 4 0
#>     2 4 4 2
#>     3 3 0 0
#>     4 5 4 1
#>     6 0 0 1
#>     8 0 0 1
#> 
#> [[3]]
#>  cyl  3 4 5
#>    4  1 8 2
#>    6  2 4 1
#>    8 12 0 2
#> 
#> [[4]]
#>   disp 3 4 5
#>   71.1 0 1 0
#>   75.7 0 1 0
#>   78.7 0 1 0
#>   79.0 0 1 0
#>   95.1 0 0 1
#>  108.0 0 1 0
#>  120.1 1 0 0
#>  120.3 0 0 1
#>  121.0 0 1 0
#>  140.8 0 1 0
#>  145.0 0 0 1
#>  146.7 0 1 0
#>  160.0 0 2 0
#>  167.6 0 2 0
#>  225.0 1 0 0
#>  258.0 1 0 0
#>  275.8 3 0 0
#>  301.0 0 0 1
#>  304.0 1 0 0
#>  318.0 1 0 0
#>  350.0 1 0 0
#>  351.0 0 0 1
#>  360.0 2 0 0
#>  400.0 1 0 0
#>  440.0 1 0 0
#>  460.0 1 0 0
#>  472.0 1 0 0
#> 
#> [[5]]
#>  drat 3 4 5
#>  2.76 2 0 0
#>  2.93 1 0 0
#>  3.00 1 0 0
#>  3.07 3 0 0
#>  3.08 2 0 0
#>  3.15 2 0 0
#>  3.21 1 0 0
#>  3.23 1 0 0
#>  3.54 0 0 1
#>  3.62 0 0 1
#>  3.69 0 1 0
#>  3.70 1 0 0
#>  3.73 1 0 0
#>  3.77 0 0 1
#>  3.85 0 1 0
#>  3.90 0 2 0
#>  3.92 0 3 0
#>  4.08 0 2 0
#>  4.11 0 1 0
#>  4.22 0 1 1
#>  4.43 0 0 1
#>  4.93 0 1 0
#> 
#> [[6]]
#>   hp 3 4 5
#>   52 0 1 0
#>   62 0 1 0
#>   65 0 1 0
#>   66 0 2 0
#>   91 0 0 1
#>   93 0 1 0
#>   95 0 1 0
#>   97 1 0 0
#>  105 1 0 0
#>  109 0 1 0
#>  110 1 2 0
#>  113 0 0 1
#>  123 0 2 0
#>  150 2 0 0
#>  175 2 0 1
#>  180 3 0 0
#>  205 1 0 0
#>  215 1 0 0
#>  230 1 0 0
#>  245 2 0 0
#>  264 0 0 1
#>  335 0 0 1
#> 
#> [[7]]
#>   mpg 3 4 5
#>  10.4 2 0 0
#>  13.3 1 0 0
#>  14.3 1 0 0
#>  14.7 1 0 0
#>  15.0 0 0 1
#>  15.2 2 0 0
#>  15.5 1 0 0
#>  15.8 0 0 1
#>  16.4 1 0 0
#>  17.3 1 0 0
#>  17.8 0 1 0
#>  18.1 1 0 0
#>  18.7 1 0 0
#>  19.2 1 1 0
#>  19.7 0 0 1
#>  21.0 0 2 0
#>  21.4 1 1 0
#>  21.5 1 0 0
#>  22.8 0 2 0
#>  24.4 0 1 0
#>  26.0 0 0 1
#>  27.3 0 1 0
#>  30.4 0 1 1
#>  32.4 0 1 0
#>  33.9 0 1 0
#> 
#> [[8]]
#>   qsec 3 4 5
#>  14.50 0 0 1
#>  14.60 0 0 1
#>  15.41 1 0 0
#>  15.50 0 0 1
#>  15.84 1 0 0
#>  16.46 0 1 0
#>  16.70 0 0 1
#>  16.87 1 0 0
#>  16.90 0 0 1
#>  17.02 1 1 0
#>  17.05 1 0 0
#>  17.30 1 0 0
#>  17.40 1 0 0
#>  17.42 1 0 0
#>  17.60 1 0 0
#>  17.82 1 0 0
#>  17.98 1 0 0
#>  18.00 1 0 0
#>  18.30 0 1 0
#>  18.52 0 1 0
#>  18.60 0 1 0
#>  18.61 0 1 0
#>  18.90 0 2 0
#>  19.44 1 0 0
#>  19.47 0 1 0
#>  19.90 0 1 0
#>  20.00 0 1 0
#>  20.01 1 0 0
#>  20.22 1 0 0
#>  22.90 0 1 0
#> 
#> [[9]]
#>  vs  3  4 5
#>   0 12  2 4
#>   1  3 10 1
#> 
#> [[10]]
#>     wt 3 4 5
#>  1.513 0 0 1
#>  1.615 0 1 0
#>  1.835 0 1 0
#>  1.935 0 1 0
#>  2.140 0 0 1
#>  2.200 0 1 0
#>  2.320 0 1 0
#>  2.465 1 0 0
#>  2.620 0 1 0
#>  2.770 0 0 1
#>  2.780 0 1 0
#>  2.875 0 1 0
#>  3.150 0 1 0
#>  3.170 0 0 1
#>  3.190 0 1 0
#>  3.215 1 0 0
#>  3.435 1 0 0
#>  3.440 1 2 0
#>  3.460 1 0 0
#>  3.520 1 0 0
#>  3.570 1 0 1
#>  3.730 1 0 0
#>  3.780 1 0 0
#>  3.840 1 0 0
#>  3.845 1 0 0
#>  4.070 1 0 0
#>  5.250 1 0 0
#>  5.345 1 0 0
#>  5.424 1 0 0

Created on 2019-11-25 by the reprex package (v0.3.0)

@gekess
Copy link
Author

gekess commented Nov 27, 2019

Thanks jzadra,

I would like to understand your code better to improve my own skills: from what I gathered, you transform the data from wide- to long-format and then split the df into list-elements each containing "gear" plus the variable to be tabulated with and the corresponding values per variable.
Don't we leave with this approach the idea of tidyverse of cases in rows and variables in columns? For me, it seemed practical in base to use indices for rows and columns. Then I read that Hadley W. doesn't approve of them, but in the post he didn't share why. And dplyr seems to follow a different logic. Could you point me to some literature or some posts where this issue is being discussed and explained?

And as an addendum: Would you be so kind to comment shortly after each line what your rationale for it is? What's the goal of your transformation?

@jzadra
Copy link
Contributor

jzadra commented Nov 27, 2019

Certainly!

Yes, going to long does go against the tidy data standard, but only for a second - because what we are splitting the long format in the next step into separate tables that are then back to tidy data. Going wide to long is useful on many occasions when you want to functionalize something (i.e. using map) or you want to vectorize some operation on multiple columns. Often you then return to wide afterwords.

The reason for not using indices in my opinion is that they are incredibly prone to breakage. If you change something in an earlier processing step that adds a column or a row, all of a sudden your indices aren't referring to what they used to refer to. Referring to things by name on the other hand it doesn't matter what location it moves to.

My general goal below is to make it so that I don't have to type the names of the all the columns we want to compare with gear. To do that I make a list of separate tables with a column for gear and a column for the other var values that has the same name in each list table: value (the default from pivot_longer, you could name it whatever you want). We also have name column that we use to identify how to split into separate tables, and from that we can also grab the name of the var so we know what we're looking at later.


  1. Reshape to long format where we have the gear column repeating alongside a name key column a value column:
mtcars %>% 
  pivot_longer(-gear) %>% 
  1. We want to end up with separate tables for each non-gear variable. This generates a list of tibbles based on the name value:
  group_split(name) %>%
  1. Now we are going to write a function to do the same thing to each tibble in the list. x becomes the table object inside the function environment, same as we'd get if we pulled one of the tables out of the list and assigned it to x.
  map(function(x) { 
  1. Because we want to include then name of the variable we are crossing with gear, we extract it here from the column that holds the value to be used later. Since every row is the same, distinct() gets us down to 1 row. And since we just want a character object, pull() grabs it out of the tibble.
    (The base R split used to name the lst elements which would have made this uncessary. Hadley refuses to do this for group_split() and it isn't clear why, but they are adamant about it, so we are left with this small extra step).
    varname <- x %>% distinct(name) %>% pull(name) 
  1. Again, x is the tibble, but it has gear, name, and value. columns. If we don't do the rename, the tabyl's will all say gear by name, so we won't know what variable is with gear. The way this is done is using rlang which is definitely more advanced R (and you can find it in Hadley's Advanced R book). The !! says to look for the varname object outside of the x object. And the := takes the place of = when we do an assignment that requires evaluating the assignee first (ie we don't want the name to be !!varname, we want it to be mpg, etc. So that code needs to evaluate as rename(mpg = value), and we have to get to the value of varname first.
    x %>% 
      rename(!!varname := value) %>% 
  1. Now finally for the tabyl. If we hadn't renamed value, this part would simply be tabyl(value, gear). But since now every time we go through the map loop the other variable has a different name, we need to use that variable name in the code. Again, the !! says to look for the varname object outside of x. sym() converts the character object varname into a symbol. Just like how we don't write tabyl("mpg", gear) where it would take "mpg" as literal rather than an object, giving it the raw varname would be the same as "mpg" when we actually want the bare, unquoted symbol mpg.
      tabyl(!!sym(varname), gear)
  })


Hope that helps!

@sfirke
Copy link
Owner

sfirke commented Nov 30, 2019

Thank you @jzadra for providing this thoughtful, detailed response! 🙌

@sfirke sfirke closed this as completed Nov 30, 2019
@gekess
Copy link
Author

gekess commented Nov 30, 2019

@jzadra : wonderful!!! Appreciate this a lot!

@RachelK1994
Copy link

RachelK1994 commented Jun 24, 2021

Hi there!

I found this extremely helpful - I essentially replicated the code above for my own dataset (switching the order of variables in the table slightly) and it worked once but since I tried to clean it up slightly, I've repeatedly generated this error:

Error in :=(!!varname, value) :
Check that is.data.table(DT) == TRUE. Otherwise, := and :=(...) are defined for use in j, once only and in particular ways. See help(":=").

Any help would be hugely appreciated (@jzadra I wonder whether you have any quick thoughts?)!

Best,
Rachel

PS New to this forum so apologies in advance if I've missed some of the conventions.

library(readxl)
TPRdata <- read_excel("TPR-061121.xlsx")
View(TPRdata)
install.packages("janitor")
install.packages("reshape")
library(ggplot2)
library(lubridate)
library(tidyr)
library(tidyverse)
library(plyr)
library(dplyr)
library(glue)
library(purrr)
library(data.table)
library(reshape)
library(janitor)
Iffunction2 <- function(x) ifelse(x == "NA", "NA",ifelse(x == "-1", "Suppressed", ifelse(is.null(x), "Blank", ifelse(x == "Not Provided", "Not Provided", "Value"))))
TPRdata2 <- mutate_all(TPRdata, Iffunction2)
TPRdata2$state2 <- TPRdata$state
View(TPRdata2)
TPRdata2 %>% 
  pivot_longer(-state2) %>% 
  group_split(name) %>% 
  map(function(x) {
    varname <- x %>% distinct(name) %>% pull(name) 
    
    x %>% 
      rename(!!varname := value) %>% 
      tabyl(state2, !!sym(varname)) %>%
      adorn_percentages("row") %>%
      adorn_pct_formatting(digits = 2) %>%
      adorn_ns()
  })

@sfirke
Copy link
Owner

sfirke commented Jun 24, 2021

I can't say for sure but it looks like := is used in the data.table package, and since you load that package after the tidyverse packages, it's calling the data.table version of that operator. Try not loading data.table and seeing if that fixes your problem.

@RachelK1994
Copy link

Thank you very much for getting back to me @sfirke! I've tried this but unfortunately I then get the message, " Error in :=(!!varname, value) : could not find function ":="" - is it just the "dplyr" package I need to read in or is there perhaps another package I'm missing (or perhaps I've loaded another redundant package that is causing the issue)? Best,Rachel

@sfirke
Copy link
Owner

sfirke commented Jun 25, 2021 via email

@jzadra
Copy link
Contributor

jzadra commented Jun 25, 2021 via email

@mariahelmrich
Copy link

I have spent days trying to figure this out! Your code worked for me, thank you!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants