Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

figure out the API for tabyl() and helpers #101

Closed
sfirke opened this issue Mar 23, 2017 · 11 comments
Closed

figure out the API for tabyl() and helpers #101

sfirke opened this issue Mar 23, 2017 · 11 comments

Comments

@sfirke
Copy link
Owner

sfirke commented Mar 23, 2017

This is scattered across other issues, consolidating it here.

For 1.0 I think we are headed toward:
Main function:

  • consolidate crosstab() into tabyl()
  • can't call it on one or two vectors anymore, you may only supply a data.frame and column name(s) (I still wonder if I'll miss this functionality, but the API is so much simpler).
    • what if you could call it on a single vector, or a data.frame and specify one or two vectors? Is that a compromise that keeps the most useful part of the vector method (a single var, more likely to be outside a data.frame) and eliminates the most confusing part (calling crosstab on two vectors)? I have trouble letting go of tabyl(a_vector)
  • one column name = frequency df (currently called a tabyl; two = a df currently called crosstab; three = list of crosstab dfs

Helper functions:
It would be nice to have them all as modules with a common prefix. The base input is a tabyl(df, col1, col2). Then you can do everything to it that you currently can with adorn_crosstab.

I think the obstacle to modularity is that the ordering of the steps is tricky. A function that calculates percentages on a data.frame without totals row/col will give nonsense values on one with totals. It could analyze the numeric values to decide whether the right and bottom vectors are column totals, but that seems dangerous and heavyweight; it could also take a user flag. Neither of those is as nice as the current adorn_crosstab which gets to do everything at once so applies the right steps and logic.

Similar issue with adding % signs, I think.

This is the genesis for the master helper function adorn_crosstab. But then sometimes you don't want % signs, you want numeric values you can plot or do calculations with. Thus ns_to_percents being exported.

Maybe ... a cleaner API to adorn_crosstab (call it adorn_tabyl?) that covers all helpers, so they don't need to be called individually? Could be a lot of possible arguments, but hey this is for getting fussy formatting right. Then don't export the sub-helpers if their functionality can be fully accessed, in any combination, through the master helper function?

Perhaps there is a way to implement the modularity in some clever specific order.

@sfirke
Copy link
Owner Author

sfirke commented Mar 23, 2017

Right now helpers only pertain to the 2-way crosstab. Would they also work on a single-var tabyl? I like the current options set right now (with valid_percent popping up automatically when applicable), and a 1-way has simpler permutations, but something could be gained if one had the option of doing the same % sign formatting as in crosstab.

For the 3-way tabyl (a list of 2-way crosstabs), I think we'd expect the user to add the helpers the way they would with a 2-way, but using purrr::map.

@rgknight
Copy link
Collaborator

We can think about whether this is feasible some more. I think it is but I don't know how to do it. We'd have to either beef up the tabyl class to provide some information about what adorn_ options have already been added or have tabyl create an environment that would contain that information. Or maybe make it lazily collect adornments then evaluate them together, but that's way outside of my ability.

Perhaps we could make a adornments( function then you place the other adorn functions inside it so adornments can evaluate what was supplied and apply it in a reasonable way? Some of the assertion packages take an approach like this to multiple assertions.

Maybe @chrishaid has thoughts?

@chrishaid
Copy link
Collaborator

chrishaid commented Mar 23, 2017 via email

@sfirke
Copy link
Owner Author

sfirke commented Mar 28, 2017

Clever ideas, both of you: attributes, or calling adorn_ functions in a list or similar so they get evaluated in the correct order.

The back-end to the latter (the list) sounds more complicated to write, and the user interface will be simpler without it. I haven't used attributes before but it looks like it could be pretty unobstrusive to add a "totals" attribute. After you've added a totals row or column, it's IMO unlikely you're doing much more with that data.frame than printing it, anyway, so most users shouldn't even notice.

Then we'd have:

  • adorn_totals
  • adorn_percentages (this has the option of keeping Ns, and adding % signs; making % signs optional makes ns_to_percents unnecessary). This function loses the show_totals argument.

Two more challenges:

  • Maybe we could break out adding the "%" sign to another adorn_ function, but that would require another attribute since it would have to get inserted into the middle of say "42.4% (32)".
  • There's a rounding argument in adorn_percents() right now that offers the unexported janitor function janitor:::round_half_up(). It seems cleanest to export this function, then allow users to call it with mutate_if(is.numeric) or similar and drop it from adorn_totals. But how could it fit into the pipeline of these functions? You would need to call it when calculating the percentages and before they are a character string combined with Ns. Could an attribute avoid having to call this from adorn_percentages?

@sfirke
Copy link
Owner Author

sfirke commented Apr 7, 2017

I realized a problem with the totals attribute: even if it successfully passes info about totals row/col to subsequent functions like adorn_percents, that may be unintuitive to the user - that converting to %s at this point will still work even with totals in place - unless they read the docs.

What if we went bigger and attached the original tabyl data.frame to itself as an attribute? Then the steps can truly go in any order, getting closer to an MS Excel PivotTable. And it embraces the seeming-oddness of the steps not mattering, we'll put that aspect front and center in the examples.

So you could then say:

mtcars %>%
  tabyl(am, cyl) %>%
  adorn_percents("row") %>% # "keep_N" would probably be an arg here
  adorn_percentage_sign %>%
  adorn_totals() %>%
  adorn_rounding("half to up")

With those function calls coming in any order, because they can all see the underlying tabyl data. I think to make the implementation easier, we'll also want attributes of what steps have been attached. E.g., adorn_percents needs to know if there are totals row/col on the tabyl; it's easier to tell that with an attribute than by trying to sniff it out from the actual data.frame that gets passed in.

Would it ruin using adorn_totals or adorn_percentages on non-tabyl inputs? I like having that option. Maybe in that case you first call a function as_tabyl() that attaches the attribute and makes a regular data.frame behave like a tabyl?

At that point is there a reason to extend tbl_df into a very similar class tabyl - then these functions look to see if it's that class? Or is that just making it more complicated.

A grander vision, but at least it feels like a more coherent API.

One more possible issue would be that attaching the data.frame to itself seems inefficient memory-wise - but the result of a call to tabyl is always going to be relatively small, thankfully. Probably good to have a simple function untabyl() that strips the tabyl class and attributes off, in case anyone finds them undesirable.

@sfirke
Copy link
Owner Author

sfirke commented Apr 10, 2017

I played around a bit with this... the evaluation is tricky, as if you allow these adornment helpers to get called in any order, the number of permutations quickly gets unwieldy. E.g., does adorn_totals need to understand what to do if % signs have been added vs. not?

I see two options:

  1. A set of nearly empty adorn_* helper functions that attach an attribute to a tabyl df, which might already have some other related attributes, then send that tabyl off to a master function for processing (which is similar to the current adorn_crosstab(), and itself calls non-exported sub-functions. This is like what @rgknight suggested above re: collecting the adornments in any order, then evaluating in a certain order. Feels decidedly not magical on the back end ... but maybe that's a nicer interface for the user than the current adorn_crosstab()? Basically we're just adding a layer in front of that function.

I have the start of this approach working.

  1. Restrict the order that these functions can get called in, enforced by checks on attributes and failing with helpful error messages like stop("cannot call adorn_totals after adorn_percentages)" so the user can quickly switch the order of their pipes. Then say, adorn_percentages` only needs to worry about totals. I think the order would go totals, rounding, percentages, percentage_sign.

That is easier to program, I think, since say, adorn_percentage_sign can confirm with an attribute check that adorn_percentages has been called in which case it can wiggle the % sign in at the end, or before the parenthetical (n) if that option was chosen. With the safety of that input validation, it could just insert the % signs vs. rebuilding the whole thing from scratch. (Although now that I type this, I think adding the % sign could be an argument to adorn_percentages instead of its own function - but this point still holds for the other adorn_* helpers).

The latter is easier to implement. And as long as the error messages give the correct reordering, sacrificing the ability to apply these in any order seems minor. Maybe I will think about the theory of this - is there a sensible hierarchy of the order in which the steps should apply? And then give it a shot.

@sfirke
Copy link
Owner Author

sfirke commented Apr 10, 2017

@chrishaid and @rgknight I welcome thoughts on that last big comment. And on this smaller one: tabyl sounds pretty close to tibble and now we're talking about making it a data.frame class... I think we should consider a name that will avoid confusion. Yes it's the result of a call to tabyl but perhaps the noun should be different.

I'm thinking maybe some imagery that captures the idea of an underlying counts table masked by lots of layered adornment attributes... something like light, or shadows, or makeup or disguise? It should describe an unadorned tabyl of crosstab counts too.

Should tabyl itself get renamed? I like that it competes with "table". Maybe the object name could be "twoway" or "two_way" or "two_way_df"?

@rgknight
Copy link
Collaborator

rgknight commented Apr 10, 2017

I think you're on the right track with "%" format as an option to adorn_percent. I think the problem is trying to mix adorn_ functions that format with adorn_ functions that add new values. We might want to think about moving one of those entirely to options. Totals are hard. I think they might want to be an option, too, with an attribute that you can override that tells each subsequent call to also calculate the totals. Either that or create an attribute that has the format for each column to adorn_totals knows what to do.

Here's a couple of ideas

Totals as an option

mtcars %>%
  tabyl(am, cyl, totals = "row") %>%  # We will know to create and format totals with each call
  adorn_count() %>%
  adorn_percents("row")
  adorn_percentage_sign %>%
  adorn_rounding("half to up")

Formatting as an option

mtcars %>%
  tabyl(am, cyl, rounding = 0) %>%  # define default rounding to zero decimals
  adorn_percents("row", format= "%") %>% 
  adorn_count(format= "()") %>%
  adorn_totals()

Or maybe there's even a utility somewhere to format like Excel, e.g., "0%,-0%,0%" would do the rounding, too, or you could use the sprintf formats. I don't think that adorn_totals() would need to be last, but it could obviously only know about functions applied before it was called. Maybe there's an adorn_simple() for quick access that adds both the count and the column % formatted as you do with crosstab.

Formatting as a function

mtcars %>%
  tabyl(am, cyl) %>% 
  adorn_metrics("N", "row_percent") %>%
  adorn_totals() %>%
  adorn_format(formats = c("()", "%"), rounding = 0)

I think I don't like this one because you have to match the format to the metric by position.

Conclusion

After playing with these options I think I like creating a format option for tabyl that stores the default format as an attribute. Each adorn_ can also accept the same format option to over-write the default. Probably using the sprintf format style since that's easier to implement and how you'd actually create it. Then people could add $ and whatnot. So the above / quick_tabyl version would be

mtcars %>%
  tabyl(am, cyl) %>% 
  adorn_percent(format = "%0.0f%%" ) %>%
  adorn_count(format = "(%0.0f)" ) %>%
  adorn_totals("row")

The sprintf formats are hard to understand though so maybe we also provide some named shortcuts for folks.

@sfirke
Copy link
Owner Author

sfirke commented Oct 3, 2017

Feeling so close to ready to merge in a huge PR that implements tabyl 1.0. What else could it be called: a 1-way, 2-way, or 3-way contingency/count table that has its values attached as an attribute?

Is there something better than tabyl? That one has ambiguous pronunciation (I say it "table" but then it's indistinguishable from table()) and is awfully close to tibble.

@sfirke
Copy link
Owner Author

sfirke commented Oct 3, 2017

  • enumerate - descriptive, but not a good noun, and used in ~100 packages
  • abacus - not the name of other CRAN functions ... but pretty obscure
  • reckon - terrible as a noun

I'll buy a beverage of your choice to someone who comes up with a better noun/verb than "tabyl" that I implement.

@sfirke
Copy link
Owner Author

sfirke commented Oct 9, 2017

Still open to better names than "tabyl" - but I can't think of one myself.

I think the API is pretty set. my velocity on janitor is pretty slow and I've implemented the best approach I saw, the best I could. Time to let it into the wild.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants