-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: A function for quick basic standardization of an otherwise tidy (almost) df #566
Comments
Hi and thanks for the suggestion! Here is my take on designing function(s) that tackle this problem.
As a user my desired workflow would be to run a suite of examination checks, get a list of issues with suggested fixes, and then be able to specify a set of those fixes to be applied. I would not be comfortable applying a set of changes without review, and I expect in many cases the user will need to intervene manually after the examination function finds a possible issue. You might leverage an R package that does assertive checks for the inspection part. I remember I think there's value in this and in theory it's in-scope for the janitor package but:
In summary, I think this could be a valuable suite of tools, but a huge undertaking to do a rigorous and complete job. I'd suggest first making sure it doesn't already exist, then if it doesn't, iterating on this for a while. If you end up developing this, I'd be interested in taking a look - feel free to post back here and I'll see it. I'm going to close this issue as unplanned, but feel free to reply if you have a question or want to discuss further. Good luck with this effort! |
Thanks for your reply, @sfirke, this is helpful. I'll keep that in mind for future, while I improve on this. |
Hello,
I would like to propose the addition of a new function called
standardize_tidy_df_cols(df, ...)
. This function is designed to standardize the columns of a tidy dataset. As we know, a tidy dataset is one that:But, even within a tidy data frame there may be a lot of frustrating issues and similar to how
clean_names
function standardizes column names, this new proposed function (standardize_tidy_df_cols(df, ...)
) that I am working on will standardize columns of a df, which although messy, follows the 3 principles of a tidy df as mentioned above.For example, here is an otherwise tidy data, but with a lot of potential issues:
Here, are some of the issues:
uid
col has leading zeros, so should not be coerced tonumeric
, as otherwise it will alter the numbers.NA
,"NA"
.col3
,col4
,col5
have numeric values stored in strings and are of classcharacter
, but should be of classnumeric
col8
has numeric values stored in strings, with some scientific notation also stored in strings and as a result is of classcharacter
, but should be of classnumeric
.logical
columns, one that has logical values stored as strings and the other one as actual logical values.The function that I am writing and improving on currently, fixes all of the above issues and has a parameter to preserve those columns on which the transformations should not be applied, so in the above example, the uid's can be kept as class character, for example, by specifying in the function call. It also takes care of things like: replacing
"NA"
with toNA
in anumeric
column, and replacing"NA"
with""
in acharacter
column.I can imaging people getting stuck in similar problems, which are not specific to my dataset, but largely applies to a lot of datasets, and instead of coercing to right types, and fixing these one at a time, especially when there are 100s of columns, they can possibly use this function first and then see where they are at and hopefully it will take care of a lot of their issues.
This function of course doesn't fix everything and also requires that a data frame supplied to it should follow the tidy data principles, but in my experience, I have come across a lot of datasets that follow these principles, but then need much more tidying.
Do you think there is value in adding such a function to janitor?
Please let me know and we can chat, thanks!
Best,
Aarsh
The text was updated successfully, but these errors were encountered: