Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clean_names function simplification #102

Closed
pgensler opened this issue Mar 27, 2017 · 5 comments
Closed

clean_names function simplification #102

pgensler opened this issue Mar 27, 2017 · 5 comments

Comments

@pgensler
Copy link

I was looking over the documentation for your clean names function, and I think it can probably be simplified to something like this:

with stringr package:

library(stringr)
colnames(df) <- str_replace_all(colnames(df), "[:punct:]|[:space:]","")

It would probably be easiest to get the column names from an dataframe, store it in a vector, do the transformations, and then re-assign the result.

This may look a little bit complicated, but essentially it is stringing together POSCIX character classes with an or operator to capture pretty much any special character, and replace with nothing. You could define your own with the brackets. Hope you find this to be usefull.
http://www.regular-expressions.info/posixbrackets.html

@pgensler
Copy link
Author

Probably should have written a pull request, but this is what I have simplified the function down to:
I think stringr is definitely a good package to depend on, but please feel free to send feedback on this:

#Function to clean column names of df
#Goal: want to be able to take any odd characters out from the dataset
#standardize column names into tidy ones, which include only _


#Requires stringr which is a wrapper on stringi, but makes the code simplified
#and easier to read
library(stringr)

clean_names <- function (df) {
  # For each element within the colnames vector. do the following:
    #if col name contains punct or space char, replace with _
    names(df) <- str_replace_all(names(df),"[:punct:]|[:space:]","")
    names(df) <- str_to_lower(names(df))
    return(df)
}


df <- tbl_df(mtcars)
names(df)[1] <- "m p g"
names(df)[2] <- "c-y-l"
names(df)[3] <- "h/p"
names(df)[4] <- "DRAT"
names(df)

#Correct way to use function:
df1 <- clean_names(df1)
names(df1)


# Unit tests:
# replace any special characters with a single underscore  
# replace cases where two underscores occur in a column name with _
# new: can also replace chars like slashes in column name
# 
# Not sure about:
# String-final underscores?
# Duplicate column names- can you prove a MWE?
# Could also add [:digit:] to remove any digits in column headings

The other approach you could take is using the function to modify the dataframe in place, but I have heard that is a bad practice. I'm wondering if something like this should be put as a pull-request to tidyr, as it seems like a task any analyst would run into. Let me know what you think of this.

@sfirke
Copy link
Owner

sfirke commented Mar 28, 2017

Hi and thanks for sharing these thoughts. Besides looking at the underlying clean_names.R file, please review the tests for clean_names for the full list of cases the function addresses.

There are a lot of things I want to improve about janitor but refactoring clean_names() is not on my radar. While the current function code could be made shorter, I don't think that would make it more readable; I like that the current approach makes a series of transparent, precise adjustments, as it seems less likely to introduce unforeseen consequences. And even if this terser stringr version passed all of the current tests (it does not), it would introduce another package dependency and while I am fond of stringr, I'd like to avoid unnecessary dependencies.

If you are looking for a way to contribute to this function, you could take a shot at #96?

@sfirke sfirke closed this as completed Mar 28, 2017
@pgensler
Copy link
Author

Thanks for the feedback.
This revision was not meant to encompass all use cases, as noted by my comments in the end of the code: "Not sure about". The reason I wrote this is that my columns in my dataframe contain headers with /, which are not supported in your function. If you were to write out a single function to test for each special char in R( /&%$ etc..), the function would be incredibly long, so this is really an effort to make it more readable.

I only suggest stringr because it would be faster than base gsub. Is there a particular reason why you choose using base over a different package? I understand dependencies come at a cost, but I think it would help with other issues, such as verifying encoding of bad characters on column names. With that, I'd definitely be willing to take a look at #96 to see if I can assist with this issue.

@rgknight
Copy link
Collaborator

Can you open a separate issue for "/" failing? That's a bug that should be fixed.

Speed isn't much of a concern here because you shouldn't have so many column names that it makes a difference. If you think clean_names is slow, feel free to share some benchmarking. I agree with Sam that refactoring to stringr is not at all necessary.

I could see [[:punct:]] having advantages because of the support for locales, but you would have to exclude "_", perhaps by first converting it to a space, as that should be an allowable character.

@pgensler
Copy link
Author

@rgknight Agreed, sorry for such a late response on this, so maybe do like a try-catch/switch statement to first detect if there are _ convert to a space, and then run the [[:punct:]] char class?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants