clean_names function simplification #102

pgensler · 2017-03-27T03:05:43Z

I was looking over the documentation for your clean names function, and I think it can probably be simplified to something like this:

with stringr package:

library(stringr)
colnames(df) <- str_replace_all(colnames(df), "[:punct:]|[:space:]","")

It would probably be easiest to get the column names from an dataframe, store it in a vector, do the transformations, and then re-assign the result.

This may look a little bit complicated, but essentially it is stringing together POSCIX character classes with an or operator to capture pretty much any special character, and replace with nothing. You could define your own with the brackets. Hope you find this to be usefull.
http://www.regular-expressions.info/posixbrackets.html

The text was updated successfully, but these errors were encountered:

pgensler · 2017-03-27T19:25:16Z

Probably should have written a pull request, but this is what I have simplified the function down to:
I think stringr is definitely a good package to depend on, but please feel free to send feedback on this:

#Function to clean column names of df
#Goal: want to be able to take any odd characters out from the dataset
#standardize column names into tidy ones, which include only _


#Requires stringr which is a wrapper on stringi, but makes the code simplified
#and easier to read
library(stringr)

clean_names <- function (df) {
  # For each element within the colnames vector. do the following:
    #if col name contains punct or space char, replace with _
    names(df) <- str_replace_all(names(df),"[:punct:]|[:space:]","")
    names(df) <- str_to_lower(names(df))
    return(df)
}


df <- tbl_df(mtcars)
names(df)[1] <- "m p g"
names(df)[2] <- "c-y-l"
names(df)[3] <- "h/p"
names(df)[4] <- "DRAT"
names(df)

#Correct way to use function:
df1 <- clean_names(df1)
names(df1)


# Unit tests:
# replace any special characters with a single underscore  
# replace cases where two underscores occur in a column name with _
# new: can also replace chars like slashes in column name
# 
# Not sure about:
# String-final underscores?
# Duplicate column names- can you prove a MWE?
# Could also add [:digit:] to remove any digits in column headings

The other approach you could take is using the function to modify the dataframe in place, but I have heard that is a bad practice. I'm wondering if something like this should be put as a pull-request to tidyr, as it seems like a task any analyst would run into. Let me know what you think of this.

sfirke · 2017-03-28T12:05:24Z

Hi and thanks for sharing these thoughts. Besides looking at the underlying clean_names.R file, please review the tests for clean_names for the full list of cases the function addresses.

There are a lot of things I want to improve about janitor but refactoring clean_names() is not on my radar. While the current function code could be made shorter, I don't think that would make it more readable; I like that the current approach makes a series of transparent, precise adjustments, as it seems less likely to introduce unforeseen consequences. And even if this terser stringr version passed all of the current tests (it does not), it would introduce another package dependency and while I am fond of stringr, I'd like to avoid unnecessary dependencies.

If you are looking for a way to contribute to this function, you could take a shot at #96?

pgensler · 2017-03-28T19:18:57Z

Thanks for the feedback.
This revision was not meant to encompass all use cases, as noted by my comments in the end of the code: "Not sure about". The reason I wrote this is that my columns in my dataframe contain headers with /, which are not supported in your function. If you were to write out a single function to test for each special char in R( /&%$ etc..), the function would be incredibly long, so this is really an effort to make it more readable.

I only suggest stringr because it would be faster than base gsub. Is there a particular reason why you choose using base over a different package? I understand dependencies come at a cost, but I think it would help with other issues, such as verifying encoding of bad characters on column names. With that, I'd definitely be willing to take a look at #96 to see if I can assist with this issue.

rgknight · 2017-03-29T18:37:42Z

Can you open a separate issue for "/" failing? That's a bug that should be fixed.

Speed isn't much of a concern here because you shouldn't have so many column names that it makes a difference. If you think clean_names is slow, feel free to share some benchmarking. I agree with Sam that refactoring to stringr is not at all necessary.

I could see [[:punct:]] having advantages because of the support for locales, but you would have to exclude "_", perhaps by first converting it to a space, as that should be an allowable character.

pgensler · 2017-09-27T17:37:13Z

@rgknight Agreed, sorry for such a late response on this, so maybe do like a try-catch/switch statement to first detect if there are _ convert to a space, and then run the [[:punct:]] char class?

sfirke closed this as completed Mar 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clean_names function simplification #102

clean_names function simplification #102

pgensler commented Mar 27, 2017

pgensler commented Mar 27, 2017

sfirke commented Mar 28, 2017

pgensler commented Mar 28, 2017

rgknight commented Mar 29, 2017

pgensler commented Sep 27, 2017

clean_names function simplification #102

clean_names function simplification #102

Comments

pgensler commented Mar 27, 2017

pgensler commented Mar 27, 2017

sfirke commented Mar 28, 2017

pgensler commented Mar 28, 2017

rgknight commented Mar 29, 2017

pgensler commented Sep 27, 2017