offer other case options as an argument in clean_names #96

sfirke · 2017-03-13T13:18:38Z

Suggested by @masalmon on Twitter. I only use snake_case, but it would be more inclusive of other coding styles to offer camelCase or maybe PascalCase. Perhaps wrap @Tazinho's snakecase package?

maelle · 2017-03-13T13:22:35Z

For the record, I only use snake case but I can imagine other people using janitor and wishing that, not sure they would install a package just for changing column names (not undermining your great work @Tazinho !) so that it'd be easier for them to have it inside janitor?

Tazinho · 2017-03-13T14:04:36Z

Just some notes on that. It is super easy to get pascal- and camel-case once you, @sfirke have converted to clean snake_case. So it might still be useful to build it up on the already available clean_names() function, which has all the tests for valid data.frame names. Of course you can wrap snakecase, just let me know any requirementes. Also note that snakecase currently imports purrr, stringr and magrittr-pipes, but I could translate to a baseR version if this is an issue.

On the long run I think it would be best if the functionality would be implemented in stringr/stringi. However, at the moment it is not, which is as far as I know, because these packages claim more generality and it is not always obvious what the snake case result should look like.

Tazinho · 2017-05-19T17:32:18Z

Just wanted to let you know that snakecase is on cran now and it is planned to support a lot of stuff that might be interesting for you. See the issues for that or this talk:
https://www.youtube.com/watch?v=T6p0l8XzP64
Note also that there are (currently experimantally) options to fill empty names and make sure that names are not dublicated.
However, I am not sure if it is a good idea to reproduce exactly your testcases in the long run. If you have any further ideas what should be supported for your usecase, don't hesitate to let me know.

maelle · 2017-05-19T19:12:33Z

Nice work! Since I have a special character in my name I like the related issue 😀

sfirke · 2017-05-23T12:59:15Z

@Tazinho congrats on the package being on CRAN! I have a data source at work that gives me camelCase and it looks like to_snake_case will get it more to my liking 👏

I think this would be nice to offer camelCase and PascalCase as options to clean_names, maybe a case = parameter. And I agree, it appears that once the current clean_names runs, the conversions with snakecase functions will be simple one-liners.

I like minimizing dependencies to the extent possible, but I don't think it'll be a dealbreaker here.
Though if we make #120 happen and that adds a dependency on stringi, janitor would then be requiring stringi and stringr (via snakecase) which seems a little odd as I think stringr is just a subset of stringi?

Tazinho · 2017-05-23T13:10:23Z

I will think about resolving stringr to stringi. In general the stringi/stringr dependency seems to be a really good thing. Tazinho/snakecase#46

Tazinho · 2017-07-11T10:48:15Z

I just finished a bigger update on the devversion of the snakecase package, including a change to the replace_special_characters argument, which addresses also the following issues:

#120 (transliteration of special characters)
Tazinho/snakecase#57 (stringr vs stringi dependency)

You can now do the following:

devtools::install_github("Tazinho/snakecase")
library(snakecase)
to_any_case("àngst häschen", case = "all_caps", 
            replace_special_characters = c("german", "Latin-ASCII"), 
            postprocess = "-")
[1] "ANGST-HAESCHEN"

So you can supply a string of combinations of entries from the stringi::stri_trans_list() and transliteration dictionaries (currently only "german"), which are implemented in the snakecase package. The dictionaries are necessary I think, because I often want to replace ö -> oe or ß to ss and this is currently not possible via stringi::stri_trans_general. It also gives the possibility to build dictionaries for other (specific) special charactersets, as I have seen that you'ld like to replace %, €, @ and so on with clean_names.

I was also thinking about the dependencies stringr and stringi. The problem is, that I only need stringi::stri_trans_list() and stringi::stri_trans_general. I don't think it is reasonable to ask if the stringr maintainers include these in stringr, since it is out of its scope. I also don't think it is reasonable to rewrite every usage of stringr I made via stringi. It is correct, that stringr mostly wrappes stringi, however, within stringr also a lot of error checking and logic is provided, so that it would just blow up the snakecase package.

Pls let me know your thoughts on that.
Best, Malte

Tazinho · 2017-07-21T13:31:11Z

Hi, since snakecase is ready for CRAN (when stringi is updated), I thought it is a good time for this issue and just played around a bit with your testcases. I tried to give you the same parsing (default) and case options from the snakecase pkg and a good default transliteration. At the same time I wanted to give you backward compatibility.

I think

the janitor specific transliteration of special characters like %, " etc. and the call to make.names() should stay at the beginning. (Within snakecase otherwise "can't" couldn't be translated to "cant". It would add an underscore and the best result would be "can_t".)
the parsing, the case conversion and the transliteration of characters should happen within to_any_case().
the numbering of duplicated entries should remain like before

Basically this looks like this:

### ____________________________________________________________________________
### LIBRARIES
> # devtools::install_github("Tazinho/snakecase")
> # library(snakecase)
> # library(magrittr)
### ____________________________________________________________________________
### OLD VERSION
> clean_names2 <- function(old_names){
+   new_names <- old_names %>%
+     gsub("'", "", .) %>% # remove quotation marks
+     gsub("\"", "", .) %>% # remove quotation marks
+     gsub("%", "percent", .) %>%
+     gsub("^[ ]+", "", .) %>%
+     make.names(.) %>%
+     gsub("[.]+", "_", .) %>% # convert 1+ periods to single _
+     gsub("[_]+", "_", .) %>% # fix rare cases of multiple consecutive underscores
+     tolower(.) %>%
+     gsub("_$", "", .) %>% # remove string-final underscores
+     stringi::stri_trans_general("latin-ascii") 
+   
+   # Handle duplicated names - they mess up dplyr pipelines
+   # This appends the column number to repeated instances of duplicate variable names
+   dupe_count <- sapply(1:length(new_names), function(i) { 
+     sum(new_names[i] == new_names[1:i]) })
+   
+   new_names[dupe_count > 1] <- paste(new_names[dupe_count > 1],
+                                      dupe_count[dupe_count > 1],
+                                      sep = "_")
+   new_names
+ }
### ____________________________________________________________________________
### NEW VERSION
> clean_names3 <- function(old_names, case = "snake"){
+   new_names <- old_names %>%
+     gsub("'", "", .) %>% # remove quotation marks
+     gsub("\"", "", .) %>% # remove quotation marks
+     gsub("%", "percent", .) %>%
+     gsub("^[ ]+", "", .) %>%
+     make.names(.) %>%
+     # gsub("[.]+", "_", .) %>% # convert 1+ periods to single _
+     # gsub("[_]+", "_", .) %>% # fix rare cases of multiple consecutive underscores
+     # tolower(.) %>%
+     # gsub("_$", "", .) %>% # remove string-final underscores
+     # stringi::stri_trans_general("latin-ascii") 
+     to_any_case(case = case, preprocess = "\\.", 
+                 replace_special_characters = c("german", "Latin-ASCII"))
+   
+   # Handle duplicated names - they mess up dplyr pipelines
+   # This appends the column number to repeated instances of duplicate variable names
+   dupe_count <- sapply(1:length(new_names), function(i) { 
+     sum(new_names[i] == new_names[1:i]) })
+   
+   new_names[dupe_count > 1] <- paste(new_names[dupe_count > 1],
+                                      dupe_count[dupe_count > 1],
+                                      sep = "_")
+   new_names
+ }
### ____________________________________________________________________________
### TEST CASES
> string <- c("sp ace", "repeated", "a**#@", "%", "#", "!",
+             "d(!)9", "REPEATED", "can\"'t", "hi_`there`", "  leading spaces",
+             "€", "ação", "farœ", "r.stüdio:v.1.0.143")
### ____________________________________________________________________________
### COMPARISON
> clean_names2(string)
 [1] "sp_ace"             "repeated"           "a"                 
 [4] "percent"            "x"                  "x_2"               
 [7] "d_9"                "repeated_2"         "cant"              
[10] "hi_there"           "leading_spaces"     "x_3"               
[13] "acao"               "faroe"              "r_studio_v_1_0_143"
> clean_names3(string)
 [1] "sp_ace"              "repeated"            "a"                  
 [4] "percent"             "x"                   "x_2"                
 [7] "d_9"                 "repeated_2"          "cant"               
[10] "hi_there"            "leading_spaces"      "x_3"                
[13] "acao"                "faroe"               "r_stuedio_v_1_0_143"
> clean_names3(string, case = "all_caps")
 [1] "SP_ACE"              "REPEATED"            "A"                  
 [4] "PERCENT"             "X"                   "X_2"                
 [7] "D_9"                 "REPEATED_2"          "CANT"               
[10] "HI_THERE"            "LEADING_SPACES"      "X_3"                
[13] "ACAO"                "FAROE"               "R_STUEDIO_V_1_0_143"

Further I would add "_" around "percentage" during the replacement and switch from sapply to vapply. I think thats basically what can be done. Let me know what you think. If you like it, I will add a pull request within the next days after the next CRAN update.

Best,
Malte

sfirke · 2017-07-21T15:57:12Z

Thanks Malte! I'm tied up at the moment but am excited to dig into this. I'll take a thorough look through and reply within the next week.

Tazinho · 2017-07-21T17:50:25Z

Sounds good, thanks :)

sfirke · 2017-08-01T20:58:41Z

Err I am late w/ this but will play around with this tonight. And I think you're right about sapply -> vapply, good catch. I thought the way I did % -> percent usually led to okay underscore separation but I will explore that too.

sfirke · 2017-08-03T14:07:09Z

@Tazinho I played around with clean_names3() above, and 👍, please submit a pull request if you are willing! I am on board with having the case options to clean_names(), putting underscores around the % -> percent replacement, and replacing sapply with vapply.

I do wonder about this line: replace_special_characters = c("german", "Latin-ASCII"). I worry about it being too niche/specific. I see how it benefits users for whom ü -> ue is preferable to ü -> u. But then this is inconsistent with how other transliterated characters are handled, and depends on a list defined by a single user in your package (vs. some broader, publicly-defined mapping).

I'm not sure what my preference is: skip this transformation? Allow users to pass through their own lists? Something else? But wanted to note this as one part I'm unsure about.

Tazinho · 2017-08-03T18:43:20Z

This sounds super good. Thanks also for taking a deeper look into the package. I will try to handle all issues latest next week.

About replace_special_characters. I can totally understand. The issue is that I wanted to give it the full flexibility of stringi::stri_trans_general(), so all the chaining possibilities of transliterations from stringi::stri_trans_list(). The issue with this, beside the current windows encoding thing in stringi's CRAN version, is that it is just not possible to translate ü to ue. I tried all transliteration options in stringi::stri_trans_list(). There are some sources on the internet where this is handled somehow, but nothing convenient with stringi or R that I am aware of. Maybe this is something that should be an issue on the stringi side. gagolews/stringi#269

I'ld also like to mention, that there are some scandinavian letters, which are tranliterated by "Latin-ASCII" to two characters (so word length can change anyway).

So as a german I can say, that ü, ä, ö and ß are really annoying. I think the native transliteration to ue, ae, oe and ss is definitely clear. These characters are to my best knowledge also only in the german alphabet. So transliteration of these cases is always obvious, besides maybe in abbreviations and that ß is always small, which is a general problem for parsing.

One problem is that I currently only doing this only for german characters. So if you like the idea in general, there might be more transliteration lists in the future, which you might wanna add.

I would suggest that I implement for now only "Latin-ASCII". However I will handle specific local transliterations in the future, we can always add it to janitor package, when we are more clear about it.

The list thing is also a good idea and I might switch someday from character to list, but your users won't recognice it. Only if you give them the ... argument to get full support of snakecase. Which could be of interest...

sfirke · 2017-09-21T18:56:37Z

Wahoo, this is integrated - sorry it took me six months (!). Thanks @Tazinho for the package + patch and @maelle for suggesting this!

Tazinho · 2017-09-22T15:44:06Z

Wuuhooo :) Partytime. Thanks also @sfirke and @maelle

sfirke · 2018-01-22T21:25:27Z

Came here to cheerfully report that my long-running woes of Windows tests failing b/c of encoding in the clean_names tests are over, thanks to the latest testthat release (thanks to r-lib/testthat#605 specifically).

My reward for revisiting this old issue is that I got to see this dinosaur GIF I'd missed the first time ^^^ 😀

sfirke mentioned this issue Mar 28, 2017

clean_names function simplification #102

Closed

Tazinho mentioned this issue May 23, 2017

resolve stringr dependency to stringi instead Tazinho/snakecase#57

Closed

sfirke added good first issue hop-right-in labels May 26, 2017

Tazinho mentioned this issue Jul 11, 2017

clean_names could transliterate accented characters #120

Closed

sfirke mentioned this issue Aug 3, 2017

minor improvements to clean_names #130

Closed

Tazinho mentioned this issue Aug 3, 2017

integrate to janitor Tazinho/snakecase#90

Closed

Tazinho mentioned this issue Aug 16, 2017

Integrate snakecase-pkg into janitor #131

Merged

sfirke closed this as completed Sep 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

offer other case options as an argument in clean_names #96

offer other case options as an argument in clean_names #96

sfirke commented Mar 13, 2017

maelle commented Mar 13, 2017

Tazinho commented Mar 13, 2017

Tazinho commented May 19, 2017 •

edited

Loading

maelle commented May 19, 2017

sfirke commented May 23, 2017

Tazinho commented May 23, 2017 •

edited

Loading

Tazinho commented Jul 11, 2017 •

edited

Loading

Tazinho commented Jul 21, 2017 •

edited

Loading

sfirke commented Jul 21, 2017

Tazinho commented Jul 21, 2017

sfirke commented Aug 1, 2017

sfirke commented Aug 3, 2017

Tazinho commented Aug 3, 2017

sfirke commented Sep 21, 2017

Tazinho commented Sep 22, 2017

sfirke commented Jan 22, 2018

offer other case options as an argument in clean_names #96

offer other case options as an argument in clean_names #96

Comments

sfirke commented Mar 13, 2017

maelle commented Mar 13, 2017

Tazinho commented Mar 13, 2017

Tazinho commented May 19, 2017 • edited Loading

maelle commented May 19, 2017

sfirke commented May 23, 2017

Tazinho commented May 23, 2017 • edited Loading

Tazinho commented Jul 11, 2017 • edited Loading

Tazinho commented Jul 21, 2017 • edited Loading

sfirke commented Jul 21, 2017

Tazinho commented Jul 21, 2017

sfirke commented Aug 1, 2017

sfirke commented Aug 3, 2017

Tazinho commented Aug 3, 2017

sfirke commented Sep 21, 2017

Tazinho commented Sep 22, 2017

sfirke commented Jan 22, 2018

Tazinho commented May 19, 2017 •

edited

Loading

Tazinho commented May 23, 2017 •

edited

Loading

Tazinho commented Jul 11, 2017 •

edited

Loading

Tazinho commented Jul 21, 2017 •

edited

Loading