Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

offer other case options as an argument in clean_names #96

Closed
sfirke opened this issue Mar 13, 2017 · 16 comments
Closed

offer other case options as an argument in clean_names #96

sfirke opened this issue Mar 13, 2017 · 16 comments

Comments

@sfirke
Copy link
Owner

sfirke commented Mar 13, 2017

Suggested by @masalmon on Twitter. I only use snake_case, but it would be more inclusive of other coding styles to offer camelCase or maybe PascalCase. Perhaps wrap @Tazinho's snakecase package?

@maelle
Copy link

maelle commented Mar 13, 2017

For the record, I only use snake case but I can imagine other people using janitor and wishing that, not sure they would install a package just for changing column names (not undermining your great work @Tazinho !) so that it'd be easier for them to have it inside janitor?

@Tazinho
Copy link
Contributor

Tazinho commented Mar 13, 2017

Just some notes on that. It is super easy to get pascal- and camel-case once you, @sfirke have converted to clean snake_case. So it might still be useful to build it up on the already available clean_names() function, which has all the tests for valid data.frame names. Of course you can wrap snakecase, just let me know any requirementes. Also note that snakecase currently imports purrr, stringr and magrittr-pipes, but I could translate to a baseR version if this is an issue.

On the long run I think it would be best if the functionality would be implemented in stringr/stringi. However, at the moment it is not, which is as far as I know, because these packages claim more generality and it is not always obvious what the snake case result should look like.

@Tazinho
Copy link
Contributor

Tazinho commented May 19, 2017

Just wanted to let you know that snakecase is on cran now and it is planned to support a lot of stuff that might be interesting for you. See the issues for that or this talk:
https://www.youtube.com/watch?v=T6p0l8XzP64
Note also that there are (currently experimantally) options to fill empty names and make sure that names are not dublicated.
However, I am not sure if it is a good idea to reproduce exactly your testcases in the long run. If you have any further ideas what should be supported for your usecase, don't hesitate to let me know.

@maelle
Copy link

maelle commented May 19, 2017

Nice work! Since I have a special character in my name I like the related issue 😀

@sfirke
Copy link
Owner Author

sfirke commented May 23, 2017

@Tazinho congrats on the package being on CRAN! I have a data source at work that gives me camelCase and it looks like to_snake_case will get it more to my liking 👏

I think this would be nice to offer camelCase and PascalCase as options to clean_names, maybe a case = parameter. And I agree, it appears that once the current clean_names runs, the conversions with snakecase functions will be simple one-liners.

I like minimizing dependencies to the extent possible, but I don't think it'll be a dealbreaker here.
Though if we make #120 happen and that adds a dependency on stringi, janitor would then be requiring stringi and stringr (via snakecase) which seems a little odd as I think stringr is just a subset of stringi?

@Tazinho
Copy link
Contributor

Tazinho commented May 23, 2017

I will think about resolving stringr to stringi. In general the stringi/stringr dependency seems to be a really good thing. Tazinho/snakecase#46

@Tazinho
Copy link
Contributor

Tazinho commented Jul 11, 2017

I just finished a bigger update on the devversion of the snakecase package, including a change to the replace_special_characters argument, which addresses also the following issues:

#120 (transliteration of special characters)
Tazinho/snakecase#57 (stringr vs stringi dependency)

You can now do the following:

devtools::install_github("Tazinho/snakecase")
library(snakecase)
to_any_case("àngst häschen", case = "all_caps", 
            replace_special_characters = c("german", "Latin-ASCII"), 
            postprocess = "-")
[1] "ANGST-HAESCHEN"

So you can supply a string of combinations of entries from the stringi::stri_trans_list() and transliteration dictionaries (currently only "german"), which are implemented in the snakecase package. The dictionaries are necessary I think, because I often want to replace ö -> oe or ß to ss and this is currently not possible via stringi::stri_trans_general. It also gives the possibility to build dictionaries for other (specific) special charactersets, as I have seen that you'ld like to replace %, €, @ and so on with clean_names.

I was also thinking about the dependencies stringr and stringi. The problem is, that I only need stringi::stri_trans_list() and stringi::stri_trans_general. I don't think it is reasonable to ask if the stringr maintainers include these in stringr, since it is out of its scope. I also don't think it is reasonable to rewrite every usage of stringr I made via stringi. It is correct, that stringr mostly wrappes stringi, however, within stringr also a lot of error checking and logic is provided, so that it would just blow up the snakecase package.

Pls let me know your thoughts on that.
Best, Malte

@Tazinho
Copy link
Contributor

Tazinho commented Jul 21, 2017

Hi, since snakecase is ready for CRAN (when stringi is updated), I thought it is a good time for this issue and just played around a bit with your testcases. I tried to give you the same parsing (default) and case options from the snakecase pkg and a good default transliteration. At the same time I wanted to give you backward compatibility.

I think

  • the janitor specific transliteration of special characters like %, " etc. and the call to make.names() should stay at the beginning. (Within snakecase otherwise "can't" couldn't be translated to "cant". It would add an underscore and the best result would be "can_t".)
  • the parsing, the case conversion and the transliteration of characters should happen within to_any_case().
  • the numbering of duplicated entries should remain like before

Basically this looks like this:

### ____________________________________________________________________________
### LIBRARIES
> # devtools::install_github("Tazinho/snakecase")
> # library(snakecase)
> # library(magrittr)
### ____________________________________________________________________________
### OLD VERSION
> clean_names2 <- function(old_names){
+   new_names <- old_names %>%
+     gsub("'", "", .) %>% # remove quotation marks
+     gsub("\"", "", .) %>% # remove quotation marks
+     gsub("%", "percent", .) %>%
+     gsub("^[ ]+", "", .) %>%
+     make.names(.) %>%
+     gsub("[.]+", "_", .) %>% # convert 1+ periods to single _
+     gsub("[_]+", "_", .) %>% # fix rare cases of multiple consecutive underscores
+     tolower(.) %>%
+     gsub("_$", "", .) %>% # remove string-final underscores
+     stringi::stri_trans_general("latin-ascii") 
+   
+   # Handle duplicated names - they mess up dplyr pipelines
+   # This appends the column number to repeated instances of duplicate variable names
+   dupe_count <- sapply(1:length(new_names), function(i) { 
+     sum(new_names[i] == new_names[1:i]) })
+   
+   new_names[dupe_count > 1] <- paste(new_names[dupe_count > 1],
+                                      dupe_count[dupe_count > 1],
+                                      sep = "_")
+   new_names
+ }
### ____________________________________________________________________________
### NEW VERSION
> clean_names3 <- function(old_names, case = "snake"){
+   new_names <- old_names %>%
+     gsub("'", "", .) %>% # remove quotation marks
+     gsub("\"", "", .) %>% # remove quotation marks
+     gsub("%", "percent", .) %>%
+     gsub("^[ ]+", "", .) %>%
+     make.names(.) %>%
+     # gsub("[.]+", "_", .) %>% # convert 1+ periods to single _
+     # gsub("[_]+", "_", .) %>% # fix rare cases of multiple consecutive underscores
+     # tolower(.) %>%
+     # gsub("_$", "", .) %>% # remove string-final underscores
+     # stringi::stri_trans_general("latin-ascii") 
+     to_any_case(case = case, preprocess = "\\.", 
+                 replace_special_characters = c("german", "Latin-ASCII"))
+   
+   # Handle duplicated names - they mess up dplyr pipelines
+   # This appends the column number to repeated instances of duplicate variable names
+   dupe_count <- sapply(1:length(new_names), function(i) { 
+     sum(new_names[i] == new_names[1:i]) })
+   
+   new_names[dupe_count > 1] <- paste(new_names[dupe_count > 1],
+                                      dupe_count[dupe_count > 1],
+                                      sep = "_")
+   new_names
+ }
### ____________________________________________________________________________
### TEST CASES
> string <- c("sp ace", "repeated", "a**#@", "%", "#", "!",
+             "d(!)9", "REPEATED", "can\"'t", "hi_`there`", "  leading spaces",
+             "", "ação", "farœ", "r.stüdio:v.1.0.143")
### ____________________________________________________________________________
### COMPARISON
> clean_names2(string)
 [1] "sp_ace"             "repeated"           "a"                 
 [4] "percent"            "x"                  "x_2"               
 [7] "d_9"                "repeated_2"         "cant"              
[10] "hi_there"           "leading_spaces"     "x_3"               
[13] "acao"               "faroe"              "r_studio_v_1_0_143"
> clean_names3(string)
 [1] "sp_ace"              "repeated"            "a"                  
 [4] "percent"             "x"                   "x_2"                
 [7] "d_9"                 "repeated_2"          "cant"               
[10] "hi_there"            "leading_spaces"      "x_3"                
[13] "acao"                "faroe"               "r_stuedio_v_1_0_143"
> clean_names3(string, case = "all_caps")
 [1] "SP_ACE"              "REPEATED"            "A"                  
 [4] "PERCENT"             "X"                   "X_2"                
 [7] "D_9"                 "REPEATED_2"          "CANT"               
[10] "HI_THERE"            "LEADING_SPACES"      "X_3"                
[13] "ACAO"                "FAROE"               "R_STUEDIO_V_1_0_143"

Further I would add "_" around "percentage" during the replacement and switch from sapply to vapply. I think thats basically what can be done. Let me know what you think. If you like it, I will add a pull request within the next days after the next CRAN update.

Best,
Malte

@sfirke
Copy link
Owner Author

sfirke commented Jul 21, 2017

Thanks Malte! I'm tied up at the moment but am excited to dig into this. I'll take a thorough look through and reply within the next week.

@Tazinho
Copy link
Contributor

Tazinho commented Jul 21, 2017

Sounds good, thanks :)

@sfirke
Copy link
Owner Author

sfirke commented Aug 1, 2017

Err I am late w/ this but will play around with this tonight. And I think you're right about sapply -> vapply, good catch. I thought the way I did % -> percent usually led to okay underscore separation but I will explore that too.

@sfirke
Copy link
Owner Author

sfirke commented Aug 3, 2017

@Tazinho I played around with clean_names3() above, and 👍, please submit a pull request if you are willing! I am on board with having the case options to clean_names(), putting underscores around the % -> percent replacement, and replacing sapply with vapply.

I do wonder about this line: replace_special_characters = c("german", "Latin-ASCII"). I worry about it being too niche/specific. I see how it benefits users for whom ü -> ue is preferable to ü -> u. But then this is inconsistent with how other transliterated characters are handled, and depends on a list defined by a single user in your package (vs. some broader, publicly-defined mapping).

I'm not sure what my preference is: skip this transformation? Allow users to pass through their own lists? Something else? But wanted to note this as one part I'm unsure about.

@Tazinho
Copy link
Contributor

Tazinho commented Aug 3, 2017

This sounds super good. Thanks also for taking a deeper look into the package. I will try to handle all issues latest next week.

About replace_special_characters. I can totally understand. The issue is that I wanted to give it the full flexibility of stringi::stri_trans_general(), so all the chaining possibilities of transliterations from stringi::stri_trans_list(). The issue with this, beside the current windows encoding thing in stringi's CRAN version, is that it is just not possible to translate ü to ue. I tried all transliteration options in stringi::stri_trans_list(). There are some sources on the internet where this is handled somehow, but nothing convenient with stringi or R that I am aware of. Maybe this is something that should be an issue on the stringi side. gagolews/stringi#269

I'ld also like to mention, that there are some scandinavian letters, which are tranliterated by "Latin-ASCII" to two characters (so word length can change anyway).

So as a german I can say, that ü, ä, ö and ß are really annoying. I think the native transliteration to ue, ae, oe and ss is definitely clear. These characters are to my best knowledge also only in the german alphabet. So transliteration of these cases is always obvious, besides maybe in abbreviations and that ß is always small, which is a general problem for parsing.

One problem is that I currently only doing this only for german characters. So if you like the idea in general, there might be more transliteration lists in the future, which you might wanna add.

I would suggest that I implement for now only "Latin-ASCII". However I will handle specific local transliterations in the future, we can always add it to janitor package, when we are more clear about it.

The list thing is also a good idea and I might switch someday from character to list, but your users won't recognice it. Only if you give them the ... argument to get full support of snakecase. Which could be of interest...

@sfirke
Copy link
Owner Author

sfirke commented Sep 21, 2017

Wahoo, this is integrated - sorry it took me six months (!). Thanks @Tazinho for the package + patch and @maelle for suggesting this!

@sfirke sfirke closed this as completed Sep 21, 2017
@Tazinho
Copy link
Contributor

Tazinho commented Sep 22, 2017

Wuuhooo :) Partytime. Thanks also @sfirke and @maelle

image

@sfirke
Copy link
Owner Author

sfirke commented Jan 22, 2018

Came here to cheerfully report that my long-running woes of Windows tests failing b/c of encoding in the clean_names tests are over, thanks to the latest testthat release (thanks to r-lib/testthat#605 specifically).

My reward for revisiting this old issue is that I got to see this dinosaur GIF I'd missed the first time ^^^ 😀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants