Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove_empty only to last rows of last columns of data.frame #419

Closed
lucasmation opened this issue Jan 12, 2021 · 5 comments
Closed

remove_empty only to last rows of last columns of data.frame #419

lucasmation opened this issue Jan 12, 2021 · 5 comments

Comments

@lucasmation
Copy link

lucasmation commented Jan 12, 2021

tkx for the amazing package.

In my application, I extract several tables that are in the same spreadsheet (for hundreds of spreadsheets with a similar format, but some tables have more rows than others, as users add rows to tables as needed). I extract the starting points of each table by regular expressions on the in-sheet column names of each table and the end from the start of the subsequent table -1.

After extracting, several of the resulting tables (data.frames) have empty rows and at the end (or empty columns on the right). Those are conceptually different than empty rows and columns in the middle of the table.

So I would like to be able to remove just the empty cells in the bottom and right margins of the table.
Perhaps the syntax could be something like:

remove_empty(remove_bottom=TRUE)

Edit 1: This is the solution I came up with to remove rows.

## Aux function to remove NA rows below table
remove_empty_row_last <- function(dt){
  dt[ ,row_empty:=rowSums(is.na(dt))==ncol(dt)] 
  while (dt[.N,row_empty]==TRUE) {
    dt <- dt[1:(.N-1)]
    
  }
  dt %>% return()
}

d <- data.table(a=c(1,NA,3,NA,5,NA,NA),b=c(1,NA,3,4,5,NA,NA))
remove_empty_row_last(d)

It requires a data.table but you could adapt to dplyr based. The function should also not leave the row_empty column in its output.

EDIT2: I cross-posted on this SO quesiton

@billdenney
Copy link
Collaborator

Thanks for the suggestion. My first impression is that this is rather special-purpose. (That is to day, I have not had a need for this in my work flow.)

  • Can you please describe the use case in a bit more detail-- specifically, why is an all NA row in the middle good to include while an all NA row at the bottom is bad?
  • And, how do you know that the first row that would be removed in your data is not the same as a middle row? (For the second question, how do you know that row 6 should be removed while row 2 should be included in your dataset?)

If @sfirke finds it to be a good feature, I would suggest the implementation to expand the use of the which argument so that it would be possible to select any of c("rows", "cols", "rows top", "rows middle", "rows bottom", "cols left", "cols middle", "cols right") where "rows" implies all of the rows and "cols" implies all of the columns (both overriding the more specific choices with a warning, if both are given).

@lucasmation
Copy link
Author

Ok. Maybe this is too special-purpose. Let me describe the case.

Let's say a spreadsheet (tab) has several tables, A, B, C. It is easy to identify the starting line of each table because each table starts with the column names. However, there is no consistent marker of the end of the tables. So I can define the end of table A as the row number for the start of table B minus 1. This will work, but table A will be imported with a bunch of empty rows at the end.

These empty rows at the end are conceptually different because I know they do not contain information. Empty rows at the middle may indicate some parsing error, typing error, etc. So, in my application, I want to differentiate between them.

Feel fee to disregard this suggestion and close the issue.

@aghaynes
Copy link

I would have a related, but slightly different suggestion... my use case is an extract from a database where the first few variables are identifiers of some sort and the rest are actual data. suppose that some variables are conditional on something, in which case you would end up with something like

id v1 v2 v3
a b c d
b
c

I need to keep the id column for merging with other data sets, so throwing that away isn't possible. But if remove_empty could be conditioned to use regex (via tidyselect perhaps or anything else) then I could do something like remove_empty(d, "rows", vars = starts_with("v")) to remove rows like b and c.

Admittedly, I'm not sure how relevant it would be to removing columns (maybe there are particular columns that should be retained regardless of them always being empty?)

@ljanda-amplify
Copy link

I have this same need, there are some fully NULL columns that I want to retain and some I want to drop (two different types of columns - one is student responses, which could possibly be fully null, while the other is a column that we don't want if it is fully null). Adding something along the lines of vars = starts_with("v") would be useful!

@billdenney
Copy link
Collaborator

This is related to #530, and that is where future discussion is planned to occur.

Thanks to all contributors to this issue. These issues seem difficult to generalize, overall, except for the specific column definition (so, switching to #530 for that). Each of these seems useful, but the context for the use would require code that seems specialized for the use case. And, with that specialization, it seems that the code would add to the code base and API complexity.

If you have ideas for generalized use cases, please open a new, specific issue for those.

@billdenney billdenney closed this as not planned Won't fix, can't repro, duplicate, stale Feb 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants