Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: join_many_to_one() and join_one_to_one() #427

Open
billdenney opened this issue Feb 5, 2021 · 13 comments
Open

Feature Request: join_many_to_one() and join_one_to_one() #427

billdenney opened this issue Feb 5, 2021 · 13 comments

Comments

@billdenney
Copy link
Collaborator

I often need to perform join operations which confirm that the join is either 1:1 or many:1. I have written functions for this in my personal package, but I think that they may be of general interest. That said, they are a bit of a stretch for normal janitor functionality, so if you don't think it's a good fit, no worries.

@billdenney
Copy link
Collaborator Author

billdenney commented Apr 15, 2022

@sfirke , If I make a PR for this, do you think it would be of interest? (And no worries if you think it's out of scope.)

@sfirke
Copy link
Owner

sfirke commented Apr 15, 2022

Sorry I missed this the first time. Can you explain a little more? I'm reading that as a wrapper for dplyr::left_join that fails if a record from the left table is matched to more than one in the right table?

@billdenney
Copy link
Collaborator Author

You've effectively got it. For join_one_to_one() it is a wrapper for full_join() that ensures that each row in both the left and right arguments is matched exactly once. For join_many_to_one() it is a wrapper for left_join() that ensures that each row in the left side matches exactly one row in the right table (but, it does not require that all rows in the right table are matched).

In my personal package, I have a more complex join_control() function that lets you specify several kinds of requirements on the join like no missing rows, no duplicated rows, all rows must be present, etc. and you can combine the requirements.

@jzadra
Copy link
Contributor

jzadra commented Apr 15, 2022

If you don't put this in janitor, I'd love to see your personal code. I also run into these questions after a join. Looking at the row numbers before and after tells you something, but this sounds like it's much more specific.

@billdenney
Copy link
Collaborator Author

@jzadra , I just made a PR for them. Yes, the personal code is much more extensive in its testing. It creates a row number column for each of the input data.frames and then checks that the output row numbering is as-expected for missingness, uniqueness, etc.

@jzadra
Copy link
Contributor

jzadra commented Apr 15, 2022

@billdenney Sounds great. It would be cool if it also provided a summary of which variables were used in the join match that ended up producing multiple rows (and showing the value of those variables), or that didn't have a match and resulted in NAs for the variables from the table being added. Happy to help with this, I think I'd use it a lot.

@billdenney
Copy link
Collaborator Author

@jzadra , It does that summary reporting, too. 😄

@jzadra
Copy link
Contributor

jzadra commented Apr 15, 2022

@billdenney Sweet! Is your personal package public?

@billdenney
Copy link
Collaborator Author

It's kinda, sorta public. It is public in that it's on GitHub. You're welcome to check it out (it's my bsd.report repo). But, it should not be relied upon to be stable. I change things in it all the time and do not consider the API to be stable at all. And, I'm aware of the parts of it that are well-thought-through and tested vs the ones that are one-off hacks that got some job done for a given day.

@billdenney
Copy link
Collaborator Author

@jzadra , For this particular feature, I'd suggest going to PR #480.

@sda030
Copy link

sda030 commented Jun 24, 2022

Any reason why no suggesting this for the dplyr package? I think many people have similar issues when joining, and I prefer the idea of having all join functions in the same package - mentally easier to remember that one package does everything related to a topic rather than splitting across several. I also think that you join_control is the better selling point than the join_one_to_one(). Perhaps the latter convenience wrapper is better suited for jantitor?

@billdenney
Copy link
Collaborator Author

@sda030 , That's a good question. I thought I had proposed something like this in dplyr in the past, but I had not. I'll put a feeler out over there to see if it would be of interest for dplyr.

@billdenney
Copy link
Collaborator Author

@sda030 , And I've now added an issue over there to see if it would be of interest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants