Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Audit the existing code to see what is and what is not covered by the retry framework #8258

Open
revans2 opened this issue May 9, 2023 · 3 comments
Assignees
Labels
epic Issue that encompasses a significant feature or body of work reliability Features to improve reliability or bugs that severly impact the reliability of the plugin task Work required that improves the product but is not user facing

Comments

@revans2
Copy link
Collaborator

revans2 commented May 9, 2023

Is your feature request related to a problem? Please describe.
We have covered a lot of operators in the retry framework, but we have not covered everything. We also don't fully have support for split and retry everywhere. This is to go through all of the code for the various operators to find If the code allocated GPU memory

  1. is that allocation covered by the retry framework
  2. if so does the retry code support a split and retry
  3. If there is no split and retry coverage would the algorithms support this (Some algorithms inherently cannot without a substantial change to how the algorithms work).

Then once we have all of this information we can file follow on issues to cover anything that was missed.

NOTE that all expressions should already be covered (at least everything except for Random with is the only non-deterministic expression that we support)

@revans2 revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify task Work required that improves the product but is not user facing reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels May 9, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label May 11, 2023
@revans2
Copy link
Collaborator Author

revans2 commented May 17, 2023

Looking at Spark 3.3.0 I see the following Exec that we need to look through and deal with retry for. I will go through this list update it with either follow on issues to add full retry support, or an indication that it is done.

@mattahrens mattahrens removed the feature request New feature or request label Jun 7, 2023
@firestarman
Copy link
Collaborator

Hi, I saw most of the follow on issues don't have the roadmap or prioritiy specified.

I will work on these issues and it would be good to know their roadmap and prioritiy. Or just following the order of the list here is good enough ?

@revans2
Copy link
Collaborator Author

revans2 commented Sep 19, 2023

@firestarman I think the order here is fine. Be aware that the non-determinisitc expressions for Project and Filter might be a little more difficult and so it is okay if you want to try and skip it for now.

@sameerz sameerz added the epic Issue that encompasses a significant feature or body of work label Oct 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic Issue that encompasses a significant feature or body of work reliability Features to improve reliability or bugs that severly impact the reliability of the plugin task Work required that improves the product but is not user facing
Projects
None yet
Development

No branches or pull requests

4 participants