Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TASK] Big Reliability Epic #1870

Closed
14 tasks done
revans2 opened this issue Mar 4, 2021 · 3 comments
Closed
14 tasks done

[TASK] Big Reliability Epic #1870

revans2 opened this issue Mar 4, 2021 · 3 comments
Assignees
Labels
epic Issue that encompasses a significant feature or body of work P0 Must have for release reliability Features to improve reliability or bugs that severly impact the reliability of the plugin task Work required that improves the product but is not user facing test Only impacts tests

Comments

@revans2
Copy link
Collaborator

revans2 commented Mar 4, 2021

We recently had an issue where contiguousSplit started to fail on 2GB partitions. We know that there are some issues with similar limits in shuffle #45 but it is the unknown unknowns that are more problematic because we cannot make informed decisions about prioritizing fixing these issues.

We need to come up with a test plan to really hammer on size limits in both cudf and this plugin so we can have a better understanding of what limits exist and so we can come up with a proper plan to address them.

Avoid Crashes:

Highest priority:

Next on the list:

Test for new issues:

Auto Tune:

Better Error Reporting:

@revans2 revans2 added ? - Needs Triage Need team to review and classify test Only impacts tests task Work required that improves the product but is not user facing labels Mar 4, 2021
@sameerz sameerz added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Apr 13, 2021
@sameerz
Copy link
Collaborator

sameerz commented May 28, 2021

Discussed and need to break down larger work items into tasks.

@sameerz sameerz added the epic Issue that encompasses a significant feature or body of work label Jul 13, 2021
@sameerz
Copy link
Collaborator

sameerz commented Jul 13, 2021

This is an epic that is being supported by other issues. Not specific to a release.

@sameerz sameerz changed the title [TASK] Figure out a testing plan for enterpriseiness [TASK] Figure out a testing plan for enterprisiness Mar 29, 2022
@revans2 revans2 added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Apr 12, 2022
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Apr 24, 2022
This PR is for NVIDIA/spark-rapids#5029  and NVIDIA/spark-rapids#1870, which enables cuDF JNI to throw CUDA errors with specific error code.  This PR relies on #10630, which exposes the CUDA error code and distinguishes fatal CUDA errors from the others.

With this improvement, it is supposed to be easier to track CUDA errors triggered by JVM APIs.

Authors:
  - Alfred Xu (https://github.com/sperlingxx)

Approvers:
  - Jason Lowe (https://github.com/jlowe)

URL: #10551
@revans2
Copy link
Collaborator Author

revans2 commented Apr 4, 2023

This became a dumping ground for a lot of reliability issues. I am going to rename this. Remove everything that is not done, and then file new epics to track each of the individual issues.

@revans2 revans2 changed the title [TASK] Figure out a testing plan for enterprisiness [TASK] Big Reliability Epic Apr 4, 2023
@revans2 revans2 closed this as completed Apr 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic Issue that encompasses a significant feature or body of work P0 Must have for release reliability Features to improve reliability or bugs that severly impact the reliability of the plugin task Work required that improves the product but is not user facing test Only impacts tests
Projects
None yet
Development

No branches or pull requests

2 participants