Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Diversify libcudf's exceptions to simplify exception handling #10200

Closed
vyasr opened this issue Feb 2, 2022 · 14 comments
Closed

[FEA] Diversify libcudf's exceptions to simplify exception handling #10200

vyasr opened this issue Feb 2, 2022 · 14 comments
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@vyasr
Copy link
Contributor

vyasr commented Feb 2, 2022

Is your feature request related to a problem? Please describe.
Currently the vast majority of exceptions thrown by libcudf are cudf::logic_error due to the use of the CUDF_EXPECTS and CUDF_FAIL macros. These are in large part used for input validation. Aside from cudf::logic_error, the only other exceptions that are thrown with any frequency are std::bad_alloc or cuda_error. While the exception payloads are typically sufficiently informative for direct users of the C++ API, they are less useful for users of higher-level language libraries that rely on libcudf. In particular, many cuDF Python functions rely on parsing the exception payloads to translate C++ exceptions into suitable exception types on the Python side.

Describe the solution you'd like
It would be helpful for libcudf to make use of a wider range of standard exceptions to provide more informative exceptions for higher-level libraries to parse. From the Python perspective, it would be especially valuable if libcudf threw exceptions that matched the exceptions that Cython already knows how to translate into Python. While it's not clear that all of these would be useful, at least a few of them certainly could be. For example, a number of cases would be nicely handled if libcudf threw a std::invalid_argument rather than a cudf::logic_error since Cython will automatically translate the former into a Python ValueError. One way to accomplish this would be to allow CUDF_EXPECTS to accept a third parameter indicating the exception type to throw. An example of this already exists in rmm with the corresponding RMM_EXPECTS macro.

For cases where no existing mapping exists, it would be helpful if exception payloads were standardized in some way so that the messages could be parsed more easily. For instance, a very simple approach might be for CUDF_EXPECTS to stringify and embed the name of the exception into the message. It is also possible for the Python layer to introduce a custom exception handler, so we could do that if cudf introduces additional custom exceptions. I am open to systematically embedding additional information in the message as well if others think of useful ways to do so.

Describe alternatives you've considered
libcudf could also use some sort of logging framework rather than relying entirely on exceptions. Such a framework would be significantly more work to implement, but it would have the benefit of also offering additional information. cuDF Python uses warnings in some non-failing cases to provide information to users, and a C++ logging framework could provide information that we could translate to Python warnings.

Additional context
I'm not familiar with how our JNI code currently handles this or whether there are any utilities to simplify translation of C++ exceptions into Java. Ideally libcudf would adopt an exception handling pattern that is suitable for use with all of the higher level libraries that rely on it, so it would be valuable to get some feedback from someone with more experience with cudf's Java interface.

@vyasr vyasr added feature request New feature or request Needs Triage Need team to review and classify labels Feb 2, 2022
@vyasr
Copy link
Contributor Author

vyasr commented Feb 2, 2022

CC @jlowe @revans2 for a Java perspective. AFAICT everything in the JNI just uses a single CudfException class. I don't know if there's any automatic way to generate a wider set of Java exceptions from C++ exceptions through features of the Native Interface.

@jrhemstad
Copy link
Contributor

jrhemstad commented Feb 2, 2022

One way to accomplish this would be to allow CUDF_EXPECTS to accept a third parameter indicating the exception type to throw.

Note that RMM's version already has this: https://github.com/rapidsai/rmm/blob/0515ca47bc6abdc4619a099976bdbf059cbde624/include/rmm/detail/error.hpp#L87-L120

From the Python perspective, it would be especially valuable if libcudf threw exceptions that matched the exceptions that Cython already knows how to translate into Python.

I totally support this so long as we have a precise list of the exact exceptions you'd want libcudf to throw and precisely when you'd want them to be thrown.

it would be helpful if exception payloads were standardized in some way so that the messages could be parsed more easily

I do not support this. The what() message is not the correct channel to carry programmatic information. I think the fact that we are currently parsing exception message strings is a fundamentally flawed approach and we need to get away from that ASAP as opposed to leaning further into it.

It is also possible for the Python layer to introduce a custom exception handler, so we could do that if cudf introduces additional custom exceptions.

This is the correct approach. If you want to know about a situation that occurred that doesn't correspond to one of the exceptions Cython already knows about, then libcudf should introduce a custom exception type that cudf knows to catch that type.

@github-actions
Copy link

github-actions bot commented Mar 5, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Mar 24, 2022
@codereport
Copy link
Contributor

I have been looking into both C++ and Python exceptions. Here is a brief summary of the investigation.

C++ Exceptions: https://en.cppreference.com/w/cpp/error/exception
Python Exceptions: https://docs.python.org/3/library/exceptions.html#bltin-exceptions

There are 26 C++ exceptions and 42 Python exception (+ or - depending on how you count them). Before looking at the Cython mappings, I managed to line up 7 of their 11 mappings. I missed both ArithmeticErrors, both TypeErrors. The only error I "mapped" that Cython doesn't was the SystemError.

These are the recommended Cython mappings:
image

And this is an "exploded" version to highlight the missing C++ exceptions:

C++ Python
logic_error
invalid_argument ValueError
domain_error ValueError
length_error
out_of_range IndexError
future_error(C++11)
bad_optional_access(C++17)
runtime_error RuntimeError
range_error ArithmeticError
overflow_error OverflowError
underflow_error ArithmeticError
regex_error(C++11)
system_error(C++11) SystemError
ios_base::failure(C++11) IOError
filesystem::filesystem_error(C++17)
nonexistent_local_time(C++20)
ambiguous_local_time(C++20)
format_error(C++20)
bad_typeid TypeError
bad_cast TypeError
bad_any_cast(C++17)
bad_weak_ptr(C++11)
bad_function_call(C++11)
bad_alloc MemoryError
bad_array_new_length(C++11)
bad_exception
bad_variant_access(C++17)

Probably the place to start is just by tacking a single pairing, such as std::invalid_argument and ValueError.

@jrhemstad
Copy link
Contributor

Probably the place to start is just by tacking a single pairing

I think the place to start is to find where C++ exception what() messages are currently being parsed from Python in order to raise the proper Python exception. That kind of exception string parsing is very bad and we should seek to eliminate it ASAP.

@jlowe
Copy link
Member

jlowe commented Mar 29, 2022

One consideration that recently popped up on the Spark side is differentiating errors that are recoverable by reattempting the task vs. errors that are fatal to the process. For example, CUDA's cudaErrorIllegalAddress and some other CUDA errors will cause every subsequent CUDA API call to return an error and thus are effectively fatal to the process. It would be nice if there was a specific exception subclass/interface that could be caught for these type of exceptions.

Failing that, it would be nice to have an exception specific to CUDA errors that includes the CUDA error code so the application can discern these "fatal" errors when appropriate. Today CUDA errors are encoded in the exception string rather than accessible directly within the exception.

@jrhemstad
Copy link
Contributor

It would be nice if there was a specific exception subclass/interface that could be caught for these type of exceptions.

This would be nice, but it is harder than it sounds because CUDA doesn't really document all that well which errors are like this and I believe it can depend on the context in which they occurred.

it would be nice to have an exception specific to CUDA errors that includes the CUDA error code

Definitely. I could have sworn we already did that, but I guess not.

@jrhemstad
Copy link
Contributor

jrhemstad commented Mar 29, 2022

This would be nice, but it is harder than it sounds because CUDA doesn't really document all that well which errors are like this and I believe it can depend on the context in which they occurred.

Hm, one way we could hack around this is to call cudaGetLastError twice and if the second invocation doesn't return cudaSuccess (and returns the same error as the first) then it is nearly certain that a sticky error occurred. I say "nearly" because it is possible that some other asynchronous error occurred between the two calls to cudaGetLastError.

A heavier hammer would be to do a cudaDeviceSynchronize() after the cudaGetLastError().

So like this:

void throw_on_cuda_error(cudaError_t error){
   if(cudaSuccess != error){
       cudaGetLastError();                      // Clear any lingering cudart error
       auto const last = cudaGetLastError();    // Check if cudart error still exists (sticky)
       bool const is_sticky = (error == last) 
                              and (last == cudaDeviceSynchronize()); // Sync to ensure "last" wasnt an async error
       throw is_sticky ? sticky_error{""} : cudart_error{""}; 
   }
}

@vyasr
Copy link
Contributor Author

vyasr commented Mar 29, 2022

Probably the place to start is just by tacking a single pairing

I think the place to start is to find where C++ exception what() messages are currently being parsed from Python in order to raise the proper Python exception. That kind of exception string parsing is very bad and we should seek to eliminate it ASAP.

My observations of this kind of parsing is why I suggested starting with the invalid_exception->ValueError map. I suspect that most cases where we are currently doing this are where libcudf is doing a CUDF_FAIL on an input parameter, and that the most common case of those right now is mapping to a Python ValueError due to an invalid argument. I could also imagine there being some cases of OverflowError, IndexError, and/or TypeError popping up. A command like grep -C 5 -R "except RuntimeError" python/cudf/cudf --include="*.py" should give a reasonable starting point of places where we are catching a C++ generated RuntimeError and reraising as a different type of error.

@vyasr
Copy link
Contributor Author

vyasr commented Mar 29, 2022

it would be nice to have an exception specific to CUDA errors that includes the CUDA error code

Definitely. I could have sworn we already did that, but I guess not.

We do have cuda_error, which is thrown by CUDA_TRY. It sticks the error into the error message, though. Perhaps now would be a good time to address the TODO requesting that the error code be encoded in the exception itself?

@jrhemstad
Copy link
Contributor

I filed #10553 separately as it is orthogonal to the rest of the exception improvements.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

github-actions bot commented Aug 9, 2022

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

rapids-bot bot pushed a commit that referenced this issue Nov 8, 2022
We should not be encouraging users to rely specific error messages. Anywhere that is currently doing so is likely an indication that libcudf should be throwing a more specific type of exception instead of just a `cudf::logic_error`. This PR removes the testing utilities that were previously used for this purpose and reworks the relevant tests.

Related to #10200.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - David Wendt (https://github.com/davidwendt)
  - Bradley Dice (https://github.com/bdice)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #12076
@vyasr vyasr assigned vyasr and unassigned codereport and vyasr Nov 29, 2022
rapids-bot bot pushed a commit that referenced this issue Dec 2, 2022
This change enables libcudf APIs throwing exceptions other than `cudf::logic_error`. This code was adapted from rmm. This contributes to #10200.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Lawrence Mitchell (https://github.com/wence-)

URL: #12078
rapids-bot bot pushed a commit that referenced this issue Jan 5, 2023
This PR removes support for checking exception messages from `assert_exceptions_equal`. See #12076 for the same change made in C++, #10200 for more context, and #7917 for the change in cudf Python's developer documentation where we officially changed our policy to not consider exception payloads part of the stable API.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: #12424
wence- added a commit to wence-/cudf that referenced this issue Feb 6, 2023
Partial progress towards rapidsai#10200, this will enable catching and
re-raising a useful overflow message in to_csv if the requested
dataframe write cannot be converted to a single string column without
overflow.
rapids-bot bot pushed a commit that referenced this issue Feb 17, 2023
Since writing to CSV files is implemented by converting all columns in
a dataframe to strings, and then concatenating those columns, when we
attempt to write a large dataframe to CSV without specifying a chunk
size, we can easily overflow the maximum column size.

Currently the error message is rather inscrutable: that the requested
size of a string column exceeds the column size limit. To help the
user, catch this error and provide a useful error message that points
them towards setting the `chunksize` argument.

So that we don't produce false positive advice, tighten the scope by
only catching `OverflowError`, to do this, make partial progress
towards resolving #10200 by throwing `std::overflow_error` when
checking for overflow of string column lengths.

Closes #12690.

Authors:
  - Lawrence Mitchell (https://github.com/wence-)
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Ashwin Srinath (https://github.com/shwina)
  - Nghia Truong (https://github.com/ttnghia)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #12705
@vyasr
Copy link
Contributor Author

vyasr commented Mar 6, 2023

Between #12076, #12078, and #12426 enough work has been done to consider this discussion closed. I will open a new issue to discuss follow-up work in a more task-oriented format to better measure completion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

5 participants