Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pin to aws-sdk-cpp<1.11 #14173

Merged
merged 3 commits into from
Sep 22, 2023
Merged

Conversation

pentschev
Copy link
Member

@pentschev pentschev commented Sep 22, 2023

Description

Pin conda packages to aws-sdk-cpp<1.11. The recent upgrade in version 1.11.* has caused several issues with cleaning up (more details on changes can be read in this link), leading to Distributed and Dask-CUDA processes to segfault. The stack for one of those crashes looks like the following:

(gdb) bt
#0  0x00007f5125359a0c in Aws::Utils::Logging::s_aws_logger_redirect_get_log_level(aws_logger*, unsigned int) () from /opt/conda/envs/dask/lib/python3.9/site-packages/pyarrow/../../.././libaws-cpp-sdk-core.so
#1  0x00007f5124968f83 in aws_event_loop_thread () from /opt/conda/envs/dask/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-io.so.1.0.0
#2  0x00007f5124ad9359 in thread_fn () from /opt/conda/envs/dask/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1
#3  0x00007f519958f6db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#4  0x00007f5198b1361f in clone () from /lib/x86_64-linux-gnu/libc.so.6

Such segfaults now manifest frequently in CI, and in some cases are reproducible with a hit rate of ~30%. Given the approaching release time, it's probably the safest option to just pin to an older version of the package while we don't pinpoint the exact cause for the issue and a patched build is released upstream.

The aws-sdk-cpp is statically-linked in the pyarrow pip package, which prevents us from using the same pinning technique. cuDF is currently pinned to pyarrow=12.0.1 which seems to be built against aws-sdk-cpp=1.10.*, as per recent build logs.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@pentschev pentschev requested a review from a team as a code owner September 22, 2023 13:36
@github-actions github-actions bot added the conda label Sep 22, 2023
@pentschev pentschev changed the title Pin to aws-sdk-cpp<11 Pin to aws-sdk-cpp<1.11 Sep 22, 2023
@wence- wence- added the bug Something isn't working label Sep 22, 2023
@pentschev pentschev added 3 - Ready for Review Ready for review by team non-breaking Non-breaking change labels Sep 22, 2023
@pentschev
Copy link
Member Author

I think this is a non-breaking change, but please correct the label if it is.

Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pentschev

@pentschev
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 40bdd8a into rapidsai:branch-23.10 Sep 22, 2023
58 checks passed
rapids-bot bot pushed a commit that referenced this pull request Oct 19, 2023
…or new CI containers (#14296)

The aws-sdk-cpp pinning introduced in #14173 causes problems because newer builds of libarrow require a newer version of aws-sdk-cpp. Even though we restrict to libarrow 12.0.1, this restriction is insufficient to create solvable environments because the conda (mamba) solver doesn't seem to consistently reach far back enough into the history of builds to pull the last build that was compatible with the aws-sdk-cpp version that we need. For now, the safest way for us to avoid this problem is to downgrade to arrow 12.0.0, for which all conda package builds are pinned to the older version of aws-sdk-cpp that does not have the bug in question.

Separately, while the above issue was encountered we also got new builds of our CI images [that removed system installs of CTK packages from CUDA 12 images](rapidsai/ci-imgs#77). This changes was made because for CUDA 12 we can get all the necessary pieces of the CTK from conda-forge. However, it turns out that the cudf_kafka builds were implicitly relying on system CTK packages, and the cudf_kafka build is in fact not fully compatible with conda-forge CTK packages because it is not using CMake via scikit-build (nor any other more sophisticated library discovery mechanism like pkg-config) and therefore does not know how to find conda-forge CTK headers/libraries. This PR introduces a set of temporary patches to get around this limitation. These patches are not a long-term fix, and are only put in place assuming that #14292 is merged in the near future before we cut a 23.12 release.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Ray Douglass (https://github.com/raydouglass)

URL: #14296
raydouglass pushed a commit that referenced this pull request Nov 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working non-breaking Non-breaking change
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants