[BUG] very large shuffles can fail #45
Labels
bug
Something isn't working
P1
Nice to have for release
reliability
Features to improve reliability or bugs that severly impact the reliability of the plugin
SQL
part of the SQL/Dataframe plugin
Describe the bug
Spark has a limit of 2GB 2^31 for storing a single shuffle element. In some cases we can go over this, and we need to make sure that when we shuffle the data that the largest batch we are going to serialize is < 2GB. We cannot do this in the serializer, because it is too late at that point. We need to do it in the shuffle executor.
The text was updated successfully, but these errors were encountered: