-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port whole parsePartitions method from Spark3.3 to Gpu side #6048
Port whole parsePartitions method from Spark3.3 to Gpu side #6048
Conversation
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall looks fine. Definitely copying more code then I expected but assume it all has changes needed from 3.3 that we don't want to pick up other versions of.
I think we should file a follow up to investigate this code more to see if there is another way to replace the paths.
build |
tests failing due to parquet test issue #6054 |
// Copied from org/apache/spark/sql/types/AbstractDataType.scala | ||
// for for https://github.com/NVIDIA/spark-rapids/issues/6026 | ||
// It can be removed when Spark 3.3.0 is the least supported Spark version | ||
private[sql] object AnyTimestampType extends AbstractDataType with Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems very bad. We are creating a separate type hierarchy from the one in Apache Spark which we should never do. Same goes for copying the TimestampNTZType code. We should never copy the type code. We should use it directly, or if we cannot because it is not available in all supported Spark versions, shim the code that needs to reference the type in some way as we have done in the past, even for code referencing TimestampNTZType.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added shim code and removed some unnecessary files. But not quite sure the shim source files are under the right packages, please help take another look. Thanks!
Signed-off-by: Allen Xu <allxu@nvidia.com>
sql-plugin/src/main/scala/org/apache/spark/sql/catalyst/util/rapids/TimestampFormatter.scala
Outdated
Show resolved
Hide resolved
.../src/main/scala/org/apache/spark/sql/execution/datasources/rapids/GpuPartitioningUtils.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/320+/scala/org/apache/spark/sql/types/shims/PartitionValueCastShims.scala
Outdated
Show resolved
Hide resolved
.../src/main/scala/org/apache/spark/sql/execution/datasources/rapids/GpuPartitioningUtils.scala
Outdated
Show resolved
Hide resolved
Agree with Tom's comments, but now that shims are in place, no longer blocking this PR.
build |
1 similar comment
build |
Signed-off-by: Allen Xu <allxu@nvidia.com>
build |
2 similar comments
build |
build |
object PartitionValueCastShims { | ||
def isSupportedType(dt: DataType): Boolean = dt match { | ||
// AnsiIntervalType | ||
case it: AnsiIntervalType => true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why doesn't the 330 shim have the AnyTimestampType checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* }}} | ||
*/ | ||
private[datasources] def parsePartitions( | ||
paths: Seq[Path], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, not going to block this, but these should be 4 spaces. same with functions below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to merge this and then followup with a PR to fix the remaining issues.
Signed-off-by: Allen Xu allxu@nvidia.com
To fix the Databricks issue mentioned in #6026 , but not the Hive issue.
Also I'm not sure it's doable to write a test for it for locally testing as the CPU will fail before Spark 3.3.0 version. Maybe a unit test only for parsePartitions can be added.