Port whole parsePartitions method from Spark3.3 to Gpu side #6048

wjxiz1992 · 2022-07-21T17:12:50Z

Signed-off-by: Allen Xu allxu@nvidia.com

To fix the Databricks issue mentioned in #6026 , but not the Hive issue.

Also I'm not sure it's doable to write a test for it for locally testing as the CPU will fail before Spark 3.3.0 version. Maybe a unit test only for parsePartitions can be added.

Signed-off-by: Allen Xu <allxu@nvidia.com>

tgravescs

overall looks fine. Definitely copying more code then I expected but assume it all has changes needed from 3.3 that we don't want to pick up other versions of.

I think we should file a follow up to investigate this code more to see if there is another way to replace the paths.

tgravescs · 2022-07-22T13:07:30Z

build

tgravescs · 2022-07-22T14:32:18Z

tests failing due to parquet test issue #6054

jlowe · 2022-07-22T15:11:24Z

sql-plugin/src/main/scala/org/apache/spark/sql/types/rapids/AbstractDataType.scala

+// Copied from org/apache/spark/sql/types/AbstractDataType.scala
+// for for https://github.com/NVIDIA/spark-rapids/issues/6026
+// It can be removed when Spark 3.3.0 is the least supported Spark version
+private[sql] object AnyTimestampType extends AbstractDataType with Serializable {


This seems very bad. We are creating a separate type hierarchy from the one in Apache Spark which we should never do. Same goes for copying the TimestampNTZType code. We should never copy the type code. We should use it directly, or if we cannot because it is not available in all supported Spark versions, shim the code that needs to reference the type in some way as we have done in the past, even for code referencing TimestampNTZType.

Added shim code and removed some unnecessary files. But not quite sure the shim source files are under the right packages, please help take another look. Thanks!

Signed-off-by: Allen Xu <allxu@nvidia.com>

sql-plugin/src/main/scala/org/apache/spark/sql/catalyst/util/rapids/TimestampFormatter.scala

.../src/main/scala/org/apache/spark/sql/execution/datasources/rapids/GpuPartitioningUtils.scala

sql-plugin/src/main/320+/scala/org/apache/spark/sql/types/shims/PartitionValueCastShims.scala

.../src/main/scala/org/apache/spark/sql/execution/datasources/rapids/GpuPartitioningUtils.scala

Agree with Tom's comments, but now that shims are in place, no longer blocking this PR.

tgravescs · 2022-07-25T19:15:59Z

build

jlowe · 2022-07-25T22:19:26Z

build

Signed-off-by: Allen Xu <allxu@nvidia.com>

sameerz · 2022-07-26T04:59:26Z

build

wjxiz1992 · 2022-07-26T05:21:12Z

build

wjxiz1992 · 2022-07-26T05:25:35Z

build

tgravescs · 2022-07-26T13:06:43Z

sql-plugin/src/main/330+/scala/org/apache/spark/sql/types/shims/PartitionValueCastShims.scala

+object PartitionValueCastShims {
+  def isSupportedType(dt: DataType): Boolean = dt match {
+    // AnsiIntervalType
+    case it: AnsiIntervalType => true


why doesn't the 330 shim have the AnyTimestampType checks?

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala

tgravescs · 2022-07-26T13:11:49Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/rapids/GpuPartitioningUtils.scala

+   * }}}
+   */
+  private[datasources] def parsePartitions(
+    paths: Seq[Path],


nit, not going to block this, but these should be 4 spaces. same with functions below

tgravescs

I'm going to merge this and then followup with a PR to fix the remaining issues.

port whole parsePartitions method from Spark3.3 to Gpu side

1cabb07

Signed-off-by: Allen Xu <allxu@nvidia.com>

sameerz added the bug Something isn't working label Jul 21, 2022

wjxiz1992 added 2 commits July 22, 2022 15:04

apply to other Spark versions

dc53319

Signed-off-by: Allen Xu <allxu@nvidia.com>

reduce duplicated code

e041304

Signed-off-by: Allen Xu <allxu@nvidia.com>

wjxiz1992 self-assigned this Jul 22, 2022

wjxiz1992 requested review from tgravescs, wbo4958 and GaryShen2008 and removed request for wbo4958, tgravescs and GaryShen2008 July 22, 2022 09:27

tgravescs previously approved these changes Jul 22, 2022

View reviewed changes

jlowe previously requested changes Jul 22, 2022

View reviewed changes

add shim

8bbf530

Signed-off-by: Allen Xu <allxu@nvidia.com>

wjxiz1992 dismissed tgravescs’s stale review via 8bbf530 July 25, 2022 09:21

add copyrights

1a4e9fb

Signed-off-by: Allen Xu <allxu@nvidia.com>

wjxiz1992 requested a review from jlowe July 25, 2022 10:05

wjxiz1992 changed the title ~~port whole parsePartitions method from Spark3.3 to Gpu side~~ Port whole parsePartitions method from Spark3.3 to Gpu side Jul 25, 2022

wjxiz1992 mentioned this pull request Jul 25, 2022

Find a better way to update the Alluxio paths automatically #6070

Closed

tgravescs reviewed Jul 25, 2022

View reviewed changes

refine shim and comments

dfef07b

Signed-off-by: Allen Xu <allxu@nvidia.com>

wjxiz1992 requested a review from tgravescs July 26, 2022 03:48

tgravescs reviewed Jul 26, 2022

View reviewed changes

tgravescs approved these changes Jul 26, 2022

View reviewed changes

tgravescs merged commit 7fa3318 into NVIDIA:branch-22.08 Jul 26, 2022

tgravescs mentioned this pull request Jul 26, 2022

[databricks] Fix 3.3 shim to include castTo handling AnyTimestampType and minor spacing #6097

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port whole parsePartitions method from Spark3.3 to Gpu side #6048

Port whole parsePartitions method from Spark3.3 to Gpu side #6048

wjxiz1992 commented Jul 21, 2022

tgravescs left a comment

tgravescs commented Jul 22, 2022

tgravescs commented Jul 22, 2022

jlowe Jul 22, 2022 •

edited

Loading

wjxiz1992 Jul 25, 2022

tgravescs commented Jul 25, 2022

jlowe commented Jul 25, 2022

sameerz commented Jul 26, 2022

wjxiz1992 commented Jul 26, 2022

wjxiz1992 commented Jul 26, 2022

tgravescs Jul 26, 2022 •

edited

Loading

tgravescs Jul 26, 2022

tgravescs Jul 26, 2022

tgravescs left a comment

Port whole parsePartitions method from Spark3.3 to Gpu side #6048

Port whole parsePartitions method from Spark3.3 to Gpu side #6048

Conversation

wjxiz1992 commented Jul 21, 2022

tgravescs left a comment

Choose a reason for hiding this comment

tgravescs commented Jul 22, 2022

tgravescs commented Jul 22, 2022

jlowe Jul 22, 2022 • edited Loading

Choose a reason for hiding this comment

wjxiz1992 Jul 25, 2022

Choose a reason for hiding this comment

tgravescs commented Jul 25, 2022

jlowe commented Jul 25, 2022

sameerz commented Jul 26, 2022

wjxiz1992 commented Jul 26, 2022

wjxiz1992 commented Jul 26, 2022

tgravescs Jul 26, 2022 • edited Loading

Choose a reason for hiding this comment

tgravescs Jul 26, 2022

Choose a reason for hiding this comment

tgravescs Jul 26, 2022

Choose a reason for hiding this comment

tgravescs left a comment

Choose a reason for hiding this comment

jlowe Jul 22, 2022 •

edited

Loading

tgravescs Jul 26, 2022 •

edited

Loading