Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port whole parsePartitions method from Spark3.3 to Gpu side #6048

Merged

Conversation

wjxiz1992
Copy link
Collaborator

Signed-off-by: Allen Xu allxu@nvidia.com

To fix the Databricks issue mentioned in #6026 , but not the Hive issue.

Also I'm not sure it's doable to write a test for it for locally testing as the CPU will fail before Spark 3.3.0 version. Maybe a unit test only for parsePartitions can be added.

Signed-off-by: Allen Xu <allxu@nvidia.com>
@sameerz sameerz added the bug Something isn't working label Jul 21, 2022
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
@wjxiz1992 wjxiz1992 self-assigned this Jul 22, 2022
@wjxiz1992 wjxiz1992 requested review from tgravescs, wbo4958 and GaryShen2008 and removed request for wbo4958, tgravescs and GaryShen2008 July 22, 2022 09:27
tgravescs
tgravescs previously approved these changes Jul 22, 2022
Copy link
Collaborator

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks fine. Definitely copying more code then I expected but assume it all has changes needed from 3.3 that we don't want to pick up other versions of.

I think we should file a follow up to investigate this code more to see if there is another way to replace the paths.

@tgravescs
Copy link
Collaborator

build

@tgravescs
Copy link
Collaborator

tests failing due to parquet test issue #6054

jlowe
jlowe previously requested changes Jul 22, 2022
// Copied from org/apache/spark/sql/types/AbstractDataType.scala
// for for https://github.com/NVIDIA/spark-rapids/issues/6026
// It can be removed when Spark 3.3.0 is the least supported Spark version
private[sql] object AnyTimestampType extends AbstractDataType with Serializable {
Copy link
Member

@jlowe jlowe Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems very bad. We are creating a separate type hierarchy from the one in Apache Spark which we should never do. Same goes for copying the TimestampNTZType code. We should never copy the type code. We should use it directly, or if we cannot because it is not available in all supported Spark versions, shim the code that needs to reference the type in some way as we have done in the past, even for code referencing TimestampNTZType.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added shim code and removed some unnecessary files. But not quite sure the shim source files are under the right packages, please help take another look. Thanks!

Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
@wjxiz1992 wjxiz1992 requested a review from jlowe July 25, 2022 10:05
@wjxiz1992 wjxiz1992 changed the title port whole parsePartitions method from Spark3.3 to Gpu side Port whole parsePartitions method from Spark3.3 to Gpu side Jul 25, 2022
@jlowe jlowe dismissed their stale review July 25, 2022 14:48

Agree with Tom's comments, but now that shims are in place, no longer blocking this PR.

@tgravescs
Copy link
Collaborator

build

1 similar comment
@jlowe
Copy link
Member

jlowe commented Jul 25, 2022

build

Signed-off-by: Allen Xu <allxu@nvidia.com>
@wjxiz1992 wjxiz1992 requested a review from tgravescs July 26, 2022 03:48
@sameerz
Copy link
Collaborator

sameerz commented Jul 26, 2022

build

2 similar comments
@wjxiz1992
Copy link
Collaborator Author

build

@wjxiz1992
Copy link
Collaborator Author

build

object PartitionValueCastShims {
def isSupportedType(dt: DataType): Boolean = dt match {
// AnsiIntervalType
case it: AnsiIntervalType => true
Copy link
Collaborator

@tgravescs tgravescs Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why doesn't the 330 shim have the AnyTimestampType checks?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* }}}
*/
private[datasources] def parsePartitions(
paths: Seq[Path],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, not going to block this, but these should be 4 spaces. same with functions below

Copy link
Collaborator

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to merge this and then followup with a PR to fix the remaining issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants