Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26164][SQL][FOLLOWUP] WriteTaskStatsTracker should know which file the row is written to #32459

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

This is a follow-up of #32198

Before #32198, in WriteTaskStatsTracker.newRow, we know that the row is written to the current file. After #32198 , we no longer know this connection.

This PR adds the file path parameter in WriteTaskStatsTracker.newRow to bring back the connection.

Why are the changes needed?

To not break some custom WriteTaskStatsTracker implementations.

Does this PR introduce any user-facing change?

no

How was this patch tested?

N/A

@github-actions github-actions bot added the SQL label May 6, 2021
@cloud-fan
Copy link
Contributor Author

cc @c21

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SparkQA
Copy link

SparkQA commented May 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42738/

@SparkQA
Copy link

SparkQA commented May 6, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42738/

Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@SparkQA
Copy link

SparkQA commented May 6, 2021

Test build #138216 has finished for PR 32459 at commit 88478b7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -151,7 +151,7 @@ class BasicWriteTaskStatsTracker(hadoopConf: Configuration)
}
}

override def newRow(row: InternalRow): Unit = {
override def newRow(filePath: String, row: InternalRow): Unit = {
Copy link
Member

@dongjoon-hyun dongjoon-hyun May 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Apache Side codebase, this is no-op unused parameter and there is no test coverage in this PR. Do you think we can have a sample custom WriteTaskStatsTracker test case to prevent a future regression, @cloud-fan ?

To not break some custom WriteTaskStatsTracker implementations.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for adding a test suite, @cloud-fan .

@SparkQA
Copy link

SparkQA commented May 7, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42755/

@SparkQA
Copy link

SparkQA commented May 7, 2021

Test build #138233 has finished for PR 32459 at commit 8e9f6cb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CustomWriteTaskStatsTrackerSuite extends SparkFunSuite
  • class CustomWriteTaskStatsTracker extends WriteTaskStatsTracker
  • case class CustomWriteTaskStats(numRowsPerFile: Map[String, Int]) extends WriteTaskStats

@cloud-fan
Copy link
Contributor Author

thanks for review, merging to master!

@cloud-fan cloud-fan closed this in e83910f May 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants