-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-7545] Adding RowCount to TextTable #8951
Conversation
R: @akedin |
Run Java_Examples_Dataflow PreCommit |
Run Java_Examples_Dataflow PreCommit |
Run JavaPortabilityApi PreCommit |
Run Java_Examples_Dataflow PreCommit |
Run JavaPortabilityApi PreCommit |
Run Java_Examples_Dataflow PreCommit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the overall approach is good, few comments:
sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java
Outdated
Show resolved
Hide resolved
sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextRowCountEstimator.java
Outdated
Show resolved
Hide resolved
sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextRowCountEstimator.java
Show resolved
Hide resolved
} | ||
|
||
FileIO.ReadableFile file = | ||
FileIO.ReadMatches.matchToReadableFile( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if directoryTreatment
matters here, we probably can just always skip them. It might matter for actual processing but for row count estimation we have to look at the files no matter what kind directory handling we want at run time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I have set it as a default value to SKIP. So it will just skip the directories. Do you think it is better to remove the setter for that?
sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextRowCountEstimator.java
Outdated
Show resolved
Hide resolved
sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextRowCountEstimator.java
Outdated
Show resolved
Hide resolved
sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextRowCountEstimator.java
Outdated
Show resolved
Hide resolved
return 0L; | ||
} | ||
|
||
return totalFileSizes * numberOfReadLines / linesSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment here explaining the formula?
I am also thinking whether we should add a configuration for:
- skipping empty files from calculation;
- choosing a different statistic, not just mean (e.g. median or any percentile);
...sions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/text/TextTable.java
Outdated
Show resolved
Hide resolved
Run Java_Examples_Dataflow PreCommit |
Run Java PostCommit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This change enables row count estimation for text table. It will first read couple of rows from the files in the directory and then using the length of those rows and the size of the files estimates the size of table.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.