[BEAM-7545] Adding RowCount to TextTable #8951

ghost · 2019-06-26T16:40:56Z

This change enables row count estimation for text table. It will first read couple of rows from the files in the directory and then using the length of those rows and the size of the files estimates the size of table.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

Post-Commit Tests Status (on master branch)

Lang	Apex	Dataflow	Gearpump	Samza
Go	---	---	---	---
Java
Python	---		---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

ghost · 2019-06-26T16:53:13Z

R: @akedin

ghost · 2019-06-26T17:19:36Z

Run Java_Examples_Dataflow PreCommit

ghost · 2019-06-26T18:22:02Z

Run Java_Examples_Dataflow PreCommit

ghost · 2019-06-26T19:24:26Z

Run JavaPortabilityApi PreCommit

ghost · 2019-06-26T19:24:35Z

Run Java_Examples_Dataflow PreCommit

ghost · 2019-06-26T19:58:29Z

Run JavaPortabilityApi PreCommit

ghost · 2019-06-26T19:58:45Z

Run Java_Examples_Dataflow PreCommit

akedin

I think the overall approach is good, few comments:

sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java

sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextRowCountEstimator.java

akedin · 2019-06-26T21:30:32Z

sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextRowCountEstimator.java

+      }
+
+      FileIO.ReadableFile file =
+          FileIO.ReadMatches.matchToReadableFile(


I am not sure if directoryTreatment matters here, we probably can just always skip them. It might matter for actual processing but for row count estimation we have to look at the files no matter what kind directory handling we want at run time.

Yes. I have set it as a default value to SKIP. So it will just skip the directories. Do you think it is better to remove the setter for that?

sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextRowCountEstimator.java

akedin · 2019-06-26T21:57:00Z

sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextRowCountEstimator.java

+      return 0L;
+    }
+
+    return totalFileSizes * numberOfReadLines / linesSize;


Can you add a comment here explaining the formula?

I am also thinking whether we should add a configuration for:

skipping empty files from calculation;

choosing a different statistic, not just mean (e.g. median or any percentile);

...sions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/text/TextTable.java

ghost · 2019-06-27T22:33:42Z

Run Java_Examples_Dataflow PreCommit

ghost · 2019-06-27T22:33:55Z

Run Java PostCommit

akedin

LGTM

ghost force-pushed the TextTableRowCount branch from 79b2fc9 to d622188 Compare June 26, 2019 16:43

ghost force-pushed the TextTableRowCount branch from d622188 to 9c89178 Compare June 26, 2019 17:59

akedin reviewed Jun 26, 2019

View reviewed changes

ghost force-pushed the TextTableRowCount branch from 9c89178 to 22209df Compare June 27, 2019 21:09

[BEAM-7545] Adding RowCount to TextTable.

16ceca5

ghost force-pushed the TextTableRowCount branch from 22209df to 16ceca5 Compare June 28, 2019 17:39

akedin approved these changes Jul 1, 2019

View reviewed changes

akedin merged commit 3d576f7 into apache:master Jul 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-7545] Adding RowCount to TextTable #8951

[BEAM-7545] Adding RowCount to TextTable #8951

ghost commented Jun 26, 2019 •

edited by ghost

Loading

ghost commented Jun 26, 2019

ghost commented Jun 26, 2019

ghost commented Jun 26, 2019

ghost commented Jun 26, 2019

ghost commented Jun 26, 2019

ghost commented Jun 26, 2019

ghost commented Jun 26, 2019

akedin left a comment

akedin Jun 26, 2019

ghost Jun 27, 2019

akedin Jun 26, 2019

ghost commented Jun 27, 2019

ghost commented Jun 27, 2019

akedin left a comment

[BEAM-7545] Adding RowCount to TextTable #8951

[BEAM-7545] Adding RowCount to TextTable #8951

Conversation

ghost commented Jun 26, 2019 • edited by ghost Loading

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

ghost commented Jun 26, 2019

ghost commented Jun 26, 2019

ghost commented Jun 26, 2019

ghost commented Jun 26, 2019

ghost commented Jun 26, 2019

ghost commented Jun 26, 2019

ghost commented Jun 26, 2019

akedin left a comment

Choose a reason for hiding this comment

akedin Jun 26, 2019

Choose a reason for hiding this comment

ghost Jun 27, 2019

Choose a reason for hiding this comment

akedin Jun 26, 2019

Choose a reason for hiding this comment

ghost commented Jun 27, 2019

ghost commented Jun 27, 2019

akedin left a comment

Choose a reason for hiding this comment

ghost commented Jun 26, 2019 •

edited by ghost

Loading