[SPARK-13260][SQL] count(*) does not work with CSV data source #11169

HyukjinKwon · 2016-02-11T08:51:40Z

https://issues.apache.org/jira/browse/SPARK-13260
This is a quicky fix for count(*).

When the requiredColumns is empty, currently it returns sqlContext.sparkContext.emptyRDD[Row] which does not have the count.

Just like JSON datasource, this PR lets the CSV datasource count the rows but do not parse each set of tokens.

SparkQA · 2016-02-11T10:19:23Z

Test build #51091 has finished for PR 11169 at commit b52e156.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-02-12T00:08:19Z

cc @rxin @falaki

falaki · 2016-02-12T01:00:00Z

@HyukjinKwon Thanks for submitting this. I think it is better to parse everything when required columns is empty. The reason is depending on parsing mode, number of rows might turn out different.

rxin · 2016-02-12T01:49:53Z

hm it seems like an important optimization that we wouldn't be able to do if we cannot skip parsing columns.

HyukjinKwon · 2016-02-12T01:52:34Z

@rxin @falaki I think I should have described this in more details. This works identical with the original CSV datasource.

When the parsing mode is drop-malformed, then it will try to parse all and in other modes, it would not.

The similar issue was found here https://github.com/databricks/spark-csv/issues/219 and it was fixed here databricks/spark-csv#220.

HyukjinKwon · 2016-02-12T01:54:25Z

This CSVRelation.scala#L193-L199 will make sure it parses everything when drop-malformed mode but it does not in other modes.

HyukjinKwon · 2016-02-12T02:57:44Z

@rxin Can we maybe merge this for now and then take the optimisation into account in another PR?

That optimisation would apply to all the pruned scan (when it's drop-malformed mode) as well and I think we should deal with this in another PR.

falaki · 2016-02-12T19:51:39Z

Thanks @HyukjinKwon this looks good to me. Github diff makes it look much bigger change than it is.

rxin · 2016-02-12T19:54:50Z

I'm going to merge this. Thanks.

count(*) does not work with CSV data source

b52e156

asfgit closed this in ac7d6af Feb 12, 2016

HyukjinKwon deleted the SPARK-13260 branch September 23, 2016 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13260][SQL] count(*) does not work with CSV data source #11169

[SPARK-13260][SQL] count(*) does not work with CSV data source #11169

HyukjinKwon commented Feb 11, 2016

SparkQA commented Feb 11, 2016

HyukjinKwon commented Feb 12, 2016

falaki commented Feb 12, 2016

rxin commented Feb 12, 2016

HyukjinKwon commented Feb 12, 2016

HyukjinKwon commented Feb 12, 2016

HyukjinKwon commented Feb 12, 2016

falaki commented Feb 12, 2016

rxin commented Feb 12, 2016

[SPARK-13260][SQL] count(*) does not work with CSV data source #11169

[SPARK-13260][SQL] count(*) does not work with CSV data source #11169

Conversation

HyukjinKwon commented Feb 11, 2016

SparkQA commented Feb 11, 2016

HyukjinKwon commented Feb 12, 2016

falaki commented Feb 12, 2016

rxin commented Feb 12, 2016

HyukjinKwon commented Feb 12, 2016

HyukjinKwon commented Feb 12, 2016

HyukjinKwon commented Feb 12, 2016

falaki commented Feb 12, 2016

rxin commented Feb 12, 2016