Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13260][SQL] count(*) does not work with CSV data source #11169

Closed
wants to merge 1 commit into from

Conversation

HyukjinKwon
Copy link
Member

https://issues.apache.org/jira/browse/SPARK-13260
This is a quicky fix for count(*).

When the requiredColumns is empty, currently it returns sqlContext.sparkContext.emptyRDD[Row] which does not have the count.

Just like JSON datasource, this PR lets the CSV datasource count the rows but do not parse each set of tokens.

@SparkQA
Copy link

SparkQA commented Feb 11, 2016

Test build #51091 has finished for PR 11169 at commit b52e156.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

cc @rxin @falaki

@falaki
Copy link
Contributor

falaki commented Feb 12, 2016

@HyukjinKwon Thanks for submitting this. I think it is better to parse everything when required columns is empty. The reason is depending on parsing mode, number of rows might turn out different.

@rxin
Copy link
Contributor

rxin commented Feb 12, 2016

hm it seems like an important optimization that we wouldn't be able to do if we cannot skip parsing columns.

@HyukjinKwon
Copy link
Member Author

@rxin @falaki I think I should have described this in more details. This works identical with the original CSV datasource.

When the parsing mode is drop-malformed, then it will try to parse all and in other modes, it would not.

The similar issue was found here https://github.com/databricks/spark-csv/issues/219 and it was fixed here databricks/spark-csv#220.

@HyukjinKwon
Copy link
Member Author

This CSVRelation.scala#L193-L199 will make sure it parses everything when drop-malformed mode but it does not in other modes.

@HyukjinKwon
Copy link
Member Author

@rxin Can we maybe merge this for now and then take the optimisation into account in another PR?

That optimisation would apply to all the pruned scan (when it's drop-malformed mode) as well and I think we should deal with this in another PR.

@falaki
Copy link
Contributor

falaki commented Feb 12, 2016

Thanks @HyukjinKwon this looks good to me. Github diff makes it look much bigger change than it is.

@rxin
Copy link
Contributor

rxin commented Feb 12, 2016

I'm going to merge this. Thanks.

@asfgit asfgit closed this in ac7d6af Feb 12, 2016
@HyukjinKwon HyukjinKwon deleted the SPARK-13260 branch September 23, 2016 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants