Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16896][SQL] Handle duplicated field names in header consistently with null or empty strings in CSV #14745

Closed
wants to merge 7 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Aug 22, 2016

What changes were proposed in this pull request?

Currently, CSV datasource allows to load duplicated empty string fields or fields having nullValue in the header. It'd be great if this can deal with normal fields as well.

This PR proposes handling the duplicates consistently with the existing behaviour with considering case-sensitivity (spark.sql.caseSensitive) as below:

data below:

fieldA,fieldB,,FIELDA,fielda,,
1,2,3,4,5,6,7

is parsed as below:

spark.read.format("csv").option("header", "true").load("test.csv").show()
  • when spark.sql.caseSensitive is false (by default).

    +-------+------+---+-------+-------+---+---+
    |fieldA0|fieldB|_c2|FIELDA3|fieldA4|_c5|_c6|
    +-------+------+---+-------+-------+---+---+
    |      1|     2|  3|      4|      5|  6|  7|
    +-------+------+---+-------+-------+---+---+
    
  • when spark.sql.caseSensitive is true.

    +-------+------+---+-------+-------+---+---+
    |fieldA0|fieldB|_c2| FIELDA|fieldA4|_c5|_c6|
    +-------+------+---+-------+-------+---+---+
    |      1|     2|  3|      4|      5|  6|  7|
    +-------+------+---+-------+-------+---+---+
    

In more details,

There is a good reference about this problem, read.csv() in R. So, I initially wanted to propose the similar behaviour.

In case of R, the CSV data below:

fieldA,fieldB,,fieldA,fieldA,,
1,2,3,4,5,6,7

is parsed as below:

test <- read.csv(file="test.csv",header=TRUE,sep=",")
> test
  fieldA fieldB X fieldA.1 fieldA.2 X.1 X.2
1      1      2 3        4        5   6   7

However, Spark CSV datasource already is handling duplicated empty strings and nullValue as field names. So the data below:

,,,fieldA,,fieldB,
1,2,3,4,5,6,7

is parsed as below:

spark.read.format("csv").option("header", "true").load("test.csv").show()
+---+---+---+------+---+------+---+
|_c0|_c1|_c2|fieldA|_c4|fieldB|_c6|
+---+---+---+------+---+------+---+
|  1|  2|  3|     4|  5|     6|  7|
+---+---+---+------+---+------+---+

R starts the number for each duplicate but Spark adds the number for its position for all fields for nullValue and empty strings.

In terms of case-sensitivity, it seems R is case-sensitive as below: (it seems it is not configurable).

a,a,a,A,A
1,2,3,4,5

is parsed as below:

test <- read.csv(file="test.csv",header=TRUE,sep=",")
> test
  a a.1 a.2 A A.1
1 1   2   3 4   5

How was this patch tested?

Unit test in CSVSuite.

@SparkQA
Copy link

SparkQA commented Aug 22, 2016

Test build #64178 has finished for PR 14745 at commit a3c3dc6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

row.zipWithIndex.map { case (value, index) =>
if (value == null || value.isEmpty || value == options.nullValue) {
// When there are empty strings or the values set in `nullValue`, put the
// index as a post-fix.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suffix?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@felixcheung
Copy link
Member

looks good to me.
do we need to consider case? is "a1" the same as "A1"?

@HyukjinKwon
Copy link
Member Author

Hm, yea, I think we should take that into account as spark.sql.caseSensitive is false by default. I will take a look at R as well and will fix this up tomorrow. Thank you for reviewing @felixcheung .

@SparkQA
Copy link

SparkQA commented Aug 22, 2016

Test build #64201 has finished for PR 14745 at commit 94620ca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

/**
* Generates a header from the given row which is null-safe and duplicates-safe.
*/
private def makeSafeHeader(row: Array[String], options: CSVOptions): Array[String] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest putting this function in utils and writing a separate unit test for it.

@HyukjinKwon
Copy link
Member Author

@falaki and @felixcheung Do you mind if I ask another quick look please?

@SparkQA
Copy link

SparkQA commented Aug 23, 2016

Test build #64250 has finished for PR 14745 at commit 316049b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Cc @rxin as well.

@SparkQA
Copy link

SparkQA commented Aug 30, 2016

Test build #64630 has finished for PR 14745 at commit 0c02581.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

cc @cloud-fan Do you mind if I ask to take a look please?

firstRow.zipWithIndex.map { case (value, index) => s"_c$index" }
}
val caseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis
val header = makeSafeHeader(firstRow, csvOptions, caseSensitive)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just make makeSafeHeader a private method in this class?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me fix this up.

@SparkQA
Copy link

SparkQA commented Oct 9, 2016

Test build #66605 has finished for PR 14745 at commit fd24c5b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 9, 2016

Test build #66608 has finished for PR 14745 at commit fd24c5b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
}
} else {
row.zipWithIndex.map { case (value, index) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be case (_, index) =>

caseSensitive: Boolean): Array[String] = {
if (options.headerFlag) {
val duplicates = {
val safeRow = if (!caseSensitive) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

val headerNames = row.filter(_ != null).map(name => if (caseSensitive) name else name.toLowerCase)

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Oct 10, 2016

Test build #66652 has finished for PR 14745 at commit 969c8f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 90217f9 Oct 11, 2016
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…ly with null or empty strings in CSV

## What changes were proposed in this pull request?

Currently, CSV datasource allows to load duplicated empty string fields or fields having `nullValue` in the header. It'd be great if this can deal with normal fields as well.

This PR proposes handling the duplicates consistently with the existing behaviour with considering case-sensitivity (`spark.sql.caseSensitive`) as below:

data below:

```
fieldA,fieldB,,FIELDA,fielda,,
1,2,3,4,5,6,7
```

is parsed as below:

```scala
spark.read.format("csv").option("header", "true").load("test.csv").show()
```

- when `spark.sql.caseSensitive` is `false` (by default).

  ```
  +-------+------+---+-------+-------+---+---+
  |fieldA0|fieldB|_c2|FIELDA3|fieldA4|_c5|_c6|
  +-------+------+---+-------+-------+---+---+
  |      1|     2|  3|      4|      5|  6|  7|
  +-------+------+---+-------+-------+---+---+
  ```

- when `spark.sql.caseSensitive` is `true`.

  ```
  +-------+------+---+-------+-------+---+---+
  |fieldA0|fieldB|_c2| FIELDA|fieldA4|_c5|_c6|
  +-------+------+---+-------+-------+---+---+
  |      1|     2|  3|      4|      5|  6|  7|
  +-------+------+---+-------+-------+---+---+
  ```

**In more details**,

There is a good reference about this problem, `read.csv()` in R. So, I initially wanted to propose the similar behaviour.

In case of R,  the CSV data below:

```
fieldA,fieldB,,fieldA,fieldA,,
1,2,3,4,5,6,7
```

is parsed as below:

```r
test <- read.csv(file="test.csv",header=TRUE,sep=",")
> test
  fieldA fieldB X fieldA.1 fieldA.2 X.1 X.2
1      1      2 3        4        5   6   7
```

However, Spark CSV datasource already is handling duplicated empty strings and `nullValue` as field names. So the data below:

```
,,,fieldA,,fieldB,
1,2,3,4,5,6,7
```

is parsed as below:

```scala
spark.read.format("csv").option("header", "true").load("test.csv").show()
```
```
+---+---+---+------+---+------+---+
|_c0|_c1|_c2|fieldA|_c4|fieldB|_c6|
+---+---+---+------+---+------+---+
|  1|  2|  3|     4|  5|     6|  7|
+---+---+---+------+---+------+---+
```

R starts the number for each duplicate but Spark adds the number for its position for all fields for `nullValue` and empty strings.

In terms of case-sensitivity, it seems R is case-sensitive as below: (it seems it is not configurable).

```
a,a,a,A,A
1,2,3,4,5
```

is parsed as below:

```r
test <- read.csv(file="test.csv",header=TRUE,sep=",")
> test
  a a.1 a.2 A A.1
1 1   2   3 4   5
```

## How was this patch tested?

Unit test in `CSVSuite`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#14745 from HyukjinKwon/SPARK-16896.
@HyukjinKwon HyukjinKwon deleted the SPARK-16896 branch January 2, 2018 03:44
@NiharGharat
Copy link

As R has a separator for column headers like
a a.1 a.2 A A.1
can we have one as well like underscore?

@HyukjinKwon
Copy link
Member Author

I think you can simply just rename the columns after loading manually.

rapids-bot bot pushed a commit to rapidsai/cudf that referenced this pull request Jul 6, 2021
…8645)

Fixes mangled name bug `read_csv` with duplicate columns.
mismatch with pandas behavior.
#### csv file:
```csv
A,A,A.1,A,A.2,A,A.4,A,A
1,2,3,4.0,a,a,a.4,a,a
2,4,6,8.0,b,b,b.4,b,a
3,6,2,6.0,c,c,c.4,c,c
```

|A|	A|		A.1|	A|		A.2|	A|		A.4|	A|		A|
|-|-|-|-|-|-|-|-|-|
|A|	A.1|	A.1.1|	A.2|	A.2.1|	A.3| 	A.4| 	A.4.1|	A.5|


#### Pandas:
```python
In [1]: import pandas as pd
In [2]: pd.read_csv("test.csv")
Out[2]: 
   A  A.1  A.1.1  A.2 A.2.1 A.3  A.4 A.4.1 A.5
0  1    2      3  4.0     a   a  a.4     a   a
1  2    4      6  8.0     b   b  b.4     b   a
2  3    6      2  6.0     c   c  c.4     c   c
```

#### cudf: (21.08 nightly docker)
```python
In [1]: import cudf
In [2]: cudf.__version__
Out[2]: '21.08.00a+238.gfba09e66d8'
In [3]: cudf.read_csv("test.csv")
Out[3]: 
   A  A.1 A.2 A.3 A.4 A.5
0  1    3   a   a   a   a
1  2    6   b   b   b   a
2  3    2   c   c   c   c
```


This PR fixes this issue.
```python
In [2]: cudf.read_csv("test.csv")
Out[2]: 
   A  A.1  A.1.1  A.2 A.2.1 A.3  A.4 A.4.1 A.5
0  1    2      3  4.0     a   a  a.4     a   a
1  2    4      6  8.0     b   b  b.4     b   a
2  3    6      2  6.0     c   c  c.4     c   c
```



Related info (sparks):
Spark duplicate column naming. 
https://issues.apache.org/jira/browse/SPARK-16896
apache/spark#14745
cudf sparks addon doesn't use libcudf names. So, this PR does not affect it.

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Elias Stehle (https://github.com/elstehle)
  - Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #8645
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants