[Audit][FEA][SPARK-36831] Support ANSI interval types from CSV source #4146

gerashegalov · 2021-11-18T09:37:27Z

Make sure the plugin can read ANSI interval types from CSV source

res-life · 2022-02-24T11:37:38Z

@revans2 Please help review the solution:

Interval types can be found in https://spark.apache.org/docs/latest/sql-ref-datatypes.html

Currently plugin do not support write for csv, so let's talk about reading interval type from CSV.

Spark read interval code is in:
IntervalUtils.fromDayTimeString

There are legacy form and normal form switched by SQLConf.LEGACY_FROM_DAYTIME_STRING.

legacy form
By default, legacy from daytime string is disable, so we may not support this.

SQLConf.LEGACY_FROM_DAYTIME_STRING See: https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3042

parseDayTimeLegacy see:
https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala#L463toL475

normal form
By default, Spark use this form, some examples see below table, we will support this form only.

type	example	valid	comment
INTERVAL DAY	INTERVAL '100' DAY	true
INTERVAL DAY	100 DAY	true
INTERVAL DAY TO HOUR	INTERVAL '100 10' DAY TO HOUR	true
INTERVAL DAY TO HOUR	100 10	true
INTERVAL DAY TO SECOND	INTERVAL '100 10:30:40.999999' DAY TO SECOND	true
INTERVAL DAY TO SECOND	100 10:30:40.999999	true
INTERVAL DAY TO SECOND	INTERVAL '100 10:30:40. 999999' DAY TO SECOND	false	has extra spaces
INTERVAL DAY TO SECOND	INTERVAL '100 10:30:40.999999' DAY TO SECOND	false	has extra spaces
INTERVAL DAY TO SECOND	100 10:30:40.999999	false	has extra spaces
INTERVAL DAY TO SECOND	100 10:30: 40.999999	false	has extra spaces
INTERVAL DAY TO SECOND	INTERVAL '-100 10:30:40.999999' DAY TO SECOND	true
INTERVAL DAY TO SECOND	INTERVAL -'-100 10:30:40.999999' DAY TO SECOND	true	two negative signs is positive
INTERVAL DAY TO SECOND	-100 10:30:40.999999	true
INTERVAL DAY TO SECOND	INTERVAL '100 26:30:40.999999' DAY TO SECOND	false	hour is 26 > 23

The invalid value will be null when reading csv.

proposed solution for the normal form

Use Cudf ColumnView.extractRe to extract the day, hour, ... , second fields by specifing the groups in regexp, and then calculate the micros.

Gpu code is like:

    val intervalCV = builder.buildAndPutOnDevice()
    val start = System.nanoTime()
    val p = "^INTERVAL\\s+([+|-])?'([+|-])?(\\d{1,9}) (\\d{1,2}):(\\d{1,2}):" +
        "(\\d{1,2})(\\.\\d{1,9})?'\\s+DAY\\s+TO\\s+SECOND$"

    // e.g.: INTERVAL -'-100 10:30:40.999999' DAY TO SECOND
    // group 0: sign is -
    // group 1: sign is -
    // group 2: day is 100
    // group 3: hour 10
    // group 4: minute 30
    // group 5: second 40
    // group 6: micro seconds 999999

    withResource(intervalCV.extractRe(p)) {
      table => {
        println("row count: " + table.getRowCount)

        val micros = table.getColumn(2).castTo(DType.INT64).mul(Scalar.fromLong(86400L * 1000000L))   // day to micros
            .add(table.getColumn(3).castTo(DType.INT64).mul(Scalar.fromLong(3600 * 1000000L))) // hour to micros
            .add(table.getColumn(4).castTo(DType.INT64).mul(Scalar.fromLong(60 * 1000000L))) // minute
            .add(table.getColumn(5).castTo(DType.INT64).mul(Scalar.fromLong(1000000L))) 
            .add(table.getColumn(6).castTo(DType.INT64))
      }
    }
    val timeS =(System.nanoTime() - start).toDouble / TimeUnit.SECONDS.toNanos(1)
    println(s"GPU used time $timeS S")

Cpu code is like

    while (i < rowCount) {
      IntervalUtils.fromDayTimeString(intervals(i))
      i += 1
    }

row count: 10,000,000
GPU used time 0.86659265 S
CPU used time 16.818680952 S
GPU speedups about 19x
Is this acceptable or have more effient approach?

revans2 · 2022-02-24T15:39:58Z

I really dislike regular expression use in casts, but it is a good first step. It would be nice to file a follow on issue to write a custom kernel to do this for us.

Also I assume you know that your code to do the conversion is leaking a lot of column views. I assume you did that just for readability of the code.

Second have you tested this with CSV? The patch that added in support for writing/reading CSV https://issues.apache.org/jira/browse/SPARK-36831 did not add in anything that calls fromDayTimeString or similar. The Parquet code just stored these as an int or a long. Please check to see if CSV is doing the same, because I suspect that they are, and then we don't need to do as much special processing.

res-life · 2022-02-25T04:01:48Z

Filed an issue: rapidsai/cudf#10356

CSV is text file, the day-time interval is stored in string form, e.g:

INTERVAL '100 10:30:40.999999' DAY TO SECOND

Spark used a similar method to parse interval string to day-time interval: IntervalUtils.castStringToDTInterval
The IntervalUtils.fromDayTimeString in the example code also invoked IntervalUtils.castStringToDTInterval.

I know the leaking in the example code, thanks the kindly reminder.

res-life · 2022-04-08T02:30:19Z

Spark Accelerator already supported reading day-time interval from CSV temporarily.
But is still waiting cuDF kernel to support.

res-life · 2022-10-17T03:54:23Z

The cuDF issue is closed but without a fix.
For details see rapidsai/cudf#10356 (comment).

gerashegalov added the audit_3.3.0 Audit related tasks for 3.3.0 label Nov 18, 2021

res-life self-assigned this Feb 17, 2022

res-life mentioned this issue Feb 25, 2022

[FEA] Add support for CSV reading ANSI day-time interval rapidsai/cudf#10356

Closed

res-life added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Mar 2, 2022

res-life mentioned this issue Mar 10, 2022

Support reading ANSI day time interval type from CSV source #4927

Merged

res-life mentioned this issue Mar 18, 2022

Update doc for CSV reading interval #4981

Merged

This was referenced Mar 31, 2022

[FEA] Support ANSI interval type(year-month) from CSV source #5115

Open

[FEA] Add ANSI support #5120

Open

revans2 mentioned this issue Oct 27, 2022

[BUG] Fix CSV Parsing #2063

Open

38 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Audit][FEA][SPARK-36831] Support ANSI interval types from CSV source #4146

[Audit][FEA][SPARK-36831] Support ANSI interval types from CSV source #4146

gerashegalov commented Nov 18, 2021

res-life commented Feb 24, 2022

revans2 commented Feb 24, 2022

res-life commented Feb 25, 2022 •

edited

Loading

res-life commented Apr 8, 2022

res-life commented Oct 17, 2022

[Audit][FEA][SPARK-36831] Support ANSI interval types from CSV source #4146

[Audit][FEA][SPARK-36831] Support ANSI interval types from CSV source #4146

Comments

gerashegalov commented Nov 18, 2021

res-life commented Feb 24, 2022

revans2 commented Feb 24, 2022

res-life commented Feb 25, 2022 • edited Loading

res-life commented Apr 8, 2022

res-life commented Oct 17, 2022

res-life commented Feb 25, 2022 •

edited

Loading