parquet writer support for TIMESTAMP_MILLIS #726

razajafri · 2020-09-10T19:27:34Z

Signed-off-by: Raza Jafri rjafri@nvidia.com

This adds support to write TIMESTAMP_MILLIS to parquet writer.

@jlowe PTAL as you are the original author

fixes #142

kuhushukla · 2020-09-10T19:35:22Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/ColumnarOutputWriter.scala

+      "false")
+    val newBatch: ColumnarBatch = if (castToMillis.equals("true")) {
+      new ColumnarBatch(GpuColumnVector.extractColumns(batch).map(cv => {
+        if (cv.dataType() == DataTypes.TimestampType) {


Nit. map {....

kuhushukla · 2020-09-10T19:38:07Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/ColumnarOutputWriter.scala

@@ -105,12 +105,27 @@ abstract class ColumnarOutputWriter(path: String, context: TaskAttemptContext,
   */
  def write(batch: ColumnarBatch, statsTrackers: Seq[ColumnarWriteTaskStatsTracker]): Unit = {
    var needToCloseBatch = true
+    val castToMillis = conf.get(GpuParquetFileFormat.PARQUET_WRITE_TIMESTAMP_CAST_TO_MILLIS,
+      "false")
+    val newBatch: ColumnarBatch = if (castToMillis.equals("true")) {


We can do a boolean check like we do for other rapids confs insterad of string equal check.

kuhushukla · 2020-09-10T19:39:36Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetFileFormat.scala

@@ -34,6 +34,8 @@ import org.apache.spark.sql.rapids.execution.TrampolineUtil
 import org.apache.spark.sql.types.{DateType, StructType, TimestampType}

 object GpuParquetFileFormat {
+  val PARQUET_WRITE_TIMESTAMP_CAST_TO_MILLIS = "com.nvidia.spark.rapids.parquet.write.castToMillis"


Should this be in Rapids conf where all other confs are?

I don't think we need a new config here. We should just do what Spark is doing here and use the same SQLConf config. That way if we add INT96 support it's straightforward. Making this a boolean means we'll have to update it if/when INT96 or other types are supported. Let's simplify and use the Spark conf key which precludes the need for a new conf key.

revans2 · 2020-09-10T19:48:05Z

integration_tests/src/main/python/parquet_test.py

+    data_path = spark_tmp_path + '/PARQUET_DATA'
+    with_gpu_session(
+        lambda spark : unary_op_df(spark, gen).write.parquet(data_path),
+        conf={'spark.sql.parquet.outputTimestampType': 'TIMESTAMP_MILLIS'})


nit could we parameterize this so in the future when we hopefully support INT96 it is a very small change to test that too?

revans2 · 2020-09-10T19:50:08Z

integration_tests/src/main/python/parquet_test.py

@@ -125,6 +125,15 @@ def test_ts_read_round_trip(spark_tmp_path, ts_write, ts_rebase, small_file_opt,
            conf={'spark.rapids.sql.format.parquet.smallFiles.enabled': small_file_opt,
                  'spark.sql.sources.useV1SourceList': v1_enabled_list})

+def test_parquet_write_ts_millis(spark_tmp_path):
+    gen = TimestampGen()


What are we doing to avoid ts_rebaseModeIn* We cannot support the full range of write/read options for time stamps unless something has changed recently.

We might not be hitting the ambiguous dates while generating the Timegen in this tests as the tests are passing.

Upon digging into this further, I am a little confused as to why the tests were passing without passing the ts_rebaseModeIn*. As per the documentation, the default mode is EXCEPTION which should throw an exception if it encounters an ambiguous date, but when I don't set a ts_rebaseModeIn* it behaves as if I had set the value as CORRECTED. Upon explicitly setting the value to EXCEPTION the test threw an exception. I have made the change to explicitly set the ts_rebaseModeIn* as CORRECTED. Please let me know if this is acceptable or if you had something else in mind.

revans2 · 2020-09-10T19:56:13Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetFileFormat.scala

@@ -162,11 +164,15 @@ class GpuParquetFileFormat extends ColumnarFileFormat with Logging {
    val outputTimestampType = sparkSession.sessionState.conf.parquetOutputTimestampType
    if (outputTimestampType != ParquetOutputTimestampType.TIMESTAMP_MICROS) {


This is really convoluted code now. Could we just use a match for the output type? and why do we need to set a second config to reflect the value of the first config?

+1 we should just port the SQLConf key/value pair into the Hadoop Configuration as Spark does.

Still not doing what I asked.

sparkSession.sessionState.conf.parquetOutputTimestampType match { case ParquetOutputTimestampType.TIMESTAMP_MICROS => case ParquetOutputTimestampType.TIMESTAMP_MILLIS => case outputTimestampType => val hasTimestamps = dataSchema.exists { field => TrampolineUtil.dataTypeExistsRecursively(field.dataType, _.isInstanceOf[TimestampType]) } if (hasTimestamps) { throw new UnsupportedOperationException( s"Unsupported output timestamp type: $outputTimestampType") } }

Because this code is doing almost exactly the same thing as above we might be able to combine them together.

jlowe · 2020-09-10T20:36:00Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/ColumnarOutputWriter.scala

@@ -105,12 +105,27 @@ abstract class ColumnarOutputWriter(path: String, context: TaskAttemptContext,
   */
  def write(batch: ColumnarBatch, statsTrackers: Seq[ColumnarWriteTaskStatsTracker]): Unit = {
    var needToCloseBatch = true
+    val castToMillis = conf.get(GpuParquetFileFormat.PARQUET_WRITE_TIMESTAMP_CAST_TO_MILLIS,


This should not lookup the config each batch but only once when the writer is created. It's wasteful to redundantly re-parse the config on every batch.

jlowe · 2020-09-10T20:38:07Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetFileFormat.scala

@@ -34,6 +34,8 @@ import org.apache.spark.sql.rapids.execution.TrampolineUtil
 import org.apache.spark.sql.types.{DateType, StructType, TimestampType}

 object GpuParquetFileFormat {
+  val PARQUET_WRITE_TIMESTAMP_CAST_TO_MILLIS = "com.nvidia.spark.rapids.parquet.write.castToMillis"


I don't think we need a new config here. We should just do what Spark is doing here and use the same SQLConf config. That way if we add INT96 support it's straightforward. Making this a boolean means we'll have to update it if/when INT96 or other types are supported. Let's simplify and use the Spark conf key which precludes the need for a new conf key.

jlowe · 2020-09-10T20:39:04Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetFileFormat.scala

@@ -162,11 +164,15 @@ class GpuParquetFileFormat extends ColumnarFileFormat with Logging {
    val outputTimestampType = sparkSession.sessionState.conf.parquetOutputTimestampType
    if (outputTimestampType != ParquetOutputTimestampType.TIMESTAMP_MICROS) {


+1 we should just port the SQLConf key/value pair into the Hadoop Configuration as Spark does.

jlowe · 2020-09-11T16:08:47Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/ColumnarOutputWriter.scala

@@ -105,12 +107,28 @@ abstract class ColumnarOutputWriter(path: String, context: TaskAttemptContext,
   */
  def write(batch: ColumnarBatch, statsTrackers: Seq[ColumnarWriteTaskStatsTracker]): Unit = {
    var needToCloseBatch = true
+    val outputTimestampType = conf.get(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key)


Still looking up the config on every batch, this should be done once when the ColumnarOutputWriter instance is created.

jlowe · 2020-09-11T16:17:07Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/ColumnarOutputWriter.scala

@@ -105,12 +107,28 @@ abstract class ColumnarOutputWriter(path: String, context: TaskAttemptContext,
   */
  def write(batch: ColumnarBatch, statsTrackers: Seq[ColumnarWriteTaskStatsTracker]): Unit = {
    var needToCloseBatch = true
+    val outputTimestampType = conf.get(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key)
+    val newBatch = if (outputTimestampType == ParquetOutputTimestampType.TIMESTAMP_MILLIS) {


Does this work? It looks like we're comparing a string with an enum here.

Thank you for pointing this out!

integration_tests/src/main/python/parquet_test.py

razajafri · 2020-09-11T19:49:29Z

Thanks for the review @jlowe @revans2 @kuhushukla can you please take another look. @revans2 I thought about merging this test with the another write_test but I stuck with this because it keeps it simple

razajafri · 2020-09-11T20:11:47Z

build

razajafri · 2020-09-11T20:21:55Z

build

sql-plugin/src/main/scala/com/nvidia/spark/rapids/ColumnarOutputWriter.scala

revans2 · 2020-09-11T22:13:48Z

I also just noticed that all of the changes are in generic code so setting a parquet config could impact orc output

sql-plugin/src/main/scala/com/nvidia/spark/RebaseHelper.scala

revans2 · 2020-09-14T20:39:59Z

@razajafri things are looking off after the upmerge. Not sure if github is confused or what but it shows 50 files have change.

sql-plugin/src/main/scala/com/nvidia/spark/RebaseHelper.scala

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri · 2020-09-15T00:41:01Z

build

razajafri · 2020-09-15T00:44:46Z

@revans2 I apologize but I had to rebase. The scanTableBeforeWrite is called before write so all the columns should be in MICROS which precludes the need to convert the columns before checking for when the switch happened.

I might be missing something very basic that you are trying to point out

razajafri · 2020-09-15T01:14:33Z

build

revans2 · 2020-09-15T13:11:18Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/ColumnarOutputWriter.scala

@@ -18,14 +18,16 @@ package com.nvidia.spark.rapids

 import scala.collection.mutable

-import ai.rapids.cudf.{HostBufferConsumer, HostMemoryBuffer, NvtxColor, NvtxRange, Table, TableWriter}
+import ai.rapids.cudf.{DType, HostBufferConsumer, HostMemoryBuffer, NvtxColor, NvtxRange, Table, TableWriter}


Can we revert all of the changes to this file? It looks like none of them are needed.

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri · 2020-09-15T17:31:59Z

build

revans2 · 2020-09-15T20:44:27Z

Odd, the CI job says it passed, but github does not show it as passed...

razajafri · 2020-09-15T20:47:15Z

can we force merge? or do we build again?

revans2 · 2020-09-15T20:57:00Z

Bypassed the github checks because the CI build passed, but it looked like it didn't update github, despite a log message in the job saying it did.

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

…IDIA#726) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com> Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

kuhushukla reviewed Sep 10, 2020

View reviewed changes

revans2 reviewed Sep 10, 2020

View reviewed changes

jlowe reviewed Sep 10, 2020

View reviewed changes

sameerz added the feature request New feature or request label Sep 11, 2020

jlowe reviewed Sep 11, 2020

View reviewed changes

revans2 reviewed Sep 11, 2020

View reviewed changes

integration_tests/src/main/python/parquet_test.py Outdated Show resolved Hide resolved

revans2 reviewed Sep 11, 2020

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/ColumnarOutputWriter.scala Outdated Show resolved Hide resolved

revans2 reviewed Sep 14, 2020

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/RebaseHelper.scala Outdated Show resolved Hide resolved

razajafri requested review from GaryShen2008, NvTimLiu and tgravescs as code owners September 14, 2020 19:21

revans2 reviewed Sep 14, 2020

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/RebaseHelper.scala Outdated Show resolved Hide resolved

razajafri added 6 commits September 14, 2020 15:47

parquet writer support for TIMESTAMP_MILLIS

1ad3f77

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

addressed review comments

f1f2efb

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

addressed review comments

8e76e6f

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

code cleanup

fbb0a43

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

rebased

e528347

cleanup

592cc05

razajafri force-pushed the parquet_write_ts_millis branch from 9747cc7 to 592cc05 Compare September 15, 2020 00:31

scalatest changes

c1d6d5f

revans2 previously approved these changes Sep 15, 2020

View reviewed changes

reverting unnecessary changes

7b32901

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri dismissed revans2’s stale review via 7b32901 September 15, 2020 17:23

revans2 approved these changes Sep 15, 2020

View reviewed changes

jlowe approved these changes Sep 15, 2020

View reviewed changes

revans2 merged commit 3219fa4 into NVIDIA:branch-0.3 Sep 15, 2020

razajafri mentioned this pull request Sep 29, 2020

[FEA] Audit: Parquet Writer support for TIMESTAMP_MILLIS #430

Closed

nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021

parquet writer support for TIMESTAMP_MILLIS (NVIDIA#726)

d40cd75

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021

parquet writer support for TIMESTAMP_MILLIS (NVIDIA#726)

5f6ed5c

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet writer support for TIMESTAMP_MILLIS #726

parquet writer support for TIMESTAMP_MILLIS #726

razajafri commented Sep 10, 2020

kuhushukla Sep 10, 2020

kuhushukla Sep 10, 2020

kuhushukla Sep 10, 2020

jlowe Sep 10, 2020

revans2 Sep 10, 2020

revans2 Sep 10, 2020

razajafri Sep 11, 2020

razajafri Sep 11, 2020

revans2 Sep 10, 2020

jlowe Sep 10, 2020

revans2 Sep 11, 2020

jlowe Sep 10, 2020

jlowe Sep 10, 2020

jlowe Sep 10, 2020

jlowe Sep 11, 2020

jlowe Sep 11, 2020

razajafri Sep 11, 2020

razajafri commented Sep 11, 2020

razajafri commented Sep 11, 2020

razajafri commented Sep 11, 2020

revans2 commented Sep 11, 2020

revans2 commented Sep 14, 2020

razajafri commented Sep 15, 2020

razajafri commented Sep 15, 2020

razajafri commented Sep 15, 2020

revans2 Sep 15, 2020

razajafri commented Sep 15, 2020

revans2 commented Sep 15, 2020

razajafri commented Sep 15, 2020

revans2 commented Sep 15, 2020

		@@ -162,11 +164,15 @@ class GpuParquetFileFormat extends ColumnarFileFormat with Logging {
		val outputTimestampType = sparkSession.sessionState.conf.parquetOutputTimestampType
		if (outputTimestampType != ParquetOutputTimestampType.TIMESTAMP_MICROS) {

parquet writer support for TIMESTAMP_MILLIS #726

parquet writer support for TIMESTAMP_MILLIS #726

Conversation

razajafri commented Sep 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

razajafri commented Sep 11, 2020

razajafri commented Sep 11, 2020

razajafri commented Sep 11, 2020

revans2 commented Sep 11, 2020

revans2 commented Sep 14, 2020

razajafri commented Sep 15, 2020

razajafri commented Sep 15, 2020

razajafri commented Sep 15, 2020

Choose a reason for hiding this comment

razajafri commented Sep 15, 2020

revans2 commented Sep 15, 2020

razajafri commented Sep 15, 2020

revans2 commented Sep 15, 2020