[SPARK-37974][SQL] Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support #35262

parthchandra · 2022-01-20T18:50:42Z

What changes were proposed in this pull request?

This PR provides a vectorized implementation of the DELTA_BYTE_ARRAY encoding of Parquet V2. The PR also implements the DELTA_LENGTH_BYTE_ARRAY encoding which is needed by the former.

Why are the changes needed?

The current support for Parquet V2 in the vectorized reader uses a non-vectorized version of the above encoding and needs to be vectorized.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Reproduces all the tests for the encodings from the Parquet implementation. Also adds more cases to the Parquet Encoding test suite.

parthchandra · 2022-01-20T18:52:11Z

@sunchao, @dongjoon-hyun, @viirya, @LuciferYang could you please review?

sunchao · 2022-01-20T18:55:09Z

Will do, @parthchandra could you use a different JIRA for this?

dongjoon-hyun

Thank you for pinging me, @parthchandra .

viirya · 2022-01-20T18:59:20Z

Yea, this looks more than a follow up.

…reader

parthchandra · 2022-01-20T19:45:42Z

Created a new JIRA and modified the commit message(s) and the PR title.

LuciferYang · 2022-01-21T03:13:25Z

.../java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaByteArrayReader.java


  @Override
  public void initFromPage(int valueCount, ByteBufferInputStream in) throws IOException {
-    deltaByteArrayReader.initFromPage(valueCount, in);
+    this.valueCount = valueCount;


hmm... may be a stupid question, why does VectorizedDeltaByteArrayReader need to hold valueCount?

Nope it doesn't. Removed

LuciferYang · 2022-01-21T03:18:09Z

.../java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaByteArrayReader.java

+  private int currentRow = 0;
+
+  //temporary variable used by getBinary
+  Binary binaryVal;


should private

LuciferYang · 2022-01-21T03:53:42Z

.../java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaByteArrayReader.java

  }

  @Override
  public Binary readBinary(int len) {
-    return deltaByteArrayReader.readBytes();
+    readValues(1, null, 0,
+        (w, r, v, l) ->


Will this lambda create a new object every time when readBinary is called?

I really hope not. AFAIK, lambdas are highly optimized to not incur object creation overhead. I'm not sure if the function call overhead might also be eliminated by inlining.

cc @rednaxelafx , can you help check whether multiple objects or one object will be generated In this lambda scene?

I changed parquet v2 pages - delta encoding in ParquetEncodingSuite into a circular query

while (true) { val actual = spark.read.parquet(path).collect() assert(actual.sortBy(_.getInt(0)) === data.map(Row.fromTuple)) }

and dumped the memory many times.

Then I found there are many object of class

org.apache.spark.sql.execution.datasources.parquet.VectorizedDeltaByteArrayReader$$Lambda$3232 in memory dump:

It seems that because the lambda involves an external variable binaryVal, a new object will be generated every time when the method called @parthchandra ,

What you say makes sense: that the reference to the external variable will cause multiple object instantiation. Thank you for doing this reasearch!
I tried something similar but with the unit test in ParquetEncodingSuite and see only a single instance of the lambda created (not sure why).
I've changed the code to use a WritableVector of size 1 which eliminates the need to access the variable directly.

LuciferYang · 2022-01-21T04:22:58Z

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedValuesReader.java

+  interface ByteBufferOutputWriter {
+    void write(WritableColumnVector c, int rowId, ByteBuffer val, int length);
+
+    static void writeArrayByteBuffer(WritableColumnVector c, int rowId, ByteBuffer val,


Is it a good practice to add static methods to interface? I'm not sure

I don't know if it is frowned upon. In this case, not including in the interface only leads to some code bloat.

LuciferYang · 2022-01-21T04:51:34Z

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java

@@ -283,25 +294,30 @@ private void initDataReader(
    } catch (IOException e) {
      throw new IOException("could not read page in col " + descriptor, e);
    }
+    if (CorruptDeltaByteArrays.requiresSequentialReads(writerVersion, dataEncoding) &&
+        previousReader != null && previousReader instanceof RequiresPreviousReader) {


Is previousReader != null necessary?
previousReader instanceof RequiresPreviousReader can covered previousReader != null

You're right, it is not needed

LuciferYang · 2022-01-21T06:25:14Z

Will this pr speed up string related benchmark result when use Parquet Data Page V2?

LuciferYang · 2022-01-24T08:45:21Z

...src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala

+        val data = (1 to 8197).map { i =>
+          ( i,
+            i.toLong, i.toShort, Array[Byte](i.toByte),
+            if (i % 2 == 1) s"test_${i}" else null,


LuciferYang · 2022-01-25T03:21:07Z

...org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaLengthByteArrayReader.java

+    VectorizedValuesReader {
+
+  private final MemoryMode memoryMode;
+  private int valueCount;


this valueCount can also be removed

LuciferYang · 2022-01-25T03:35:40Z

@parthchandra Can you update the benchmark running with Java 8 again? The marked data is much slower than before. I'm not sure whether this data is reasonable

LuciferYang · 2022-01-25T06:29:11Z

...java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java

@@ -93,6 +96,12 @@ public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptCont
        HadoopInputFile.fromPath(file, configuration), options);
    this.reader = new ParquetRowGroupReaderImpl(fileReader);
    this.fileSchema = fileReader.getFileMetaData().getSchema();
+    try {
+      this.writerVersion = VersionParser.parse(fileReader.getFileMetaData().getCreatedBy());
+    } catch (Exception e) {


Will other types of exceptions be thrown here, except VersionParseException?

Well yes. I encountered at least one case where the version information was empty and the version check threw a NPE.

parthchandra · 2022-01-26T00:36:46Z

Updated the JDK 8 benchmark results as well.

LuciferYang · 2022-01-26T03:03:07Z

Updated the JDK 8 benchmark results as well.

After comparing the new bench data, I find that the data corresponding to Parquet Data Page V2 in the two test cases String with Nulls Scan (50.0%) and String with Nulls Scan (95.0%) is relatively slower than the previous pr (although the CPU frequency of the testing machine is reduced):

	before	after
String with Nulls Scan (50.0%)	145.7 ns/per row	228.7 ns/per row
String with Nulls Scan (95.0%)	25.2 ns/per row	77.9 ns/per row

parthchandra · 2022-01-26T18:12:25Z

Updated the JDK 8 benchmark results as well.

After comparing the new bench data, I find that the data corresponding to Parquet Data Page V2 in the two test cases String with Nulls Scan (50.0%) and String with Nulls Scan (95.0%) is relatively slower than the previous pr (although the CPU frequency of the testing machine is reduced):

before after
String with Nulls Scan (50.0%) 145.7 ns/per row 228.7 ns/per row
String with Nulls Scan (95.0%) 25.2 ns/per row 77.9 ns/per row

It's hard to reasonably compare the numbers across runs (even though the difference is substantial) because of the difference in the environment.
Incidentally, with nulls, the decoder doesn't even get called so such a precipitous drop is somewhat suspicious. And it appears that the vectorized decoder is being called one record at a time (this may not be a problem because the decoding has mostly been done though not written into the output vector).
I made a change to determine runs of null/non-null values and increase the number of values being written out to the output vector in each call, but saw no significant change (running benchmark on laptop).
See:

spark/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java

Line 237 in 6e64e92

for (int i = 0; i < n; ++i) {

Let me do a profile run to see if any obvious bottlenecks stand out.

LuciferYang · 2022-01-27T07:25:07Z

@parthchandra I think we should add some UTs similar to String with Nulls Scan because when I add

sparkSession.conf.set(SQLConf.COLUMN_VECTOR_OFFHEAP_ENABLED.key, "true")

to DataSourceReadBenchmark to enable ColumnVector use offheap memory, String with Nulls Scan releated cases will failed as follows:

14:33:29.271 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 5043.0 (TID 3936)
org.apache.spark.sql.execution.QueryExecutionException: Encountered error while reading file file:///private/var/folders/0x/xj61_dbd0dldn793s6cyb7rr0000gp/T/spark-a6065795-c141-43cd-8ec6-359f3f3a0307/parquetV2/part-00000-7c6de322-95b1-4283-9399-8306753c68ab-c000.snappy.parquet. Details: 
	at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFilesError(QueryExecutionErrors.scala:659) ~[classes/:?]
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:283) ~[classes/:?]
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) ~[classes/:?]
	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:546) ~[classes/:?]
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) ~[?:?]
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_0$(Unknown Source) ~[?:?]
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) ~[?:?]
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) ~[classes/:?]
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) ~[classes/:?]
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[scala-library-2.12.15.jar:?]
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140) ~[classes/:?]
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) ~[classes/:?]
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) ~[classes/:?]
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) ~[classes/:?]
	at org.apache.spark.scheduler.Task.run(Task.scala:136) ~[classes/:?]
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507) ~[classes/:?]
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1475) ~[classes/:?]
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510) [classes/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_292]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_292]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]
Caused by: org.apache.parquet.io.ParquetDecodingException: Failed to read 268435456 bytes
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedDeltaLengthByteArrayReader.readBinary(VectorizedDeltaLengthByteArrayReader.java:79) ~[classes/:?]
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedDeltaByteArrayReader.initFromPage(VectorizedDeltaByteArrayReader.java:76) ~[classes/:?]
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.initDataReader(VectorizedColumnReader.java:293) ~[classes/:?]
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPageV2(VectorizedColumnReader.java:362) ~[classes/:?]
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.access$100(VectorizedColumnReader.java:52) ~[classes/:?]
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:260) ~[classes/:?]
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:247) ~[classes/:?]
	at org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:192) ~[parquet-column-1.12.2.jar:1.12.2]
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPage(VectorizedColumnReader.java:247) ~[classes/:?]
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:183) ~[classes/:?]
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:311) ~[classes/:?]
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:209) ~[classes/:?]
	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) ~[classes/:?]
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) ~[classes/:?]
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:274) ~[classes/:?]
	... 19 more

I manually verified that there was no such problem before this pr

LuciferYang · 2022-01-27T10:12:07Z

...org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaLengthByteArrayReader.java

+  @Override
+  public void initFromPage(int valueCount, ByteBufferInputStream in) throws IOException {
+    if (memoryMode == MemoryMode.OFF_HEAP) {
+      lengthsVector = new OffHeapColumnVector(valueCount, IntegerType);


Maybe we should call lengthsVector.putInts(0, valueCount, 0); to ensure this init value of OffHeapColumnVector, or use other ways to avoid reading unexpected values in line 75 when memoryMode is MemoryMode.OFF_HEAP

parthchandra · 2022-01-27T19:00:42Z

Thank you for finding this issue! Let me address this and add the unit test(s) as well

sunchao · 2022-03-04T19:24:05Z

...org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaLengthByteArrayReader.java

+
+  @Override
+  public void skipBinary(int total) {
+    if (total == 0) {


We can remove this too

sunchao · 2022-03-05T00:39:01Z

.../java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaByteArrayReader.java

+    prefixLengthReader.readIntegers(prefixLengthReader.getTotalValueCount(),
+        prefixLengthVector, 0);
+    suffixReader.initFromPage(valueCount, in);
+    suffixReader.readBinary(prefixLengthReader.getTotalValueCount(), suffixVector, 0);


Instead of eagerly read the suffixes, we can have a method in VectorizedDeltaLengthByteArrayReader that just return the suffix at rowId:

public ByteBuffer getBytes(int rowId) { int length = lengthsVector.getInt(rowId); try { return in.slice(length); } catch (EOFException e) { throw new ParquetDecodingException("Failed to read " + length + " bytes"); } }

I tried this approach here, and it can improve the benchmark.

sunchao · 2022-03-05T01:36:03Z

.../java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaByteArrayReader.java

+        // but it incurs the same cost of copying the values twice _and_ c.getBinary
+        // is a _slow_ byte by byte copy
+        // The following always uses the faster system arraycopy method
+        byte[] out = new byte[length];


We can also potentially skip this copying at least for OnHeapColumnVector. I tried it and it gives some extra performance improvements.

[info] OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Mac OS X 10.16 [info] Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz [info] String with Nulls Scan (0.0%): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] SQL CSV 5721 5727 8 1.8 545.6 1.0X [info] SQL Json 6289 6295 9 1.7 599.7 0.9X [info] SQL Parquet Vectorized: DataPageV1 700 800 87 15.0 66.7 8.2X [info] SQL Parquet Vectorized: DataPageV2 994 1031 52 10.5 94.8 5.8X [info] SQL Parquet MR: DataPageV1 2035 2051 23 5.2 194.1 2.8X [info] SQL Parquet MR: DataPageV2 2289 2454 232 4.6 218.3 2.5X [info] ParquetReader Vectorized: DataPageV1 472 482 15 22.2 45.0 12.1X [info] ParquetReader Vectorized: DataPageV2 640 645 4 16.4 61.0 8.9X [info] SQL ORC Vectorized 670 694 35 15.7 63.9 8.5X [info] SQL ORC MR 1846 2047 284 5.7 176.0 3.1X [info] OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Mac OS X 10.16 [info] Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz [info] String with Nulls Scan (50.0%): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] SQL CSV 4825 4890 91 2.2 460.2 1.0X [info] SQL Json 5298 7385 2951 2.0 505.3 0.9X [info] SQL Parquet Vectorized: DataPageV1 701 889 169 14.9 66.9 6.9X [info] SQL Parquet Vectorized: DataPageV2 684 737 58 15.3 65.2 7.1X [info] SQL Parquet MR: DataPageV1 1857 1869 17 5.6 177.1 2.6X [info] SQL Parquet MR: DataPageV2 2034 2146 159 5.2 193.9 2.4X [info] ParquetReader Vectorized: DataPageV1 474 493 11 22.1 45.2 10.2X [info] ParquetReader Vectorized: DataPageV2 585 586 1 17.9 55.8 8.2X [info] SQL ORC Vectorized 810 845 53 12.9 77.3 6.0X [info] SQL ORC MR 1854 1935 114 5.7 176.8 2.6X [info] OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Mac OS X 10.16 [info] Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz [info] String with Nulls Scan (95.0%): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] SQL CSV 3212 3256 63 3.3 306.3 1.0X [info] SQL Json 3693 3695 3 2.8 352.2 0.9X [info] SQL Parquet Vectorized: DataPageV1 147 203 46 71.2 14.0 21.8X [info] SQL Parquet Vectorized: DataPageV2 160 286 144 65.4 15.3 20.0X [info] SQL Parquet MR: DataPageV1 1229 1351 172 8.5 117.2 2.6X [info] SQL Parquet MR: DataPageV2 1074 1099 36 9.8 102.4 3.0X [info] ParquetReader Vectorized: DataPageV1 107 109 2 97.9 10.2 30.0X [info] ParquetReader Vectorized: DataPageV2 124 127 2 84.7 11.8 25.9X [info] SQL ORC Vectorized 262 308 86 40.0 25.0 12.3X [info] SQL ORC MR 1002 1070 96 10.5 95.5 3.2X ``

…eap.

parthchandra · 2022-03-07T21:12:28Z

@sunchao I merged your changes into the PR. Also updated the benchmarks.

LuciferYang · 2022-03-14T13:05:17Z

@sunchao should we continue this?

sunchao

LGTM

cc @cloud-fan @sadikovi if you want to take another look.

sunchao · 2022-03-11T23:44:31Z

.../java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaByteArrayReader.java

+  private ByteBuffer previous;
+  private int currentRow = 0;
+
+  // temporary variable used by getBinary


nit: getBinary -> readBinary. Also can we add some comments on what tempBinaryValVector is for?

sunchao · 2022-03-14T16:46:52Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java

@@ -443,6 +444,8 @@ public UTF8String getUTF8String(int rowId) {
    }
  }

+  public abstract ByteBuffer getBytesUnsafe(int rowId, int count);


nit: maybe add a few comments here

I agree, the method is misleading since there is a memory copy involved, it is just does not call System.arraycopy in OnHeapColumnVector.

I agree as well. @sunchao given that this is from your patch, is it ok to change the name to say getByteBuffer ?

Yea I think we can use getByteBuffer. the "unsafe" here is a bit confusing.

LuciferYang · 2022-03-15T08:08:09Z

.../java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaByteArrayReader.java

@@ -16,50 +16,126 @@
 */
 package org.apache.spark.sql.execution.datasources.parquet;

+import static org.apache.spark.sql.types.DataTypes.BinaryType;


Do we have a clear import order definition for static import ? @sunchao @dongjoon-hyun

LuciferYang · 2022-03-15T09:41:02Z

...org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaLengthByteArrayReader.java

+      length = lengthsVector.getInt(currentRow + i);
+      int remaining = length;
+      while (remaining > 0) {
+        remaining -= in.skip(remaining);


Did I miss anything? Do we really need length here?

parthchandra · 2022-03-15T20:45:29Z

Addressed the latest few comments.

LuciferYang

LGTM +1

sadikovi

I left a few comments, would appreciate it if you could take a look. Thanks.

sadikovi · 2022-03-16T04:12:16Z

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java

@@ -283,6 +290,11 @@ private void initDataReader(
    } catch (IOException e) {
      throw new IOException("could not read page in col " + descriptor, e);
    }
+    if (CorruptDeltaByteArrays.requiresSequentialReads(writerVersion, dataEncoding) &&


When does this happen? Can you add a comment on why we need this?

Added comment. Detailed explanation is in the comment in VectorizedDeltaByteArrayReader.setPreviousValue

sadikovi · 2022-03-16T04:15:31Z

...va/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaBinaryPackedReader.java

@@ -90,13 +90,18 @@ public void initFromPage(int valueCount, ByteBufferInputStream in) throws IOExce
    Preconditions.checkArgument(miniSize % 8 == 0,
        "miniBlockSize must be multiple of 8, but it's " + miniSize);
    this.miniBlockSizeInValues = (int) miniSize;
+    // True value count. May be less than valueCount because of nulls


I think it would be more useful to annotate the method getTotalValueCount instead of here.

Added the comment to getTotalValueCount as well.

sadikovi · 2022-03-16T04:19:41Z

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java

@@ -283,6 +290,11 @@ private void initDataReader(
    } catch (IOException e) {
      throw new IOException("could not read page in col " + descriptor, e);
    }
+    if (CorruptDeltaByteArrays.requiresSequentialReads(writerVersion, dataEncoding) &&
+        previousReader instanceof RequiresPreviousReader) {
+      // previous reader can only be set if reading sequentially


nit: [P]revious.

sadikovi · 2022-03-16T04:20:08Z

.../java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaByteArrayReader.java

+  private ByteBuffer previous;
+  private int currentRow = 0;
+
+  // temporary variable used by readBinary


nit: Upper case.

sadikovi · 2022-03-16T04:20:30Z

.../java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaByteArrayReader.java

+
+  // temporary variable used by readBinary
+  private final WritableColumnVector binaryValVector;
+  // temporary variable used by skipBinary


nit: Upper case.

sadikovi · 2022-03-16T04:22:19Z

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedValuesReader.java

+    static void skipWrite(WritableColumnVector c, int rowId, ByteBuffer val, int length) { }
+
+  }
+


nit: new line.

sadikovi · 2022-03-16T04:23:29Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java

@@ -221,6 +221,13 @@ protected UTF8String getBytesAsUTF8String(int rowId, int count) {
    return UTF8String.fromAddress(null, data + rowId, count);
  }

+  @Override
+  public ByteBuffer getBytesUnsafe(int rowId, int count) {


Can we replace it with:

return ByteBuffer.wrap(getBytes(rowId, count));

We could, but it would incur an additional function call in performance sensitive code. @sunchao ?

It seems fine to use getBytes here also - the function call will be inlined by JIT if this is in a hot path.

sadikovi · 2022-03-16T04:25:56Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java

@@ -443,6 +444,8 @@ public UTF8String getUTF8String(int rowId) {
    }
  }

+  public abstract ByteBuffer getBytesUnsafe(int rowId, int count);


I agree, the method is misleading since there is a memory copy involved, it is just does not call System.arraycopy in OnHeapColumnVector.

sadikovi · 2022-03-16T04:26:09Z

.../org/apache/spark/sql/execution/datasources/parquet/ParquetDeltaByteArrayEncodingSuite.scala

+      i += skipCount + 1
+    }
+  }
+


sadikovi · 2022-03-16T04:29:15Z

...src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala

+          // reads at least twice from the reader). This will catch any issues with state
+          // maintained by the reader(s)
+          // Add at least one string with a null
+          val data = (1 to 81971).map { i =>


I don't quite understand how this number was chosen. Can you elaborate? Can we make it 2 * 4096 + 1 - would it work as well?

Yes, it would. Changed. (Sorry that number sneaked in after I tested something else and forgot to undo it).

parthchandra · 2022-03-16T22:37:06Z

Updated getBytesUnsafe to getByteBuffer and cleaned up the off-heap implementation.

sunchao · 2022-03-17T03:09:21Z

Thanks @sadikovi for the review! looks like we need to fix lint. Since Spark 3.3 branch has been cut, I've asked in the dev mailing list to see if we can still include this in the release.

parthchandra · 2022-03-17T17:44:15Z

@sunchao, @LuciferYang, @sadikovi thank you for your reviews!

sunchao · 2022-03-31T20:05:56Z

Sorry for the delay. Going to merge this now since the PR is included in the the allowed list of Spark 3.3. The linter issue looks unrelated.

…NGTH_BYTE_ARRAY encodings for Parquet V2 support ### What changes were proposed in this pull request? This PR provides a vectorized implementation of the DELTA_BYTE_ARRAY encoding of Parquet V2. The PR also implements the DELTA_LENGTH_BYTE_ARRAY encoding which is needed by the former. ### Why are the changes needed? The current support for Parquet V2 in the vectorized reader uses a non-vectorized version of the above encoding and needs to be vectorized. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Reproduces all the tests for the encodings from the Parquet implementation. Also adds more cases to the Parquet Encoding test suite. Closes #35262 from parthchandra/SPARK-36879-PR3. Lead-authored-by: Parth Chandra <parthc@apache.org> Co-authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Chao Sun <sunchao@apple.com>

sunchao · 2022-03-31T20:09:02Z

Merged to master/3.3, thanks @parthchandra and all!!

parthchandra · 2022-03-31T20:22:40Z

Thank you @sunchao , @LuciferYang , @sadikovi . I'll submit a few smaller, followup PRs for the issues that were deferred as a result of the review.

github-actions bot added the SQL label Jan 20, 2022

dongjoon-hyun reviewed Jan 20, 2022

View reviewed changes

parthchandra added 2 commits January 20, 2022 11:40

[SPARK-37974][SQL] Vectorized implementation of DeltaLengthByteArray …

7a8b41c

…reader

[SPARK-37974][SQL] Vectorized implementation of DeltaByteArray reader

2c73794

parthchandra force-pushed the SPARK-36879-PR3 branch from 4ada974 to 2c73794 Compare January 20, 2022 19:44

parthchandra changed the title ~~[SPARK-36879][SQL][FOLLOWUP] Support Parquet v2 data page encodings for the vectorized path~~ [SPARK-37974][SQL] Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support Jan 20, 2022

LuciferYang reviewed Jan 21, 2022

View reviewed changes

Addressing review comments

52df517

LuciferYang reviewed Jan 24, 2022

View reviewed changes

More review comments addressed

0011bab

LuciferYang reviewed Jan 25, 2022

View reviewed changes

parthchandra added 2 commits January 25, 2022 11:09

One more review comment

50ed815

Updated JDK 8 benchmark

3dc340a

LuciferYang reviewed Jan 27, 2022

View reviewed changes

sunchao reviewed Mar 5, 2022

View reviewed changes

sunchao and others added 4 commits March 7, 2022 11:03

Evaluate suffix array lazily in VectorizedDeltaLengthByteArrayReader

0583e2f

In DeltaLengthByteArrayReader avoid extra copy if memory mode is on_h…

31eee9f

…eap.

Avoid unnecessary check for parameter in skipBytes

9eaf387

Update benchmark results

1fc0060

sunchao approved these changes Mar 14, 2022

View reviewed changes

LuciferYang reviewed Mar 15, 2022

View reviewed changes

More review comments

d95100a

LuciferYang approved these changes Mar 16, 2022

View reviewed changes

sadikovi reviewed Mar 16, 2022

View reviewed changes

More review comments addressed

6d273f0

sadikovi approved these changes Mar 16, 2022

View reviewed changes

Cleaner naming for WritableColumnVector.getBytesUnsafe

1d15022

sunchao closed this in af40145 Mar 31, 2022

pxLi mentioned this pull request Apr 1, 2022

[BUG] Compile error for Spark330 because of VectorizedColumnReader constructor added a new parameter. NVIDIA/spark-rapids#5123

Closed

jlowe mentioned this pull request Apr 1, 2022

Fix ShimVectorizedColumnReader construction for recent Spark 3.3.0 changes NVIDIA/spark-rapids#5124

Merged

This was referenced Mar 21, 2023

Reading records inserted using Athena throws UOE exception when read using Spark (AWS) apache/iceberg#5593

Closed

Support Parquet v2 Spark vectorized read apache/iceberg#7162

Closed

		static void skipWrite(WritableColumnVector c, int rowId, ByteBuffer val, int length) { }

		}

[SPARK-37974][SQL] Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support #35262

[SPARK-37974][SQL] Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support #35262

Conversation

parthchandra commented Jan 20, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

parthchandra commented Jan 20, 2022

sunchao commented Jan 20, 2022

dongjoon-hyun left a comment

Choose a reason for hiding this comment

viirya commented Jan 20, 2022

parthchandra commented Jan 20, 2022

LuciferYang Jan 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LuciferYang Jan 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LuciferYang Jan 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LuciferYang commented Jan 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LuciferYang commented Jan 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parthchandra commented Jan 26, 2022

LuciferYang commented Jan 26, 2022 • edited Loading

parthchandra commented Jan 26, 2022 • edited Loading

LuciferYang commented Jan 27, 2022

LuciferYang Jan 27, 2022 • edited Loading

Choose a reason for hiding this comment

parthchandra commented Jan 27, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parthchandra commented Mar 7, 2022 • edited Loading

LuciferYang commented Mar 14, 2022

sunchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parthchandra commented Mar 15, 2022

LuciferYang left a comment

Choose a reason for hiding this comment

sadikovi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LuciferYang Jan 21, 2022 •

edited

Loading

LuciferYang Jan 21, 2022 •

edited

Loading

LuciferYang Jan 21, 2022 •

edited

Loading

LuciferYang commented Jan 21, 2022 •

edited

Loading

LuciferYang commented Jan 25, 2022 •

edited

Loading

LuciferYang commented Jan 26, 2022 •

edited

Loading

parthchandra commented Jan 26, 2022 •

edited

Loading

LuciferYang Jan 27, 2022 •

edited

Loading

parthchandra commented Mar 7, 2022 •

edited

Loading

sunchao commented Mar 31, 2022 •

edited

Loading