Spark: Add read/write support for UUIDs #7399

nastra · 2023-04-21T14:22:37Z

RussellSpitzer · 2023-04-21T17:20:32Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java

+        ThreadLocal.withInitial(
+            () -> {
+              ByteBuffer buffer = ByteBuffer.allocate(16);
+              buffer.order(ByteOrder.BIG_ENDIAN);


This is the default right? Just setting it to be sure?

I was mainly aligning with other places in the code that also used a Thread local ByteBuffer when reading UUIDs

Big endian is correct. See UUIDUtil for another implementation.

Not that it's incorrect, it's just the default for all new ByteBuffers. Just wondering why we were setting it explicitly

I think it's usually good to be explicit. Are we sure this the default, or is it the default on certain architectures?

https://docs.oracle.com/javase/7/docs/api/java/nio/ByteBuffer.html#order()

The byte order is used when reading or writing multibyte values, and when creating buffers that are views of this byte buffer. The order of a newly-created byte buffer is always [BIG_ENDIAN](https://docs.oracle.com/javase/7/docs/api/java/nio/ByteOrder.html#BIG_ENDIAN).

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java

RussellSpitzer

LGTM

Fokko

This is awesome!

RussellSpitzer

Lgtm! Thanks for doing this!

rdblue · 2023-04-25T21:39:43Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueWriters.java

+
+    @Override
+    public void nonNullWrite(int rowId, UTF8String data, ColumnVector output) {
+      ByteBuffer buffer = UUIDUtil.convertToByteBuffer(UUID.fromString(data.toString()));


This allocates a buffer. We may want to have a buffer here as a thread-local or a field to avoid allocation in a tight loop.

I agree with that observation and I initially used a Thread local to reduce byte[] allocation but couldn't get it to work because ((BytesColumnVector) output).setRef(..) would just store a reference to the passed byte[] and on subsequent writes we would end up overwriting previous values.
Worth mentioning that GenericOrcWriters does the same thing when writing UUIDs.

Probably worth mentioning in a comment?

makes sense, I've added a comment to this as part of #7496

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcWriter.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetWriters.java

rdblue · 2023-04-25T22:44:48Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetWriters.java

+      ByteBuffer buffer = BUFFER.get();
+      buffer.rewind();
+      buffer.putLong(uuid.getMostSignificantBits());
+      buffer.putLong(uuid.getLeastSignificantBits());


In other places, like UUIDUtil, we use putLong(int offset, long value) instead of putLong(long value) so that the position is not updated and we don't need to worry about the buffer's internal state. I think that's usually a better approach.

Also, we might want to update UUIDUtil to share this code:

public static ByteBuffer convertToByteBuffer(UUID value) { return convertToByteBuffer(value, null); } public static ByteBuffer convertToByteBuffer(UUID value, ByteBuffer reuse) { ByteBuffer buffer; if (reuse != null) { buffer = reuse; } else { buffer = ByteBuffer.allocate(16); } buffer.order(ByteOrder.BIG_ENDIAN); buffer.putLong(0, value.getMostSignificantBits()); buffer.putLong(8, value.getLeastSignificantBits()); return buffer; }

that makes sense, I've updated that.
We still have a few places in the code that do buffer.putLong(uuid.getMostSignificantBits());. I'll follow up on those and update them independently.

I've opened #7525 to address those other places in Spark

rdblue · 2023-04-25T22:45:35Z

...spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ArrowVectorAccessorFactory.java

@@ -74,6 +76,11 @@ public UTF8String ofRow(VarCharVector vector, int rowId) {
          null, vector.getDataBuffer().memoryAddress() + start, end - start);
    }

+    @Override
+    public UTF8String ofRow(FixedSizeBinaryVector vector, int rowId) {
+      return UTF8String.fromString(UUIDUtil.convert(vector.get(rowId)).toString());


Is there a way to get the underlying array and offset?

vector.get(rowId) will return the byte[] with a length of 16 for the given rowId. I think we could get underlying array and offset from the underlying ArrowBuf, but we would need to read it into a new byte[], which is what vector.get(rowId) is doing underneath

rdblue · 2023-04-25T22:48:02Z

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/data/RandomData.java

@@ -329,6 +329,8 @@ public Object primitive(Type.PrimitiveType primitive) {
          return UTF8String.fromString((String) obj);
        case DECIMAL:
          return Decimal.apply((BigDecimal) obj);
+        case UUID:
+          return UTF8String.fromString(UUID.nameUUIDFromBytes((byte[]) obj).toString());


Why does generatePrimitive provide byte[]? Shouldn't it create a String for Spark already?

my guess would be because RandomUtil.generatePrimitive(..) is used in other places where UUIDs are expected to be byte[]

rdblue · 2023-05-01T23:51:53Z

Thanks, @nastra!

nastra marked this pull request as draft April 21, 2023 14:22

github-actions bot added the spark label Apr 21, 2023

nastra force-pushed the spark-uuid-read-write-support-3.4 branch 2 times, most recently from ef2b760 to 0c81705 Compare April 21, 2023 15:34

RussellSpitzer reviewed Apr 21, 2023

View reviewed changes

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java Show resolved Hide resolved

nastra force-pushed the spark-uuid-read-write-support-3.4 branch from 0c81705 to 461276b Compare April 24, 2023 15:17

github-actions bot added API arrow labels Apr 24, 2023

nastra force-pushed the spark-uuid-read-write-support-3.4 branch from 461276b to dbd7ba4 Compare April 24, 2023 15:58

RussellSpitzer approved these changes Apr 24, 2023

View reviewed changes

nastra marked this pull request as ready for review April 25, 2023 07:11

Fokko approved these changes Apr 25, 2023

View reviewed changes

nastra force-pushed the spark-uuid-read-write-support-3.4 branch from dbd7ba4 to b4f2593 Compare April 25, 2023 08:10

nastra mentioned this pull request Apr 25, 2023

UUID type support in Spark is incomplete? #4038

Closed

nastra added this to the Iceberg 1.3.0 milestone Apr 25, 2023

RussellSpitzer approved these changes Apr 25, 2023

View reviewed changes