Arrow: Add support for TimeType / UUIDType #2739

nastra · 2021-06-25T08:53:35Z

This is partly fixing #2486 and #2485. I didn't want to include all types as otherwise the PR would become too large. It's been a bit of a pain adding new type support. So I'm planning to refactor the code after this PR is merged in the arrow project in order to reduce code duplication and complexity before adding support for more types.

rymurr · 2021-06-28T08:47:03Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

@@ -169,6 +172,9 @@ public VectorHolder read(VectorHolder reuse, int numValsToRead) {
          case TIMESTAMP_MILLIS:
            vectorizedColumnIterator.nextBatchTimestampMillis(vec, typeWidth, nullabilityHolder);
            break;
+          case UUID:


How come only UUID was added to this switch?

TIME_MICROS and TIMESTAMP_MICROS are evaluated as LONG: https://github.com/nastra/iceberg/blob/21af7bcd73f450e997d1af085634567a734a16b9/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java#L240-L255

so both are basically handled in the LONG part of that switch statement. Only UUID needs to be handled differently

rymurr · 2021-06-28T08:53:57Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

@@ -240,6 +247,12 @@ private void allocateFieldVector(boolean dictionaryEncodedVector) {
            this.readType = ReadType.LONG;
            this.typeWidth = (int) BigIntVector.TYPE_WIDTH;
            break;
+          case TIME_MICROS:


as discussed offline do we want to add support for MILLIs?

It looks like Parquet's TimeWriter only writes micros in https://github.com/nastra/iceberg/blob/50f4ecca7711e69f63589fea828d26230fac8d59/parquet/src/main/java/org/apache/iceberg/data/parquet/BaseParquetWriter.java#L256, so I think the answer would be that we don't need to handle TIME_MILLIS

I will look into supporting TIME_MILLIS as a follow-up tomorrow

after some investigation, supporting TIME_MILLIS might be a bit more involved. I opened #2755

rymurr

lgtm, thanks @nastra

github-actions bot added the arrow label Jun 25, 2021

nastra force-pushed the support-timetype-uuidtype branch from 92a551b to 2eacfed Compare June 25, 2021 10:51

Add support for TimeType / UUIDType

21af7bc

nastra force-pushed the support-timetype-uuidtype branch from 2eacfed to 21af7bc Compare June 25, 2021 11:51

rymurr reviewed Jun 28, 2021

View reviewed changes

rymurr approved these changes Jun 28, 2021

View reviewed changes

rymurr merged commit aa65c06 into apache:master Jun 28, 2021

nastra deleted the support-timetype-uuidtype branch June 28, 2021 16:06

nastra mentioned this pull request Dec 21, 2022

Python: Parse UUID as binary in PyArrow #6468

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow: Add support for TimeType / UUIDType #2739

Arrow: Add support for TimeType / UUIDType #2739

nastra commented Jun 25, 2021 •

edited

Loading

rymurr Jun 28, 2021

nastra Jun 28, 2021 •

edited

Loading

rymurr Jun 28, 2021

nastra Jun 28, 2021

nastra Jun 28, 2021

nastra Jun 29, 2021

rymurr left a comment

Arrow: Add support for TimeType / UUIDType #2739

Arrow: Add support for TimeType / UUIDType #2739

Conversation

nastra commented Jun 25, 2021 • edited Loading

rymurr Jun 28, 2021

Choose a reason for hiding this comment

nastra Jun 28, 2021 • edited Loading

Choose a reason for hiding this comment

rymurr Jun 28, 2021

Choose a reason for hiding this comment

nastra Jun 28, 2021

Choose a reason for hiding this comment

nastra Jun 28, 2021

Choose a reason for hiding this comment

nastra Jun 29, 2021

Choose a reason for hiding this comment

rymurr left a comment

Choose a reason for hiding this comment

nastra commented Jun 25, 2021 •

edited

Loading

nastra Jun 28, 2021 •

edited

Loading