Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow: Add support for TimeType / UUIDType #2739

Merged
merged 1 commit into from
Jun 28, 2021

Conversation

nastra
Copy link
Contributor

@nastra nastra commented Jun 25, 2021

This is partly fixing #2486 and #2485. I didn't want to include all types as otherwise the PR would become too large. It's been a bit of a pain adding new type support. So I'm planning to refactor the code after this PR is merged in the arrow project in order to reduce code duplication and complexity before adding support for more types.

@nastra nastra force-pushed the support-timetype-uuidtype branch from 2eacfed to 21af7bc Compare June 25, 2021 11:51
@@ -169,6 +172,9 @@ public VectorHolder read(VectorHolder reuse, int numValsToRead) {
case TIMESTAMP_MILLIS:
vectorizedColumnIterator.nextBatchTimestampMillis(vec, typeWidth, nullabilityHolder);
break;
case UUID:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come only UUID was added to this switch?

Copy link
Contributor Author

@nastra nastra Jun 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIME_MICROS and TIMESTAMP_MICROS are evaluated as LONG: https://github.com/nastra/iceberg/blob/21af7bcd73f450e997d1af085634567a734a16b9/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java#L240-L255

so both are basically handled in the LONG part of that switch statement. Only UUID needs to be handled differently

@@ -240,6 +247,12 @@ private void allocateFieldVector(boolean dictionaryEncodedVector) {
this.readType = ReadType.LONG;
this.typeWidth = (int) BigIntVector.TYPE_WIDTH;
break;
case TIME_MICROS:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed offline do we want to add support for MILLIs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like Parquet's TimeWriter only writes micros in https://github.com/nastra/iceberg/blob/50f4ecca7711e69f63589fea828d26230fac8d59/parquet/src/main/java/org/apache/iceberg/data/parquet/BaseParquetWriter.java#L256, so I think the answer would be that we don't need to handle TIME_MILLIS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look into supporting TIME_MILLIS as a follow-up tomorrow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after some investigation, supporting TIME_MILLIS might be a bit more involved. I opened #2755

Copy link
Contributor

@rymurr rymurr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks @nastra

@rymurr rymurr merged commit aa65c06 into apache:master Jun 28, 2021
@nastra nastra deleted the support-timetype-uuidtype branch June 28, 2021 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants