Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-7948: Unable to query file with required fixed_len_byte_array decimal columns #2254

Merged
merged 2 commits into from
Jun 16, 2021

Conversation

vvysotskyi
Copy link
Member

DRILL-7948: Unable to query file with required fixed_len_byte_array decimal columns

Description

  1. Simplified logic for selecting required column reader and fixed some absent cases
  2. Fixed DictionaryVarDecimalReader to handle FIXED_LEN_BYTE_ARRAY and BINARY types

Documentation

NA

Testing

Added UT.

Copy link
Member

@vdiravka vdiravka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main fix is for BINARY/FIXED_LEN_BYTE_ARRAY Dictionary and non DictionaryColumnChunkMetaData type, right?
And the missing case was PrimitiveTypeName.FIXED_LEN_BYTE_ARRAY and ConvertedType.INTERVAL right? Should we cover it with a test case?

@vvysotskyi
Copy link
Member Author

@vdiravka, sorry, but I don't know how to generate a parquet file with the correct INTERVAL type and using dictionary encoding, so no test was added, but we have tests for INTERVAL with non-dictionary encoding.

@vvysotskyi
Copy link
Member Author

Yes, the main change is the fix for the FIXED_LEN_BYTE_ARRAY decimal type.

@vdiravka
Copy link
Member

Let me try to generate it within Drill's ParquetSimpleTestFileGenerator

@vvysotskyi
Copy link
Member Author

@vdiravka, thanks to referring to ParquetSimpleTestFileGenerator, I have generated a parquet file using it, but looks like it generates files with almost all columns of the non-dictionary encoding (except for _INT96_RAW in the last row group):

java -jar parquet-tools-1.12.0-SNAPSHOT_LocalMode.jar meta parquet/drill/parquet_test_file_simple
file:                              file:/private/tmp/parquet/drill/parquet_test_file_simple
creator:                           parquet-mr version 1.12.0 (build db75a6815f2ba1d1ee89d1a90aeb296f1f3a8f20)
extra:                             writer.model.name = example

file schema:                       ParquetLogicalDataTypes
--------------------------------------------------------------------------------
rowKey:                            REQUIRED INT32 R:0 D:0
_UTF8:                             REQUIRED BINARY L:STRING R:0 D:0
_Enum:                             REQUIRED BINARY L:ENUM R:0 D:0
_INT32_RAW:                        REQUIRED INT32 R:0 D:0
_INT_8:                            REQUIRED INT32 L:INTEGER(8,true) R:0 D:0
_INT_16:                           REQUIRED INT32 L:INTEGER(16,true) R:0 D:0
_INT_32:                           REQUIRED INT32 L:INTEGER(32,true) R:0 D:0
_UINT_8:                           REQUIRED INT32 L:INTEGER(8,false) R:0 D:0
_UINT_16:                          REQUIRED INT32 L:INTEGER(16,false) R:0 D:0
_UINT_32:                          REQUIRED INT32 L:INTEGER(32,false) R:0 D:0
_DECIMAL_decimal9:                 REQUIRED INT32 L:DECIMAL(9,2) R:0 D:0
_INT64_RAW:                        REQUIRED INT64 R:0 D:0
_INT_64:                           REQUIRED INT64 L:INTEGER(64,true) R:0 D:0
_UINT_64:                          REQUIRED INT64 L:INTEGER(64,false) R:0 D:0
_DECIMAL_decimal18:                REQUIRED INT64 L:DECIMAL(18,2) R:0 D:0
_DECIMAL_fixed_n:                  REQUIRED FIXED_LEN_BYTE_ARRAY L:DECIMAL(20,2) R:0 D:0
_DECIMAL_unlimited:                REQUIRED BINARY L:DECIMAL(30,2) R:0 D:0
_DATE_int32:                       REQUIRED INT32 L:DATE R:0 D:0
_TIME_MILLIS_int32:                REQUIRED INT32 L:TIME(MILLIS,true) R:0 D:0
_TIMESTAMP_MILLIS_int64:           REQUIRED INT64 L:TIMESTAMP(MILLIS,true) R:0 D:0
_TIMESTAMP_MICROS_int64:           REQUIRED INT64 L:TIMESTAMP(MICROS,true) R:0 D:0
_INTERVAL_fixed_len_byte_array_12: REQUIRED FIXED_LEN_BYTE_ARRAY L:INTERVAL R:0 D:0
_INT96_RAW:                        REQUIRED INT96 R:0 D:0

row group 1:                       RC:3 TS:1152 OFFSET:4
--------------------------------------------------------------------------------
rowKey:                             INT32 SNAPPY DO:0 FPO:4 SZ:37/35/0.95 VC:3 ENC:PLAIN,BIT_PACKED ST:[min: 1, max: 3, num_nulls: 0]
_UTF8:                              BINARY SNAPPY DO:0 FPO:41 SZ:62/71/1.15 VC:3 ENC:PLAIN,BIT_PACKED ST:[min: UTF8 string1, max: UTF8 string3, num_nulls: 0]
_Enum:                              BINARY SNAPPY DO:0 FPO:103 SZ:59/65/1.10 VC:3 ENC:PLAIN,BIT_PACKED ST:[min: MAX_VALUE, max: RANDOM_VALUE, num_nulls: 0]
_INT32_RAW:                         INT32 SNAPPY DO:0 FPO:162 SZ:37/35/0.95 VC:3 ENC:PLAIN,BIT_PACKED ST:[min: -2147483648, max: 2147483647, num_nulls: 0]
_INT_8:                             INT32 SNAPPY DO:0 FPO:199 SZ:37/35/0.95 VC:3 ENC:PLAIN,BIT_PACKED ST:[min: -128, max: 127, num_nulls: 0]
_INT_16:                            INT32 SNAPPY DO:0 FPO:236 SZ:37/35/0.95 VC:3 ENC:PLAIN,BIT_PACKED ST:[min: -32768, max: 32767, num_nulls: 0]
_INT_32:                            INT32 SNAPPY DO:0 FPO:273 SZ:37/35/0.95 VC:3 ENC:PLAIN,BIT_PACKED ST:[min: -2147483648, max: 2147483647, num_nulls: 0]
_UINT_8:                            INT32 SNAPPY DO:0 FPO:310 SZ:37/35/0.95 VC:3 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 255, num_nulls: 0]
_UINT_16:                           INT32 SNAPPY DO:0 FPO:347 SZ:37/35/0.95 VC:3 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 65535, num_nulls: 0]
_UINT_32:                           INT32 SNAPPY DO:0 FPO:384 SZ:37/35/0.95 VC:3 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 4294967295, num_nulls: 0]
_DECIMAL_decimal9:                  INT32 SNAPPY DO:0 FPO:421 SZ:37/35/0.95 VC:3 ENC:PLAIN,BIT_PACKED ST:[min: -0.01, max: 12345.67, num_nulls: 0]
_INT64_RAW:                         INT64 SNAPPY DO:0 FPO:458 SZ:49/47/0.96 VC:3 ENC:PLAIN,BIT_PACKED ST:[min: -9223372036854775808, max: 9223372036854775807, num_nulls: 0]
_INT_64:                            INT64 SNAPPY DO:0 FPO:507 SZ:49/47/0.96 VC:3 ENC:PLAIN,BIT_PACKED ST:[min: -9223372036854775808, max: 9223372036854775807, num_nulls: 0]
_UINT_64:                           INT64 SNAPPY DO:0 FPO:556 SZ:49/47/0.96 VC:3 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 18446744073709551615, num_nulls: 0]
_DECIMAL_decimal18:                 INT64 SNAPPY DO:0 FPO:605 SZ:49/47/0.96 VC:3 ENC:PLAIN,BIT_PACKED ST:[min: -0.01, max: 12345678901234.56, num_nulls: 0]
_DECIMAL_fixed_n:                   FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:654 SZ:46/82/1.78 VC:3 ENC:PLAIN,BIT_PACKED ST:[min: 0.00, max: 2808600455222908552998455437577489916754369068.00, num_nulls: 0]
_DECIMAL_unlimited:                 BINARY SNAPPY DO:0 FPO:700 SZ:55/126/2.29 VC:3 ENC:PLAIN,BIT_PACKED ST:[min: 0.00, max: 3395389607300375329868809150482838932772815902977278052454891160836652.00, num_nulls: 0]
_DATE_int32:                        INT32 SNAPPY DO:0 FPO:755 SZ:37/35/0.95 VC:3 ENC:PLAIN,BIT_PACKED ST:[min: 1969-12-31, max: 5350-02-17, num_nulls: 0]
_TIME_MILLIS_int32:                 INT32 SNAPPY DO:0 FPO:792 SZ:37/35/0.95 VC:3 ENC:PLAIN,BIT_PACKED ST:[min: 00:00:00.001+0000, max: 00:20:34.567+0000, num_nulls: 0]
_TIMESTAMP_MILLIS_int64:            INT64 SNAPPY DO:0 FPO:829 SZ:49/47/0.96 VC:3 ENC:PLAIN,BIT_PACKED ST:[min: 1970-01-01T00:00:00.000+0000, max: 2038-01-19T03:14:07.999+0000, num_nulls: 0]
_TIMESTAMP_MICROS_int64:            INT64 SNAPPY DO:0 FPO:878 SZ:49/47/0.96 VC:3 ENC:PLAIN,BIT_PACKED ST:[min: 1970-01-01T00:00:00.000000+0000, max: +294247-01-10T04:00:54.775807+0000, num_nulls: 0]
_INTERVAL_fixed_len_byte_array_12:  FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:927 SZ:53/59/1.11 VC:3 ENC:PLAIN,BIT_PACKED ST:[num_nulls: 0, min/max not defined]
_INT96_RAW:                         INT96 SNAPPY DO:980 FPO:1029 SZ:78/82/1.05 VC:3 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[num_nulls: 0, min/max not defined]

Here is the example of output for dictionary encoding:

java -jar parquet-tools-1.12.0-SNAPSHOT_LocalMode.jar meta dict_dec.parquet
file:        file:/private/tmp/dict_dec.parquet
creator:     parquet-cpp version 1.5.1-SNAPSHOT

file schema: schema
--------------------------------------------------------------------------------
RecId:       OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1
RegHrs:      OPTIONAL FIXED_LEN_BYTE_ARRAY L:DECIMAL(29,6) R:0 D:1

row group 1: RC:1 TS:232 OFFSET:4
--------------------------------------------------------------------------------
RecId:        INT64 SNAPPY DO:4 FPO:28 SZ:96/92/0.96 VC:1 ENC:PLAIN_DICTIONARY,PLAIN,RLE ST:[min: 250, max: 250, num_nulls: 0]
RegHrs:       FIXED_LEN_BYTE_ARRAY SNAPPY DO:195 FPO:227 SZ:136/132/0.97 VC:1 ENC:PLAIN_DICTIONARY,PLAIN,RLE ST:[min: 8.000000, max: 8.000000, num_nulls: 0]

@vvysotskyi
Copy link
Member Author

@vdiravka, did you have a chance to take a look at generating a parquet file?

@vdiravka
Copy link
Member

vdiravka commented Jun 14, 2021

For me it is some sort of bug in Parquet lib. Anyway looks like there is a workaround: you can remove " optional int96 _INT96_RAW ; \n" from schema and then dictionary encoding is used for _INTERVAL_fixed_len_byte_array_12, which you are interested in.

vitalii@vitalii-UX331UN:~/IdeaProjects/parquet-mr/parquet-cli$ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main meta /tmp/parquet/drill/parquet_test_file_simple
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/home/vitalii/IdeaProjects/parquet-mr/parquet-cli/target/dependency/hadoop-auth-2.10.1.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release

File path:  /tmp/parquet/drill/parquet_test_file_simple
Created by: parquet-mr version 1.12.0 (build db75a6815f2ba1d1ee89d1a90aeb296f1f3a8f20)
Properties:
  writer.model.name: example
Schema:
message ParquetLogicalDataTypes {
  required int32 rowKey;
  required binary _UTF8 (STRING);
  required binary _Enum (ENUM);
  required fixed_len_byte_array(16) _UUID (UUID);
  required int32 _INT32_RAW;
  required int32 _INT_8 (INTEGER(8,true));
  required int32 _INT_16 (INTEGER(16,true));
  required int32 _INT_32 (INTEGER(32,true));
  required int32 _UINT_8 (INTEGER(8,false));
  required int32 _UINT_16 (INTEGER(16,false));
  required int32 _UINT_32 (INTEGER(32,false));
  required int32 _DECIMAL_decimal9 (DECIMAL(9,2));
  required int64 _INT64_RAW;
  required int64 _INT_64 (INTEGER(64,true));
  required int64 _UINT_64 (INTEGER(64,false));
  required int64 _DECIMAL_decimal18 (DECIMAL(18,2));
  required fixed_len_byte_array(20) _DECIMAL_fixed_n (DECIMAL(20,2));
  required binary _DECIMAL_unlimited (DECIMAL(30,2));
  required int32 _DATE_int32 (DATE);
  required int32 _TIME_MILLIS_int32 (TIME(MILLIS,true));
  required int64 _TIMESTAMP_MILLIS_int64 (TIMESTAMP(MILLIS,true));
  required int64 _TIMESTAMP_MICROS_int64 (TIMESTAMP(MICROS,true));
  required fixed_len_byte_array(12) _INTERVAL_fixed_len_byte_array_12 (INTERVAL);
}


Row group 0:  count: 3  435.00 B records  start: 4  total: 1.274 kB
--------------------------------------------------------------------------------
                                   type      encodings count     avg size   nulls   min / max
rowKey                             INT32     S   D     3         11.00 B    0       "1" / "3"
_UTF8                              BINARY    S   D     3         22.33 B    0       "UTF8 string1" / "UTF8 string3"
_Enum                              BINARY    S   D     3         26.33 B    0       "MAX_VALUE" / "RANDOM_VALUE"
_UUID                              FIXED[16] S _ R     3         20.67 B  0       "01010101-0101-0101-0101-0..." / "01010101-0101-0101-0101-0..."
_INT32_RAW                         INT32     S   D     3         16.33 B    0       "-2147483648" / "2147483647"
_INT_8                             INT32     S   D     3         13.67 B    0       "-128" / "127"
_INT_16                            INT32     S   D     3         15.00 B    0       "-32768" / "32767"
_INT_32                            INT32     S   D     3         16.33 B    0       "-2147483648" / "2147483647"
_UINT_8                            INT32     S   D     3         13.67 B    0       "0" / "255"
_UINT_16                           INT32     S   D     3         14.67 B    0       "0" / "65535"
_UINT_32                           INT32     S   D     3         17.33 B    0       "0" / "4294967295"
_DECIMAL_decimal9                  INT32     S   D     3         17.33 B    0       "-0.01" / "12345.67"
_INT64_RAW                         INT64     S   D     3         21.00 B    0       "-9223372036854775808" / "9223372036854775807"
_INT_64                            INT64     S   D     3         21.00 B    0       "-9223372036854775808" / "9223372036854775807"
_UINT_64                           INT64     S   D     3         21.33 B    0       "0" / "18446744073709551615"
_DECIMAL_decimal18                 INT64     S   D     3         21.33 B    0       "-0.01" / "12345678901234.56"
_DECIMAL_fixed_n                   FIXED[20] S _ R     3         23.00 B  0       "0.00" / "2808600455222908552998455..."
_DECIMAL_unlimited                 BINARY    S   D     3         20.33 B    0       "0.00" / "3395389607300375329868809..."
_DATE_int32                        INT32     S   D     3         17.33 B    0       "1969-12-31" / "5350-02-17"
_TIME_MILLIS_int32                 INT32     S   D     3         17.33 B    0       "00:00:00.001+0000" / "00:20:34.567+0000"
_TIMESTAMP_MILLIS_int64            INT64     S   D     3         20.33 B    0       "1970-01-01T00:00:00.000+0000" / "2038-01-19T03:14:07.999+0000"
_TIMESTAMP_MICROS_int64            INT64     S   D     3         22.00 B    0       "1970-01-01T00:00:00.00000..." / "+294247-01-10T04:00:54.77..."
_INTERVAL_fixed_len_byte_array_12  FIXED[12] S _ R     3         25.33 B  0  

where R means RLE_DICTIONARY or PLAIN_DICTIONARY.

Initially there was a following meta for this file for me:

vitalii@vitalii-UX331UN:~/IdeaProjects/parquet-mr/parquet-cli$ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main meta /tmp/parquet/drill/parquet_test_file_simple
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/home/vitalii/IdeaProjects/parquet-mr/parquet-cli/target/dependency/hadoop-auth-2.10.1.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release

File path:  /tmp/parquet/drill/parquet_test_file_simple
Created by: parquet-mr version 1.12.0 (build db75a6815f2ba1d1ee89d1a90aeb296f1f3a8f20)
Properties:
  writer.model.name: example
Schema:
message ParquetLogicalDataTypes {
  required int32 rowKey;
  required binary _UTF8 (STRING);
  required binary _Enum (ENUM);
  required fixed_len_byte_array(16) _UUID (UUID);
  required int32 _INT32_RAW;
  required int32 _INT_8 (INTEGER(8,true));
  required int32 _INT_16 (INTEGER(16,true));
  required int32 _INT_32 (INTEGER(32,true));
  required int32 _UINT_8 (INTEGER(8,false));
  required int32 _UINT_16 (INTEGER(16,false));
  required int32 _UINT_32 (INTEGER(32,false));
  required int32 _DECIMAL_decimal9 (DECIMAL(9,2));
  required int64 _INT64_RAW;
  required int64 _INT_64 (INTEGER(64,true));
  required int64 _UINT_64 (INTEGER(64,false));
  required int64 _DECIMAL_decimal18 (DECIMAL(18,2));
  required fixed_len_byte_array(20) _DECIMAL_fixed_n (DECIMAL(20,2));
  required binary _DECIMAL_unlimited (DECIMAL(30,2));
  required int32 _DATE_int32 (DATE);
  required int32 _TIME_MILLIS_int32 (TIME(MILLIS,true));
  required int64 _TIMESTAMP_MILLIS_int64 (TIMESTAMP(MILLIS,true));
  required int64 _TIMESTAMP_MICROS_int64 (TIMESTAMP(MICROS,true));
  required fixed_len_byte_array(12) _INTERVAL_fixed_len_byte_array_12 (INTERVAL);
  required int96 _INT96_RAW;
}


Row group 0:  count: 3  361.00 B records  start: 4  total: 1.058 kB
--------------------------------------------------------------------------------
                                   type      encodings count     avg size   nulls   min / max
rowKey                             INT32     S   _     3         12.33 B    0       "1" / "3"
_UTF8                              BINARY    S   _     3         20.67 B    0       "UTF8 string1" / "UTF8 string3"
_Enum                              BINARY    S   _     3         19.67 B    0       "MAX_VALUE" / "RANDOM_VALUE"
_UUID                              FIXED[16] S   _     3         9.67 B   0       "01010101-0101-0101-0101-0..." / "01010101-0101-0101-0101-0..."
_INT32_RAW                         INT32     S   _     3         12.33 B    0       "-2147483648" / "2147483647"
_INT_8                             INT32     S   _     3         12.33 B    0       "-128" / "127"
_INT_16                            INT32     S   _     3         12.33 B    0       "-32768" / "32767"
_INT_32                            INT32     S   _     3         12.33 B    0       "-2147483648" / "2147483647"
_UINT_8                            INT32     S   _     3         12.33 B    0       "0" / "255"
_UINT_16                           INT32     S   _     3         12.33 B    0       "0" / "65535"
_UINT_32                           INT32     S   _     3         12.33 B    0       "0" / "4294967295"
_DECIMAL_decimal9                  INT32     S   _     3         12.33 B    0       "-0.01" / "12345.67"
_INT64_RAW                         INT64     S   _     3         16.33 B    0       "-9223372036854775808" / "9223372036854775807"
_INT_64                            INT64     S   _     3         16.33 B    0       "-9223372036854775808" / "9223372036854775807"
_UINT_64                           INT64     S   _     3         16.33 B    0       "0" / "18446744073709551615"
_DECIMAL_decimal18                 INT64     S   _     3         16.33 B    0       "-0.01" / "12345678901234.56"
_DECIMAL_fixed_n                   FIXED[20] S   _     3         15.33 B  0       "0.00" / "2808600455222908552998455..."
_DECIMAL_unlimited                 BINARY    S   _     3         18.33 B    0       "0.00" / "3395389607300375329868809..."
_DATE_int32                        INT32     S   _     3         12.33 B    0       "1969-12-31" / "5350-02-17"
_TIME_MILLIS_int32                 INT32     S   _     3         12.33 B    0       "00:00:00.001+0000" / "00:20:34.567+0000"
_TIMESTAMP_MILLIS_int64            INT64     S   _     3         16.33 B    0       "1970-01-01T00:00:00.000+0000" / "2038-01-19T03:14:07.999+0000"
_TIMESTAMP_MICROS_int64            INT64     S   _     3         16.33 B    0       "1970-01-01T00:00:00.00000..." / "+294247-01-10T04:00:54.77..."
_INTERVAL_fixed_len_byte_array_12  FIXED[12] S   _     3         17.67 B  0       
_INT96_RAW                         INT96     S _ R     3         26.00 B    0  

DRILL-7948: Enable testNullableIntervalDictionaryEncoding test
@vvysotskyi
Copy link
Member Author

@vdiravka, I have added requested unit tests and fixed several issues, please take a look

Copy link
Member

@vdiravka vdiravka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! LGTM +1

@vvysotskyi vvysotskyi merged commit f056ea7 into apache:master Jun 16, 2021
Leon-WTF pushed a commit to Leon-WTF/drill that referenced this pull request Jul 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants