Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor parquet thrift reader #14097

Merged
merged 31 commits into from
Sep 20, 2023

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Sep 12, 2023

Description

Refactors the current CompactProtocolReader used to parse parquet file metadata. The main goal of the refactor is to allow easier use of std::optional fields in the thrift structs to prevent situations as in #14024 where an optional field is an empty string. The writer cannot distinguish between present-but-empty and not-present, so chooses the latter when writing the field. This PR adds a ParquetFieldOptional functor that can wrap the other field functors, obviating the need to write a new optional functor for each type.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 12, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Sep 12, 2023
@etseidl
Copy link
Contributor Author

etseidl commented Sep 12, 2023

I've also moved the definitions of the field functors to compact_protocol_reader.cpp as they are not required to be in the header for CPR to function properly. Having them defined in the header also leads to unnecessary compilations when fields or structs are modified.

Also added an implementation PoC for unions that uses optional fields rather than an isset struct.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick first pass
thanks for clearing out the header!

cpp/src/io/parquet/parquet_common.hpp Show resolved Hide resolved
Comment on lines 121 to 130
/** Empty struct to signal the order defined by the physical or logical type */
struct TypeDefinedOrder {};

/**
* Union to specify the order used for the min_value and max_value fields for a column.
*/
struct ColumnOrder {
std::optional<TypeDefinedOrder> TYPE_ORDER;
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this equivalent to bool is_type_defined_order?

Copy link
Contributor Author

@etseidl etseidl Sep 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parquet spec says unions are like enhanced enums. Many of the fields are just placeholders to signify which field is valid, but then the fields can have state, as with LogicalType. Right now, LogicalType is a struct of structs, with a struct field that's a struct of bools to indicate which field is valid. I'm proposing replacing that setup with a struct of optional structs, where only one field has a value.

ColumnOrder is a strange case because there's only one defined value at present, but that may change in a bit.

Copy link
Contributor Author

@etseidl etseidl Sep 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about this, since this is supposed to be thought of as an enum, we could do something like:

struct TypeDefinedOrder {};

struct SomethingElse {
  int some_state;
};

struct ColumnOrder {
  enum class Type {
    UNDEFINED = 0,
    TYPE_ORDER = 1,
    SOMETHING_ELSE = 2
  };
  std::optional<TypeDefinedOrder> type_order;
  std::optional<SomethingElse> something_else;

  Type operator() const {
    if (type_order.has_value()) { return Type::TYPE_ORDER; }
    if (something_else.has_value()) { return Type::SOMETHING_ELSE; }
    return Type::UNDEFINED;
  }

  // or
  Type type; // assigned during read
  Type operator() const { return type; }
};

Then the ColumnOrder could be used in a switch statement, and the case for SOMETHING_ELSE could access column_order.something_else.value().some_state. Empty structs could be replaced with a bool I suppose, or not even be defined if we save the enumerator value (the logic for empty struct is already separate from the non-empty case).

I would probably do it this way if I was writing a generic thrift parser, but for just parsing the parquet types, I think this may be overkill 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wound up making the above change (with mods to make it compile). I think treating these unions as enums is the way to go. And having a bunch of empty optional structs is unnecessary and takes up space.

cpp/src/io/parquet/parquet.hpp Outdated Show resolved Hide resolved
@vuule vuule added feature request New feature or request non-breaking Non-breaking change cuIO cuIO issue labels Sep 13, 2023
@etseidl etseidl marked this pull request as ready for review September 13, 2023 06:00
@etseidl etseidl requested a review from a team as a code owner September 13, 2023 06:00
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great.

Depending on how @vuule sees it, shall we take this chance to rename involved classes or functions with snake_case to align with libcudf guidance?

@vuule
Copy link
Contributor

vuule commented Sep 18, 2023

Looks great.

Depending on how @vuule sees it, shall we take this chance to rename involved classes or functions with snake_case to align with libcudf guidance?

I like to keep such changes separate.
@etseidl do we have naming directly from PQ specs in the code here? For example, we have a bunch of types in ORC that keep the ORC specs names.

@etseidl
Copy link
Contributor Author

etseidl commented Sep 18, 2023

do we have naming directly from PQ specs in the code here?

For the most part, yes. So from the parquet thrift IDL:

/** Empty struct to signal the order defined by the physical or logical type */
struct TypeDefinedOrder {}

union ColumnOrder {
  1: TypeDefinedOrder TYPE_ORDER;
}

and I'm adding

struct ColumnOrder {
  enum Type { UNDEFINED, TYPE_ORDER };
  Type type;

  operator Type() const { return type; }
};

There are a few places where the names don't quite match, so ColumnMetaData from the thrift becomes ColumnChunkMetaData in cudf. The parquet spec uses CamelCase for type names, and snake_case for field names.

But for function names we're certainly free to switch to snake_case as @PointKernel suggests.

@etseidl
Copy link
Contributor Author

etseidl commented Sep 18, 2023

So, assuming we're keeping the names from parquet as they are, shall I rename CompactProtocolReader, its functions, and all the field functors in compact_protocol_reader.cpp? Or just the functors that don't change the interface, and save the more drastic changes for another day?

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few nitpicks, looks good otherwise

cpp/src/io/parquet/compact_protocol_reader.cpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/compact_protocol_reader.cpp Outdated Show resolved Hide resolved
@vuule
Copy link
Contributor

vuule commented Sep 19, 2023

/ok to test

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for the refactoring.

shall I rename CompactProtocolReader, its functions, and all the field functors in compact_protocol_reader.cpp?

We could do it in a separate PR and clean up both thrift reader and writer.

@PointKernel
Copy link
Member

/merge

@rapids-bot rapids-bot bot merged commit 40d4cc5 into rapidsai:branch-23.10 Sep 20, 2023
57 of 58 checks passed
@etseidl etseidl deleted the refactor_parquet_thrift branch September 20, 2023 18:50
@etseidl etseidl mentioned this pull request Oct 10, 2023
3 tasks
rapids-bot bot pushed a commit that referenced this pull request Oct 20, 2023
Continuation of #14097, this PR refactors the LogicalType struct to use the new way of treating unions defined in the parquet thrift (more enum like than struct like).

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Nghia Truong (https://github.com/ttnghia)

URL: #14264
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants