Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for struct type in ORC writer #9025

Merged
merged 82 commits into from
Sep 22, 2021
Merged
Show file tree
Hide file tree
Changes from 69 commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
b28b2b9
enable struct columns on the python layer; enable stream creation for…
vuule Jul 28, 2021
77e07eb
move input metadata types to types.hpp
vuule Jul 28, 2021
d983db4
replace table_metadata and table_metadata_with_nullability; update tests
vuule Jul 29, 2021
524f7db
simplify table creation in orc tests
vuule Jul 29, 2021
295312a
metadata
vuule Jul 29, 2021
f6dba5b
chunked metadata test fix; nullable getter fix
vuule Jul 29, 2021
6940680
add list column to a metadata test
vuule Jul 29, 2021
08bf157
Revert "add list column to a metadata test"
vuule Jul 29, 2021
3abf86d
refactor regular and chunked orc writer options
vuule Jul 30, 2021
c06fcbc
add the metadata + list test again
vuule Jul 30, 2021
eec7941
generate default column names
vuule Jul 30, 2021
7c74d95
subtype footer
vuule Aug 2, 2021
03bd98f
Merge branch 'branch-21.10' of https://github.com/rapidsai/cudf into …
vuule Aug 2, 2021
94d9892
Merge branch 'branch-21.10' of https://github.com/rapidsai/cudf into …
vuule Aug 3, 2021
6e83117
Merge branch 'branch-21.10' of https://github.com/rapidsai/cudf into …
vuule Aug 10, 2021
c90d04a
Merge branch 'branch-21.10' of https://github.com/rapidsai/cudf into …
vuule Aug 12, 2021
0fa6aa7
don't use valid_buf as null_mask; remove a redundant syncthreads
vuule Aug 16, 2021
f9cec92
typo fix
vuule Aug 17, 2021
8e333c9
Merge branch 'branch-21.10' of https://github.com/rapidsai/cudf into …
vuule Aug 17, 2021
4a65fad
Merge branch 'branch-21.10' of https://github.com/rapidsai/cudf into …
vuule Aug 18, 2021
184c686
Merge branch 'branch-21.10' of https://github.com/rapidsai/cudf into …
vuule Aug 20, 2021
444d4cf
pushdown null masks pt 1
vuule Aug 20, 2021
df2e916
pushdown null masks pt2
vuule Aug 20, 2021
6e76ccc
pushdown mask different approach
vuule Aug 21, 2021
7682899
pushdown for struct - host side copmplete
vuule Aug 24, 2021
11eef31
Merge branch 'branch-21.10' of https://github.com/rapidsai/cudf into …
vuule Aug 24, 2021
9780279
null mask row group alignment - lists only
vuule Aug 26, 2021
4b7627f
move alignment to separate function
vuule Aug 26, 2021
f965f3c
per rg valid count stub
vuule Aug 27, 2021
b7c86a1
set pd masks in d_column views
vuule Aug 27, 2021
fd35ba4
complete per rg valid counting
vuule Aug 27, 2021
1728e6c
basic logic bit borrowing
vuule Aug 28, 2021
69563f6
rowgroup alignment complete
vuule Aug 30, 2021
ba6b7c4
small prep clean up
vuule Aug 31, 2021
5826eb9
further clean up, prep for null mask compacting
vuule Sep 1, 2021
4671003
encode_nested_null_mask placeholder
vuule Sep 1, 2021
ef8d850
prep clean up in encode_nested_null_mask
vuule Sep 2, 2021
ee95438
more clarity changes
vuule Sep 2, 2021
958bc01
pd bits scan
vuule Sep 4, 2021
7807fe3
compact null masks - complete
vuule Sep 7, 2021
daca245
fix a few logic errors; at least one more to go
vuule Sep 8, 2021
24c7fa4
fix rg alignment
vuule Sep 8, 2021
4112554
OOB fixes
vuule Sep 8, 2021
94d90a3
use pushdown masks in val encode
vuule Sep 8, 2021
8465421
account for pushdown masks in decimal column size
vuule Sep 9, 2021
6419fc6
Merge branch 'branch-21.10' of https://github.com/rapidsai/cudf into …
vuule Sep 9, 2021
63c6e39
fix cython merge error; YOLO
vuule Sep 9, 2021
f23874f
add missing sync
vuule Sep 10, 2021
240ae27
fix null access when column has a pushdown mask but no null mask
vuule Sep 10, 2021
1f11e50
fix value count kernel
vuule Sep 11, 2021
77275f6
reset previously_borrowed when a rowgroup happens to be aligned
vuule Sep 11, 2021
38f6ab0
add py test
vuule Sep 11, 2021
f31172b
style fix
vuule Sep 11, 2021
0c9b9ad
rename test
vuule Sep 11, 2021
b4bc467
pre-commit style fix
vuule Sep 11, 2021
98d581d
python style fix and such
vuule Sep 12, 2021
71c5e82
include column offset in null mask access
vuule Sep 12, 2021
4a04bbd
init orc_column_view members >: (
vuule Sep 12, 2021
e3fdc76
pushdown list null masks
vuule Sep 13, 2021
db5fc6b
reduce_pushdown_masks docs
vuule Sep 14, 2021
06db73e
move metadata comp util to test utils (un-duplicate code)
vuule Sep 14, 2021
a7220e6
Merge branch 'branch-21.10' of https://github.com/rapidsai/cudf into …
vuule Sep 14, 2021
0d4dcd9
update yaml
vuule Sep 14, 2021
a433079
newline at EOF
vuule Sep 14, 2021
cd894cd
include structs in more cpp tests
vuule Sep 14, 2021
c51cf4f
kernel clean up
vuule Sep 14, 2021
a6be5b5
writer clean up pt1
vuule Sep 14, 2021
1249913
writer clean up pt2
vuule Sep 15, 2021
01d3033
Fix Java build after ORC struct write change
jlowe Sep 14, 2021
44d7538
precision; null mask offset
vuule Sep 15, 2021
db2a8cc
Merge branch 'branch-21.10' of https://github.com/rapidsai/cudf into …
vuule Sep 15, 2021
f05a503
Merge branch 'branch-21.10' of https://github.com/rapidsai/cudf into …
vuule Sep 15, 2021
b9ef40a
rename bit_is_set_or
vuule Sep 15, 2021
cacef12
have orc_column_device_view inherit from column_device_view
vuule Sep 16, 2021
b9214c4
sliced table support
vuule Sep 17, 2021
3623391
Merge branch 'branch-21.10' of https://github.com/rapidsai/cudf into …
vuule Sep 17, 2021
be18ba4
delete submodules (?)
vuule Sep 17, 2021
3ddd197
enable multi-rowgroup borrow even without pushdown masks
vuule Sep 18, 2021
181f897
review changes mostly
vuule Sep 18, 2021
90bdbb5
experimental
vuule Sep 18, 2021
f681fdb
list_struct_buff as module fixture
vuule Sep 18, 2021
8c20fb9
Merge branch 'branch-21.10' of https://github.com/rapidsai/cudf into …
vuule Sep 18, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions conda/recipes/libcudf/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -238,6 +238,7 @@ test:
- test -f $PREFIX/include/cudf_test/cudf_gtest.hpp
- test -f $PREFIX/include/cudf_test/cxxopts.hpp
- test -f $PREFIX/include/cudf_test/file_utilities.hpp
- test -f $PREFIX/include/cudf_test/io_metadata_utilities.hpp
- test -f $PREFIX/include/cudf_test/iterator_utilities.hpp
- test -f $PREFIX/include/cudf_test/table_utilities.hpp
- test -f $PREFIX/include/cudf_test/timestamp_utilities.cuh
Expand Down
1 change: 1 addition & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -563,6 +563,7 @@ add_library(cudftestutil STATIC
tests/utilities/base_fixture.cpp
tests/utilities/column_utilities.cu
tests/utilities/table_utilities.cu
tests/io/metadata_utilities.cpp
tests/strings/utilities.cu)

set_target_properties(cudftestutil
Expand Down
2 changes: 1 addition & 1 deletion cpp/include/cudf/column/column_device_view.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -1175,7 +1175,7 @@ __device__ inline bitmask_type get_mask_offset_word(bitmask_type const* __restri
size_type source_word_index = destination_word_index + word_index(source_begin_bit);
bitmask_type curr_word = source[source_word_index];
bitmask_type next_word = 0;
if (word_index(source_end_bit) >
if (word_index(source_end_bit - 1) >
word_index(source_begin_bit +
destination_word_index * detail::size_in_bits<bitmask_type>())) {
next_word = source[source_word_index + 1];
Expand Down
16 changes: 8 additions & 8 deletions cpp/include/cudf/io/orc.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -389,7 +389,7 @@ class orc_writer_options {
// Set of columns to output
table_view _table;
// Optional associated metadata
const table_metadata* _metadata = nullptr;
const table_input_metadata* _metadata = nullptr;

friend orc_writer_options_builder;

Expand Down Expand Up @@ -445,7 +445,7 @@ class orc_writer_options {
/**
* @brief Returns associated metadata.
*/
table_metadata const* get_metadata() const { return _metadata; }
table_input_metadata const* get_metadata() const { return _metadata; }

// Setters

Expand Down Expand Up @@ -475,7 +475,7 @@ class orc_writer_options {
*
* @param meta Associated metadata.
*/
void set_metadata(table_metadata* meta) { _metadata = meta; }
void set_metadata(table_input_metadata const* meta) { _metadata = meta; }
};

class orc_writer_options_builder {
Expand Down Expand Up @@ -541,7 +541,7 @@ class orc_writer_options_builder {
* @param meta Associated metadata.
* @return this for chaining.
*/
orc_writer_options_builder& metadata(table_metadata* meta)
orc_writer_options_builder& metadata(table_input_metadata const* meta)
{
options._metadata = meta;
return *this;
Expand Down Expand Up @@ -592,7 +592,7 @@ class chunked_orc_writer_options {
// Enable writing column statistics
bool _enable_statistics = true;
// Optional associated metadata
const table_metadata_with_nullability* _metadata = nullptr;
const table_input_metadata* _metadata = nullptr;

friend chunked_orc_writer_options_builder;

Expand Down Expand Up @@ -638,7 +638,7 @@ class chunked_orc_writer_options {
/**
* @brief Returns associated metadata.
*/
table_metadata_with_nullability const* get_metadata() const { return _metadata; }
table_input_metadata const* get_metadata() const { return _metadata; }

// Setters

Expand All @@ -661,7 +661,7 @@ class chunked_orc_writer_options {
*
* @param meta Associated metadata.
*/
void metadata(table_metadata_with_nullability* meta) { _metadata = meta; }
void metadata(table_input_metadata const* meta) { _metadata = meta; }
};

class chunked_orc_writer_options_builder {
Expand Down Expand Up @@ -712,7 +712,7 @@ class chunked_orc_writer_options_builder {
* @param meta Associated metadata.
* @return this for chaining.
*/
chunked_orc_writer_options_builder& metadata(table_metadata_with_nullability* meta)
chunked_orc_writer_options_builder& metadata(table_input_metadata const* meta)
{
options._metadata = meta;
return *this;
Expand Down
169 changes: 0 additions & 169 deletions cpp/include/cudf/io/parquet.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,6 @@

#include <rmm/mr/device/per_device_resource.hpp>

#include <thrust/optional.h>

#include <iostream>
#include <memory>
#include <string>
Expand Down Expand Up @@ -375,173 +373,6 @@ table_with_metadata read_parquet(
* @{
* @file
*/
class table_input_metadata;

class column_in_metadata {
friend table_input_metadata;
std::string _name = "";
thrust::optional<bool> _nullable;
// TODO: This isn't implemented yet
bool _list_column_is_map = false;
bool _use_int96_timestamp = false;
// bool _output_as_binary = false;
thrust::optional<uint8_t> _decimal_precision;
std::vector<column_in_metadata> children;

public:
/**
* @brief Get the children of this column metadata
*
* @return this for chaining
*/
column_in_metadata& add_child(column_in_metadata const& child)
{
children.push_back(child);
return *this;
}

/**
* @brief Set the name of this column
*
* @return this for chaining
*/
column_in_metadata& set_name(std::string const& name)
{
_name = name;
return *this;
}

/**
* @brief Set the nullability of this column
*
* Only valid in case of chunked writes. In single writes, this option is ignored.
*
* @return column_in_metadata&
*/
column_in_metadata& set_nullability(bool nullable)
{
_nullable = nullable;
return *this;
}

/**
* @brief Specify that this list column should be encoded as a map in the written parquet file
*
* The column must have the structure list<struct<key, value>>. This option is invalid otherwise
*
* @return this for chaining
*/
column_in_metadata& set_list_column_as_map()
{
_list_column_is_map = true;
return *this;
}

/**
* @brief Specifies whether this timestamp column should be encoded using the deprecated int96
* physical type. Only valid for the following column types:
* timestamp_s, timestamp_ms, timestamp_us, timestamp_ns
*
* @param req True = use int96 physical type. False = use int64 physical type
* @return this for chaining
*/
column_in_metadata& set_int96_timestamps(bool req)
{
_use_int96_timestamp = req;
return *this;
}

/**
* @brief Set the decimal precision of this column. Only valid if this column is a decimal
* (fixed-point) type
*
* @param precision The integer precision to set for this decimal column
* @return this for chaining
*/
column_in_metadata& set_decimal_precision(uint8_t precision)
{
_decimal_precision = precision;
return *this;
}

/**
* @brief Get reference to a child of this column
*
* @param i Index of the child to get
* @return this for chaining
*/
column_in_metadata& child(size_type i) { return children[i]; }

/**
* @brief Get const reference to a child of this column
*
* @param i Index of the child to get
* @return this for chaining
*/
column_in_metadata const& child(size_type i) const { return children[i]; }

/**
* @brief Get the name of this column
*/
std::string get_name() const { return _name; }

/**
* @brief Get whether nullability has been explicitly set for this column.
*/
bool is_nullability_defined() const { return _nullable.has_value(); }

/**
* @brief Gets the explicitly set nullability for this column.
* @throws If nullability is not explicitly defined for this column.
* Check using `is_nullability_defined()` first.
*/
bool nullable() const { return _nullable.value(); }

/**
* @brief If this is the metadata of a list column, returns whether it is to be encoded as a map.
*/
bool is_map() const { return _list_column_is_map; }

/**
* @brief Get whether to encode this timestamp column using deprecated int96 physical type
*/
bool is_enabled_int96_timestamps() const { return _use_int96_timestamp; }

/**
* @brief Get whether precision has been set for this decimal column
*/
bool is_decimal_precision_set() const { return _decimal_precision.has_value(); }

/**
* @brief Get the decimal precision that was set for this column.
* @throws If decimal precision was not set for this column.
* Check using `is_decimal_precision_set()` first.
*/
uint8_t get_decimal_precision() const { return _decimal_precision.value(); }

/**
* @brief Get the number of children of this column
*/
size_type num_children() const { return children.size(); }
};

class table_input_metadata {
public:
table_input_metadata() = default; // Required by cython

/**
* @brief Construct a new table_input_metadata from a table_view.
*
* The constructed table_input_metadata has the same structure as the passed table_view
*
* @param table The table_view to construct metadata for
* @param user_data Optional Additional metadata to encode, as key-value pairs
*/
table_input_metadata(table_view const& table, std::map<std::string, std::string> user_data = {});

std::vector<column_in_metadata> column_metadata;
std::map<std::string, std::string> user_data; //!< Format-dependent metadata as key-values pairs
};

/**
* @brief Class to build `parquet_writer_options`.
Expand Down
Loading