Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Add libcudf policy that list columns must be sanitized #11748

Closed
nvdbaranec opened this issue Sep 22, 2022 · 6 comments
Closed

[DOC] Add libcudf policy that list columns must be sanitized #11748

nvdbaranec opened this issue Sep 22, 2022 · 6 comments
Labels
0 - Backlog In queue waiting for assignment doc Documentation libcudf Affects libcudf (C++/CUDA) code.

Comments

@nvdbaranec
Copy link
Contributor

I found a case where the parquet writer seems to be incorrectly writing the validity data for the top level of a list column. It ends up writing out an all-valid mask, when there are actually nulls. Confirmed this is a writer issue by loading the resulting file in external tools as well. ORC writer does not exhibit this behavior.

Repro:

constexpr int num_rows = 16;

std::mt19937 gen(6542);
std::bernoulli_distribution bn(0.7f);
auto valids =
  cudf::detail::make_counting_transform_iterator(0, [&](int index) { return bn(gen); });
auto values = thrust::make_counting_iterator(0);

// list<float>
constexpr int floats_per_row = 4;
auto c1_offset_iter = cudf::detail::make_counting_transform_iterator(0, [floats_per_row](cudf::size_type idx){
    return idx * floats_per_row;
  });
cudf::test::fixed_width_column_wrapper<cudf::offset_type> c1_offsets(c1_offset_iter, c1_offset_iter + num_rows + 1);    
cudf::test::fixed_width_column_wrapper<float> c1_floats(values, values + (num_rows * floats_per_row), valids);
auto c1 = cudf::make_lists_column(num_rows,
                                  c1_offsets.release(),
                                  c1_floats.release(),
                                  cudf::UNKNOWN_NULL_COUNT,
                                  cudf::test::detail::make_null_mask(valids, valids + num_rows));

cudf::test::print(*c1);

// write it out
cudf::table_view tbl({*c1});  
std::string filepath("bad_list_write.parquet");
cudf::io::parquet_writer_options out_args =
  cudf::io::parquet_writer_options::builder(cudf::io::sink_info{filepath}, tbl);
cudf::io::write_parquet(out_args);

cudf::io::parquet_reader_options read_args =
  cudf::io::parquet_reader_options::builder(cudf::io::source_info{filepath});
auto result = cudf::io::read_parquet(read_args);

cudf::test::print(result.tbl->get_column(0));

Expected (nested float column not showed)

List<float>:
Length : 16
Offsets : 0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64
Null count: 5
1111111010011010

Got

List<float>:
Length : 16
Offsets : 0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64
Null count: 0
1111111111111111
@nvdbaranec nvdbaranec added bug Something isn't working Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. labels Sep 22, 2022
@vuule
Copy link
Contributor

vuule commented Sep 23, 2022

The issue only reproes with list columns that have non-empty null elements. Modifying the failing test to have zero-length null lists fixes the test.
Such input validation is (arguably) out of scope of the Parquet writer. cuDF APIs assume that list columns are sanitized.

@ttnghia
Copy link
Contributor

ttnghia commented Sep 23, 2022

So we may end up to always calling purge_nonempty_nulls before writing (of course after checking has_nulls)?

@vuule
Copy link
Contributor

vuule commented Sep 23, 2022

It depends on where the data is coming from. For example, cuIO readers do not create such columns.

Maybe @mythrocks can comment on the cases in which we can create non-conforming columns.

rapids-bot bot pushed a commit that referenced this issue Oct 5, 2022
…ization for nested preprocessing (#11752)

Fixes an issue where using user bounds with parquet files containing both nested and non-nested types could result in incorrect row counts for the non-nested columns.  Originally reported by @etseidl 

The nature of the fix also implements a longstanding desired optimization:  when running the preprocess step for nested types, ignore pages for non-nested hierarchies.  This can result in significant speedups for files containing only a few nested columns.

<s>The tests added for this PR seem to tease a bug in the parquet writer into happening (#11748) so I will leave this as a draft until that issue is resolved.</s>

Authors:
  - https://github.com/nvdbaranec

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Nghia Truong (https://github.com/ttnghia)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #11752
@GregoryKimball
Copy link
Contributor

I will convert this into a documentation issue. The update will be needed in https://github.com/rapidsai/cudf/blob/branch-22.12/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md#list-columns

@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment doc Documentation cuIO cuIO issue and removed Needs Triage Need team to review and classify cuIO cuIO issue labels Oct 21, 2022
@GregoryKimball GregoryKimball changed the title [BUG] Corrupt data using parquet writer to write list columns. [DOC] Add convention that all lists must be sanitized Oct 21, 2022
@GregoryKimball GregoryKimball removed the bug Something isn't working label Oct 21, 2022
@GregoryKimball GregoryKimball changed the title [DOC] Add convention that all lists must be sanitized [DOC] Add libcudf policy that list columns must be sanitized Oct 21, 2022
@bdice
Copy link
Contributor

bdice commented Oct 21, 2022

@vyasr
Copy link
Contributor

vyasr commented Oct 21, 2022

Agreed. Closing.

@vyasr vyasr closed this as completed Oct 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment doc Documentation libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

6 participants