[BUG] ORC timestamps loaded with specified timestamp type are corrupted #9365

jlowe · 2021-10-04T22:10:44Z

Describe the bug
Recently RAPIDS Accelerator for Apache Spark tests have been failing when timestamp types are involved in the input file on disk. I noticed that timestamps seem to be corrupted if the caller specifies a timestamp type to use for any timestamp columns being loaded.

Steps/Code to reproduce bug
Apply the following patch which demonstrates the issue. If you remove the .timestamp_type(cudf::data_type... line on the ORC options builder then the test will pass.

diff --git a/cpp/tests/io/orc_test.cpp b/cpp/tests/io/orc_test.cpp
index cdf0a3b275..707b6450a5 100644
--- a/cpp/tests/io/orc_test.cpp
+++ b/cpp/tests/io/orc_test.cpp
@@ -306,6 +306,31 @@ TYPED_TEST(OrcWriterTimestampTypeTest, TimestampsWithNulls)
   CUDF_TEST_EXPECT_TABLES_EQUAL(expected, result.tbl->view());
 }
 
+TEST_F(OrcWriterTest, SimpleTimestamps)
+{
+  int64_t num_rows = 100;
+  
+  auto int_data = random_values<int64_t>(num_rows);
+  auto validity  = cudf::detail::make_counting_transform_iterator(0, [](auto i) { return true; });
+
+  column_wrapper<int64_t> const intcol{int_data.begin(), int_data.end(), validity};
+  auto tscol = cudf::bit_cast(intcol, cudf::data_type{cudf::type_id::TIMESTAMP_NANOSECONDS});
+  table_view expected({tscol});
+
+  auto filepath = temp_env->get_temp_filepath("OrcSimpleTimestamps.orc");
+  cudf_io::orc_writer_options out_opts =
+    cudf_io::orc_writer_options::builder(cudf_io::sink_info{filepath}, expected);
+  cudf_io::write_orc(out_opts);
+
+  cudf_io::orc_reader_options in_opts =
+    cudf_io::orc_reader_options::builder(cudf_io::source_info{filepath})
+      .use_index(false)
+      .timestamp_type(cudf::data_type{cudf::type_id::TIMESTAMP_NANOSECONDS});
+  auto result = cudf_io::read_orc(in_opts);
+
+  CUDF_TEST_EXPECT_TABLES_EQUAL(expected, result.tbl->view());
+}
+
 TEST_F(OrcWriterTest, MultiColumn)
 {
   constexpr auto num_rows = 10;

Expected behavior
Requesting TIMESTAMP_NANOSECONDS should return the same data as not requesting a timestamp result type.

The text was updated successfully, but these errors were encountered:

vuule · 2021-10-04T22:46:55Z

@PointKernel timing/type suggests that this could be related to #9278, can you please look into this?

PointKernel · 2021-10-04T23:04:24Z

I will take care of this.

jlowe · 2021-10-05T19:26:12Z

Thanks for looking into this, @PointKernel! Note that this has blocked the RAPIDS Accelerator 21.12 CI pipelines. If it will take a while to develop a fix, we may want to consider reverting the change that triggered the regression.

PointKernel · 2021-10-05T19:37:10Z

I will work on this probably tomorrow. I feel I know where the issue is but not 100% sure. Will let you know whether it would be a quick fix or not.

Closes #9365 This PR gets rid of integer overflow issues along with the clock rate logic by directly operating on timestamp type id. It also fixes a truncation bug in Parquet. Corresponding unit tests are added. Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - Vukasin Milovanovic (https://github.com/vuule) URL: #9382

jlowe added bug Something isn't working Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Oct 4, 2021

jlowe mentioned this issue Oct 4, 2021

[BUG] 21.12 parquet and orc test failures NVIDIA/spark-rapids#3742

Closed

PointKernel self-assigned this Oct 4, 2021

beckernick removed the Needs Triage Need team to review and classify label Oct 5, 2021

PointKernel mentioned this issue Oct 6, 2021

Fix timestamp truncation/overflow bugs in orc/parquet #9382

Merged

rapids-bot bot closed this as completed in #9382 Oct 7, 2021

jlowe mentioned this issue Oct 8, 2021

Restore disabled ORC and Parquet tests NVIDIA/spark-rapids#3773

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ORC timestamps loaded with specified timestamp type are corrupted #9365

[BUG] ORC timestamps loaded with specified timestamp type are corrupted #9365

jlowe commented Oct 4, 2021

vuule commented Oct 4, 2021

PointKernel commented Oct 4, 2021

jlowe commented Oct 5, 2021

PointKernel commented Oct 5, 2021

[BUG] ORC timestamps loaded with specified timestamp type are corrupted #9365

[BUG] ORC timestamps loaded with specified timestamp type are corrupted #9365

Comments

jlowe commented Oct 4, 2021

vuule commented Oct 4, 2021

PointKernel commented Oct 4, 2021

jlowe commented Oct 5, 2021

PointKernel commented Oct 5, 2021