Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ORC timestamps loaded with specified timestamp type are corrupted #9365

Closed
jlowe opened this issue Oct 4, 2021 · 4 comments · Fixed by #9382
Closed

[BUG] ORC timestamps loaded with specified timestamp type are corrupted #9365

jlowe opened this issue Oct 4, 2021 · 4 comments · Fixed by #9382
Assignees
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@jlowe
Copy link
Member

jlowe commented Oct 4, 2021

Describe the bug
Recently RAPIDS Accelerator for Apache Spark tests have been failing when timestamp types are involved in the input file on disk. I noticed that timestamps seem to be corrupted if the caller specifies a timestamp type to use for any timestamp columns being loaded.

Steps/Code to reproduce bug
Apply the following patch which demonstrates the issue. If you remove the .timestamp_type(cudf::data_type... line on the ORC options builder then the test will pass.

diff --git a/cpp/tests/io/orc_test.cpp b/cpp/tests/io/orc_test.cpp
index cdf0a3b275..707b6450a5 100644
--- a/cpp/tests/io/orc_test.cpp
+++ b/cpp/tests/io/orc_test.cpp
@@ -306,6 +306,31 @@ TYPED_TEST(OrcWriterTimestampTypeTest, TimestampsWithNulls)
   CUDF_TEST_EXPECT_TABLES_EQUAL(expected, result.tbl->view());
 }
 
+TEST_F(OrcWriterTest, SimpleTimestamps)
+{
+  int64_t num_rows = 100;
+  
+  auto int_data = random_values<int64_t>(num_rows);
+  auto validity  = cudf::detail::make_counting_transform_iterator(0, [](auto i) { return true; });
+
+  column_wrapper<int64_t> const intcol{int_data.begin(), int_data.end(), validity};
+  auto tscol = cudf::bit_cast(intcol, cudf::data_type{cudf::type_id::TIMESTAMP_NANOSECONDS});
+  table_view expected({tscol});
+
+  auto filepath = temp_env->get_temp_filepath("OrcSimpleTimestamps.orc");
+  cudf_io::orc_writer_options out_opts =
+    cudf_io::orc_writer_options::builder(cudf_io::sink_info{filepath}, expected);
+  cudf_io::write_orc(out_opts);
+
+  cudf_io::orc_reader_options in_opts =
+    cudf_io::orc_reader_options::builder(cudf_io::source_info{filepath})
+      .use_index(false)
+      .timestamp_type(cudf::data_type{cudf::type_id::TIMESTAMP_NANOSECONDS});
+  auto result = cudf_io::read_orc(in_opts);
+
+  CUDF_TEST_EXPECT_TABLES_EQUAL(expected, result.tbl->view());
+}
+
 TEST_F(OrcWriterTest, MultiColumn)
 {
   constexpr auto num_rows = 10;

Expected behavior
Requesting TIMESTAMP_NANOSECONDS should return the same data as not requesting a timestamp result type.

@jlowe jlowe added bug Something isn't working Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Oct 4, 2021
@vuule
Copy link
Contributor

vuule commented Oct 4, 2021

@PointKernel timing/type suggests that this could be related to #9278, can you please look into this?

@PointKernel
Copy link
Member

I will take care of this.

@PointKernel PointKernel self-assigned this Oct 4, 2021
@beckernick beckernick removed the Needs Triage Need team to review and classify label Oct 5, 2021
@jlowe
Copy link
Member Author

jlowe commented Oct 5, 2021

Thanks for looking into this, @PointKernel! Note that this has blocked the RAPIDS Accelerator 21.12 CI pipelines. If it will take a while to develop a fix, we may want to consider reverting the change that triggered the regression.

@PointKernel
Copy link
Member

I will work on this probably tomorrow. I feel I know where the issue is but not 100% sure. Will let you know whether it would be a quick fix or not.

@rapids-bot rapids-bot bot closed this as completed in #9382 Oct 7, 2021
rapids-bot bot pushed a commit that referenced this issue Oct 7, 2021
Closes #9365

This PR gets rid of integer overflow issues along with the clock rate logic by directly operating on timestamp type id. It also fixes a truncation bug in Parquet. Corresponding unit tests are added.

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #9382
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants