[R][C++] PreBuffer is not enabled when scanning parquet via exec nodes #29623

asfimport · 2021-09-17T11:55:13Z

In ExecNode_Scan a ScanOptions object is built up. If we are reading parquet we should enable pre-buffering. This is done by creating a ParquetFragmentScanOptions object and enabling pre-buffering.

Alternatively, we could just default pre-buffering to true for asynchronous scans of parquet data.

Reporter: Weston Pace / @westonpace
Assignee: Neal Richardson / @nealrichardson

PRs and other links:

GitHub Pull Request #11386

_{Note: This issue was originally created as ARROW-14025. Please see the migration documentation for further details.}

asfimport · 2021-10-04T18:25:29Z

Neal Richardson / @nealrichardson:
@westonpace is this the right idea (+/- whatever it takes to get it to compile)? I believe this means we also would be getting whatever FragmentScanOptions the dataset was created with (so that we could correctly query CSVs etc.).

diff --git a/r/src/compute-exec.cpp b/r/src/compute-exec.cpp
index c61f7a3d1..5c60f2ac2 100644
--- a/r/src/compute-exec.cpp
+++ b/r/src/compute-exec.cpp
@@ -113,7 +113,17 @@ std::shared_ptr<compute::ExecNode> ExecNode_Scan(
   arrow::dataset::internal::Initialize();
 
   // TODO: pass in FragmentScanOptions
-  auto options = std::make_shared<arrow::dataset::ScanOptions>();
+  if (dataset->type_name() == "filesystem") {
+    // HALP: dataset is Dataset but needs to be cast to FileSystemDataset
+    // to have format() method
+    auto fmt = dataset->format();
+    auto options = fmt->default_fragment_scan_options;
+    if (fmt->type_name() == "parquet") {
+      options->arrow_reader_properties.pre_buffer_ = true;
+    }
+  } else {
+    auto options = std::make_shared<arrow::dataset::ScanOptions>();
+  }
 
   options->use_async = true;
   options->use_threads = arrow::r::GetBoolOption("arrow.use_threads", true);

asfimport · 2021-10-04T19:17:05Z

David Li / @lidavidm:
Something like this:

diff --git a/r/src/compute-exec.cpp b/r/src/compute-exec.cpp
index c61f7a3d1..cd34ad42f 100644
--- a/r/src/compute-exec.cpp
+++ b/r/src/compute-exec.cpp
@@ -114,6 +114,15 @@ std::shared_ptr<compute::ExecNode> ExecNode_Scan(
 
   // TODO: pass in FragmentScanOptions
   auto options = std::make_shared<arrow::dataset::ScanOptions>();
+  if (dataset->type_name() == "filesystem") {
+    auto fs_dataset = static_cast<const arrow::dataset::FileSystemDataset&>(*dataset);
+    if (fs_dataset.format()->type_name() == "parquet") {
+      auto fragment_scan_options = std::make_shared<arrow::dataset::ParquetFragmentScanOptions>();
+      fragment_scan_options->arrow_reader_properties->set_pre_buffer(true);
+      fragment_scan_options->arrow_reader_properties->set_cache_options(arrow::io::CacheOptions::LazyDefaults());
+      options->fragment_scan_options = std::move(fragment_scan_options);
+    }
+  }
 
   options->use_async = true;
   options->use_threads = arrow::r::GetBoolOption("arrow.use_threads", true);

asfimport · 2021-10-04T19:19:40Z

Weston Pace / @westonpace:
Per a brief external discussion with @nealrichardson & @lidavidm the best thing to do here will be to change the default to prebuffer and make sure we have the CacheOptions set to LazyDefaults.

asfimport · 2021-10-04T19:24:03Z

David Li / @lidavidm:
If we really want to avoid the dynamic type checking we could instead set the fragment scan options on the ParquetFileFormat instance when created. I think ParquetFragmentScanOptions$create and ParquetFileFormat$create would need to gain some options, or we could just make lazy-caching the only possibility by changing dataset___ParquetFragmentScanOptions__Make in dataset.cpp. (This would change all scans, not just ones that go through the exec plan.)

asfimport · 2021-10-04T19:39:32Z

Neal Richardson / @nealrichardson:
So the Dataset gets created with ParquetFragmentScanOptions (or whatever format), shouldn't we grab that and start with it? Less important for Parquet perhaps but for CSV don't we need that in order to get parsing options?

asfimport · 2021-10-04T19:41:25Z

David Li / @lidavidm:
Ah, yeah, we shouldn't just override them completely, you're right.

asfimport · 2021-10-04T19:42:19Z

Weston Pace / @westonpace:

So the Dataset gets created with ParquetFragmentScanOptions (or whatever format)

Does it? I mean, this sounds like a good idea, and I like it. But do we specify scan options when creating a dataset?

asfimport · 2021-10-04T19:44:02Z

David Li / @lidavidm:
Er, it's when creating the FooFileFormat, right? That has a default_fragment_scan_options. Maybe the easiest thing is to just change the default scan options for the Parquet format to enable pre-buffering.

asfimport · 2021-10-04T19:47:12Z

Neal Richardson / @nealrichardson:
Yes, when we create a FileSystemDataset, the FileFormat gets created and it takes an Options object, which for parquet is ParquetFragmentScanOptions. Here are the R defaults: https://github.com/apache/arrow/blob/master/r/R/dataset-format.R#L292-L294

asfimport · 2021-10-04T20:04:19Z

David Li / @lidavidm:
So we could just change that default to TRUE and then in dataset.cpp also set the cache options. (And if desired, also expose a flag to not set cache options or something.)

asfimport · 2021-10-12T20:13:42Z

Neal Richardson / @nealrichardson:
Issue resolved by pull request 11386
#11386

asfimport closed this as completed Oct 12, 2021

asfimport assigned nealrichardson Jan 10, 2023

asfimport added this to the 6.0.0 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R][C++] PreBuffer is not enabled when scanning parquet via exec nodes #29623

[R][C++] PreBuffer is not enabled when scanning parquet via exec nodes #29623

asfimport commented Sep 17, 2021

asfimport commented Oct 4, 2021

asfimport commented Oct 4, 2021

asfimport commented Oct 4, 2021

asfimport commented Oct 4, 2021

asfimport commented Oct 4, 2021

asfimport commented Oct 4, 2021

asfimport commented Oct 4, 2021

asfimport commented Oct 4, 2021

asfimport commented Oct 4, 2021

asfimport commented Oct 4, 2021

asfimport commented Oct 12, 2021

[R][C++] PreBuffer is not enabled when scanning parquet via exec nodes #29623

[R][C++] PreBuffer is not enabled when scanning parquet via exec nodes #29623

Comments

asfimport commented Sep 17, 2021

PRs and other links:

asfimport commented Oct 4, 2021

asfimport commented Oct 4, 2021

asfimport commented Oct 4, 2021

asfimport commented Oct 4, 2021

asfimport commented Oct 4, 2021

asfimport commented Oct 4, 2021

asfimport commented Oct 4, 2021

asfimport commented Oct 4, 2021

asfimport commented Oct 4, 2021

asfimport commented Oct 4, 2021

asfimport commented Oct 12, 2021