Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R][C++] PreBuffer is not enabled when scanning parquet via exec nodes #29623

Closed
asfimport opened this issue Sep 17, 2021 · 11 comments
Closed

[R][C++] PreBuffer is not enabled when scanning parquet via exec nodes #29623

asfimport opened this issue Sep 17, 2021 · 11 comments

Comments

@asfimport
Copy link
Collaborator

In ExecNode_Scan a ScanOptions object is built up. If we are reading parquet we should enable pre-buffering. This is done by creating a ParquetFragmentScanOptions object and enabling pre-buffering.

Alternatively, we could just default pre-buffering to true for asynchronous scans of parquet data.

Reporter: Weston Pace / @westonpace
Assignee: Neal Richardson / @nealrichardson

PRs and other links:

Note: This issue was originally created as ARROW-14025. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Neal Richardson / @nealrichardson:
@westonpace is this the right idea (+/- whatever it takes to get it to compile)? I believe this means we also would be getting whatever FragmentScanOptions the dataset was created with (so that we could correctly query CSVs etc.).

diff --git a/r/src/compute-exec.cpp b/r/src/compute-exec.cpp
index c61f7a3d1..5c60f2ac2 100644
--- a/r/src/compute-exec.cpp
+++ b/r/src/compute-exec.cpp
@@ -113,7 +113,17 @@ std::shared_ptr<compute::ExecNode> ExecNode_Scan(
   arrow::dataset::internal::Initialize();
 
   // TODO: pass in FragmentScanOptions
-  auto options = std::make_shared<arrow::dataset::ScanOptions>();
+  if (dataset->type_name() == "filesystem") {
+    // HALP: dataset is Dataset but needs to be cast to FileSystemDataset
+    // to have format() method
+    auto fmt = dataset->format();
+    auto options = fmt->default_fragment_scan_options;
+    if (fmt->type_name() == "parquet") {
+      options->arrow_reader_properties.pre_buffer_ = true;
+    }
+  } else {
+    auto options = std::make_shared<arrow::dataset::ScanOptions>();
+  }
 
   options->use_async = true;
   options->use_threads = arrow::r::GetBoolOption("arrow.use_threads", true);

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
Something like this:

diff --git a/r/src/compute-exec.cpp b/r/src/compute-exec.cpp
index c61f7a3d1..cd34ad42f 100644
--- a/r/src/compute-exec.cpp
+++ b/r/src/compute-exec.cpp
@@ -114,6 +114,15 @@ std::shared_ptr<compute::ExecNode> ExecNode_Scan(
 
   // TODO: pass in FragmentScanOptions
   auto options = std::make_shared<arrow::dataset::ScanOptions>();
+  if (dataset->type_name() == "filesystem") {
+    auto fs_dataset = static_cast<const arrow::dataset::FileSystemDataset&>(*dataset);
+    if (fs_dataset.format()->type_name() == "parquet") {
+      auto fragment_scan_options = std::make_shared<arrow::dataset::ParquetFragmentScanOptions>();
+      fragment_scan_options->arrow_reader_properties->set_pre_buffer(true);
+      fragment_scan_options->arrow_reader_properties->set_cache_options(arrow::io::CacheOptions::LazyDefaults());
+      options->fragment_scan_options = std::move(fragment_scan_options);
+    }
+  }
 
   options->use_async = true;
   options->use_threads = arrow::r::GetBoolOption("arrow.use_threads", true);

 

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
Per a brief external discussion with @nealrichardson & @lidavidm the best thing to do here will be to change the default to prebuffer and make sure we have the CacheOptions set to LazyDefaults.

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
If we really want to avoid the dynamic type checking we could instead set the fragment scan options on the ParquetFileFormat instance when created. I think ParquetFragmentScanOptions$create and ParquetFileFormat$create would need to gain some options, or we could just make lazy-caching the only possibility by changing dataset___ParquetFragmentScanOptions__Make in dataset.cpp. (This would change all scans, not just ones that go through the exec plan.)

@asfimport
Copy link
Collaborator Author

Neal Richardson / @nealrichardson:
So the Dataset gets created with ParquetFragmentScanOptions (or whatever format), shouldn't we grab that and start with it? Less important for Parquet perhaps but for CSV don't we need that in order to get parsing options?

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
Ah, yeah, we shouldn't just override them completely, you're right.

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:

So the Dataset gets created with ParquetFragmentScanOptions (or whatever format)

Does it? I mean, this sounds like a good idea, and I like it. But do we specify scan options when creating a dataset?

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
Er, it's when creating the FooFileFormat, right? That has a default_fragment_scan_options. Maybe the easiest thing is to just change the default scan options for the Parquet format to enable pre-buffering.

@asfimport
Copy link
Collaborator Author

Neal Richardson / @nealrichardson:
Yes, when we create a FileSystemDataset, the FileFormat gets created and it takes an Options object, which for parquet is ParquetFragmentScanOptions. Here are the R defaults: https://github.com/apache/arrow/blob/master/r/R/dataset-format.R#L292-L294

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
So we could just change that default to TRUE and then in dataset.cpp also set the cache options. (And if desired, also expose a flag to not set cache options or something.)

@asfimport
Copy link
Collaborator Author

Neal Richardson / @nealrichardson:
Issue resolved by pull request 11386
#11386

@asfimport asfimport added this to the 6.0.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants