-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R][C++] PreBuffer is not enabled when scanning parquet via exec nodes #29623
Comments
Neal Richardson / @nealrichardson: diff --git a/r/src/compute-exec.cpp b/r/src/compute-exec.cpp
index c61f7a3d1..5c60f2ac2 100644
--- a/r/src/compute-exec.cpp
+++ b/r/src/compute-exec.cpp
@@ -113,7 +113,17 @@ std::shared_ptr<compute::ExecNode> ExecNode_Scan(
arrow::dataset::internal::Initialize();
// TODO: pass in FragmentScanOptions
- auto options = std::make_shared<arrow::dataset::ScanOptions>();
+ if (dataset->type_name() == "filesystem") {
+ // HALP: dataset is Dataset but needs to be cast to FileSystemDataset
+ // to have format() method
+ auto fmt = dataset->format();
+ auto options = fmt->default_fragment_scan_options;
+ if (fmt->type_name() == "parquet") {
+ options->arrow_reader_properties.pre_buffer_ = true;
+ }
+ } else {
+ auto options = std::make_shared<arrow::dataset::ScanOptions>();
+ }
options->use_async = true;
options->use_threads = arrow::r::GetBoolOption("arrow.use_threads", true); |
David Li / @lidavidm: diff --git a/r/src/compute-exec.cpp b/r/src/compute-exec.cpp
index c61f7a3d1..cd34ad42f 100644
--- a/r/src/compute-exec.cpp
+++ b/r/src/compute-exec.cpp
@@ -114,6 +114,15 @@ std::shared_ptr<compute::ExecNode> ExecNode_Scan(
// TODO: pass in FragmentScanOptions
auto options = std::make_shared<arrow::dataset::ScanOptions>();
+ if (dataset->type_name() == "filesystem") {
+ auto fs_dataset = static_cast<const arrow::dataset::FileSystemDataset&>(*dataset);
+ if (fs_dataset.format()->type_name() == "parquet") {
+ auto fragment_scan_options = std::make_shared<arrow::dataset::ParquetFragmentScanOptions>();
+ fragment_scan_options->arrow_reader_properties->set_pre_buffer(true);
+ fragment_scan_options->arrow_reader_properties->set_cache_options(arrow::io::CacheOptions::LazyDefaults());
+ options->fragment_scan_options = std::move(fragment_scan_options);
+ }
+ }
options->use_async = true;
options->use_threads = arrow::r::GetBoolOption("arrow.use_threads", true);
|
Weston Pace / @westonpace: |
David Li / @lidavidm: |
Neal Richardson / @nealrichardson: |
Does it? I mean, this sounds like a good idea, and I like it. But do we specify scan options when creating a dataset? |
Neal Richardson / @nealrichardson: |
David Li / @lidavidm: |
Neal Richardson / @nealrichardson: |
In ExecNode_Scan a ScanOptions object is built up. If we are reading parquet we should enable pre-buffering. This is done by creating a ParquetFragmentScanOptions object and enabling pre-buffering.
Alternatively, we could just default pre-buffering to true for asynchronous scans of parquet data.
Reporter: Weston Pace / @westonpace
Assignee: Neal Richardson / @nealrichardson
PRs and other links:
Note: This issue was originally created as ARROW-14025. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: