Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(rust): Request for preliminary directional check-in: making file loading much better factored #12750

Draft
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

itamarst
Copy link
Contributor

@itamarst itamarst commented Nov 28, 2023

Motivation

User stories

Issues with the current code

Structural issues

  • In polars-lazy, the various lazy file readers had a bad pattern where there was both a path and paths on the structs.
    • The path had two completely different meanings: a glob, and a specific file. A glob would be converted to a specific file with a specific path.
    • The paths would also be converted into specific files with a specific path.
    • In short, this is an implicit state machine with ... 6 states? Some of which were never addressed.
  • In addition, any time time a PathBuf was used for a specific path to a file or as a glob, it had two potential meanings: a local path on the filesystem, or a cloud URI. Dispatch was done based on is_cloud_url().

Both these cases ought to be modeled via the type system, not string inspection or implicit, vague state machines.

Functionality issues

Because all loading was assumed to be from a PathBuf, loading from a Python file-like object is not possible.

Proposed design changes

  1. Switch from path + paths in LazyFileListReader to a single SpecificOrMultipleReaderFactories. This is still not great, since there's still a requirement for the code to do things in a certain order if it doesn't want a panic; a better refactor would involve splitting the classes into a builder phase and a final phase. But it's a large enough branch as it is, and at least now there's only two states instead of six!
  2. Use of PathBuf for things to load is replaced with a ReaderFactory; for now it's either a filesystem path or a URI, but a Python factory function that returns file-like objects could be added. It's a factory to allow for opening a file multiple times, and because of the need for (de)serialization.
  • mmap.rs is probably the wrong place for ReaderFactory, it should probably be elsewhere.

I did some of the refactoring, but there's a lot more, and it will percolate throughout the code base, so this seemed like a good place to get some feedback.

State of the PR

There's lots and lots of code that hasn't been convered from PathBuf to ReaderFactory, and I haven't quite proven to myself that it will support python file-like objects as well. But it probably will?

@itamarst itamarst marked this pull request as draft November 28, 2023 16:32
@@ -84,3 +91,41 @@ impl<'a, T: 'a + MmapBytesReader> From<&'a T> for ReaderBytes<'a> {
}
}
}

/// Create MmapBytesReaders for a specific "file", either locally or remotely.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This replaces the use of PathBuf as a source of data to be read.


#[derive(Clone)]
#[cfg(feature = "csv")]
pub struct LazyCsvReader<'a> {
path: PathBuf,
Copy link
Contributor Author

@itamarst itamarst Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice there are multiple potential states in the original code.

Expected states:

  • path is glob, paths is empty.
  • path is specific path, paths is empty.
  • path is empty, paths is non-empty list of specific paths.

Unexpected and hopefully not present states:

  • path is glob, paths is non-empty list of specific paths.
  • path is specific path, paths is non-empty list of specific paths.
  • path is empty, paths is empty.

@ritchie46
Copy link
Member

I hope to get to this in a few days. A bit busy atm.

crates/polars-lazy/src/scan/file_list_reader.rs Outdated Show resolved Hide resolved
crates/polars-lazy/src/scan/ndjson.rs Show resolved Hide resolved
crates/polars-lazy/src/scan/ipc.rs Show resolved Hide resolved
@@ -199,7 +210,10 @@ impl LogicalPlanBuilder {
// We set the hive partitions of the first path to determine the schema.
// On iteration the partition values will be re-set per file.
if hive_partitioning {
file_info.init_hive_partitions(path.as_path())?;
if let Some(uri) = uri {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is correct and hive only applies to cloud URLs? This is why it's better not to have "a string" as a datastructure, and explicitly separate local paths from cloud URLs in the type system.

crates/polars-lazy/src/scan/csv.rs Outdated Show resolved Hide resolved
crates/polars-io/src/mmap.rs Outdated Show resolved Hide resolved
crates/polars-lazy/src/scan/file_list_reader.rs Outdated Show resolved Hide resolved
@itamarst
Copy link
Contributor Author

itamarst commented Dec 5, 2023

Thank you, I will try to update the sketch later this week.

@itamarst
Copy link
Contributor Author

itamarst commented Dec 6, 2023

I hadn't quite realized before the ways in which this is exposed as part of the public Rust API, mostly I was thinking about it from perspective of implementation. So now I'm thinking about that half more in my sketch.

@itamarst
Copy link
Contributor Author

itamarst commented Dec 6, 2023

OK, here's a new design. It turns LazyFileListReader into a single-state factory, rather than the 6 states in main or the 2 states in previous PR iteration. It also makes more explicit what happens when you load e.g. a string (personally I am uncomfortable with idea that globbing might be used by mistake so I think it's helpful to have a few extra more-explicit utility methods for loading).

@@ -44,9 +44,8 @@ impl LazyIpcReader {
}

impl LazyFileListReader for LazyIpcReader {
fn finish_no_glob(self) -> PolarsResult<LazyFrame> {
fn finish_no_glob(self, path: &Path) -> PolarsResult<LazyFrame> {
Copy link
Contributor Author

@itamarst itamarst Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and similar changes should actually be load_specific(self, location: ScanLocation).

@@ -112,10 +112,10 @@
//! use polars::prelude::*;
//! # fn example() -> PolarsResult<()> {
//!
//! let df = LazyCsvReader::new("reddit.csv")
//! let df = LazyCsvReader::new()
Copy link
Contributor Author

@itamarst itamarst Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diff shows how the public API differs. By moving the path/paths/glob/etc to be loaded to the load_* methods (which consume the factory and finish the LazyFrame construction), there is no longer a requirement for weird internal state tracking as there was previously with path/paths.

@@ -233,20 +228,12 @@ impl<'a> LazyCsvReader<'a> {
/// Modify a schema before we run the lazy scanning.
///
/// Important! Run this function latest in the builder!
pub fn with_schema_modify<F>(self, f: F) -> PolarsResult<Self>
pub fn with_schema_modify<F>(mut self, f: F, location: &ScanLocation) -> PolarsResult<Self>
Copy link
Contributor Author

@itamarst itamarst Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kinda broken at this stage in the refactor, but insofar as it's only use in Polars is in the py-polars wrapper I assume it can be made to work with some further tweaking.

@itamarst
Copy link
Contributor Author

Sounds like Polars is going to stabilize APIs soon? If so, would love to get this in first since it will become much harder afterwards.

@stinodego
Copy link
Member

stinodego commented Feb 14, 2024

Sounds like Polars is going to stabilize APIs soon? If so, would love to get this in first since it will become much harder afterwards.

We do not consider the Rust API stable yet and are not planning to soon. Breaking changes are allowed in every PR.

The Python versions are going to 1.0.0 soon though, but that should not impact this PR.

Unfortunately, I will not be much help in reviewing this PR as this is not my expertise.

@stinodego stinodego changed the title Request for preliminary directional check-in: making file loading much better factored refactor(rust): Request for preliminary directional check-in: making file loading much better factored Feb 26, 2024
@github-actions github-actions bot added internal An internal refactor or improvement rust Related to Rust Polars labels Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal An internal refactor or improvement rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants