Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(python): Fix LazyFrame fetch method references #18033

Merged
merged 7 commits into from
Sep 25, 2024

Conversation

edwinvehmaanpera
Copy link
Contributor

@edwinvehmaanpera edwinvehmaanpera commented Aug 3, 2024

Fixes #17675 and other similar problems

Removing fetch from the miscellaneous.rst listing meant that the method was still listed as a method under LazyFrame, but appeared as a dead link and the method could not be found in the docs at all. This adds it back and fixes the links. Generally depreciated functions are still in the docs and there does not seem to be an elegant way of hiding a "public" function from autodocs entirely.

What fetch looks like under LazyFrame now:
image

Removed some other fetch references.

I am not sure what do about the fetch part in the user-guide. Should it be removed entirely or replaced with something else?

### Execution on a partial dataset
While you're writing, optimizing or checking your query on a large dataset, querying all available data may lead to a slow development process.
You can instead execute the query with the `.fetch` method. The `.fetch` method takes a parameter `n_rows` and tries to 'fetch' that number of rows at the data source. The number of rows cannot be guaranteed, however, as the lazy API does not count how many rows there are at each stage of the query.
Here we "fetch" 100 rows from the source file and apply the predicates.

@github-actions github-actions bot added documentation Improvements or additions to documentation python Related to Python Polars labels Aug 3, 2024
Copy link

codecov bot commented Aug 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.89%. Comparing base (262f7bc) to head (ea5cd0c).
Report is 25 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #18033      +/-   ##
==========================================
+ Coverage   79.87%   79.89%   +0.01%     
==========================================
  Files        1519     1521       +2     
  Lines      205839   206678     +839     
  Branches     2898     2906       +8     
==========================================
+ Hits       164416   165120     +704     
- Misses      40875    41010     +135     
  Partials      548      548              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Aug 4, 2024

We actually don't want to put the fetch docs reference back in (which is a bit unusual, but it's a debug function that has acted as a major footgun, so we prefer to simply remove it from the docs so people don't come across it at all). See: #17642 (comment)

@edwinvehmaanpera
Copy link
Contributor Author

edwinvehmaanpera commented Aug 4, 2024

In that case should it be excluded/delisted entirely somehow or should I just put it back and leave it as-is? Having the method listed but not documented can be confusing.

What about the other changes and the user-guide?

@coastalwhite
Copy link
Collaborator

coastalwhite commented Aug 4, 2024

It should probably be removed from the notes as it is not true anymore. LazyFrame.head will also get pushed into the scan if possible.

@edwinvehmaanpera
Copy link
Contributor Author

Removed fetch docs reference again. As well as the fetch section in the user-guide.

@ritchie46
Copy link
Member

Can you do a rebase? @rodrigogiraoserrao can you review this one?

Co-authored-by: Rodrigo Girão Serrão <5621605+rodrigogiraoserrao@users.noreply.github.com>
@edwinvehmaanpera
Copy link
Contributor Author

Are we good to go now? @rodrigogiraoserrao

Copy link
Collaborator

@rodrigogiraoserrao rodrigogiraoserrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me; thanks for the help.

@@ -53,27 +53,18 @@ We look at [streaming in more detail here](streaming.md).

While you're writing, optimizing or checking your query on a large dataset, querying all available data may lead to a slow development process.

You can instead execute the query with the `.fetch` method. The `.fetch` method takes a parameter `n_rows` and tries to 'fetch' that number of rows at the data source. The number of rows cannot be guaranteed, however, as the lazy API does not count how many rows there are at each stage of the query.
Instead, you can scan a subset of your partitions or use `.head`/`.collect` at the beginning and end of your query, respectively.
Keep in mind that the results of aggregations and filters on subsets of your data may not be representative of the result you would get on the full data.
Copy link
Member

@ritchie46 ritchie46 Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heads that are after an aggregation don't influence an aggregation. I don't think this sentence makes much sense. Of course head followed by aggregation influences a result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggestion is to have head at the beginning of your query so in front of the aggregations.

Copy link
Contributor Author

@edwinvehmaanpera edwinvehmaanpera Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. in the given example

q9 = (
    pl.scan_csv(f"docs/assets/data/reddit.csv")
    .head(10)
    .with_columns(pl.col("name").str.to_uppercase())
    .filter(pl.col("comment_karma") > 0)
    .fetch(n_rows=int(100))
    .collect()
)

Copy link
Contributor Author

@edwinvehmaanpera edwinvehmaanpera Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it too obvious? Maybe a little, but I have had these types of operations fail in ways I didn't expect. Calling head(10000) and returning 0 rows might confuse someone. It is just single sentence so I feel like it is justified.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, with the accompanied query it makes sense. 👍

@ritchie46 ritchie46 merged commit 39d63aa into pola-rs:main Sep 25, 2024
16 of 17 checks passed
@ritchie46
Copy link
Member

Thanks all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

polars.LazyFrame.head recommends using fetch()
6 participants