Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: make pandas and NumPy optional dependencies, don't require PyArrow for plotting with Polars/Modin/cuDF #3452

Merged
merged 24 commits into from
Jul 15, 2024

Conversation

MarcoGorelli
Copy link
Contributor

@MarcoGorelli MarcoGorelli commented Jun 30, 2024

Suggested in #3445 (comment)

closes #3456
closes #3445
closes #3473
closes #3471 (this wasn't an issue to begin with)

Demo

Polars can work without PyArrow nor pandas installed (EDIT: and without NumPy either!):

Screenshot 2024-06-30 153556

PyArrow table is still supported

(notice the .to_arrow)

Screenshot 2024-06-30 150956

Ibis is still supported, and now Date dtype works too!

(previously, it would have been necessary to manually cast 'date' to 'datetime' type before plotting)

image

Performance

There's also a perf improvement to staying Polars-native where possible - for example, if I run

import polars as pl
import altair as alt
from datetime import date

# dataset: https://github.com/Mcompetitions/M6-methods/blob/main/assets_m6.csv
df = pl.read_csv('../scratch/assets.csv', try_parse_dates=True)
df = df.filter(pl.col('date')>date(2022, 11, 15))
def func():
    alt.Chart(df).mark_line().encode(
        x='date', y='price', color='symbol',
    ).to_json()
results = %timeit -o -q func()
results.best

in IPython, then I see:

  • Altair main branch: 0.033584889399935494
  • Altair MarcoGorelli:narwhalify branch: 0.024283270300657023

That's for a (4475, 3)-shaped dataframe - for dataframe with many columns (esp. with strings / categoricals) I'd expect the difference to grow

Addendum: I hadn't noticed this originally, but this PR also makes Altair's import times about twice as fast

@MarcoGorelli MarcoGorelli changed the title WIP: make pandas optional dependency, don't require PyArrow for plotting with Polars/Modin/cuDF WIP: feat: make pandas optional dependency, don't require PyArrow for plotting with Polars/Modin/cuDF Jun 30, 2024
@MarcoGorelli MarcoGorelli changed the title WIP: feat: make pandas optional dependency, don't require PyArrow for plotting with Polars/Modin/cuDF feat: WIP make pandas optional dependency, don't require PyArrow for plotting with Polars/Modin/cuDF Jun 30, 2024
@dangotbanned
Copy link
Member

dangotbanned commented Jun 30, 2024

@MarcoGorelli if all the tests are passing locally for you, try running these two:

hatch run generate-schema-wrapper
hatch run update-init-file

If there are any changes other than whitespace, you'll get a more informative error than https://github.com/vega/altair/actions/runs/9733301918/job/26860053484?pr=3452

Otherwise those two should clean everything up.

@dangotbanned
Copy link
Member

dangotbanned commented Jun 30, 2024

https://github.com/vega/altair/actions/runs/9734437115/job/26862450764?pr=3452

@MarcoGorelli try this to run the tests on the matrix locally:

hatch test --all

The help for hatch is pretty useful as well:

hatch test -h

@dangotbanned
Copy link
Member

@MarcoGorelli these were two minor things. I couldn't figure out a way to request these another way.

fix: add missing pandas import to doctest

519     Examples 520     -------- 521     >>> data = pd.DataFrame({'foo': ['A', 'B', 'A', 'B'], UNEXPECTED EXCEPTION: NameError("name 'pd' is not defined") Traceback (most recent call last):   File "..\Local\hatch\env\virtual\.pythons\3.12\python\Lib\doctest.py", line 1361, in __run           exec(compile(example.source, filename, "single",   File "<doctest altair.utils.core.parse_shorthand[0]>", line 1, in <module> NameError: name 'pd' is not defined C:\Users\danie\Documents\GitHub\altair\altair\utils\core.py:521: UnexpectedException ============================================== short test summary info ==============================================  FAILED altair/utils/core.py::altair.utils.core.parse_shorthand ========================= 1 failed, 1250 passed, 1 skipped, 2 xfailed, 1 xpassed in 30.64s ==========================
diff --git a/altair/utils/core.py b/altair/utils/core.py
index e6620146..d5d753db 100644
--- a/altair/utils/core.py
+++ b/altair/utils/core.py
@@ -520,6 +520,7 @@ def parse_shorthand(
 
     Examples
     --------
+    >>> import pandas as pd
     >>> data = pd.DataFrame({'foo': ['A', 'B', 'A', 'B'],
     ...                      'bar': [1, 2, 3, 4]})
 

refactor(typing): Reuse _typing alias as InferredVegaLiteType

diff --git a/altair/utils/core.py b/altair/utils/core.py
index e6620146..f3a5469a 100644
--- a/altair/utils/core.py
+++ b/altair/utils/core.py
@@ -42,6 +42,7 @@ else:
 if TYPE_CHECKING:
     from types import ModuleType
     import typing as t
+    from altair.vegalite.v5.schema._typing import StandardType_T as InferredVegaLiteType
     from altair.utils._dfi_types import DataFrame as DfiDataFrame
     from altair.utils.data import DataType
     from narwhals.typing import IntoExpr
@@ -199,9 +200,6 @@ TIMEUNITS = [
 ]
 
 
-InferredVegaLiteType = Literal["ordinal", "nominal", "quantitative", "temporal"]
-
-
 def infer_vegalite_type_for_pandas(
     data: object,
 ) -> InferredVegaLiteType | tuple[InferredVegaLiteType, list[Any]]:

@MarcoGorelli MarcoGorelli changed the title feat: WIP make pandas optional dependency, don't require PyArrow for plotting with Polars/Modin/cuDF feat: make pandas optional dependency, don't require PyArrow for plotting with Polars/Modin/cuDF Jul 2, 2024
@MarcoGorelli MarcoGorelli marked this pull request as ready for review July 2, 2024 15:26
@mattijn
Copy link
Contributor

mattijn commented Jul 2, 2024

Thank you very much for making this Pull Request. That is very much appreciated!

This PR introduces a new dependency on narwhals. Narwhals is described as a lightweight and extensible compatibility layer between multiple dataframe libraries.
Narwhals can replace our hard dependency on pandas, the size of pandas is ~11 MB (without mentioning its subdependencies). This by itself is a very interesting development already.

Next to this it enables support for native dataframes (pandas still included) without converting these to a pyarrow Tables through the dataframe interchange protocol.

The latest release of Altair is ~850 kB and after this PR, Altair will have the following hard dependencies:

  • jinja2 : ~130 kB
  • jsonschema : ~90 kB
  • typing_extensions: ~35 kB
  • packaging : ~55 kB
  • narwhals : ~80 kB
  • numpy : 18 MB (also removed as part of this PR)

When using Altair with only remote endpoints (like APIs that are returning JSON-data), I've the feeling that we do not need to rely on numpy. Therefor I'm still hopefull that we can reduce this hard dependency on numpy as well one day.

I've a few questions regarding this PR and narwhals.

  1. Just curious, why is the lower limit of pandas in narwhals version 1.1.5? As the supported lower limit of pandas in Altair before this PR was version 0.25.

  2. You serialize the nw.Date type to:

if dtype == nw.Date:
	columns.append(nw.col(name).dt.to_string("%Y-%m-%d"))

And change a test from:

"a": {"date": "2019-01-01T00:00:00"},

To:

"a": {"date": "2019-01-01"},

Upon first sight, I think we cannot introduce this breaking change. Does this change still adhere to full ISO-8601 dates representation as is required by Javascript? See also past issues like this: #1027.

  1. Can you explain the behavior of nw.from_native()?
    When I look to this code:
data = nw.from_native(data, eager_only=True, strict=False)
if isinstance(data, nw.DataFrame):
	return data
if isinstance(data, DataFrameLike):
	from altair.utils.data import arrow_table_from_dfi_dataframe

	pa_table = arrow_table_from_dfi_dataframe(data)
	return nw.from_native(pa_table, eager_only=True)

I'm a bit surprised that the data object, as a result of passing it into nw.from_native can be a non-narwhals dataframe. If it is not supported by narwhals it returns the original data object?

  1. Maybe a naive question, but do you have any intention to support dataframes that support the dataframe interface protocol within narwhals?

  2. Not being a type expert, but I cannot really understand why there needs to be quotes around the type "IntoDataFrame", like in:

DataType: TypeAlias = Union[
    Dict[Any, Any], "IntoDataFrame", SupportsGeoInterface, DataFrameLike
]

I remember I have discussed this with @dangotbanned recently in #3431, see #3431 (comment) onwards, but I can't find a clear answer anymore on this topic.

Thanks again for this PR! Really promising! (also this PR will supersede #3384)

@dangotbanned
Copy link
Member

dangotbanned commented Jul 3, 2024

  1. Not being a type expert, but I cannot really understand why there needs to be quotes around the type "IntoDataFrame", like in:
DataType: TypeAlias = Union[
    Dict[Any, Any], "IntoDataFrame", SupportsGeoInterface, DataFrameLike
]

I remember I have discussed this with @dangotbanned recently in #3431, see #3431 (comment) onwards, but I can't find a clear answer anymore on this topic.

Thanks for the ping @mattijn

The good news is you can trust the ruff config now for managing cases like this (particularly TCH), what @MarcoGorelli has written is correct.

https://github.com/MarcoGorelli/altair/blob/2df9422dca75a26eb16fd0c8030b6e89b4d4fa0f/altair/utils/data.py#L44-L66

All of the imports within the TYPE_CHECKING block must be quoted when used in a runtime alias definition. IntoDataFrame is used in: DataType, TIntoDataFrame, SampleReturnType.

AFAIK, this should be fine to move to a regular import though

altair/utils/data.py Outdated Show resolved Hide resolved
altair/utils/data.py Outdated Show resolved Hide resolved
altair/utils/data.py Outdated Show resolved Hide resolved
altair/utils/data.py Outdated Show resolved Hide resolved
dangotbanned added a commit to dangotbanned/altair that referenced this pull request Jul 3, 2024
The user will see this message ```
altair\utils\data.py:17:5: TID251 `typing.Optional` is banned: Use `Union[T, None]` instead. `typing.Optional` is likely to be confused with `altair.Optional`, which have similar but different semantic meaning. See vega#3449 Found 1 error.
```

Partial fix for vega#3452 (comment)
@MarcoGorelli
Copy link
Contributor Author

Thanks so much for your quick and careful reviews @mattijn and @dangotbanned 🙏 ! Much appreciated

  1. No reason :) I just lowered the minimum version of pandas in Narwhals - I set it to 0.25.3 rather than 0.25 though, as when I tried using 0.25 in CI it just froze indefinitely. 0.25.3 is also the one that Altair is currently testing against in CI - is it OK to set that as the minimum?

  2. Interesting, thanks, and I'm sorry for not having picked up on that! Quoting the issue

    If you were east of London when running this code, you wouldn't see this issue 😄.

    that might be why I didn't see it when testing it out 😳 .
    Polars doesn't let us format Dates with time directives (because the underlying Rust crate chrono doesn't either) - but, we can just cast to datetime, I've added a comment about why this is done. On my laptop, pyarrow.compute.strftime is about 200 times slower than casting Date to Datetime - so, the extra cast to ensure consistency, in the grand scheme of the things, is quite negligible, and not the bottleneck of the sanitize function

  3. That's because of strict=False in the first line of the function https://narwhals-dev.github.io/narwhals/api-reference/narwhals/#narwhals.from_native, the idea is:

    • strict=False: if the object isn't supported by Narwhals, then it just gets passed through as-is
    • strict=True: if the object isn't supported by Narwhals, then raise
  4. We'd like to have 2 levels of supported dataframes in Narwhals:

    • full API
    • basic. I.e. an extended and more user-friendly version of the dataframe interchange protocol. Anything which already supports the interchange protocol would be in scope. My main gripe with the interchange protocol is that it's bundled with dataframe libraries themselves, which means is super slow to change - for example, Jon reported BUG: Conversion of datetime64[ns, UTC] to Arrow C format string is not implemented pandas-dev/pandas#54239, and it got fixed, but it's going to be several years until the minimum pandas version gets bumped up enough that it can be used... 😭
  5. thanks @dangotbanned for helping out with this one!

I've the feeling that we do not need to rely on numpy

"in for a penny, in for a pound" - I've added a commit to do this. NumPy is used very little, mostly isinstance checks and np.ceil, so I think it can already be made optional. Then Altair would truly be the lightest visualisation library in the ecosystem?

I've caught up with Polars devs, and they're on board with using Altair in polars.DataFrame.plot if the plots can be done directly without going via pandas, so we may want to have a conversation later about how that should look (personally, I think df.plot.line(x='x', y='y') as a shorthand for alt.Chart(df).mark_line().encode(x='x', y='y').interactive() would be really nice, and users would ❤️ love ❤️ it)

@dangotbanned
Copy link
Member

dangotbanned commented Jul 3, 2024

Thanks @MarcoGorelli

When I originally suggested narwhals I didn't expect you'd take an active role - and this PR is certainly far beyond that! 😄
I get the impression that this is beneficial for both projects, as any issues you've encountered have been quickly resolved in subsequent narwhals releases

I've caught up with Polars devs, and they're on board with using Altair in polars.DataFrame.plot if the plots can be done directly without going via pandas, so we may want to have a conversation later about how that should look (personally, I think df.plot.line(x='x', y='y') as a shorthand for alt.Chart(df).mark_line().encode(x='x', y='y').interactive() would be really nice, and users would ❤️ love ❤️ it)

I think most people would agree on the value of this, but probably best for another issue.
@joelostblom would probably be able to chime in, there was also this discussion #2789 (comment)

I would expect there would be more support if the implementation was different to the existing hvPlot (polars) / hvPlot accessor.
Currently, the burden would be on altair to define what kinds of charts (e.g. line, scatter, etc), but this isn't a 1-to-1 with how many of the other plotting libraries work.

My suggestion would be for polars to define a protocol with some chart methods, with altair implementing those that make sense for quick EDA.
That way, these methods/functions may have less of a chance of leaking out into the rest of the altair api.

@MarcoGorelli
Copy link
Contributor Author

Thanks @dangotbanned !

I would expect there would be more support if the implementation was different to the existing hvPlot (polars) / hvPlot accessor.

Definitely - plot is marked as unstable in Polars 😎 What I had in mind was just something like

class PlotNameSpace:
    def __init__(self, df: DataFrame) -> None:
        self.df = df
    def line(self, *args, **kwargs):
        return alt.Chart(self.df).mark_line().encode(*args, **kwargs).interactive()
    def point(self, *args, **kwargs):
        return alt.Chart(self.df).mark_point().encode(*args, **kwargs).interactive()
    ...

Anyway, we can figure out the details later, whether or not Altair decides to go ahead with Narwhals

(for the record, I think hvplot is amazing and its developers are really friendly, so I'd still advocating for it to be advertised in the Polars docs - but if Altair could provide a more lightweight alternative which doesn't require other heavy libraries, I think it'd be a more suitable default)

@mattijn
Copy link
Contributor

mattijn commented Jul 3, 2024

Really awesome to see you were able to remove numpy as a hard dependency too! Its great to see this happening!

A few more comments while you might still have this PR open.

In my opinion, there is no need to serialize microseconds for type Date (also reducing characters in the vega-lite specification). If we define local_iso_fmt_string = "%Y-%m-%dT%H:%M:%S" and use that when nw.Date applies and when nw.Datetime applies we can use f"{local_iso_fmt_string}%.f" or something similar.

A single comment like # using `strict=False` will return the `data` as-is if the object cannot be converted. would be appreciated for my future-self or other people who will be reading the following line:

data = nw.from_native(data, eager_only=True, strict=False)

This line can be removed:

# (~200x faster for pyarrow) and so isn't the bottleneck in this function.

Thanks again for pushing this!

@mattijn
Copy link
Contributor

mattijn commented Jul 3, 2024

A bit off topic, but regarding polars.DataFrame.plot functionality: (1) you also might learn some lessons from this archived repo: https://github.com/altair-viz/pdvega, (2) there is also some work underway to extend the .interactive() method within altair #3394.

@MarcoGorelli MarcoGorelli changed the title feat: make pandas optional dependency, don't require PyArrow for plotting with Polars/Modin/cuDF feat: make pandas and NumPy optional dependencies, don't require PyArrow for plotting with Polars/Modin/cuDF Jul 3, 2024
Copy link
Contributor

@mattijn mattijn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me this is a +1, it would be great if there is another maintainer doing a review.

@binste
Copy link
Contributor

binste commented Jul 3, 2024

This is exciting! :) I'd also like to have a quick look. Hopefully get to it this weekend. @jonmmease if you have a moment you might want to have a look as well as you've worked on the interchange protocol implementation and hence might be aware of edge cases we need to consider.

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Jul 14, 2024

Thanks @jonmmease for your careful review, really appreciate it! 🙏

But when the "vegafusion" data transformer is enabled, VegaFusion is able to detect which columns are needed and it will request only the required columns when loading into pyarrow (See here).

Aah, that's amazing! I'd missed that.
To address this here, we've made a new release of Narwhals which includes interchange-level support - that is to say, if the input object implements the interchange protocol, then we let you inspect the schema

Like this, parse_shorthand doesn't need to do any data conversion, and so:

  • for Ibis users using the default data transformers, we fix the Date dtype bug
  • for Ibis users using the vegafusion data transformer, we preserves the status-quo 😇

(and maybe polars.LazyFrame should also implement the interchange protocol just at the metadata level, but that's one for a separate discussion)

@MarcoGorelli MarcoGorelli marked this pull request as ready for review July 14, 2024 19:10
@mattijn
Copy link
Contributor

mattijn commented Jul 15, 2024

Thanks for the changes @MarcoGorelli!

I still have one question which I cannot really understand. Given the following example:

import datetime
import polars as pl
import pyarrow as pa
import narwhals as nw

data_type_date = [{"date": datetime.date(2024,7,15)}]
local_iso_fmt_string = "%Y-%m-%dT%H:%M:%S"
nw_selector = [nw.col('date').cast(nw.Datetime).dt.to_string(local_iso_fmt_string)]

pl_df = pl.DataFrame(data_type_date)
nw_pl_df = nw.from_native(pl_df, eager_only=True)
nw_pl_df.select(nw_selector).rows(named=True)

This returns

[{'date': '2024-07-15T00:00:00'}]

But if I do:

py_tb = pa.Table.from_pylist(data_type_date)
nw_py_tb = nw.from_native(py_tb, eager_only=True)
nw_py_tb.select(nw_selector).rows(named=True)

It returns

[{'date': '2024-07-15T00:00:00.000000'}]

I expected these output to be equal.

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Jul 15, 2024

thanks @mattijn for taking a look!

Libraries differ in how they interpret strftime directives - there's an issue about this in Arrow apache/arrow#20146
We also have a note about this to the Narwhals docs: https://narwhals-dev.github.io/narwhals/api-reference/expressions_dt/#narwhals.expression.ExprDateTimeNamespace.to_string

Practically speaking, if '2024-07-15' gets transformed to

  • '2024-07-15T00:00:00' for Polars and pandas-like libraries, and
  • '2024-07-15T00:00:00.000000' for PyArrow

does that cause any issues?

Note that, if the datetime contains fractional seconds, then the outputs match

In [2]: import datetime
   ...: import polars as pl
   ...: import pyarrow as pa
   ...: import narwhals as nw
   ...:
   ...: data_type_date = [{"date": datetime.datetime(2024,7,15,1,2,3,123456)}]
   ...: local_iso_fmt_string = "%Y-%m-%dT%H:%M:%S%.f"
   ...: nw_selector = [nw.col('date').cast(nw.Datetime).dt.to_string(local_iso_fmt_string)]
   ...:
   ...: pl_df = pl.DataFrame(data_type_date)
   ...: nw_pl_df = nw.from_native(pl_df, eager_only=True)
   ...: nw_pl_df.select(nw_selector).rows(named=True)
Out[2]: [{'date': '2024-07-15T01:02:03.123456'}]

In [3]: py_tb = pa.Table.from_pylist(data_type_date)
   ...: nw_py_tb = nw.from_native(py_tb, eager_only=True)
   ...: nw_py_tb.select(nw_selector).rows(named=True)
Out[3]: [{'date': '2024-07-15T01:02:03.123456'}]

@mattijn
Copy link
Contributor

mattijn commented Jul 15, 2024

Yes for type nw.Datetime it works correctly. The question I have is about the nw.Date type.

One change with your latest comment compared to my comment above is that for me the local_iso_fmt_string does not contain a %.f.

I thought it was an issue with pyarrow, but if I do:

import pyarrow.compute as pc
pc.strftime(py_tb['date'], format=local_iso_fmt_string).to_pylist()

It returns

['2024-07-15T00:00:00']

That seems right. And in the narwhals docs you link to (thanks for the link!) it is mentioned that:

for PyArrow, we replace "%S.%f" with "%S".

That makes me even more confused with your statement:

Practically speaking, if '2024-07-15' gets transformed to

  • '2024-07-15T00:00:00' for Polars and pandas-like libraries, and
  • '2024-07-15T00:00:00.000000' for PyArrow

Thanks for answering so promptly! Appreciated.

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Jul 15, 2024

The difference is the .cast(nw.Datetime) 😉 If you start with a datetime, then PyArrow's strftime outputs fractional seconds even if we just specify '%S':

In [1]: import pyarrow as pa

In [2]: from datetime import date, datetime

In [3]: import pyarrow.compute as pc

In [4]: pc.strftime(pa.array([datetime(2020, 1, 1)]), '%Y-%m-%dT%H%:%M%:%S')
Out[4]:
<pyarrow.lib.StringArray object at 0x7fd513149cc0>
[
  "2020-01-01T00%:00%:00.000000"
]

So, in summary:

  • if we're starting with dates, then we cast to datetime and format with '%Y-%m-%dT%H:%M:%S'. PyArrow differs from other libraries here in that it outputs '.000000' too
  • if we start with datetimes, then we format with '%Y-%m-%dT%H:%M:%S%.f'

@mattijn
Copy link
Contributor

mattijn commented Jul 15, 2024

Interesting... Let me see if I can find something on this at the pyarrow library. Thanks!

@mattijn
Copy link
Contributor

mattijn commented Jul 15, 2024

We have to cast the nw.Date type to nw.Datetime only because polars does not support time directives isn't it? Maybe we check upfront if we have to cast or not. @jonmmease, how important is this? Do you have some suggestions here. Its about this piece:

altair/altair/utils/core.py

Lines 451 to 456 in 8b4b3db

if dtype == nw.Date:
# Polars doesn't allow formatting `Date` with time directives.
# The date -> datetime cast is extremely fast compared with `to_string`
columns.append(
nw.col(name).cast(nw.Datetime).dt.to_string(local_iso_fmt_string)
)

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Jul 15, 2024

Sure - if you're OK with a teeny-tiny bit of Polars-specific logic, this can work 👍

        if dtype == nw.Date and nw.get_native_namespace(data) is get_polars():
            # Polars doesn't allow formatting `Date` with time directives.
            # The date -> datetime cast is extremely fast compared with `to_string`
            columns.append(
                nw.col(name).cast(nw.Datetime).dt.to_string(local_iso_fmt_string)
            )
        elif dtype == nw.Date:
            columns.append(nw.col(name).dt.to_string(local_iso_fmt_string))
        elif dtype == nw.Datetime:
            columns.append(nw.col(name).dt.to_string(f"{local_iso_fmt_string}%.f"))

Just pushed a commit to do that

Copy link
Contributor

@jonmmease jonmmease left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To address this here, we've made a new release of Narwhals which includes interchange-level support - that is to say, if the input object implements the interchange protocol, then we let you inspect the schema

Amazing! This looks great, and having the narwhals metadata interface to all DataFrame interchange protocol objects is really nice.

@mattijn I don't the difference in the inclusion of fractional seconds will make any different to Vega in the browser, so I wouldn't worry about it. That said, @MarcoGorelli 's small polars workaround looks fine as well, so I'm happy to keep that since it maintains identical results.

Thanks again for this great PR @MarcoGorelli, and for all of the work you put into narwhals to make it possible (even during the review process!). I'm ready for this to be merged, but want to pass it back over to you one last time in case any of my review comments shake something loose for you.

Please add a comment with a ✅ when you're ready for it to merge.

and column.describe_categorical["is_ordered"]
and column.describe_categorical["categories"] is not None
nw.is_ordered_categorical(column)
and not (categories := column.cat.get_categories()).is_empty()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nice. I had missed that dropping Python 3.7 means we can use the walrus operator!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😄 there's a few others if you're brave enough to try out auto-walrus

if isinstance(data, DataFrameLike):
from altair.utils.data import arrow_table_from_dfi_dataframe

pa_table = arrow_table_from_dfi_dataframe(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a priority now (and I can't think of any specific scenarios that would benefit), was just curious.

For posterity, this ibis team is in the process of removing the hard pyarrow dependency for some use cases (ibis-project/ibis#9552), but it will still be required for many (all?) specific backends I think.

# Early return if already a Narwhals DataFrame
return data
# Using `strict=False` will return `data` as-is if the object cannot be converted.
data = nw.from_native(data, eager_only=True, strict=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so they get wrapped by narwhals fine (just as normal pandas DataFrames), but then we use from_native whenever we need to be able to tell the difference between a GeoDataFrame and a regular DataFrame?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the file name itself. It's currently test_dataframe_interchange.py, and I was suggesting renaming to test_to_values_narwhals.py.

@MarcoGorelli
Copy link
Contributor Author

thanks @jonmmease ! review comments look great to me, much appreciated 🙏 ✅

@binste
Copy link
Contributor

binste commented Jul 15, 2024

Thanks to everyone involved! This is very exciting and we should cut a release soon :) I'll open a new discussion for it.

@binste binste merged commit 2036b99 into vega:main Jul 15, 2024
11 checks passed
dangotbanned added a commit to dangotbanned/altair that referenced this pull request Jul 16, 2024
binste pushed a commit that referenced this pull request Jul 20, 2024
* refactor(typing): reduce type ignores in `api.py`

* feat(typing): adds `_is_undefined` guard

Direct copy from another PR, but is helpful here https://github.com/dangotbanned/altair/blob/d607c70824860ef3478d7d3996d26dc516d69369/altair/vegalite/v5/api.py#L504

* fix(typing): `ignore[index]` from #3455

* fix(typing): `[ignore[arg-type]` from #3452

* refactor(typing): use `_is_undefined` in some cases

* refactor: reduce code duplication in `Chart.to_dict`

* fix(typing): Recover `TopLevelMixin.resolve_` return types

I removed these in 6adf564 due to how complex they were getting, but both `mypy` and `pyright` seem happy with this solution

* fix(typing): narrow annotated types for `_check_if_valid_subspec`

Following investigation during #3480 (comment)

* refactor: remove unused `_wrapper_classes` argument from `SchemaBase`

Fixes:
- #3480 (comment)
- #3480 (comment)
- #3480 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants