Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: make pandas and NumPy optional dependencies, don't require PyArrow for plotting with Polars/Modin/cuDF #3452

Merged
merged 24 commits into from
Jul 15, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
fix ibis-date32 bug too, make sure to only convert to arrow once
  • Loading branch information
MarcoGorelli committed Jul 11, 2024
commit e91ed4d9783fd1910df1132234bc9628e880ce12
10 changes: 6 additions & 4 deletions altair/utils/_vegafusion_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,9 @@
Callable,
)

import narwhals as nw

from altair.utils._importers import import_vegafusion
from altair.utils.core import DataFrameLike
from altair.utils.data import (
DataType,
ToValuesReturnType,
Expand All @@ -22,9 +23,9 @@
)
from altair.vegalite.data import default_data_transformer


if TYPE_CHECKING:
from narwhals.typing import IntoDataFrame
from altair.utils.core import DataFrameLike
from vegafusion.runtime import ChartState # type: ignore

# Temporary storage for dataframes that have been extracted
Expand Down Expand Up @@ -70,9 +71,10 @@ def vegafusion_data_transformer(
"""VegaFusion Data Transformer"""
if data is None:
return vegafusion_data_transformer
elif isinstance(data, DataFrameLike) and not isinstance(data, SupportsGeoInterface):
elif isinstance(data, nw.DataFrame) and not isinstance(data, SupportsGeoInterface):
table_name = f"table_{uuid.uuid4()}".replace("-", "_")
extracted_inline_tables[table_name] = data
# vegafusion doesn't support Narwhals, so we extract the native object.
extracted_inline_tables[table_name] = nw.to_native(data)
return {"url": VEGAFUSION_PREFIX + table_name}
else:
# Use default transformer for geo interface objects
Expand Down
7 changes: 4 additions & 3 deletions altair/utils/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -468,12 +468,13 @@ def sanitize_narwhals_dataframe(
return data.select(columns)


def narwhalify(data: DataType) -> nw.DataFrame:
"""Wrap `data` in `narwhals.DataFrame`.
def narwhalify(data: DataType) -> nw.DataFrame[Any]:
"""Wrap `data` in `narwhals.DataFrame` (if possible).

If `data` is not supported by Narwhals, but it is convertible
to a PyArrow table, then first convert to a PyArrow Table,
and then wrap in `narwhals.DataFrame`.
If it can't even be converted to a PyArrow Table, return as-is.
"""
# Using `strict=False` will return `data` as-is if the object cannot be converted.
data = nw.from_native(data, eager_only=True, strict=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does narwhals do with geopandas DataFrames? Are these just treated the same as pandas?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now they're going down different paths:

https://github.com/MarcoGorelli/altair/blob/5d54bc4263dd5bdba600819b316dda8c592bdd8f/altair/utils/data.py#L316-L334

Maybe this can all be unified further, although the pandas/geointerface paths are doing some very library-specific things

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so they get wrapped by narwhals fine (just as normal pandas DataFrames), but then we use from_native whenever we need to be able to tell the difference between a GeoDataFrame and a regular DataFrame?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's right!

it's a (practically) free operation, narwhals.DataFrame just holds a reference to the original dataframe, there's no copying nor conversion involved - getting the original dataframe out again is just a matter of accessing 1-2 properties

https://github.com/narwhals-dev/narwhals/blob/a5276cd6e80781c61143c71041db81bd700a0e12/narwhals/translate.py#L41-L75

Expand Down Expand Up @@ -634,7 +635,7 @@ def parse_shorthand(
if "type" not in attrs and is_data_type(data):
data_nw = narwhalify(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something important that I think we're losing here is that previously we were checking the types of columns in DataFrame Interchange Protocol DataFrames without loading them all into memory as pyarrow (which is what narwhalify does for types that narwhals doesn't support). Ibis has an optimization where calling dfi.get_column_by_name is a metadata-only operation that doesn't trigger loading all columns into memory.

When used with plain Altair, the full dataset is loaded into memory anyway during the to_values() calculation. But when the "vegafusion" data transformer is enabled, VegaFusion is able to detect which columns are needed and it will request only the required columns when loading into pyarrow (See here).

So unless/until narwhals can support wrapping DataFrame Interchange Protocol DataFrames directly, I think we need to avoid the arrow conversion here and fall back to the legacy infer_vegalite_type_for_dfi_column behavior for this case.

unescaped_field = attrs["field"].replace("\\", "")
if isinstance(data_nw, nw.DataFrame) and unescaped_field in data_nw.columns:
if unescaped_field in data_nw.columns:
column = data_nw[unescaped_field]
if column.dtype in {nw.Object, nw.Unknown} and _is_pandas_dataframe(data):
attrs["type"] = infer_vegalite_type_for_pandas(nw.to_native(column))
Expand Down
39 changes: 19 additions & 20 deletions altair/utils/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -202,9 +202,8 @@ def sample(
else:
# Maybe this should raise an error or return something useful?
return None
try:
data = narwhalify(data)
except TypeError:
data = narwhalify(data)
if not isinstance(data, nw.DataFrame):
MarcoGorelli marked this conversation as resolved.
Show resolved Hide resolved
# Maybe this should raise an error or return something useful? Currently,
# if data is of type SupportsGeoInterface it lands here
return None
Expand Down Expand Up @@ -325,29 +324,29 @@ def _to_text_kwds(prefix: str, extension: str, filename: str, urlpath: str, /) -
def to_values(data: DataType) -> ToValuesReturnType:
"""Replace a DataFrame by a data model with values."""
check_data_type(data)
if isinstance(data, SupportsGeoInterface):
if _is_pandas_dataframe(data):
data = sanitize_pandas_dataframe(data)
data_native = nw.to_native(data, strict=False)
if isinstance(data_native, SupportsGeoInterface):
if _is_pandas_dataframe(data_native):
data_native = sanitize_pandas_dataframe(data_native)
# Maybe the type could be further clarified here that it is
# SupportGeoInterface and then the ignore statement is not needed?
data_sanitized = sanitize_geo_interface(data.__geo_interface__) # type: ignore[arg-type]
data_sanitized = sanitize_geo_interface(data_native.__geo_interface__)
return {"values": data_sanitized}
elif _is_pandas_dataframe(data):
data = sanitize_pandas_dataframe(data)
return {"values": data.to_dict(orient="records")}
elif isinstance(data, dict):
if "values" not in data:
elif _is_pandas_dataframe(data_native):
data_native = sanitize_pandas_dataframe(data_native)
return {"values": data_native.to_dict(orient="records")}
elif isinstance(data_native, dict):
if "values" not in data_native:
msg = "values expected in data dict, but not present."
raise KeyError(msg)
return data
try:
data = narwhalify(data)
except TypeError as exc:
return data_native
elif isinstance(data, nw.DataFrame):
data = sanitize_narwhals_dataframe(data)
return {"values": data.rows(named=True)}
else:
# Should never reach this state as tested by check_data_type
msg = f"Unrecognized data type: {type(data)}"
raise ValueError(msg) from exc
data = sanitize_narwhals_dataframe(data)
return {"values": data.rows(named=True)}
raise ValueError(msg)


def check_data_type(data: DataType) -> None:
Expand Down Expand Up @@ -435,7 +434,7 @@ def arrow_table_from_dfi_dataframe(dfi_df: DataFrameLike) -> pa.Table:
# has more control over the conversion, and may have broader compatibility.
# This is the case for Polars, which supports Date32 columns in direct conversion
# while pyarrow does not yet support this type in from_dataframe
for convert_method_name in ("arrow", "to_arrow", "to_arrow_table"):
for convert_method_name in ("arrow", "to_arrow", "to_arrow_table", "to_pyarrow"):
MarcoGorelli marked this conversation as resolved.
Show resolved Hide resolved
convert_method = getattr(dfi_df, convert_method_name, None)
if callable(convert_method):
result = convert_method()
Expand Down
15 changes: 11 additions & 4 deletions altair/vegalite/v5/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from .schema import core, channels, mixins, Undefined, SCHEMA_URL

from altair.utils import Optional
from altair.utils.data import narwhalify as _narwhalify
from .data import data_transformers
from ... import utils
from ...expr import core as _expr_core
Expand Down Expand Up @@ -1015,11 +1016,17 @@ def to_dict(
# TopLevelMixin instance does not necessarily have copy defined but due to how
# Altair is set up this should hold. Too complex to type hint right now
copy = self.copy(deep=False) # type: ignore[attr-defined]
original_data = getattr(copy, "data", Undefined)
copy.data = _prepare_data(original_data, context)

if original_data is not Undefined:
context["data"] = original_data
data = getattr(copy, "data", Undefined)
try:
data = _narwhalify(data) # type: ignore[arg-type]
except TypeError:
# Non-narwhalifiable type still supported by Altair, such as dict.
pass
copy.data = _prepare_data(data, context)

if data is not Undefined:
context["data"] = data

# remaining to_dict calls are not at top level
context["top_level"] = False
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ all = [
dev = [
"hatch",
"ruff>=0.5.1",
"ibis-framework",
"ipython",
"pandas>=0.25.3",
"pytest",
Expand Down
7 changes: 5 additions & 2 deletions tests/utils/test_dataframe_interchange.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this test file to something like test_to_values_narwhals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I can't see the original line you'd commented on any more...any chance you remember which test it was? Sorry about this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the file name itself. It's currently test_dataframe_interchange.py, and I was suggesting renaming to test_to_values_narwhals.py.

Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
import pandas as pd
import pytest
import sys
import narwhals.stable.v1 as nw

try:
import pyarrow as pa
Expand Down Expand Up @@ -36,8 +37,9 @@ def test_arrow_timestamp_conversion():
"value": [102, 129, 139],
}
pa_table = pa.table(data)
nw_frame = nw.from_native(pa_table)

values = to_values(pa_table)
values = to_values(nw_frame)
expected_values = {
"values": [
{"date": "2004-08-01T00:00:00.000000", "value": 102},
Expand All @@ -54,8 +56,9 @@ def test_duration_raises():
df = pd.DataFrame(td).reset_index()
df.columns = ["id", "timedelta"]
pa_table = pa.table(df)
nw_frame = nw.from_native(pa_table)
with pytest.raises(ValueError) as e: # noqa: PT011
to_values(pa_table)
to_values(nw_frame)

# Check that exception mentions the duration[ns] type,
# which is what the pandas timedelta is converted into
Expand Down
15 changes: 15 additions & 0 deletions tests/vegalite/v5/test_api.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
"""Unit tests for altair API"""

from datetime import date
import io
import ibis
import sys
import json
import operator
Expand Down Expand Up @@ -1082,3 +1084,16 @@ def test_polars_with_pandas_nor_pyarrow(monkeypatch: pytest.MonkeyPatch):
assert "pandas" not in sys.modules
assert "pyarrow" not in sys.modules
assert "numpy" not in sys.modules
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't seen monkeypatch.delitem before, that's really handy



MarcoGorelli marked this conversation as resolved.
Show resolved Hide resolved
def test_ibis_with_date_32():
df = pl.DataFrame(
{"a": [1, 2, 3], "b": [date(2020, 1, 1), date(2020, 1, 2), date(2020, 1, 3)]}
)
tbl = ibis.memtable(df)
result = alt.Chart(tbl).mark_line().encode(x="a", y="b").to_dict()
assert next(iter(result["datasets"].values())) == [
{"a": 1, "b": "2020-01-01T00:00:00.000000"},
{"a": 2, "b": "2020-01-02T00:00:00.000000"},
{"a": 3, "b": "2020-01-03T00:00:00.000000"},
]
Loading