-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
from_polars
#370
from_polars
#370
Conversation
Trying to fix the build error with numpy 2.0. In the meantime, you are missing the dependency in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting to see some Pandas/Polars compatibility in action! I left a few comments as ideas to possibly decrease code duplication, if you haven't yet considered them.
f5ad514
to
873a556
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we are breaking new grounds by supporting polars, pandas and NumPy in here. I think/hope we might find a slightly easier-to-maintain approach still. Here are some thoughts:
I think some of the difficulties originate from having the mapping of different backend libraries intertwined with the business logic of this library. Things might get easier if the mapping is more cleanly separated behind wrapper classes. The CategoricalMatrix
takes a cat_vec
as an argument. Right now, the __init__
method maps all possible input types to the disassembled representation of a tuple of NumPy arrays+dtype (namely indices
, categories
, and _input_dtype
). The cat
property then reassembles these with a similar mapping. Instead, one may consider wrapping the cat_vec
in a class that exposes exactly the data that we might want to derive from cat_vec
as properties/functions. These functions would encapsulate the necessary mapping logic. Something like:
class CatVec:
_wrapped: np.ndarray | pd.Series | pl.Series
@property
def categories(self) -> np.ndarray:
# get categories from `self._wrapped` on the fly
...
@property
def codes(self) -> np.ndarray: ...
@property
def shape(self) -> tuple[int, ...]: ...
def contains_missing(self) -> bool: ...
The CategoricalMatrix.index
, CategoricalMatrix.shape
, etc members could be replaced by properties that simply call through to the (privately) stored CatVec
instance. Might this work better?
@@ -255,7 +283,7 @@ class CategoricalMatrix(MatrixBase): | |||
|
|||
def __init__( | |||
self, | |||
cat_vec: Union[list, np.ndarray, pd.Categorical], | |||
cat_vec, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mypy might actually be useful here. It is quite good with union types and narrowing them in if-statements these days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mypy
doesn't play nice with optional imports. I tried conditioning imports on type hinting to no avail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using TYPE_CHECKING
didn't work? I think it is reasonable to assume both Pandas and Polars are installed in the dev environment, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using TYPE_CHECKING didn't work?
No. :\ I may have done it wrong though.
I appreciate that the current code is messy and wrapper classes would help streamline it, but it'll look better once we are past the transition phase. We currently have a At some point, we will do away with the If your concern is maintenance, it's all temporary, so it might not warrant a lot of infrastructure. |
As you prefer |
) | ||
indices.append(dense_mxidx) | ||
if dense_columns: | ||
matrices.append(_dense_matrix(df, dense_columns, dtype)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem in glum comes from here. It should be df[dense_columns]. However, using column names doesn't work here because of potential duplicates. This is why we had a logic to use the column index.
This PR adds basic support for Polars.
On the surface level, it adds a
from_polars
function, which mirrorsfrom_pandas
(except in that it doesn't allow for object columns, which are painful to work with in Polars). Because accessor methods differ between Pandas and Polars, it also restructures theCategoricalMatrix
so it stores categories instead of a categorical array (which should also lower its memory footprint).This PR also streamlines
from_pandas
and collects tests into a dedicated file.I haven't extended the formula interface, which will be left to a future PR.