Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't exclude datetime[ns] by targeting pl.DateTime #5300

Closed
2 tasks done
thomasaarholt opened this issue Oct 22, 2022 · 14 comments · Fixed by #9641
Closed
2 tasks done

Can't exclude datetime[ns] by targeting pl.DateTime #5300

thomasaarholt opened this issue Oct 22, 2022 · 14 comments · Fixed by #9641
Labels
A-timeseries Area: date/time functionality enhancement New feature or an improvement of an existing feature python Related to Python Polars

Comments

@thomasaarholt
Copy link
Contributor

thomasaarholt commented Oct 22, 2022

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

I was trying to drop a datetime[ns] column today, so I did pl.all().exclude(pl.Datetime), but this does not drop it. It does drop datetime[us] columns

Reproducible example

import pandas as pd
import polars as pl
(
    pl.from_pandas(pd.date_range("2021-01-01", "2021-01-02"))
    .to_frame()
    .with_column(pl.lit(1))
    .select(pl.all().exclude(pl.Datetime))
)
# leaves both datetime and literal columns

Expected behavior

Works for the regular polars-created type

(
    pl.Series(name="foo", values=[], dtype=pl.Datetime)
    .to_frame()
    .with_column(pl.lit(1))
    .select(pl.all().exclude(pl.Datetime))
)

Installed versions

---Version info---
Polars: 0.14.22
Index type: UInt32
Platform: Linux-5.10.124-linuxkit-aarch64-with-glibc2.28
Python: 3.9.15 (main, Oct 14 2022, 00:47:57) 
[GCC 8.3.0]
---Optional dependencies---
pyarrow: 8.0.0
pandas: 1.4.4
numpy: 1.23.4
fsspec: 2022.10.0
connectorx: <not installed>
xlsx2csv: <not installed>
matplotlib: 3.6.1
@thomasaarholt thomasaarholt added bug Something isn't working python Related to Python Polars labels Oct 22, 2022
@thomasaarholt
Copy link
Contributor Author

Selecting has the same issue:

import pandas as pd
import polars as pl
(
    pl.from_pandas(pd.date_range("2021-01-01", "2021-01-02"))
    .to_frame()
    .with_column(pl.lit(1))
    .select(pl.col(pl.Datetime))
)
# Empty dataframe

@zundertj
Copy link
Collaborator

zundertj commented Oct 22, 2022

pl.Datetime takes in a tu (time unit) argument: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.datatypes.Datetime.html

The default is us. When you create the regular polars created type:

>>> pl.Series(name="foo", values=[], dtype=pl.Datetime)
shape: (0,)
Series: 'foo' [datetime[μs]]
[
]

it returns us, because it is default. So the exclude takes it out.

However, converting from Pandas results in a different time unit ns

>>> pl.from_pandas(pd.date_range("2021-01-01", "2021-01-02"))
shape: (2,)
Series: '' [datetime[ns]]
[
        2021-01-01 00:00:00
        2021-01-02 00:00:00
]

and hence the exclude does not pick up this type.

You can fix your first example by doing:

import pandas as pd
import polars as pl
(
    pl.from_pandas(pd.date_range("2021-01-01", "2021-01-02"))
    .to_frame()
    .with_column(pl.lit(1))
    .select(pl.all().exclude(pl.Datetime("ns")))  # <<<--- note the "ns" here!
)

@thomasaarholt
Copy link
Contributor Author

Ah! That's interesting! Hm. I think I need to sleep on whether I'm fine with this the way it is 😅

@thomasaarholt
Copy link
Contributor Author

thomasaarholt commented Oct 23, 2022

Having slept on this, I think that it is currently quite difficult to exclude all datetime columns. Given a situation where you don't know the incoming datetime, I have to either exhaustively drop all combinations of time_unit and time_zone, which comes out as 1833 combinations.

The alternative is to target all other dtypes, and df.select(all_other_dtypes).

pl.DateTimeAllDtype, pl.DateTimeParentDType or something similar, would be a useful dtype in this case.

Or we could specify None for both arguments in pl.DateTime, and depending if you're using pl.col(pl.DateTime()).cast() or pl.select(pl.col(pl.Datetime()))it would either default to "us"/"UTC" or targetting all DateTimes respectively. This could also be the behaviour if you're passing pl.DateTime (compared to pl.DateTime()), which I was.

I prefer some variation of this second suggestion.

@thomasaarholt
Copy link
Contributor Author

The motivation is that I have several pl.DateTime columns with time_unit="ns" and time_zone="UTC that I needed to drop before calling pl.Categorical for all non-number columns for later input straight into xgboost as an arrow dataset. In this case, I can

@zundertj
Copy link
Collaborator

zundertj commented Oct 23, 2022

You can provide a list of data types, so for the three time units, you could easily check:

select(pl.all().exclude([pl.Datetime("ms"), pl.Datetime("ns"), pl.Datetime("us")]))

That would be unwieldly with all possible timezones, although I guess in practice you won't have a mixture of timezones (or at least I hope so).

Alternatively, if you are ok with determining this in eager mode:

non_datetimes = [c for c, tp in df.schema.items() if not isinstance(tp, pl.Datetime)]

I do agree neither of these solutions is very elegant.

@mcrumiller
Copy link
Contributor

mcrumiller commented Oct 25, 2022

I feel excluding all pl.Datetimes makes the most sense here. The call .exclude(pl.Datetime) doesn't specify which type of datetime, and you might not know what the type is at compile time (where #4982 would help a lot), and when the base type has multiple subtypes, and you reference the base type, it feels it should apply to all subtypes as well.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Oct 25, 2022

I have some ideas here... should have some time in the evening over the next few days to take a look (and certainly at the weekend if things get busy during the week). Agree that there needs to be a simple way to match "all Datetimes" - or, equally, "all Durations".

@ritchie46
Copy link
Member

I was also thinking about accepting a wildcard "*" as timezone that evaluates to always true in comparison.

@zundertj
Copy link
Collaborator

zundertj commented Oct 26, 2022

I feel excluding all pl.Datetimes makes the most sense here. The call .exclude(pl.Datetime) doesn't specify which type of datetime, and you might not know what the type is at compile time (where #4982 would help a lot), and when the base type has multiple subtypes, and you reference the base type, it feels it should apply to all subtypes as well.

I agree, but then we have to change pl.Datetime to not default to a particular time unit, which may have implications elsewhere. But there is hopefully a way to make that work.

Edit: just seen the response by @ritchie46 : would be even better if we can not have the wildcard, it is additional syntax and thing to remember. But not sure if it is feasible.

@ritchie46
Copy link
Member

Edit: just seen the response by @ritchie46 : would be even better if we can not have the wildcard, it is additional syntax and thing to remember. But not sure if it is feasible.

We would only set that wildcard in the exclude logic if no timezone is given. A user does not have to know.

@arturdaraujo
Copy link

I think this issue should be closed due to inactivity

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Feb 16, 2023

FYI: following #6425 you can now exclude all Datetime cols like so...

from polars.datatypes import DATETIME_DTYPES
df.select( pl.exclude(DATETIME_DTYPES) )

...though wildcard support for timezones is still outstanding.

The various "official" dtype groups available are:

DATETIME_DTYPES
DURATION_DTYPES
FLOAT_DTYPES
INTEGER_DTYPES
NUMERIC_DTYPES
TEMPORAL_DTYPES

@zundertj zundertj added enhancement New feature or an improvement of an existing feature and removed bug Something isn't working labels Mar 7, 2023
@zundertj
Copy link
Collaborator

zundertj commented Mar 7, 2023

Leaving this open for the time being with the suggestion being to be able to replace this:

from polars.datatypes import DATETIME_DTYPES
df.select( pl.exclude(DATETIME_DTYPES) )

with

df.select(pl.exclude(pl.Datetime("*"))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-timeseries Area: date/time functionality enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
7 participants