Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to control the assumed century cutoff with Expr.str.to_date when parsing 2-digit years #17213

Closed
MarkRotchell opened this issue Jun 26, 2024 · 5 comments · Fixed by #17661
Labels
A-timeseries Area: date/time functionality enhancement New feature or an improvement of an existing feature

Comments

@MarkRotchell
Copy link

Description

Currently when parsing dates with 2-digit years, any number upto 49 is assumed to be the 21st century and 50 or more is assumed to be the 20th century so that:

>>> pl.DataFrame({'dt':'31-12-49'}).select(pl.col('dt').str.to_date('%d-%m-%y'))
┌────────────┐
│ dt         │
│ ---        │
│ date       │
╞════════════╡
│ 2049-12-31 │
└────────────┘

>>> pl.DataFrame({'dt':'01-01-50'}).select(pl.col('dt').str.to_date('%d-%m-%y'))
┌────────────┐
│ dt         │
│ ---        │
│ date       │
╞════════════╡
│ 1950-01-01 │
└────────────┘

There are obviously ways around this, for example

pl.DataFrame({'dt':'01-01-50'}).select((pl.col('dt')+'-20').str.to_date('%d-%m-%y-%C'))

But it would be good to have direct control over the assumed century cutoff being 2050. Potentially something set via pl.Config, or as an extra argument to Expr.str.to_date?

@MarkRotchell MarkRotchell added the enhancement New feature or an improvement of an existing feature label Jun 26, 2024
@alexander-beedie
Copy link
Collaborator

If such a thing were added it should definitely be explicit rather than something stashed invisibly in the Config object 🤔 Got any examples of other APIs that allow this? (It makes sense, but I can't recall actually having seen it done elsewhere).

@Julian-J-S
Copy link
Contributor

First assumption was that it is not on polars but on chrono (rust date/datetime parser) but chrono seems to be "fine"

fn main() {
    use chrono::NaiveDate; // 0.4.38

    let Y_M_D = "%y-%m-%d";
    let D_M_Y = "%d-%m-%y";

    let Y_M_D_Text = ["49-01-01", "49-12-31", "50-01-01"];
    let D_M_Y_Text = ["01-01-49", "31-12-49", "01-01-50"];

    for date_str in Y_M_D_Text {
        println!("{:?}", NaiveDate::parse_from_str(date_str, Y_M_D));
    }

    for date_str in D_M_Y_Text {
        println!("{:?}", NaiveDate::parse_from_str(date_str, D_M_Y));
    }
}

// Ok(2049-01-01)
// Ok(2049-12-31)
// Ok(2050-01-01)
// Ok(2049-01-01)
// Ok(2049-12-31)
// Ok(2050-01-01)

Must be some polars optimized performance implementation that is not using chrono?! 😆

Anyway, as a side note, would strongly advise you to use a clear and unambiguous format whenever possible! 😉

@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented Jun 26, 2024

Must be some polars optimized performance implementation that is not using chrono?! 😆

there's a fast-path for some fixed-length-formats, it might be that

@MarkRotchell
Copy link
Author

python's docs for datetime suggest a different cutoff, but no way to override it:

When 2-digit years are parsed, they are converted according to the POSIX and ISO C standards: values 69–99 are mapped to 1969–1999, and values 0–68 are mapped to 2000–2068.

pandas seems to have the same cutoff, but again no way to override it. Perhaps with it being an extra 20 years out it's not an issue as frequently.

I suggest that even if we don't agree on implementing a way to control this in polars then we should at least make it explicit in the docs that we follow a different convention to python's datetime.

would strongly advise you to use a clear and unambiguous format whenever possible! 😉

Unfornately, I encountered this with data received from a third party - certainly not a fan of YY myself if I can avoid it.

@MarkRotchell
Copy link
Author

One possible solution, given that this is already being handled by polars, rather than delegated to chrono, is to add a new format specifier, perhaps %20y could indicate "2 digit year assumed to start with a 20".

This would be in line with the convention in chrono for allowing a number between the % and the letter to indicate a parameter, e.g. %3f to indicate a three-digit decimal fraction of a second.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-timeseries Area: date/time functionality enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants