Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] to_date with two digit year uses different century between CPU and GPU #2118

Open
andygrove opened this issue Apr 13, 2021 · 2 comments
Labels
bug Something isn't working

Comments

@andygrove
Copy link
Contributor

Describe the bug
I enabled spark.rapids.sql.incompatibleDateFormats.enabled and attempted to parse 31/12/99 using the format dd/MM/yy.

Spark on CPU returns 2099-12-31 but Spark on GPU returns 1999-12-31.

Steps/Code to reproduce bug

Update ParseDateTimeSuite.scala:

  • Add value "31/12/99" to timestampValues variable

  • Add new test:

testSparkResultsAreEqual("to_date dd/MM/yy",
  datesAsStrings,
  conf = new SparkConf().set(SQLConf.LEGACY_TIME_PARSER_POLICY.key, "CORRECTED")) {
  df => df.withColumn("c1", to_date(col("c0"), "dd/MM/yy"))
}

Expected behavior
Test should pass or we should fall back to CPU for yy.

Additional context
Spark 3.1.1

@andygrove andygrove added bug Something isn't working ? - Needs Triage Need team to review and classify labels Apr 13, 2021
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Apr 13, 2021
@andygrove andygrove self-assigned this Apr 13, 2021
@andygrove andygrove added this to the Apr 12 - Apr 23 milestone Apr 13, 2021
@andygrove
Copy link
Contributor Author

andygrove commented Apr 13, 2021

I plan on disabling yy on GPU for now and filing a follow-on issue to handle this consistently with Spark.

cuDF parsing is based on strptime and I found this in the strptime docs, so I am assuming for now that cuDF does the same but I need to confirm this as well as investigate the rules that Spark implements.

The year within century. When a century is not otherwise specified, 
values in the range [69,99] shall refer to years 1969 to 1999 inclusive, and 
values in the range [00,68] shall refer to years 2000 to 2068 inclusive;
leading zeros shall be permitted but shall not be required.

Spark uses DateTimeFormatter which has different rules:

For parsing, this will parse using the base value of 2000, resulting in a year within the range 2000 to 2099 inclusive.

@sameerz
Copy link
Collaborator

sameerz commented Apr 19, 2021

We will now fall back to the CPU but leaving this open so we can fix the correctness issue on the GPU in the longer run.

@andygrove andygrove removed their assignment Aug 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants