Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] FAILED schema_evolution_test.py::test_column_add_after_partition[orc][DATAGEN_SEED=1700413303, INJECT_OOM, IGNORE_ORDER({'local': True})] in 330cdh #9807

Open
pxLi opened this issue Nov 21, 2023 · 5 comments
Labels
bug Something isn't working test Only impacts tests

Comments

@pxLi
Copy link
Collaborator

pxLi commented Nov 21, 2023

Describe the bug
Mismatched CPU and GPU results. internal pipeline: rapids-it-cloudera1-raplab

FAILED ../../src/main/python/schema_evolution_test.py::test_column_add_after_partition[orc][DATAGEN_SEED=1700413303, INJECT_OOM, IGNORE_ORDER({'local': True})]

[2023-11-19T17:02:42.493Z] ### CPU RUN ###
[2023-11-19T17:02:42.493Z] ### GPU RUN ###
[2023-11-19T17:02:42.493Z] ### COLLECT: GPU TOOK 1.816152572631836 CPU TOOK 2.8293328285217285 ###
[2023-11-19T17:02:42.493Z] --- CPU OUTPUT
[2023-11-19T17:02:42.493Z] +++ GPU OUTPUT
[2023-11-19T17:02:42.493Z] @@ -2323,7 +2323,7 @@
[2023-11-19T17:02:42.493Z]  Row(c=747949758072796799, new_0=False, new_1=68, new_2=-32415, new_3=1382142121, new_4=-2229952796518441831, new_5=None, new_6=6.443761270526859e+250, new_7='`0ÌÙÛ\x1bärÑE\x1e¼ª0D\x18ò@¤õLvÊ|\x99ã\x99]y»', new_8=datetime.date(6979, 5, 9), new_9=datetime.datetime(1842, 8, 20, 11, 32, 25, 128767), new_10=[datetime.date(2495, 9, 28), datetime.date(7989, 10, 3), datetime.date(7668, 3, 8), datetime.date(2483, 1, 7), datetime.date(7474, 12, 20), datetime.date(4109, 8, 6), datetime.date(6820, 2, 7), datetime.date(8399, 6, 9), datetime.date(5965, 4, 10), datetime.date(339, 10, 8), datetime.date(4812, 8, 16), datetime.date(9039, 3, 14)], new_11=Row(child0=None), new_12=Row(c0=[-4807041446153269385, -492861108868460956, 2445107072012345415, 8495904361905650441, 0, 3069048035912442783], c1=True), a=0, b='x')
[2023-11-19T17:02:42.493Z]  Row(c=756767049961098093, new_0=None, new_1=None, new_2=None, new_3=None, new_4=None, new_5=None, new_6=None, new_7=None, new_8=None, new_9=None, new_10=None, new_11=None, new_12=None, a=1, b='z')
[2023-11-19T17:02:42.493Z]  Row(c=758360272413391259, new_0=False, new_1=93, new_2=-24932, new_3=-1362240737, new_4=-326270103423974582, new_5=8.31479808960357e-27, new_6=-7.480874109391324e-227, new_7='wÏ\x02þ\xadèó\x11Ç\x8e\x05\x82\x86Ð$Î)ÑNä¢È\x10Uý\x028f<\x0f', new_8=datetime.date(4850, 1, 25), new_9=datetime.datetime(2178, 3, 5, 9, 50, 14, 38792), new_10=[datetime.date(2000, 3, 1), datetime.date(8824, 9, 15), datetime.date(214, 3, 10), datetime.date(9963, 2, 1)], new_11=Row(child0=Decimal('67480762349703055.83')), new_12=Row(c0=[0, 2092818463731806519], c1=True), a=0, b='z')
[2023-11-19T17:02:42.493Z] -Row(c=761887531456626263, new_0=False, new_1=-59, new_2=-27428, new_3=649879972, new_4=300241518059178542, new_5=0.0, new_6=-4.4153881189739035e+222, new_7=None, new_8=datetime.date(6464, 3, 19), new_9=datetime.datetime(2163, 1, 20, 6, 5, 33, 728202), new_10=[datetime.date(1582, 10, 15)], new_11=Row(child0=Decimal('-274330538087067736.29')), new_12=Row(c0=[-9169051718731907692, 3207217407084977732, 5828166240839985036, -769860422017136925, -6317725922746163153, -4531858387965210345, -7549594355732088622, 2870330501981589583, 4918878943345752411, -1, -2538576527345834980, -1777000261340738155, -454263620209757316, 8207708571539683202, 0, -7855648709717721967, -1580255216891208658, -5557174777925197249, 8490753925456683644, -6041309477268441620], c1=True), a=0, b='y')
[2023-11-19T17:02:42.493Z] +Row(c=761887531456626263, new_0=False, new_1=-59, new_2=-27428, new_3=649879972, new_4=300241518059178542, new_5=0.0, new_6=-4.4153881189739035e+222, new_7=None, new_8=datetime.date(6464, 3, 19), new_9=datetime.datetime(2163, 1, 20, 6, 5, 33, 728202), new_10=[datetime.date(1582, 10, 13)], new_11=Row(child0=Decimal('-274330538087067736.29')), new_12=Row(c0=[-9169051718731907692, 3207217407084977732, 5828166240839985036, -769860422017136925, -6317725922746163153, -4531858387965210345, -7549594355732088622, 2870330501981589583, 4918878943345752411, -1, -2538576527345834980, -1777000261340738155, -454263620209757316, 8207708571539683202, 0, -7855648709717721967, -1580255216891208658, -5557174777925197249, 8490753925456683644, -6041309477268441620], c1=True), a=0, b='y')
[2023-11-19T17:02:42.493Z]  Row(c=768380761926021119, new_0=True, new_1=-87, new_2=4034, new_3=None, new_4=8519045545037378653, new_5=5.065325137478845e+29, new_6=-4.666103834234089e-215, new_7='¶Î\x8f\x8cU¶\x9caa\nÿ[Ä\x90µ)\x85îm²d\x19Ù¨\x13Ý+\x15\xad\x16', new_8=datetime.date(2000, 3, 1), new_9=datetime.datetime(2141, 2, 23, 16, 7, 4, 731178), new_10=[datetime.date(3286, 5, 19), datetime.date(8069, 9, 17), datetime.date(5396, 12, 17), datetime.date(631, 11, 13), datetime.date(3529, 2, 18), datetime.date(2590, 4, 17), datetime.date(8565, 7, 10), datetime.date(1629, 6, 13)], new_11=Row(child0=Decimal('-880390660518370190.39')), new_12=Row(c0=None, c1=True), a=1, b='z')
[2023-11-19T17:02:42.493Z]  Row(c=789081968798337592, new_0=None, new_1=None, new_2=None, new_3=None, new_4=None, new_5=None, new_6=None, new_7=None, new_8=None, new_9=None, new_10=None, new_11=None, new_12=None, a=-1, b='z')
[2023-11-19T17:02:42.493Z]  Row(c=806216176268352943, new_0=True, new_1=-45, new_2=-26835, new_3=-1020998181, new_4=-4168755332298322914, new_5=-8.087319503242584e+24, new_6=5.857303241779751e+83, new_7=':Å°ø°Á÷\x00DM²PÚRT/\x13AH\x92eý~£µ+\x0b5c4', new_8=datetime.date(1650, 11, 11), new_9=datetime.datetime(2179, 5, 24, 1, 16, 57, 493474), new_10=[datetime.date(72, 12, 13), datetime.date(2033, 2, 23), datetime.date(8471, 1, 7), datetime.date(6000, 3, 1), datetime.date(7344, 6, 4), datetime.date(1875, 5, 20), datetime.date(9395, 8, 23), datetime.date(858, 8, 22), datetime.date(8000, 3, 1), datetime.date(1505, 1, 19), datetime.date(2624, 1, 30), datetime.date(5096, 11, 8), datetime.date(3053, 7, 14), datetime.date(4000, 2, 29), datetime.date(7627, 6, 8), datetime.date(4691, 5, 26)], new_11=Row(child0=None), new_12=Row(c0=[8883040341267712284, -5854667154153339848, -7992805157714132332, 3849673508210869062, -3458498021831871544, 5689542648647886628, -5973772120704534010, -6341409587235122230, 8986229899183325292, -3347809925252056780, 4127516452669723253, -914703985335797152, 8228864223747791863, 5034508634006205111, -4701340369623626054, 6539709654621966036, 3217011538087074601], c1=True), a=-1, b='y')

Steps/Code to reproduce bug
re-run internal rapids-it-cloudera1-raplab

Expected behavior
pass the test

Environment details (please complete the following information)

  • Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
  • Spark configuration settings related to the issue

Additional context
Add any other context about the problem here.

@pxLi pxLi added bug Something isn't working ? - Needs Triage Need team to review and classify test Only impacts tests labels Nov 21, 2023
@revans2
Copy link
Collaborator

revans2 commented Nov 21, 2023

I had this fail outside of cdh.

@sameerz
Copy link
Collaborator

sameerz commented Nov 21, 2023

Is this a case where the timezone is not set in the CDH test environment?

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Nov 21, 2023
@jlowe
Copy link
Member

jlowe commented Nov 22, 2023

The problem relates to an arguably invalid date being generated during the test. During the failed run, the date 1582-10-13 is generated. According to the Gregorian calendar, that date doesn't exist, because it falls in the 10 lost days during the transition from Julian to Gregorian. 1582-10-04 is followed by 1582-10-15. When the CPU writes this to an ORC file, it marshals the date to 10-15 on both a CPU and GPU read:

scala> spark.conf.set("spark.rapids.sql.enabled", "false")

scala> Seq("1582-10-13").toDF("s").selectExpr("cast(s as date) as d").repartition(1).write.orc("/tmp/orccpu")

scala> spark.read.orc("/tmp/orccpu").show()
+----------+
|         d|
+----------+
|1582-10-15|
+----------+

scala> spark.conf.set("spark.rapids.sql.enabled", "true")

scala> spark.read.orc("/tmp/orccpu").show()
23/11/22 16:28:01 WARN GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU

+----------+
|         d|
+----------+
|1582-10-15|
+----------+

However when the GPU writes the date, it encodes it in such a way that the CPU and GPU readers differ on the value:

scala> spark.conf.set("spark.rapids.sql.enabled", "true")

scala> Seq("1582-10-13").toDF("s").selectExpr("cast(s as date) as d").repartition(1).write.orc("/tmp/orcgpu")
23/11/22 16:26:39 WARN GpuOverrides: 
!Exec <ShuffleExchangeExec> cannot run on GPU because Columnar exchange without columnar children is inefficient
  @Partitioning <SinglePartition$> could run on GPU
  ! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
    @Expression <AttributeReference> d#27 could run on GPU

scala> spark.read.orc("/tmp/orcgpu").show()
23/11/22 16:28:08 WARN GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU

+----------+
|         d|
+----------+
|1582-10-13|
+----------+

scala> spark.conf.set("spark.rapids.sql.enabled", "false")

scala> spark.read.orc("/tmp/orcgpu").show()
+----------+
|         d|
+----------+
|1582-10-03|
+----------+

I tried pandas and the ORC Java tools against the CPU and GPU generated ORC files. All readers agree the CPU-written file has the date 1582-10-15. With the GPU-written file, Pandas and the libcudf reader see the date as 1582-10-13 whereas Spark CPU and the ORC Java tools read the date as 1582-10-03.

@jlowe
Copy link
Member

jlowe commented Nov 22, 2023

Relates to #131

@jlowe
Copy link
Member

jlowe commented Dec 6, 2023

Updated the compatibility docs to describe the incompatibilities with the lost days associated with the switch to the Gregorian calendar. Putting this back into the backlog since fixing is deemed low priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working test Only impacts tests
Projects
None yet
Development

No branches or pull requests

5 participants