[BUG] test_multi_types_window_aggs_for_rows_lead_lag fails against Spark 3.1.0 #999

jlowe · 2020-10-21T16:48:03Z

Describe the bug
Nightly build of integration tests against Spark 3.1.0 are failing in the test_multi_types_window_aggs_for_lead_lag tests.

Steps/Code to reproduce bug
Run integration tests against latest Spark 3.1.0-SNAPSHOT

Expected behavior
Tests do not fail.

jlowe · 2020-10-21T16:50:54Z

Looks like the GPU is somehow returning nulls or no result. Here's an excerpt from the failure log. Note the gpu = None.

11:24:30  =================================== FAILURES ===================================
11:24:30  _ test_multi_types_window_aggs_for_rows_lead_lag[partBy:Long-orderBy:Long-Long] _
11:24:30  
11:24:30  a_gen = Long, b_gen = Long, c_gen = Long
11:24:30  
11:24:30      @ignore_order
11:24:30      @approximate_float
11:24:30      @pytest.mark.parametrize('c_gen', lead_lag_data_gens, ids=idfn)
11:24:30      @pytest.mark.parametrize('b_gen', part_and_order_gens, ids=meta_idfn('orderBy:'))
11:24:30      @pytest.mark.parametrize('a_gen', part_and_order_gens, ids=meta_idfn('partBy:'))
11:24:30      def test_multi_types_window_aggs_for_rows_lead_lag(a_gen, b_gen, c_gen):
11:24:30          data_gen = [
11:24:30                  ('a', RepeatSeqGen(a_gen, length=20)),
11:24:30                  ('b', b_gen),
11:24:30                  ('c', c_gen)]
11:24:30          # By default for many operations a range of unbounded to unbounded is used
11:24:30          # This will not work until https://github.com/NVIDIA/spark-rapids/issues/216
11:24:30          # is fixed.
11:24:30      
11:24:30          # Ordering needs to include c because with nulls and especially on booleans
11:24:30          # it is possible to get a different ordering when it is ambiguous.
11:24:30          baseWindowSpec = Window.partitionBy('a').orderBy('b', 'c')
11:24:30          inclusiveWindowSpec = baseWindowSpec.rowsBetween(-10, 100)
11:24:30      
11:24:30          defaultVal = gen_scalar_value(c_gen, force_no_nulls=False)
11:24:30      
11:24:30          def do_it(spark):
11:24:30              return gen_df(spark, data_gen, length=2048) \
11:24:30                      .withColumn('inc_count_1', f.count('*').over(inclusiveWindowSpec)) \
11:24:30                      .withColumn('inc_count_c', f.count('c').over(inclusiveWindowSpec)) \
11:24:30                      .withColumn('inc_max_c', f.max('c').over(inclusiveWindowSpec)) \
11:24:30                      .withColumn('inc_min_c', f.min('c').over(inclusiveWindowSpec)) \
11:24:30                      .withColumn('lead_5_c', f.lead('c', 5).over(baseWindowSpec)) \
11:24:30                      .withColumn('lead_def_c', f.lead('c', 2, defaultVal).over(baseWindowSpec)) \
11:24:30                      .withColumn('lag_1_c', f.lag('c', 1).over(baseWindowSpec)) \
11:24:30                      .withColumn('lag_def_c', f.lag('c', 4, defaultVal).over(baseWindowSpec)) \
11:24:30                      .withColumn('row_num', f.row_number().over(baseWindowSpec))
11:24:30  >       assert_gpu_and_cpu_are_equal_collect(do_it, conf={'spark.rapids.sql.hasNans': 'false'})
11:24:30  
11:24:30  src/main/python/window_function_test.py:113: 
11:24:30  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
11:24:30  src/main/python/asserts.py:296: in assert_gpu_and_cpu_are_equal_collect
11:24:30      _assert_gpu_and_cpu_are_equal(func, True, conf=conf)
11:24:30  src/main/python/asserts.py:288: in _assert_gpu_and_cpu_are_equal
11:24:30      assert_equal(from_cpu, from_gpu)
11:24:30  src/main/python/asserts.py:86: in assert_equal
11:24:30      _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])
11:24:30  src/main/python/asserts.py:38: in _assert_equal
11:24:30      _assert_equal(cpu[index], gpu[index], float_check, path + [index])
11:24:30  src/main/python/asserts.py:31: in _assert_equal
11:24:30      _assert_equal(cpu[field], gpu[field], float_check, path + [field])
11:24:30  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
11:24:30  
11:24:30  cpu = -6068833985444302436, gpu = None
11:24:30  float_check = <function get_float_check.<locals>.<lambda> at 0x7f63ca8033b0>
11:24:30  path = [1, 'lag_1_c']
11:24:30  
11:24:30      def _assert_equal(cpu, gpu, float_check, path):
11:24:30          t = type(cpu)
11:24:30          if (t is Row):
11:24:30              assert len(cpu) == len(gpu), "CPU and GPU row have different lengths at {}".format(path)
11:24:30              if hasattr(cpu, "__fields__") and hasattr(gpu, "__fields__"):
11:24:30                  for field in cpu.__fields__:
11:24:30                      _assert_equal(cpu[field], gpu[field], float_check, path + [field])
11:24:30              else:
11:24:30                  for index in range(len(cpu)):
11:24:30                      _assert_equal(cpu[index], gpu[index], float_check, path + [index])
11:24:30          elif (t is list):
11:24:30              assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {}".format(path)
11:24:30              for index in range(len(cpu)):
11:24:30                  _assert_equal(cpu[index], gpu[index], float_check, path + [index])
11:24:30          elif (t is pytypes.GeneratorType):
11:24:30              index = 0
11:24:30              # generator has no zip :( so we have to do this the hard way
11:24:30              done = False
11:24:30              while not done:
11:24:30                  sub_cpu = None
11:24:30                  sub_gpu = None
11:24:30                  try:
11:24:30                      sub_cpu = next(cpu)
11:24:30                  except StopIteration:
11:24:30                      done = True
11:24:30      
11:24:30                  try:
11:24:30                      sub_gpu = next(gpu)
11:24:30                  except StopIteration:
11:24:30                      done = True
11:24:30      
11:24:30                  if done:
11:24:30                      assert sub_cpu == sub_gpu and sub_cpu == None, "CPU and GPU generators have different lengths at {}".format(path)
11:24:30                  else:
11:24:30                      _assert_equal(sub_cpu, sub_gpu, float_check, path + [index])
11:24:30      
11:24:30                  index = index + 1
11:24:30          elif (t is int):
11:24:30  >           assert cpu == gpu, "GPU and CPU int values are different at {}".format(path)
11:24:30  E           AssertionError: GPU and CPU int values are different at [1, 'lag_1_c']
11:24:30  
11:24:30  src/main/python/asserts.py:63: AssertionError
11:24:30  ----------------------------- Captured stdout call -----------------------------
11:24:30  ### CPU RUN ###
11:24:30  ### GPU RUN ###
11:24:30  ### COLLECT: GPU TOOK 0.8450026512145996 CPU TOOK 0.69598388671875 ###
11:24:30  _ test_multi_types_window_aggs_for_rows_lead_lag[partBy:Long-orderBy:Long-Double] _
11:24:30  
11:24:30  a_gen = Long, b_gen = Long, c_gen = Double
11:24:30  
11:24:30      @ignore_order
11:24:30      @approximate_float
11:24:30      @pytest.mark.parametrize('c_gen', lead_lag_data_gens, ids=idfn)
11:24:30      @pytest.mark.parametrize('b_gen', part_and_order_gens, ids=meta_idfn('orderBy:'))
11:24:30      @pytest.mark.parametrize('a_gen', part_and_order_gens, ids=meta_idfn('partBy:'))
11:24:30      def test_multi_types_window_aggs_for_rows_lead_lag(a_gen, b_gen, c_gen):
11:24:30          data_gen = [
11:24:30                  ('a', RepeatSeqGen(a_gen, length=20)),
11:24:30                  ('b', b_gen),
11:24:30                  ('c', c_gen)]
11:24:30          # By default for many operations a range of unbounded to unbounded is used
11:24:30          # This will not work until https://github.com/NVIDIA/spark-rapids/issues/216
11:24:30          # is fixed.
11:24:30      
11:24:30          # Ordering needs to include c because with nulls and especially on booleans
11:24:30          # it is possible to get a different ordering when it is ambiguous.
11:24:30          baseWindowSpec = Window.partitionBy('a').orderBy('b', 'c')
11:24:30          inclusiveWindowSpec = baseWindowSpec.rowsBetween(-10, 100)
11:24:30      
11:24:30          defaultVal = gen_scalar_value(c_gen, force_no_nulls=False)
11:24:30      
11:24:30          def do_it(spark):
11:24:30              return gen_df(spark, data_gen, length=2048) \
11:24:30                      .withColumn('inc_count_1', f.count('*').over(inclusiveWindowSpec)) \
11:24:30                      .withColumn('inc_count_c', f.count('c').over(inclusiveWindowSpec)) \
11:24:30                      .withColumn('inc_max_c', f.max('c').over(inclusiveWindowSpec)) \
11:24:30                      .withColumn('inc_min_c', f.min('c').over(inclusiveWindowSpec)) \
11:24:30                      .withColumn('lead_5_c', f.lead('c', 5).over(baseWindowSpec)) \
11:24:30                      .withColumn('lead_def_c', f.lead('c', 2, defaultVal).over(baseWindowSpec)) \
11:24:30                      .withColumn('lag_1_c', f.lag('c', 1).over(baseWindowSpec)) \
11:24:30                      .withColumn('lag_def_c', f.lag('c', 4, defaultVal).over(baseWindowSpec)) \
11:24:30                      .withColumn('row_num', f.row_number().over(baseWindowSpec))
11:24:30  >       assert_gpu_and_cpu_are_equal_collect(do_it, conf={'spark.rapids.sql.hasNans': 'false'})
11:24:30  
11:24:30  src/main/python/window_function_test.py:113: 
11:24:30  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
11:24:30  src/main/python/asserts.py:296: in assert_gpu_and_cpu_are_equal_collect
11:24:30      _assert_gpu_and_cpu_are_equal(func, True, conf=conf)
11:24:30  src/main/python/asserts.py:288: in _assert_gpu_and_cpu_are_equal
11:24:30      assert_equal(from_cpu, from_gpu)
11:24:30  src/main/python/asserts.py:86: in assert_equal
11:24:30      _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])
11:24:30  src/main/python/asserts.py:38: in _assert_equal
11:24:30      _assert_equal(cpu[index], gpu[index], float_check, path + [index])
11:24:30  src/main/python/asserts.py:31: in _assert_equal
11:24:30      _assert_equal(cpu[field], gpu[field], float_check, path + [field])
11:24:30  src/main/python/asserts.py:68: in _assert_equal
11:24:30      assert float_check(cpu, gpu), "GPU and CPU float values are different {}".format(path)
11:24:30  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
11:24:30  
11:24:30  lhs = -3.546714417938851e-160, rhs = None
11:24:30  
11:24:30  >   return lambda lhs,rhs: lhs == pytest.approx(rhs, **_approximate_float_args)
11:24:30  E   TypeError: cannot make approximate comparisons to non-numeric values: None
11:24:30  
11:24:30  src/main/python/conftest.py:24: TypeError

sameerz · 2020-11-17T22:21:45Z

Need to mark as xfail until there is a cudf fix.

revans2 · 2020-11-17T23:56:21Z

It is now failing in a very different way from before. It looks like the Spark changed in an incompatible way and we are now getting failures so it appears that we have to move OffsetWindowFunctionMeta into the Shim layer.

E                   py4j.protocol.Py4JJavaError: An error occurred while calling o60729.collectToPython.
E                   : java.lang.VerifyError: Bad type on operand stack
E                   Exception Details:
E                     Location:
E                       com/nvidia/spark/rapids/OffsetWindowFunctionMeta.<init>(Lorg/apache/spark/sql/catalyst/expressions/OffsetWindowFunction;Lcom/nvidia/spark/rapids/RapidsConf;Lscala/Option;Lcom/nvidia/spark/rapids/ConfKeysAndIncompat;)V @11: invokespecial
E                     Reason:
E                       Type 'org/apache/spark/sql/catalyst/expressions/OffsetWindowFunction' (current frame, stack[1]) is not assignable to 'org/apache/spark/sql/catalyst/expressions/Expression'
E                     Current Frame:
E                       bci: @11
E                       flags: { flagThisUninit }
E                       locals: { uninitializedThis, 'org/apache/spark/sql/catalyst/expressions/OffsetWindowFunction', 'com/nvidia/spark/rapids/RapidsConf', 'scala/Option', 'com/nvidia/spark/rapids/ConfKeysAndIncompat', '[Z' }
E                       stack: { uninitializedThis, 'org/apache/spark/sql/catalyst/expressions/OffsetWindowFunction', 'com/nvidia/spark/rapids/RapidsConf', 'scala/Option', 'com/nvidia/spark/rapids/ConfKeysAndIncompat' }
E                     Bytecode:
E                       0x0000000: b800 a63a 052a 2b2c 2d19 04b7 0056 1905
E                       0x0000010: 1008 0454 2ab2 005b 2ab7 0027 c000 29b6
E                       0x0000020: 002c 2ab7 005e bb00 6059 2ab7 0063 b600
E                       0x0000030: 67b5 0015 1905 1009 0454 2ab2 005b 2ab7
E                       0x0000040: 0027 c000 29b6 0069 2ab7 005e bb00 6059
E                       0x0000050: 2ab7 0063 b600 67b5 0019 1905 100a 0454
E                       0x0000060: 2ab2 005b 2ab7 0027 c000 29b6 006b 2ab7
E                       0x0000070: 005e bb00 6059 2ab7 0063 b600 67b5 001b
E                       0x0000080: 2a19 0510 0b04 542a b700 27c0 0029 b600
E                       0x0000090: 6bb6 0032 b200 703a 0659 c700 1b57 1906
E                       0x00000a0: c700 0c19 0510 0c04 54a7 0023 1905 100d
E                       0x00000b0: 0454 a700 4719 06b6 003e 9a00 0c19 0510
E                       0x00000c0: 0e04 54a7 0036 1905 100f 0454 b200 7bb2
E                       0x00000d0: 0080 05bd 0082 5903 2ab6 0084 5359 042a
E                       0x00000e0: b600 8653 c000 88b6 008c b600 90c0 0092
E                       0x00000f0: 1905 1010 0454 a700 34b2 007b b200 8006
E                       0x0000100: bd00 8259 032a b600 8453 5904 2ab6 0086
E                       0x0000110: 5359 052a b600 9453 c000 88b6 008c b600
E                       0x0000120: 90c0 0092 1905 1011 0454 b500 1f19 0510
E                       0x0000130: 1204 54b1                              
E                     Stackmap Table:
E                       full_frame(@172,{Object[#2],Object[#41],Object[#114],Object[#116],Object[#118],Object[#168],Object[#109]},{Object[#2]})
E                       full_frame(@181,{Object[#2],Object[#41],Object[#114],Object[#116],Object[#118],Object[#168],Object[#109]},{Object[#2],Object[#76]})
E                       same_locals_1_stack_item_frame(@198,Object[#2])
E                       same_locals_1_stack_item_frame(@204,Object[#2])
E                       same_locals_1_stack_item_frame(@249,Object[#2])
E                       full_frame(@298,{Object[#2],Object[#41],Object[#114],Object[#116],Object[#118],Object[#168],Object[#109]},{Object[#2],Object[#146]})
E                   
E                   	at com.nvidia.spark.rapids.GpuOverrides$.$anonfun$commonExpressions$16(GpuOverrides.scala:760)
E                   	at com.nvidia.spark.rapids.ReplacementRule.wrap(GpuOverrides.scala:186)
E                   	at com.nvidia.spark.rapids.GpuOverrides$.$anonfun$wrapExpr$1(GpuOverrides.scala:632)
E                   	at scala.Option.map(Option.scala:230)
E                   	at com.nvidia.spark.rapids.GpuOverrides$.wrapExpr(GpuOverrides.scala:632)
E                   	at com.nvidia.spark.rapids.BaseExprMeta.$anonfun$childExprs$2(RapidsMeta.scala:670)
E                   	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
E                   	at scala.collection.immutable.List.foreach(List.scala:392)
E                   	at scala.collection.TraversableLike.map(TraversableLike.scala:238)

mythrocks · 2021-02-25T08:24:56Z

It looks like the Spark changed in an incompatible way and we are now getting failures so it appears that we have to move OffsetWindowFunctionMeta into the Shim layer.

Yep. OffsetWindowFunction compiled with Spark 3.0.2 is not recognized as a valid Expression in Spark 3.1.1. Moving the object construction into the shims layer sorted this out.

The original problem remained:

Looks like the GPU is somehow returning nulls or no result.

This is the result of a change in Spark 3.1.x semantics for Lag.offset. Whereas the offset for, say, LAG(col, 5) used to be stored internally as Int(5) in 3.0.2, it was changed to Int(-5) in 3.1.x, trampling over the GpuWindowExec logic.

A fix is available in #1813.

sameerz · 2021-02-26T20:08:33Z

Closed by #1813

…IDIA#999) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

jlowe added bug Something isn't working ? - Needs Triage Need team to review and classify labels Oct 21, 2020

sameerz added P1 Nice to have for release Spark 3.1+ Bugs only related to Spark 3.1 or higher and removed ? - Needs Triage Need team to review and classify labels Oct 27, 2020

sameerz assigned mythrocks Oct 27, 2020

mythrocks pinned this issue Oct 28, 2020

mythrocks unpinned this issue Oct 28, 2020

sameerz removed the P1 Nice to have for release label Nov 17, 2020

sameerz added the P1 Nice to have for release label Nov 17, 2020

sameerz assigned mythrocks and unassigned mythrocks Nov 17, 2020

revans2 mentioned this issue Nov 18, 2020

Mark test as xfail until we can get a fix in #1154

Merged

sameerz added this to the Jan 4 - Jan 15 milestone Dec 4, 2020

sameerz modified the milestones: Jan 4 - Jan 15, Jan 18 - Jan 29 Jan 15, 2021

sameerz removed this from the Jan 18 - Jan 29 milestone Feb 1, 2021

sameerz added this to the Feb 16 - Feb 26 milestone Feb 12, 2021

mythrocks mentioned this issue Feb 25, 2021

Fix LEAD/LAG failures in Spark 3.1.1 #1813

Merged

sameerz closed this as completed Feb 26, 2021

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023

Update submodule cudf to 9299c2b6369a0dcdc38a7f8495e9577e60cf6347 (NV…

b4f1958

…IDIA#999) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] test_multi_types_window_aggs_for_rows_lead_lag fails against Spark 3.1.0 #999

[BUG] test_multi_types_window_aggs_for_rows_lead_lag fails against Spark 3.1.0 #999

jlowe commented Oct 21, 2020

jlowe commented Oct 21, 2020

sameerz commented Nov 17, 2020

revans2 commented Nov 17, 2020

mythrocks commented Feb 25, 2021 •

edited

Loading

sameerz commented Feb 26, 2021

[BUG] test_multi_types_window_aggs_for_rows_lead_lag fails against Spark 3.1.0 #999

[BUG] test_multi_types_window_aggs_for_rows_lead_lag fails against Spark 3.1.0 #999

Comments

jlowe commented Oct 21, 2020

jlowe commented Oct 21, 2020

sameerz commented Nov 17, 2020

revans2 commented Nov 17, 2020

mythrocks commented Feb 25, 2021 • edited Loading

sameerz commented Feb 26, 2021

mythrocks commented Feb 25, 2021 •

edited

Loading