Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] test_multi_types_window_aggs_for_rows_lead_lag fails against Spark 3.1.0 #999

Closed
jlowe opened this issue Oct 21, 2020 · 5 comments · Fixed by #1813
Closed

[BUG] test_multi_types_window_aggs_for_rows_lead_lag fails against Spark 3.1.0 #999

jlowe opened this issue Oct 21, 2020 · 5 comments · Fixed by #1813
Assignees
Labels
bug Something isn't working P1 Nice to have for release Spark 3.1+ Bugs only related to Spark 3.1 or higher

Comments

@jlowe
Copy link
Member

jlowe commented Oct 21, 2020

Describe the bug
Nightly build of integration tests against Spark 3.1.0 are failing in the test_multi_types_window_aggs_for_lead_lag tests.

Steps/Code to reproduce bug
Run integration tests against latest Spark 3.1.0-SNAPSHOT

Expected behavior
Tests do not fail.

@jlowe jlowe added bug Something isn't working ? - Needs Triage Need team to review and classify labels Oct 21, 2020
@jlowe
Copy link
Member Author

jlowe commented Oct 21, 2020

Looks like the GPU is somehow returning nulls or no result. Here's an excerpt from the failure log. Note the gpu = None.

11:24:30  =================================== FAILURES ===================================
11:24:30  _ test_multi_types_window_aggs_for_rows_lead_lag[partBy:Long-orderBy:Long-Long] _
11:24:30  
11:24:30  a_gen = Long, b_gen = Long, c_gen = Long
11:24:30  
11:24:30      @ignore_order
11:24:30      @approximate_float
11:24:30      @pytest.mark.parametrize('c_gen', lead_lag_data_gens, ids=idfn)
11:24:30      @pytest.mark.parametrize('b_gen', part_and_order_gens, ids=meta_idfn('orderBy:'))
11:24:30      @pytest.mark.parametrize('a_gen', part_and_order_gens, ids=meta_idfn('partBy:'))
11:24:30      def test_multi_types_window_aggs_for_rows_lead_lag(a_gen, b_gen, c_gen):
11:24:30          data_gen = [
11:24:30                  ('a', RepeatSeqGen(a_gen, length=20)),
11:24:30                  ('b', b_gen),
11:24:30                  ('c', c_gen)]
11:24:30          # By default for many operations a range of unbounded to unbounded is used
11:24:30          # This will not work until https://github.com/NVIDIA/spark-rapids/issues/216
11:24:30          # is fixed.
11:24:30      
11:24:30          # Ordering needs to include c because with nulls and especially on booleans
11:24:30          # it is possible to get a different ordering when it is ambiguous.
11:24:30          baseWindowSpec = Window.partitionBy('a').orderBy('b', 'c')
11:24:30          inclusiveWindowSpec = baseWindowSpec.rowsBetween(-10, 100)
11:24:30      
11:24:30          defaultVal = gen_scalar_value(c_gen, force_no_nulls=False)
11:24:30      
11:24:30          def do_it(spark):
11:24:30              return gen_df(spark, data_gen, length=2048) \
11:24:30                      .withColumn('inc_count_1', f.count('*').over(inclusiveWindowSpec)) \
11:24:30                      .withColumn('inc_count_c', f.count('c').over(inclusiveWindowSpec)) \
11:24:30                      .withColumn('inc_max_c', f.max('c').over(inclusiveWindowSpec)) \
11:24:30                      .withColumn('inc_min_c', f.min('c').over(inclusiveWindowSpec)) \
11:24:30                      .withColumn('lead_5_c', f.lead('c', 5).over(baseWindowSpec)) \
11:24:30                      .withColumn('lead_def_c', f.lead('c', 2, defaultVal).over(baseWindowSpec)) \
11:24:30                      .withColumn('lag_1_c', f.lag('c', 1).over(baseWindowSpec)) \
11:24:30                      .withColumn('lag_def_c', f.lag('c', 4, defaultVal).over(baseWindowSpec)) \
11:24:30                      .withColumn('row_num', f.row_number().over(baseWindowSpec))
11:24:30  >       assert_gpu_and_cpu_are_equal_collect(do_it, conf={'spark.rapids.sql.hasNans': 'false'})
11:24:30  
11:24:30  src/main/python/window_function_test.py:113: 
11:24:30  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
11:24:30  src/main/python/asserts.py:296: in assert_gpu_and_cpu_are_equal_collect
11:24:30      _assert_gpu_and_cpu_are_equal(func, True, conf=conf)
11:24:30  src/main/python/asserts.py:288: in _assert_gpu_and_cpu_are_equal
11:24:30      assert_equal(from_cpu, from_gpu)
11:24:30  src/main/python/asserts.py:86: in assert_equal
11:24:30      _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])
11:24:30  src/main/python/asserts.py:38: in _assert_equal
11:24:30      _assert_equal(cpu[index], gpu[index], float_check, path + [index])
11:24:30  src/main/python/asserts.py:31: in _assert_equal
11:24:30      _assert_equal(cpu[field], gpu[field], float_check, path + [field])
11:24:30  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
11:24:30  
11:24:30  cpu = -6068833985444302436, gpu = None
11:24:30  float_check = <function get_float_check.<locals>.<lambda> at 0x7f63ca8033b0>
11:24:30  path = [1, 'lag_1_c']
11:24:30  
11:24:30      def _assert_equal(cpu, gpu, float_check, path):
11:24:30          t = type(cpu)
11:24:30          if (t is Row):
11:24:30              assert len(cpu) == len(gpu), "CPU and GPU row have different lengths at {}".format(path)
11:24:30              if hasattr(cpu, "__fields__") and hasattr(gpu, "__fields__"):
11:24:30                  for field in cpu.__fields__:
11:24:30                      _assert_equal(cpu[field], gpu[field], float_check, path + [field])
11:24:30              else:
11:24:30                  for index in range(len(cpu)):
11:24:30                      _assert_equal(cpu[index], gpu[index], float_check, path + [index])
11:24:30          elif (t is list):
11:24:30              assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {}".format(path)
11:24:30              for index in range(len(cpu)):
11:24:30                  _assert_equal(cpu[index], gpu[index], float_check, path + [index])
11:24:30          elif (t is pytypes.GeneratorType):
11:24:30              index = 0
11:24:30              # generator has no zip :( so we have to do this the hard way
11:24:30              done = False
11:24:30              while not done:
11:24:30                  sub_cpu = None
11:24:30                  sub_gpu = None
11:24:30                  try:
11:24:30                      sub_cpu = next(cpu)
11:24:30                  except StopIteration:
11:24:30                      done = True
11:24:30      
11:24:30                  try:
11:24:30                      sub_gpu = next(gpu)
11:24:30                  except StopIteration:
11:24:30                      done = True
11:24:30      
11:24:30                  if done:
11:24:30                      assert sub_cpu == sub_gpu and sub_cpu == None, "CPU and GPU generators have different lengths at {}".format(path)
11:24:30                  else:
11:24:30                      _assert_equal(sub_cpu, sub_gpu, float_check, path + [index])
11:24:30      
11:24:30                  index = index + 1
11:24:30          elif (t is int):
11:24:30  >           assert cpu == gpu, "GPU and CPU int values are different at {}".format(path)
11:24:30  E           AssertionError: GPU and CPU int values are different at [1, 'lag_1_c']
11:24:30  
11:24:30  src/main/python/asserts.py:63: AssertionError
11:24:30  ----------------------------- Captured stdout call -----------------------------
11:24:30  ### CPU RUN ###
11:24:30  ### GPU RUN ###
11:24:30  ### COLLECT: GPU TOOK 0.8450026512145996 CPU TOOK 0.69598388671875 ###
11:24:30  _ test_multi_types_window_aggs_for_rows_lead_lag[partBy:Long-orderBy:Long-Double] _
11:24:30  
11:24:30  a_gen = Long, b_gen = Long, c_gen = Double
11:24:30  
11:24:30      @ignore_order
11:24:30      @approximate_float
11:24:30      @pytest.mark.parametrize('c_gen', lead_lag_data_gens, ids=idfn)
11:24:30      @pytest.mark.parametrize('b_gen', part_and_order_gens, ids=meta_idfn('orderBy:'))
11:24:30      @pytest.mark.parametrize('a_gen', part_and_order_gens, ids=meta_idfn('partBy:'))
11:24:30      def test_multi_types_window_aggs_for_rows_lead_lag(a_gen, b_gen, c_gen):
11:24:30          data_gen = [
11:24:30                  ('a', RepeatSeqGen(a_gen, length=20)),
11:24:30                  ('b', b_gen),
11:24:30                  ('c', c_gen)]
11:24:30          # By default for many operations a range of unbounded to unbounded is used
11:24:30          # This will not work until https://github.com/NVIDIA/spark-rapids/issues/216
11:24:30          # is fixed.
11:24:30      
11:24:30          # Ordering needs to include c because with nulls and especially on booleans
11:24:30          # it is possible to get a different ordering when it is ambiguous.
11:24:30          baseWindowSpec = Window.partitionBy('a').orderBy('b', 'c')
11:24:30          inclusiveWindowSpec = baseWindowSpec.rowsBetween(-10, 100)
11:24:30      
11:24:30          defaultVal = gen_scalar_value(c_gen, force_no_nulls=False)
11:24:30      
11:24:30          def do_it(spark):
11:24:30              return gen_df(spark, data_gen, length=2048) \
11:24:30                      .withColumn('inc_count_1', f.count('*').over(inclusiveWindowSpec)) \
11:24:30                      .withColumn('inc_count_c', f.count('c').over(inclusiveWindowSpec)) \
11:24:30                      .withColumn('inc_max_c', f.max('c').over(inclusiveWindowSpec)) \
11:24:30                      .withColumn('inc_min_c', f.min('c').over(inclusiveWindowSpec)) \
11:24:30                      .withColumn('lead_5_c', f.lead('c', 5).over(baseWindowSpec)) \
11:24:30                      .withColumn('lead_def_c', f.lead('c', 2, defaultVal).over(baseWindowSpec)) \
11:24:30                      .withColumn('lag_1_c', f.lag('c', 1).over(baseWindowSpec)) \
11:24:30                      .withColumn('lag_def_c', f.lag('c', 4, defaultVal).over(baseWindowSpec)) \
11:24:30                      .withColumn('row_num', f.row_number().over(baseWindowSpec))
11:24:30  >       assert_gpu_and_cpu_are_equal_collect(do_it, conf={'spark.rapids.sql.hasNans': 'false'})
11:24:30  
11:24:30  src/main/python/window_function_test.py:113: 
11:24:30  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
11:24:30  src/main/python/asserts.py:296: in assert_gpu_and_cpu_are_equal_collect
11:24:30      _assert_gpu_and_cpu_are_equal(func, True, conf=conf)
11:24:30  src/main/python/asserts.py:288: in _assert_gpu_and_cpu_are_equal
11:24:30      assert_equal(from_cpu, from_gpu)
11:24:30  src/main/python/asserts.py:86: in assert_equal
11:24:30      _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])
11:24:30  src/main/python/asserts.py:38: in _assert_equal
11:24:30      _assert_equal(cpu[index], gpu[index], float_check, path + [index])
11:24:30  src/main/python/asserts.py:31: in _assert_equal
11:24:30      _assert_equal(cpu[field], gpu[field], float_check, path + [field])
11:24:30  src/main/python/asserts.py:68: in _assert_equal
11:24:30      assert float_check(cpu, gpu), "GPU and CPU float values are different {}".format(path)
11:24:30  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
11:24:30  
11:24:30  lhs = -3.546714417938851e-160, rhs = None
11:24:30  
11:24:30  >   return lambda lhs,rhs: lhs == pytest.approx(rhs, **_approximate_float_args)
11:24:30  E   TypeError: cannot make approximate comparisons to non-numeric values: None
11:24:30  
11:24:30  src/main/python/conftest.py:24: TypeError

@sameerz sameerz added P1 Nice to have for release Spark 3.1+ Bugs only related to Spark 3.1 or higher and removed ? - Needs Triage Need team to review and classify labels Oct 27, 2020
@mythrocks mythrocks pinned this issue Oct 28, 2020
@mythrocks mythrocks unpinned this issue Oct 28, 2020
@sameerz sameerz removed the P1 Nice to have for release label Nov 17, 2020
@sameerz sameerz added the P1 Nice to have for release label Nov 17, 2020
@sameerz
Copy link
Collaborator

sameerz commented Nov 17, 2020

Need to mark as xfail until there is a cudf fix.

@sameerz sameerz assigned mythrocks and unassigned mythrocks Nov 17, 2020
@revans2
Copy link
Collaborator

revans2 commented Nov 17, 2020

It is now failing in a very different way from before. It looks like the Spark changed in an incompatible way and we are now getting failures so it appears that we have to move OffsetWindowFunctionMeta into the Shim layer.

E                   py4j.protocol.Py4JJavaError: An error occurred while calling o60729.collectToPython.
E                   : java.lang.VerifyError: Bad type on operand stack
E                   Exception Details:
E                     Location:
E                       com/nvidia/spark/rapids/OffsetWindowFunctionMeta.<init>(Lorg/apache/spark/sql/catalyst/expressions/OffsetWindowFunction;Lcom/nvidia/spark/rapids/RapidsConf;Lscala/Option;Lcom/nvidia/spark/rapids/ConfKeysAndIncompat;)V @11: invokespecial
E                     Reason:
E                       Type 'org/apache/spark/sql/catalyst/expressions/OffsetWindowFunction' (current frame, stack[1]) is not assignable to 'org/apache/spark/sql/catalyst/expressions/Expression'
E                     Current Frame:
E                       bci: @11
E                       flags: { flagThisUninit }
E                       locals: { uninitializedThis, 'org/apache/spark/sql/catalyst/expressions/OffsetWindowFunction', 'com/nvidia/spark/rapids/RapidsConf', 'scala/Option', 'com/nvidia/spark/rapids/ConfKeysAndIncompat', '[Z' }
E                       stack: { uninitializedThis, 'org/apache/spark/sql/catalyst/expressions/OffsetWindowFunction', 'com/nvidia/spark/rapids/RapidsConf', 'scala/Option', 'com/nvidia/spark/rapids/ConfKeysAndIncompat' }
E                     Bytecode:
E                       0x0000000: b800 a63a 052a 2b2c 2d19 04b7 0056 1905
E                       0x0000010: 1008 0454 2ab2 005b 2ab7 0027 c000 29b6
E                       0x0000020: 002c 2ab7 005e bb00 6059 2ab7 0063 b600
E                       0x0000030: 67b5 0015 1905 1009 0454 2ab2 005b 2ab7
E                       0x0000040: 0027 c000 29b6 0069 2ab7 005e bb00 6059
E                       0x0000050: 2ab7 0063 b600 67b5 0019 1905 100a 0454
E                       0x0000060: 2ab2 005b 2ab7 0027 c000 29b6 006b 2ab7
E                       0x0000070: 005e bb00 6059 2ab7 0063 b600 67b5 001b
E                       0x0000080: 2a19 0510 0b04 542a b700 27c0 0029 b600
E                       0x0000090: 6bb6 0032 b200 703a 0659 c700 1b57 1906
E                       0x00000a0: c700 0c19 0510 0c04 54a7 0023 1905 100d
E                       0x00000b0: 0454 a700 4719 06b6 003e 9a00 0c19 0510
E                       0x00000c0: 0e04 54a7 0036 1905 100f 0454 b200 7bb2
E                       0x00000d0: 0080 05bd 0082 5903 2ab6 0084 5359 042a
E                       0x00000e0: b600 8653 c000 88b6 008c b600 90c0 0092
E                       0x00000f0: 1905 1010 0454 a700 34b2 007b b200 8006
E                       0x0000100: bd00 8259 032a b600 8453 5904 2ab6 0086
E                       0x0000110: 5359 052a b600 9453 c000 88b6 008c b600
E                       0x0000120: 90c0 0092 1905 1011 0454 b500 1f19 0510
E                       0x0000130: 1204 54b1                              
E                     Stackmap Table:
E                       full_frame(@172,{Object[#2],Object[#41],Object[#114],Object[#116],Object[#118],Object[#168],Object[#109]},{Object[#2]})
E                       full_frame(@181,{Object[#2],Object[#41],Object[#114],Object[#116],Object[#118],Object[#168],Object[#109]},{Object[#2],Object[#76]})
E                       same_locals_1_stack_item_frame(@198,Object[#2])
E                       same_locals_1_stack_item_frame(@204,Object[#2])
E                       same_locals_1_stack_item_frame(@249,Object[#2])
E                       full_frame(@298,{Object[#2],Object[#41],Object[#114],Object[#116],Object[#118],Object[#168],Object[#109]},{Object[#2],Object[#146]})
E                   
E                   	at com.nvidia.spark.rapids.GpuOverrides$.$anonfun$commonExpressions$16(GpuOverrides.scala:760)
E                   	at com.nvidia.spark.rapids.ReplacementRule.wrap(GpuOverrides.scala:186)
E                   	at com.nvidia.spark.rapids.GpuOverrides$.$anonfun$wrapExpr$1(GpuOverrides.scala:632)
E                   	at scala.Option.map(Option.scala:230)
E                   	at com.nvidia.spark.rapids.GpuOverrides$.wrapExpr(GpuOverrides.scala:632)
E                   	at com.nvidia.spark.rapids.BaseExprMeta.$anonfun$childExprs$2(RapidsMeta.scala:670)
E                   	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
E                   	at scala.collection.immutable.List.foreach(List.scala:392)
E                   	at scala.collection.TraversableLike.map(TraversableLike.scala:238)

@mythrocks
Copy link
Collaborator

mythrocks commented Feb 25, 2021

It looks like the Spark changed in an incompatible way and we are now getting failures so it appears that we have to move OffsetWindowFunctionMeta into the Shim layer.

Yep. OffsetWindowFunction compiled with Spark 3.0.2 is not recognized as a valid Expression in Spark 3.1.1. Moving the object construction into the shims layer sorted this out.

The original problem remained:

Looks like the GPU is somehow returning nulls or no result.

This is the result of a change in Spark 3.1.x semantics for Lag.offset. Whereas the offset for, say, LAG(col, 5) used to be stored internally as Int(5) in 3.0.2, it was changed to Int(-5) in 3.1.x, trampling over the GpuWindowExec logic.

A fix is available in #1813.

@sameerz
Copy link
Collaborator

sameerz commented Feb 26, 2021

Closed by #1813

@sameerz sameerz closed this as completed Feb 26, 2021
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
…IDIA#999)

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1 Nice to have for release Spark 3.1+ Bugs only related to Spark 3.1 or higher
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants