[BUG] assert failed test_struct_self_join and test_computation_in_grpby_columns #5286

pxLi · 2022-04-20T03:07:00Z

Describe the bug
The failures occurred in integration tests, seems like related to recent cudf changes

[2022-04-20T02:45:21.259Z] FAILED ../../src/main/python/hash_aggregate_test.py::test_computation_in_grpby_columns[IGNORE_ORDER]
[2022-04-20T02:45:21.260Z] FAILED ../../src/main/python/join_test.py::test_struct_self_join[IGNORE_ORDER({'local': True})]

join_test.py::test_struct_self_join[IGNORE_ORDER({'local': True})],

[2022-04-20T02:45:21.257Z] ### COLLECT: GPU TOOK 0.4855940341949463 CPU TOOK 0.5880005359649658 ###
[2022-04-20T02:45:21.257Z] CPU OUTPUT: [Row(col=Row(name=Row(firstname='Adam ', middlename='', lastname='Green'), newname=Row(firstname='Adam ', lastname='Green')), name=Row(firstname='Adam ', middlename='', lastname='Green')), Row(col=Row(name=Row(firstname='Bob ', middlename='Middle', lastname='Green'), newname=Row(firstname='Bob ', lastname='Green')), name=Row(firstname='Bob ', middlename='Middle', lastname='Green')), Row(col=Row(name=Row(firstname='Cathy ', middlename='', lastname='Green'), newname=Row(firstname='Cathy ', lastname='Green')), name=Row(firstname='Cathy ', middlename='', lastname='Green'))]
[2022-04-20T02:45:21.258Z] GPU OUTPUT: [Row(col=Row(name=Row(firstname='Adam ', middlename=None, lastname='Green'), newname=Row(firstname='Adam ', lastname='Green')), name=Row(firstname='Adam ', middlename=None, lastname='Green')), Row(col=Row(name=Row(firstname='Bob ', middlename='Middle', lastname='Green'), newname=Row(firstname='Bob ', lastname='Green')), name=Row(firstname='Bob ', middlename='Middle', lastname='Green')), Row(col=Row(name=Row(firstname='Cathy ', middlename=None, lastname='Green'), newname=Row(firstname='Cathy ', lastname='Green')), name=Row(firstname='Cathy ', middlename=None, lastname='Green'))]

hash_aggregate_test.py::test_computation_in_grpby_columns[IGNORE_ORDER],

[2022-04-20T02:45:21.255Z] ### COLLECT: GPU TOOK 2.2361795902252197 CPU TOOK 0.8191990852355957 ###
[2022-04-20T02:45:21.255Z] CPU OUTPUT: [Row(substring(a, 2, 10)=None, sum(b)=306548), Row(substring(a, 2, 10)='', sum(b)=-148367), Row(substring(a, 2, 10)='a', sum(b)=198339), Row(substring(a, 2, 10)='aa', sum(b)=-213522), Row(substring(a, 2, 10)='aaa', sum(b)=462), Row(substring(a, 2, 10)='aaaa', sum(b)=-210714), Row(substring(a, 2, 10)='aaaaa', sum(b)=-48129), Row(substring(a, 2, 10)='aaaaaa', sum(b)=-137496), Row(substring(a, 2, 10)='aaaaaaa', sum(b)=-30235), Row(substring(a, 2, 10)='aaaaaaaa', sum(b)=163441), Row(substring(a, 2, 10)='aaaaaaaaa', sum(b)=206396), Row(substring(a, 2, 10)='aaaaaaaaaa', sum(b)=-807799)]
[2022-04-20T02:45:21.255Z] GPU OUTPUT: [Row(substring(a, 2, 10)=None, sum(b)=-67160), Row(substring(a, 2, 10)=None, sum(b)=306548), Row(substring(a, 2, 10)='', sum(b)=-81207), Row(substring(a, 2, 10)='a', sum(b)=198339), Row(substring(a, 2, 10)='aa', sum(b)=-213522), Row(substring(a, 2, 10)='aaa', sum(b)=462), Row(substring(a, 2, 10)='aaaa', sum(b)=-210714), Row(substring(a, 2, 10)='aaaaa', sum(b)=-48129), Row(substring(a, 2, 10)='aaaaaa', sum(b)=-137496), Row(substring(a, 2, 10)='aaaaaaa', sum(b)=-30235), Row(substring(a, 2, 10)='aaaaaaaa', sum(b)=163441), Row(substring(a, 2, 10)='aaaaaaaaa', sum(b)=206396), Row(substring(a, 2, 10)='aaaaaaaaaa', sum(b)=-807799)]

detailed log,

[2022-04-20T02:45:21.254Z] =================================== FAILURES ===================================
[2022-04-20T02:45:21.254Z] �[31m�[1m______________________ test_computation_in_grpby_columns _______________________�[0m
[2022-04-20T02:45:21.254Z] [gw2] linux -- Python 3.8.13 /databricks/conda/envs/cudf-udf/bin/python
[2022-04-20T02:45:21.254Z] 
[2022-04-20T02:45:21.254Z]     �[37m@ignore_order�[39;49;00m
[2022-04-20T02:45:21.254Z]     �[94mdef�[39;49;00m �[92mtest_computation_in_grpby_columns�[39;49;00m():
[2022-04-20T02:45:21.254Z]         conf = {�[33m'�[39;49;00m�[33mspark.rapids.sql.batchSizeBytes�[39;49;00m�[33m'�[39;49;00m : �[33m'�[39;49;00m�[33m250�[39;49;00m�[33m'�[39;49;00m}
[2022-04-20T02:45:21.254Z]         data_gen = [
[2022-04-20T02:45:21.254Z]                 (�[33m'�[39;49;00m�[33ma�[39;49;00m�[33m'�[39;49;00m, RepeatSeqGen(StringGen(�[33m'�[39;49;00m�[33ma�[39;49;00m�[33m{�[39;49;00m�[33m1,20}�[39;49;00m�[33m'�[39;49;00m), length=�[94m50�[39;49;00m)),
[2022-04-20T02:45:21.254Z]                 (�[33m'�[39;49;00m�[33mb�[39;49;00m�[33m'�[39;49;00m, short_gen)]
[2022-04-20T02:45:21.254Z] >       assert_gpu_and_cpu_are_equal_collect(
[2022-04-20T02:45:21.254Z]             �[94mlambda�[39;49;00m spark: gen_df(spark, data_gen).groupby(f.substring(f.col(�[33m'�[39;49;00m�[33ma�[39;49;00m�[33m'�[39;49;00m), �[94m2�[39;49;00m, �[94m10�[39;49;00m)).agg(f.sum(�[33m'�[39;49;00m�[33mb�[39;49;00m�[33m'�[39;49;00m)),
[2022-04-20T02:45:21.254Z]             conf = conf)
[2022-04-20T02:45:21.254Z] 
[2022-04-20T02:45:21.254Z] �[1m�[31m../../src/main/python/hash_aggregate_test.py�[0m:352: 
[2022-04-20T02:45:21.254Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2022-04-20T02:45:21.254Z] �[1m�[31m../../src/main/python/asserts.py�[0m:508: in assert_gpu_and_cpu_are_equal_collect
[2022-04-20T02:45:21.254Z]     _assert_gpu_and_cpu_are_equal(func, �[33m'�[39;49;00m�[33mCOLLECT�[39;49;00m�[33m'�[39;49;00m, conf=conf, is_cpu_first=is_cpu_first)
[2022-04-20T02:45:21.254Z] �[1m�[31m../../src/main/python/asserts.py�[0m:439: in _assert_gpu_and_cpu_are_equal
[2022-04-20T02:45:21.254Z]     assert_equal(from_cpu, from_gpu)
[2022-04-20T02:45:21.254Z] �[1m�[31m../../src/main/python/asserts.py�[0m:106: in assert_equal
[2022-04-20T02:45:21.254Z]     _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])
[2022-04-20T02:45:21.254Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2022-04-20T02:45:21.254Z] 
[2022-04-20T02:45:21.254Z] cpu = [Row(substring(a, 2, 10)=None, sum(b)=306548), Row(substring(a, 2, 10)='', sum(b)=-148367), Row(substring(a, 2, 10)='a...aa', sum(b)=-213522), Row(substring(a, 2, 10)='aaa', sum(b)=462), Row(substring(a, 2, 10)='aaaa', sum(b)=-210714), ...]
[2022-04-20T02:45:21.254Z] gpu = [Row(substring(a, 2, 10)=None, sum(b)=-67160), Row(substring(a, 2, 10)=None, sum(b)=306548), Row(substring(a, 2, 10)='...0)='a', sum(b)=198339), Row(substring(a, 2, 10)='aa', sum(b)=-213522), Row(substring(a, 2, 10)='aaa', sum(b)=462), ...]
[2022-04-20T02:45:21.254Z] float_check = <function get_float_check.<locals>.<lambda> at 0x7fae16be90d0>
[2022-04-20T02:45:21.254Z] path = []
[2022-04-20T02:45:21.254Z] 
[2022-04-20T02:45:21.254Z]     �[94mdef�[39;49;00m �[92m_assert_equal�[39;49;00m(cpu, gpu, float_check, path):
[2022-04-20T02:45:21.255Z]         t = �[96mtype�[39;49;00m(cpu)
[2022-04-20T02:45:21.255Z]         �[94mif�[39;49;00m (t �[95mis�[39;49;00m Row):
[2022-04-20T02:45:21.255Z]             �[94massert�[39;49;00m �[96mlen�[39;49;00m(cpu) == �[96mlen�[39;49;00m(gpu), �[33m"�[39;49;00m�[33mCPU and GPU row have different lengths at �[39;49;00m�[33m{}�[39;49;00m�[33m CPU: �[39;49;00m�[33m{}�[39;49;00m�[33m GPU: �[39;49;00m�[33m{}�[39;49;00m�[33m"�[39;49;00m.format(path, �[96mlen�[39;49;00m(cpu), �[96mlen�[39;49;00m(gpu))
[2022-04-20T02:45:21.255Z]             �[94mif�[39;49;00m �[96mhasattr�[39;49;00m(cpu, �[33m"�[39;49;00m�[33m__fields__�[39;49;00m�[33m"�[39;49;00m) �[95mand�[39;49;00m �[96mhasattr�[39;49;00m(gpu, �[33m"�[39;49;00m�[33m__fields__�[39;49;00m�[33m"�[39;49;00m):
[2022-04-20T02:45:21.255Z]                 �[94massert�[39;49;00m cpu.__fields__ == gpu.__fields__, �[33m"�[39;49;00m�[33mCPU and GPU row have different fields at �[39;49;00m�[33m{}�[39;49;00m�[33m CPU: �[39;49;00m�[33m{}�[39;49;00m�[33m GPU: �[39;49;00m�[33m{}�[39;49;00m�[33m"�[39;49;00m.format(path, cpu.__fields__, gpu.__fields__)
[2022-04-20T02:45:21.255Z]                 �[94mfor�[39;49;00m field �[95min�[39;49;00m cpu.__fields__:
[2022-04-20T02:45:21.255Z]                     _assert_equal(cpu[field], gpu[field], float_check, path + [field])
[2022-04-20T02:45:21.255Z]             �[94melse�[39;49;00m:
[2022-04-20T02:45:21.255Z]                 �[94mfor�[39;49;00m index �[95min�[39;49;00m �[96mrange�[39;49;00m(�[96mlen�[39;49;00m(cpu)):
[2022-04-20T02:45:21.255Z]                     _assert_equal(cpu[index], gpu[index], float_check, path + [index])
[2022-04-20T02:45:21.255Z]         �[94melif�[39;49;00m (t �[95mis�[39;49;00m �[96mlist�[39;49;00m):
[2022-04-20T02:45:21.255Z] >           �[94massert�[39;49;00m �[96mlen�[39;49;00m(cpu) == �[96mlen�[39;49;00m(gpu), �[33m"�[39;49;00m�[33mCPU and GPU list have different lengths at �[39;49;00m�[33m{}�[39;49;00m�[33m CPU: �[39;49;00m�[33m{}�[39;49;00m�[33m GPU: �[39;49;00m�[33m{}�[39;49;00m�[33m"�[39;49;00m.format(path, �[96mlen�[39;49;00m(cpu), �[96mlen�[39;49;00m(gpu))
[2022-04-20T02:45:21.255Z] �[1m�[31mE           AssertionError: CPU and GPU list have different lengths at [] CPU: 12 GPU: 13�[0m
[2022-04-20T02:45:21.255Z] 
[2022-04-20T02:45:21.255Z] �[1m�[31m../../src/main/python/asserts.py�[0m:40: AssertionError
[2022-04-20T02:45:21.255Z] ----------------------------- Captured stdout call -----------------------------
[2022-04-20T02:45:21.255Z] ### CPU RUN ###
[2022-04-20T02:45:21.255Z] ### GPU RUN ###
[2022-04-20T02:45:21.255Z] ### COLLECT: GPU TOOK 2.2361795902252197 CPU TOOK 0.8191990852355957 ###
[2022-04-20T02:45:21.255Z] CPU OUTPUT: [Row(substring(a, 2, 10)=None, sum(b)=306548), Row(substring(a, 2, 10)='', sum(b)=-148367), Row(substring(a, 2, 10)='a', sum(b)=198339), Row(substring(a, 2, 10)='aa', sum(b)=-213522), Row(substring(a, 2, 10)='aaa', sum(b)=462), Row(substring(a, 2, 10)='aaaa', sum(b)=-210714), Row(substring(a, 2, 10)='aaaaa', sum(b)=-48129), Row(substring(a, 2, 10)='aaaaaa', sum(b)=-137496), Row(substring(a, 2, 10)='aaaaaaa', sum(b)=-30235), Row(substring(a, 2, 10)='aaaaaaaa', sum(b)=163441), Row(substring(a, 2, 10)='aaaaaaaaa', sum(b)=206396), Row(substring(a, 2, 10)='aaaaaaaaaa', sum(b)=-807799)]
[2022-04-20T02:45:21.255Z] GPU OUTPUT: [Row(substring(a, 2, 10)=None, sum(b)=-67160), Row(substring(a, 2, 10)=None, sum(b)=306548), Row(substring(a, 2, 10)='', sum(b)=-81207), Row(substring(a, 2, 10)='a', sum(b)=198339), Row(substring(a, 2, 10)='aa', sum(b)=-213522), Row(substring(a, 2, 10)='aaa', sum(b)=462), Row(substring(a, 2, 10)='aaaa', sum(b)=-210714), Row(substring(a, 2, 10)='aaaaa', sum(b)=-48129), Row(substring(a, 2, 10)='aaaaaa', sum(b)=-137496), Row(substring(a, 2, 10)='aaaaaaa', sum(b)=-30235), Row(substring(a, 2, 10)='aaaaaaaa', sum(b)=163441), Row(substring(a, 2, 10)='aaaaaaaaa', sum(b)=206396), Row(substring(a, 2, 10)='aaaaaaaaaa', sum(b)=-807799)]
[2022-04-20T02:45:21.255Z] �[31m�[1m____________________________ test_struct_self_join _____________________________�[0m
[2022-04-20T02:45:21.255Z] [gw0] linux -- Python 3.8.13 /databricks/conda/envs/cudf-udf/bin/python
[2022-04-20T02:45:21.255Z] 
[2022-04-20T02:45:21.255Z] spark_tmp_table_factory = <conftest.TmpTableFactory object at 0x7f4a624ae940>
[2022-04-20T02:45:21.255Z] 
[2022-04-20T02:45:21.255Z]     �[37m@ignore_order�[39;49;00m(local=�[94mTrue�[39;49;00m)
[2022-04-20T02:45:21.255Z]     �[94mdef�[39;49;00m �[92mtest_struct_self_join�[39;49;00m(spark_tmp_table_factory):
[2022-04-20T02:45:21.255Z]         �[94mdef�[39;49;00m �[92mdo_join�[39;49;00m(spark):
[2022-04-20T02:45:21.255Z]             data = [
[2022-04-20T02:45:21.255Z]                 ((�[33m"�[39;49;00m�[33mAdam �[39;49;00m�[33m"�[39;49;00m, �[33m"�[39;49;00m�[33m"�[39;49;00m, �[33m"�[39;49;00m�[33mGreen�[39;49;00m�[33m"�[39;49;00m), �[33m"�[39;49;00m�[33m1�[39;49;00m�[33m"�[39;49;00m, �[33m"�[39;49;00m�[33mM�[39;49;00m�[33m"�[39;49;00m, �[94m1000�[39;49;00m),
[2022-04-20T02:45:21.255Z]                 ((�[33m"�[39;49;00m�[33mBob �[39;49;00m�[33m"�[39;49;00m, �[33m"�[39;49;00m�[33mMiddle�[39;49;00m�[33m"�[39;49;00m, �[33m"�[39;49;00m�[33mGreen�[39;49;00m�[33m"�[39;49;00m), �[33m"�[39;49;00m�[33m2�[39;49;00m�[33m"�[39;49;00m, �[33m"�[39;49;00m�[33mM�[39;49;00m�[33m"�[39;49;00m, �[94m2000�[39;49;00m),
[2022-04-20T02:45:21.255Z]                 ((�[33m"�[39;49;00m�[33mCathy �[39;49;00m�[33m"�[39;49;00m, �[33m"�[39;49;00m�[33m"�[39;49;00m, �[33m"�[39;49;00m�[33mGreen�[39;49;00m�[33m"�[39;49;00m), �[33m"�[39;49;00m�[33m3�[39;49;00m�[33m"�[39;49;00m, �[33m"�[39;49;00m�[33mF�[39;49;00m�[33m"�[39;49;00m, �[94m3000�[39;49;00m)
[2022-04-20T02:45:21.255Z]             ]
[2022-04-20T02:45:21.255Z]             schema = (StructType()
[2022-04-20T02:45:21.255Z]                       .add(�[33m"�[39;49;00m�[33mname�[39;49;00m�[33m"�[39;49;00m, StructType()
[2022-04-20T02:45:21.255Z]                            .add(�[33m"�[39;49;00m�[33mfirstname�[39;49;00m�[33m"�[39;49;00m, StringType())
[2022-04-20T02:45:21.256Z]                            .add(�[33m"�[39;49;00m�[33mmiddlename�[39;49;00m�[33m"�[39;49;00m, StringType())
[2022-04-20T02:45:21.256Z]                            .add(�[33m"�[39;49;00m�[33mlastname�[39;49;00m�[33m"�[39;49;00m, StringType()))
[2022-04-20T02:45:21.256Z]                       .add(�[33m"�[39;49;00m�[33mid�[39;49;00m�[33m"�[39;49;00m, StringType())
[2022-04-20T02:45:21.256Z]                       .add(�[33m"�[39;49;00m�[33mgender�[39;49;00m�[33m"�[39;49;00m, StringType())
[2022-04-20T02:45:21.256Z]                       .add(�[33m"�[39;49;00m�[33msalary�[39;49;00m�[33m"�[39;49;00m, IntegerType()))
[2022-04-20T02:45:21.256Z]             df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
[2022-04-20T02:45:21.256Z]             df_name = spark_tmp_table_factory.get()
[2022-04-20T02:45:21.256Z]             df.createOrReplaceTempView(df_name)
[2022-04-20T02:45:21.256Z]             resultdf = spark.sql(
[2022-04-20T02:45:21.256Z]                 �[33m"�[39;49;00m�[33mselect struct(name, struct(name.firstname, name.lastname) as newname)�[39;49;00m�[33m"�[39;49;00m +
[2022-04-20T02:45:21.256Z]                 �[33m"�[39;49;00m�[33m as col,name from �[39;49;00m�[33m"�[39;49;00m + df_name + �[33m"�[39;49;00m�[33m union�[39;49;00m�[33m"�[39;49;00m +
[2022-04-20T02:45:21.256Z]                 �[33m"�[39;49;00m�[33m select struct(name, struct(name.firstname, name.lastname) as newname) as col,name�[39;49;00m�[33m"�[39;49;00m +
[2022-04-20T02:45:21.256Z]                 �[33m"�[39;49;00m�[33m from �[39;49;00m�[33m"�[39;49;00m + df_name)
[2022-04-20T02:45:21.256Z]             resultdf_name = spark_tmp_table_factory.get()
[2022-04-20T02:45:21.256Z]             resultdf.createOrReplaceTempView(resultdf_name)
[2022-04-20T02:45:21.256Z]             �[94mreturn�[39;49;00m spark.sql(�[33m"�[39;49;00m�[33mselect a.* from �[39;49;00m�[33m{}�[39;49;00m�[33m a, �[39;49;00m�[33m{}�[39;49;00m�[33m b where a.name=b.name�[39;49;00m�[33m"�[39;49;00m.format(
[2022-04-20T02:45:21.256Z]                 resultdf_name, resultdf_name))
[2022-04-20T02:45:21.256Z] >       assert_gpu_and_cpu_are_equal_collect(do_join)
[2022-04-20T02:45:21.256Z] 
[2022-04-20T02:45:21.256Z] �[1m�[31m../../src/main/python/join_test.py�[0m:771: 
[2022-04-20T02:45:21.256Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2022-04-20T02:45:21.256Z] �[1m�[31m../../src/main/python/asserts.py�[0m:508: in assert_gpu_and_cpu_are_equal_collect
[2022-04-20T02:45:21.256Z]     _assert_gpu_and_cpu_are_equal(func, �[33m'�[39;49;00m�[33mCOLLECT�[39;49;00m�[33m'�[39;49;00m, conf=conf, is_cpu_first=is_cpu_first)
[2022-04-20T02:45:21.256Z] �[1m�[31m../../src/main/python/asserts.py�[0m:439: in _assert_gpu_and_cpu_are_equal
[2022-04-20T02:45:21.256Z]     assert_equal(from_cpu, from_gpu)
[2022-04-20T02:45:21.256Z] �[1m�[31m../../src/main/python/asserts.py�[0m:106: in assert_equal
[2022-04-20T02:45:21.256Z]     _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])
[2022-04-20T02:45:21.256Z] �[1m�[31m../../src/main/python/asserts.py�[0m:42: in _assert_equal
[2022-04-20T02:45:21.256Z]     _assert_equal(cpu[index], gpu[index], float_check, path + [index])
[2022-04-20T02:45:21.256Z] �[1m�[31m../../src/main/python/asserts.py�[0m:35: in _assert_equal
[2022-04-20T02:45:21.256Z]     _assert_equal(cpu[field], gpu[field], float_check, path + [field])
[2022-04-20T02:45:21.256Z] �[1m�[31m../../src/main/python/asserts.py�[0m:35: in _assert_equal
[2022-04-20T02:45:21.256Z]     _assert_equal(cpu[field], gpu[field], float_check, path + [field])
[2022-04-20T02:45:21.256Z] �[1m�[31m../../src/main/python/asserts.py�[0m:35: in _assert_equal
[2022-04-20T02:45:21.256Z]     _assert_equal(cpu[field], gpu[field], float_check, path + [field])
[2022-04-20T02:45:21.256Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2022-04-20T02:45:21.256Z] 
[2022-04-20T02:45:21.256Z] cpu = '', gpu = None
[2022-04-20T02:45:21.256Z] float_check = <function get_float_check.<locals>.<lambda> at 0x7f4a6780d430>
[2022-04-20T02:45:21.256Z] path = [0, 'col', 'name', 'middlename']
[2022-04-20T02:45:21.256Z] 
[2022-04-20T02:45:21.256Z]     �[94mdef�[39;49;00m �[92m_assert_equal�[39;49;00m(cpu, gpu, float_check, path):
[2022-04-20T02:45:21.256Z]         t = �[96mtype�[39;49;00m(cpu)
[2022-04-20T02:45:21.256Z]         �[94mif�[39;49;00m (t �[95mis�[39;49;00m Row):
[2022-04-20T02:45:21.256Z]             �[94massert�[39;49;00m �[96mlen�[39;49;00m(cpu) == �[96mlen�[39;49;00m(gpu), �[33m"�[39;49;00m�[33mCPU and GPU row have different lengths at �[39;49;00m�[33m{}�[39;49;00m�[33m CPU: �[39;49;00m�[33m{}�[39;49;00m�[33m GPU: �[39;49;00m�[33m{}�[39;49;00m�[33m"�[39;49;00m.format(path, �[96mlen�[39;49;00m(cpu), �[96mlen�[39;49;00m(gpu))
[2022-04-20T02:45:21.256Z]             �[94mif�[39;49;00m �[96mhasattr�[39;49;00m(cpu, �[33m"�[39;49;00m�[33m__fields__�[39;49;00m�[33m"�[39;49;00m) �[95mand�[39;49;00m �[96mhasattr�[39;49;00m(gpu, �[33m"�[39;49;00m�[33m__fields__�[39;49;00m�[33m"�[39;49;00m):
[2022-04-20T02:45:21.256Z]                 �[94massert�[39;49;00m cpu.__fields__ == gpu.__fields__, �[33m"�[39;49;00m�[33mCPU and GPU row have different fields at �[39;49;00m�[33m{}�[39;49;00m�[33m CPU: �[39;49;00m�[33m{}�[39;49;00m�[33m GPU: �[39;49;00m�[33m{}�[39;49;00m�[33m"�[39;49;00m.format(path, cpu.__fields__, gpu.__fields__)
[2022-04-20T02:45:21.256Z]                 �[94mfor�[39;49;00m field �[95min�[39;49;00m cpu.__fields__:
[2022-04-20T02:45:21.256Z]                     _assert_equal(cpu[field], gpu[field], float_check, path + [field])
[2022-04-20T02:45:21.256Z]             �[94melse�[39;49;00m:
[2022-04-20T02:45:21.256Z]                 �[94mfor�[39;49;00m index �[95min�[39;49;00m �[96mrange�[39;49;00m(�[96mlen�[39;49;00m(cpu)):
[2022-04-20T02:45:21.256Z]                     _assert_equal(cpu[index], gpu[index], float_check, path + [index])
[2022-04-20T02:45:21.256Z]         �[94melif�[39;49;00m (t �[95mis�[39;49;00m �[96mlist�[39;49;00m):
[2022-04-20T02:45:21.256Z]             �[94massert�[39;49;00m �[96mlen�[39;49;00m(cpu) == �[96mlen�[39;49;00m(gpu), �[33m"�[39;49;00m�[33mCPU and GPU list have different lengths at �[39;49;00m�[33m{}�[39;49;00m�[33m CPU: �[39;49;00m�[33m{}�[39;49;00m�[33m GPU: �[39;49;00m�[33m{}�[39;49;00m�[33m"�[39;49;00m.format(path, �[96mlen�[39;49;00m(cpu), �[96mlen�[39;49;00m(gpu))
[2022-04-20T02:45:21.257Z]             �[94mfor�[39;49;00m index �[95min�[39;49;00m �[96mrange�[39;49;00m(�[96mlen�[39;49;00m(cpu)):
[2022-04-20T02:45:21.257Z]                 _assert_equal(cpu[index], gpu[index], float_check, path + [index])
[2022-04-20T02:45:21.257Z]         �[94melif�[39;49;00m (t �[95mis�[39;49;00m �[96mtuple�[39;49;00m):
[2022-04-20T02:45:21.257Z]             �[94massert�[39;49;00m �[96mlen�[39;49;00m(cpu) == �[96mlen�[39;49;00m(gpu), �[33m"�[39;49;00m�[33mCPU and GPU list have different lengths at �[39;49;00m�[33m{}�[39;49;00m�[33m CPU: �[39;49;00m�[33m{}�[39;49;00m�[33m GPU: �[39;49;00m�[33m{}�[39;49;00m�[33m"�[39;49;00m.format(path, �[96mlen�[39;49;00m(cpu), �[96mlen�[39;49;00m(gpu))
[2022-04-20T02:45:21.257Z]             �[94mfor�[39;49;00m index �[95min�[39;49;00m �[96mrange�[39;49;00m(�[96mlen�[39;49;00m(cpu)):
[2022-04-20T02:45:21.257Z]                 _assert_equal(cpu[index], gpu[index], float_check, path + [index])
[2022-04-20T02:45:21.257Z]         �[94melif�[39;49;00m (t �[95mis�[39;49;00m pytypes.GeneratorType):
[2022-04-20T02:45:21.257Z]             index = �[94m0�[39;49;00m
[2022-04-20T02:45:21.257Z]             �[90m# generator has no zip :( so we have to do this the hard way�[39;49;00m
[2022-04-20T02:45:21.257Z]             done = �[94mFalse�[39;49;00m
[2022-04-20T02:45:21.257Z]             �[94mwhile�[39;49;00m �[95mnot�[39;49;00m done:
[2022-04-20T02:45:21.257Z]                 sub_cpu = �[94mNone�[39;49;00m
[2022-04-20T02:45:21.257Z]                 sub_gpu = �[94mNone�[39;49;00m
[2022-04-20T02:45:21.257Z]                 �[94mtry�[39;49;00m:
[2022-04-20T02:45:21.257Z]                     sub_cpu = �[96mnext�[39;49;00m(cpu)
[2022-04-20T02:45:21.257Z]                 �[94mexcept�[39;49;00m �[96mStopIteration�[39;49;00m:
[2022-04-20T02:45:21.257Z]                     done = �[94mTrue�[39;49;00m
[2022-04-20T02:45:21.257Z]     
[2022-04-20T02:45:21.257Z]                 �[94mtry�[39;49;00m:
[2022-04-20T02:45:21.257Z]                     sub_gpu = �[96mnext�[39;49;00m(gpu)
[2022-04-20T02:45:21.257Z]                 �[94mexcept�[39;49;00m �[96mStopIteration�[39;49;00m:
[2022-04-20T02:45:21.257Z]                     done = �[94mTrue�[39;49;00m
[2022-04-20T02:45:21.257Z]     
[2022-04-20T02:45:21.257Z]                 �[94mif�[39;49;00m done:
[2022-04-20T02:45:21.257Z]                     �[94massert�[39;49;00m sub_cpu == sub_gpu �[95mand�[39;49;00m sub_cpu == �[94mNone�[39;49;00m, �[33m"�[39;49;00m�[33mCPU and GPU generators have different lengths at �[39;49;00m�[33m{}�[39;49;00m�[33m"�[39;49;00m.format(path)
[2022-04-20T02:45:21.257Z]                 �[94melse�[39;49;00m:
[2022-04-20T02:45:21.257Z]                     _assert_equal(sub_cpu, sub_gpu, float_check, path + [index])
[2022-04-20T02:45:21.257Z]     
[2022-04-20T02:45:21.257Z]                 index = index + �[94m1�[39;49;00m
[2022-04-20T02:45:21.257Z]         �[94melif�[39;49;00m (t �[95mis�[39;49;00m �[96mdict�[39;49;00m):
[2022-04-20T02:45:21.257Z]             �[90m# The order of key/values is not guaranteed in python dicts, nor are they guaranteed by Spark�[39;49;00m
[2022-04-20T02:45:21.257Z]             �[90m# so sort the items to do our best with ignoring the order of dicts�[39;49;00m
[2022-04-20T02:45:21.257Z]             cpu_items = �[96mlist�[39;49;00m(cpu.items()).sort(key=_RowCmp)
[2022-04-20T02:45:21.257Z]             gpu_items = �[96mlist�[39;49;00m(gpu.items()).sort(key=_RowCmp)
[2022-04-20T02:45:21.257Z]             _assert_equal(cpu_items, gpu_items, float_check, path + [�[33m"�[39;49;00m�[33mmap�[39;49;00m�[33m"�[39;49;00m])
[2022-04-20T02:45:21.257Z]         �[94melif�[39;49;00m (t �[95mis�[39;49;00m �[96mint�[39;49;00m):
[2022-04-20T02:45:21.257Z]             �[94massert�[39;49;00m cpu == gpu, �[33m"�[39;49;00m�[33mGPU and CPU int values are different at �[39;49;00m�[33m{}�[39;49;00m�[33m"�[39;49;00m.format(path)
[2022-04-20T02:45:21.257Z]         �[94melif�[39;49;00m (t �[95mis�[39;49;00m �[96mfloat�[39;49;00m):
[2022-04-20T02:45:21.257Z]             �[94mif�[39;49;00m (math.isnan(cpu)):
[2022-04-20T02:45:21.257Z]                 �[94massert�[39;49;00m math.isnan(gpu), �[33m"�[39;49;00m�[33mGPU and CPU float values are different at �[39;49;00m�[33m{}�[39;49;00m�[33m"�[39;49;00m.format(path)
[2022-04-20T02:45:21.257Z]             �[94melse�[39;49;00m:
[2022-04-20T02:45:21.257Z]                 �[94massert�[39;49;00m float_check(cpu, gpu), �[33m"�[39;49;00m�[33mGPU and CPU float values are different �[39;49;00m�[33m{}�[39;49;00m�[33m"�[39;49;00m.format(path)
[2022-04-20T02:45:21.257Z]         �[94melif�[39;49;00m �[96misinstance�[39;49;00m(cpu, �[96mstr�[39;49;00m):
[2022-04-20T02:45:21.257Z] >           �[94massert�[39;49;00m cpu == gpu, �[33m"�[39;49;00m�[33mGPU and CPU string values are different at �[39;49;00m�[33m{}�[39;49;00m�[33m"�[39;49;00m.format(path)
[2022-04-20T02:45:21.257Z] �[1m�[31mE           AssertionError: GPU and CPU string values are different at [0, 'col', 'name', 'middlename']�[0m
[2022-04-20T02:45:21.257Z] 
[2022-04-20T02:45:21.257Z] �[1m�[31m../../src/main/python/asserts.py�[0m:84: AssertionError
[2022-04-20T02:45:21.257Z] ----------------------------- Captured stdout call -----------------------------
[2022-04-20T02:45:21.257Z] ### CPU RUN ###
[2022-04-20T02:45:21.257Z] ### GPU RUN ###
[2022-04-20T02:45:21.257Z] ### COLLECT: GPU TOOK 0.4855940341949463 CPU TOOK 0.5880005359649658 ###
[2022-04-20T02:45:21.257Z] CPU OUTPUT: [Row(col=Row(name=Row(firstname='Adam ', middlename='', lastname='Green'), newname=Row(firstname='Adam ', lastname='Green')), name=Row(firstname='Adam ', middlename='', lastname='Green')), Row(col=Row(name=Row(firstname='Bob ', middlename='Middle', lastname='Green'), newname=Row(firstname='Bob ', lastname='Green')), name=Row(firstname='Bob ', middlename='Middle', lastname='Green')), Row(col=Row(name=Row(firstname='Cathy ', middlename='', lastname='Green'), newname=Row(firstname='Cathy ', lastname='Green')), name=Row(firstname='Cathy ', middlename='', lastname='Green'))]
[2022-04-20T02:45:21.258Z] GPU OUTPUT: [Row(col=Row(name=Row(firstname='Adam ', middlename=None, lastname='Green'), newname=Row(firstname='Adam ', lastname='Green')), name=Row(firstname='Adam ', middlename=None, lastname='Green')), Row(col=Row(name=Row(firstname='Bob ', middlename='Middle', lastname='Green'), newname=Row(firstname='Bob ', lastname='Green')), name=Row(firstname='Bob ', middlename='Middle', lastname='Green')), Row(col=Row(name=Row(firstname='Cathy ', middlename=None, lastname='Green'), newname=Row(firstname='Cathy ', lastname='Green')), name=Row(firstname='Cathy ', middlename=None, lastname='Green'))]

The text was updated successfully, but these errors were encountered:

abellina · 2022-04-20T16:28:52Z

It looks like the issue is in this diff: https://github.com/rapidsai/cudf/compare/6c79b5902d55bab599731a9bded7e89b9c4875c5..65b1cbdeda9cab57243d0a98e646c860ef86039e#diff-50ba2711690aca8e4f28d7b491373a4dd76443127c8b452a77b6c1fe2388d9e3.

There were some string changes here that could be related, so I am reverting those to confirm.

abellina · 2022-04-20T19:31:40Z

Reverting: rapidsai/cudf#10673 fixes the test failure. It is specific to when rows with empty strings are joined, as regular projections are working fine.

pxLi · 2022-04-21T00:40:05Z

thanks for looking into this! I am wondering if we could add some UTs in cudfjni side so we could catch the error earlier~

abellina · 2022-04-22T19:45:54Z

Quick update, here's a minimum repro case in java (this test fails, where we should be getting a table with a single row/column with the empty string).

I'll move to working on this in cuDF.

  @Test
  void testPartitionStrings() {
    try (Table t = new Table.TestBuilder().column("").build();
         ContiguousTable ct = t.contiguousSplit()[0]) {
      try (ColumnVector parts = ColumnVector.fromInts(0);
           PartitionedTable pt = ct.getTable().partition(parts, 2)) {
        ColumnVector partitioned = pt.getTable().getColumn(0);
        try (HostColumnVector hostP = partitioned.copyToHost()) {
          assert(!hostP.isNull(0));
        }
      }
    }
  }

abellina · 2022-04-22T20:07:25Z

thanks for looking into this! I am wondering if we could add some UTs in cudfjni side so we could catch the error earlier~

@pxLi I'll try, but this a chain of things. I have to have a string column with an empty string row, then I need to call contiguous split, and finally I should call partition.

Removing rapidsai/cudf#10673 fixes the issue, removing contiguous split also fixes the issue, and if the row isn't a string or it is a non-empty string it all works. It seems we are assuming that "" (size 0 string) is null, so we are loosing track of the fact that it is a valid string.

sameerz · 2022-04-29T04:56:19Z

@abellina is this resolved?

jlowe · 2022-04-29T13:28:23Z

@abellina is this resolved?

Almost. The cudf change is in, but we still need to re-enable the disabled tests. I'll be posting a PR shortly.

pxLi added bug Something isn't working ? - Needs Triage Need team to review and classify labels Apr 20, 2022

pxLi changed the title ~~[BUG] assert failed test_struct_self_join and test_computation_in_grpby_columns in databricks runtime~~ [BUG] assert failed test_struct_self_join and test_computation_in_grpby_columns Apr 20, 2022

abellina self-assigned this Apr 20, 2022

abellina added the P0 Must have for release label Apr 20, 2022

jlowe mentioned this issue Apr 20, 2022

Temporarily xfail tests to restore premerge builds #5288

Merged

abellina mentioned this issue Apr 22, 2022

[BUG] corruption in string column after contig split + partition rapidsai/cudf#10717

Closed

abellina added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Apr 25, 2022

abellina mentioned this issue Apr 25, 2022

Fix scatter for all-empty-string column case rapidsai/cudf#10724

Merged

sameerz removed the ? - Needs Triage Need team to review and classify label Apr 26, 2022

jlowe mentioned this issue Apr 29, 2022

Restore test_computation_in_grpby_columns and test_struct_self_join #5400

Merged

sameerz added this to the Apr 18 - Apr 29 milestone Apr 29, 2022

jlowe closed this as completed in #5400 Apr 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] assert failed test_struct_self_join and test_computation_in_grpby_columns #5286

[BUG] assert failed test_struct_self_join and test_computation_in_grpby_columns #5286

pxLi commented Apr 20, 2022 •

edited

Loading

abellina commented Apr 20, 2022

abellina commented Apr 20, 2022

pxLi commented Apr 21, 2022

abellina commented Apr 22, 2022

abellina commented Apr 22, 2022 •

edited

Loading

sameerz commented Apr 29, 2022

jlowe commented Apr 29, 2022

[BUG] assert failed test_struct_self_join and test_computation_in_grpby_columns #5286

[BUG] assert failed test_struct_self_join and test_computation_in_grpby_columns #5286

Comments

pxLi commented Apr 20, 2022 • edited Loading

abellina commented Apr 20, 2022

abellina commented Apr 20, 2022

pxLi commented Apr 21, 2022

abellina commented Apr 22, 2022

abellina commented Apr 22, 2022 • edited Loading

sameerz commented Apr 29, 2022

jlowe commented Apr 29, 2022

pxLi commented Apr 20, 2022 •

edited

Loading

abellina commented Apr 22, 2022 •

edited

Loading