Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] left_semi_join operation is abnormal and serious time-consuming #569

Closed
chenrui17 opened this issue Aug 15, 2020 · 12 comments · Fixed by #594 or #754
Closed

[BUG] left_semi_join operation is abnormal and serious time-consuming #569

chenrui17 opened this issue Aug 15, 2020 · 12 comments · Fixed by #594 or #754
Assignees
Labels
bug Something isn't working P0 Must have for release performance A performance related task/issue

Comments

@chenrui17
Copy link

chenrui17 commented Aug 15, 2020

Describe the bug
I test tpc-ds query-10 with rapids-0.2 recently committed , and I find there is a phenomenon that during the reduce process after shuffle, several takes are time-consuming, and I use nsys found that time spent on left_semi_join , i suspect this is a bug
Normally, this query can be completed in dozens of seconds, but now the several takes 5-8 minutes. At the same time, it is found that the GPU utilization rate is very high in this process
image
image

@chenrui17 chenrui17 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Aug 15, 2020
@chenrui17 chenrui17 changed the title left_semi_join operation isAbnormal operation and serious time-consuming[BUG] left_semi_join operation is abnormal and serious time-consuming[BUG] Aug 15, 2020
@chenrui17
Copy link
Author

image
I test with local mode , it is likely GpuCoalesceBatches execution exception firstly.

@jlowe jlowe added P0 Must have for release performance A performance related task/issue and removed ? - Needs Triage Need team to review and classify labels Aug 17, 2020
@jlowe
Copy link
Member

jlowe commented Aug 17, 2020

This is definitely a bug. No GPU operation should take thousands of seconds as shown in the nsys trace.

@chenrui17 does this happen every time you run this query or only sometimes? Does the scale factor of the tpc-ds data matter or will this occur even with smaller scale factors?

@chenrui17
Copy link
Author

This is definitely a bug. No GPU operation should take thousands of seconds as shown in the nsys trace.

@chenrui17 does this happen every time you run this query or only sometimes? Does the scale factor of the tpc-ds data matter or will this occur even with smaller scale factors?

it happens every time and my scale factor of the tpc-ds is 1000
I will make the scalefactor smaller and try again.

@chenrui17
Copy link
Author

This is definitely a bug. No GPU operation should take thousands of seconds as shown in the nsys trace.

@chenrui17 does this happen every time you run this query or only sometimes? Does the scale factor of the tpc-ds data matter or will this occur even with smaller scale factors?

I also tried two scalfactors. When the scalefactor is 100, tpc-ds query 10 seems to be able to execute normally. It takes a little time to reduce. The total query time is 14s, and the reduce phase takes 8s. However, most tasks take 1s, and some tasks still take 7S, as shown in the figure below;

When scalefactor is 10000, tpc-ds query 10 fails to execute successfully, and it is stuck in the reduce phase mentioned above.

Therefore, even with a small amount of data, this problem still exists
image

@revans2
Copy link
Collaborator

revans2 commented Aug 17, 2020

I think I have been able to reproduce this issue. I ran query 10 with a SF 200 data set encoded in parquet.

spark.sql.shuffle.partitions=1
spark.sql.files.maxPartitionBytes=1g
spark.rapids.memory.pinnedPool.size=16g
spark.rapids.sql.concurrentGpuTasks=2

8 CPU cores per GPU.

With the plugin it can take over 50 seconds to run the semi-join. If I disable just LeftSemi as a join option the query takes 6 seconds. If I run on the CPU for everything it completes in about 15 seconds.

It is producing the correct result in all cases, so it looks to be a performance related issue of some kind.

It actually looks like it is scaling related. When I increase the number of workers I see a very large decrease in the total runtime. It is almost like the join being done is brute force some how. I'll dig into the source for it and see if I can come up with something.

@revans2
Copy link
Collaborator

revans2 commented Aug 17, 2020

So I read through the code and got a profile too. It looks like building the hash table, just for the right table keys is taking about 87% of the time, and then probing the table is taking the other 13% of the time. I really don't know why. I'll try to get a repro case and file something in cudf about this.

@jlowe this looks to be bad enough that I think we want to disable both semi and anti joins (They share the same implementation), but I would like to hear your opinion on this.

@jlowe
Copy link
Member

jlowe commented Aug 17, 2020

this looks to be bad enough that I think we want to disable both semi and anti joins

I think we need to get a bit more details on how it's slow before disabling outright. IIUC semi/anti join should always be as fast or faster than a normal join in cudf. The build phase constructs a hash table just like a regular join, but instead of a multimap it's just a map. The lookup phase should also be as fast or faster since it only cares about a key hit in the table and doesn't care about any value associated with it.

It would be interesting to know if a regular join on the same input tables and same join keys is significantly faster.

@JustPlay
Copy link

If I disable just LeftSemi as a join option the query takes 6 seconds

So,how can i disable the semi (and anti)join?@revans2

@revans2
Copy link
Collaborator

revans2 commented Aug 18, 2020

@JustPlay I had to go in and change the code to disable just those joins. We don't have a config for that yet.

@jlowe I was able to make some progress on this. I dumped the inputs to the join and was able to reproduce the issue in c++. It has to do with null values and null equality.

TEST_F(JoinTest, LeftSemiJoinPerf)
{
  cudf::io::read_parquet_args left_args{cudf::io::source_info("/tmp/input/left_0")};
  cudf::io::table_with_metadata left = cudf::io::read_parquet(left_args);

  std::cerr << "LEFT: " << left.tbl->num_columns() << " X " << left.tbl->num_rows() << std::endl;

  cudf::io::read_parquet_args right_args{cudf::io::source_info("/tmp/input/right_0")};
  cudf::io::table_with_metadata right = cudf::io::read_parquet(right_args);

  std::cerr << "RIGHT: " << right.tbl->num_columns() << " X " << right.tbl->num_rows() << std::endl;

  auto join_table = cudf::left_semi_join(*left.tbl, *right.tbl, {0}, {0}, {0, 1, 2, 3},
          cudf::null_equality::UNEQUAL);
  
  std::cerr << "RESULT: " << join_table->num_columns() << " X " << join_table->num_rows() << std::endl;
}

On the 200GB data set when you get to the left semi join there are 513,298 null key values in the right table. If I say that nulls are equal cudf::null_equality::EQUAL, which is the default, then it takes about 63ms to read in both the left and the right table and do the join. If I say that they are not equal, which is what spark needs, then it takes about 50 seconds to do the same thing.

When I do the join manually in Spark I am not able to reproduce it because Spark automatically filters all nulls from both the left and right table before doing the join. My guess as to why it is taking so long is that cudf sees nulls as not equal so each time it sees one it inserts it into the table as a new entry, but because they hash to the same thing it has to walk the table looking for a free location. This then messes up the hash table enough that probing it takes a lot longer too.

The fastest fix for this would be to filter out any nulls before we do the join. We should also file a follow on issue with cudf to see if there is a better fix on their side, but I am off today so if this can wait until Thursday then I can work on it. If not feel free to put in a fix yourself and and we can work with cudf on this later.

@jlowe jlowe changed the title left_semi_join operation is abnormal and serious time-consuming[BUG] [BUG] left_semi_join operation is abnormal and serious time-consuming Aug 18, 2020
@jlowe
Copy link
Member

jlowe commented Aug 18, 2020

@revans2 Thursday should be fine. I'm chasing some other things today, and you're already setup to reproduce the issue easily and therefore can verify the fix.

The fastest fix for this would be to filter out any nulls before we do the join.

Agreed. I assume we'd only need to pre-filter the nulls in the build table for the hash table, thinking it would be faster for the streamed table rows to fail a hit in the hash table, that now no longer contains null entries, than it would be to manifest the non-null version of the streamed table prior to the join.

@sameerz sameerz added this to the Aug 17 - Aug 28 milestone Aug 18, 2020
@JustPlay
Copy link

at least q5,q10, q35, q69 in tpc-ds has this problem

@revans2
Copy link
Collaborator

revans2 commented Sep 11, 2020

Reopening because the fix broke correctness in some cases and had to be reverted for the 0.2 release. I'll try to get it fixed properly, but 0.2 is too close to try and do it in that time frame.

@revans2 revans2 reopened this Sep 11, 2020
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
…IDIA#569)

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release performance A performance related task/issue
Projects
None yet
5 participants