Clear pending tasks in the worker when the context is canceled to avoid deadlocks in StopAndWait when tasks are queued for the worker. #62

CorentinClabaut · 2024-06-07T17:55:30Z

closes #61

…id deadlocks in StopAndWait when tasks are queued for the worker.

alitto · 2024-06-08T21:47:01Z

Hey @CorentinClabaut, thanks for submitting this PR 🙌
I noticed the github actions workflow was specifying an unsopported version of go (1.15) so I pushed a fix to the master branch 3f439a7
Do you mind rebasing this PR so that we can retry the failed pipelines 🙂?

CorentinClabaut · 2024-06-10T07:46:37Z

Hey @alitto no problem for the PR :)
I've just merged master, let me know if something else needs to be done.

alitto · 2024-06-10T13:32:13Z

It seems a test is failing due to a race condition 🤔 .
Apparently, the race condition appears in this line

  github.com/alitto/pond.(*WorkerPool).stop.func1()
      /home/runner/work/pond/pond/pond.go:358 +0x44

I wonder if that's related to moving the closing of the tasks channel up. Will continue digging when i get a chance

CorentinClabaut · 2024-06-10T14:15:46Z

I just saw this error as well

=== RUN   TestSubmitWithContextCancelWithIdleTasks
    pond_blackbox_test.go:582: Expected int32(1) but was int32(2)
--- FAIL: TestSubmitWithContextCancelWithIdleTasks (0.00s)

This one seems to be due to

select {
		case <-context.Done():
                   ...
		case task, ok := <-tasks:
                   ...
		}

Here if both channels contain something any of the case could be triggered (https://stackoverflow.com/questions/46200343/force-priority-of-go-select-statement)

I can push a fix for this and see if it fixes the race condition as well.

CorentinClabaut · 2024-06-11T06:42:54Z

Hey @alitto it should be all good now.
You were right, it was due to moving the closing of the tasks up

CorentinClabaut · 2024-06-11T14:14:28Z

Hey @alitto

The issue is now:

=== RUN   TestPurgeDuringSubmit
    pond_test.go:62: Expected int(1) but was int(0)
--- FAIL: TestPurgeDuringSubmit (0.00s)

I'm not sure how this issue could be triggered by this PR though.

Could it be that in some situations the purge might reset idleWorkerCount before we do the check in the test?

alitto

Leaving some comments, looks good overall 👍

worker.go

pond.go

alitto · 2024-06-12T12:18:52Z

Hey @alitto

The issue is now:
=== RUN   TestPurgeDuringSubmit
    pond_test.go:62: Expected int(1) but was int(0)
--- FAIL: TestPurgeDuringSubmit (0.00s)
I'm not sure how this issue could be triggered by this PR though.

Could it be that in some situations the purge might reset idleWorkerCount before we do the check in the test?

Mhm, I think the idleWorkerCount counter might be slower to update when running in Github actions, I have the feeling i've seen this behavior before.
Adding an extra sleep after submitting the first task should help i think:

// Submit a task to ensure at least 1 worker is started
pool.SubmitAndWait(func() {
    atomic.AddInt32(&doneCount, 1)
})

// Ensure idle worker count is updated
time.Sleep(1 * time.Millisecond)

assertEqual(t, 1, pool.IdleWorkers())

CorentinClabaut · 2024-06-17T06:57:53Z

Hey @alitto
The issue is now:
=== RUN   TestPurgeDuringSubmit
    pond_test.go:62: Expected int(1) but was int(0)
--- FAIL: TestPurgeDuringSubmit (0.00s)
I'm not sure how this issue could be triggered by this PR though.
Could it be that in some situations the purge might reset idleWorkerCount before we do the check in the test?
Mhm, I think the idleWorkerCount counter might be slower to update when running in Github actions, I have the feeling i've seen this behavior before. Adding an extra sleep after submitting the first task should help i think:
// Submit a task to ensure at least 1 worker is started
pool.SubmitAndWait(func() {
    atomic.AddInt32(&doneCount, 1)
})

// Ensure idle worker count is updated
time.Sleep(1 * time.Millisecond)

assertEqual(t, 1, pool.IdleWorkers())

That makes sense, I can see that it's what you have done in TestSubmitToIdle I'll add it here and in another test that seems to need it to make sure this issue doesn't reappear later

alitto

Thank you very much @CorentinClabaut! I will take care of fixing the codecov action which is missing a token apparently and then also publish a new version of pond with your changes 🙂

CorentinClabaut · 2024-06-18T07:30:46Z

Thank you for the merge @alitto :)

Clear pending tasks in the worker when the context is canceled to avo…

a1caa4e

…id deadlocks in StopAndWait when tasks are queued for the worker.

Merge branch 'master' into deadlock

f865e23

Corentin Clabaut added 2 commits June 10, 2024 16:35

Prioritize context.Done statement in worker

f59ff90

Fix race condition

bdb616c

alitto reviewed Jun 12, 2024

View reviewed changes

worker.go Show resolved Hide resolved

pond.go Show resolved Hide resolved

Fix tests

57e5a0a

Improve readablility

f681bc1

alitto approved these changes Jun 17, 2024

View reviewed changes

alitto merged commit 8097a00 into alitto:master Jun 17, 2024
17 of 18 checks passed

hongkuancn mentioned this pull request Sep 1, 2024

Data race due to "Unsynchronized send and close operations" #70

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clear pending tasks in the worker when the context is canceled to avoid deadlocks in StopAndWait when tasks are queued for the worker. #62

Clear pending tasks in the worker when the context is canceled to avoid deadlocks in StopAndWait when tasks are queued for the worker. #62

CorentinClabaut commented Jun 7, 2024

alitto commented Jun 8, 2024

CorentinClabaut commented Jun 10, 2024

alitto commented Jun 10, 2024

CorentinClabaut commented Jun 10, 2024

CorentinClabaut commented Jun 11, 2024

CorentinClabaut commented Jun 11, 2024

alitto left a comment

alitto commented Jun 12, 2024

CorentinClabaut commented Jun 17, 2024 •

edited

Loading

alitto left a comment

CorentinClabaut commented Jun 18, 2024

Clear pending tasks in the worker when the context is canceled to avoid deadlocks in StopAndWait when tasks are queued for the worker. #62

Clear pending tasks in the worker when the context is canceled to avoid deadlocks in StopAndWait when tasks are queued for the worker. #62

Conversation

CorentinClabaut commented Jun 7, 2024

alitto commented Jun 8, 2024

CorentinClabaut commented Jun 10, 2024

alitto commented Jun 10, 2024

CorentinClabaut commented Jun 10, 2024

CorentinClabaut commented Jun 11, 2024

CorentinClabaut commented Jun 11, 2024

alitto left a comment

Choose a reason for hiding this comment

alitto commented Jun 12, 2024

CorentinClabaut commented Jun 17, 2024 • edited Loading

alitto left a comment

Choose a reason for hiding this comment

CorentinClabaut commented Jun 18, 2024

CorentinClabaut commented Jun 17, 2024 •

edited

Loading