Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: investigate flakiness of parallel/test-async-wrap-getasyncid #13020

Closed
refack opened this issue May 14, 2017 · 35 comments
Closed

test: investigate flakiness of parallel/test-async-wrap-getasyncid #13020

refack opened this issue May 14, 2017 · 35 comments
Labels
arm Issues and PRs related to the ARM platform. flaky-test Issues and PRs related to the tests with unstable failures on the CI. test Issues and PRs related to the tests. windows Issues and PRs related to the Windows platform.

Comments

@refack
Copy link
Contributor

refack commented May 14, 2017

  • Version: master
  • Platform: Windows smartos15-64 ++
  • Subsystem: test,async-hooks

file:

https://github.com/nodejs/node/blob/master/test/parallel/test-async-wrap-getasyncid.js

falakings

  1. https://ci.nodejs.org/job/node-test-binary-windows/8471/
  2. https://ci.nodejs.org/job/node-test-binary-windows/8362/
  3. https://ci.nodejs.org/job/node-test-binary-windows/8402/

TAP output

2	parallel/test-async-wrap-getasyncid	
duration_ms	0.232
severity	fail
stack	
Mismatched <anonymous> function calls. Expected 1, actual 0.
    at Object.<anonymous> (c:\workspace\node-test-binary-windows\RUN_SUBSET\3\VS_VERSION\vs2015\label\win2008r2\test\parallel\test-async-wrap-getasyncid.js:155:42)
    at Module._compile (module.js:569:30)
    at Object.Module._extensions..js (module.js:580:10)
    at Module.load (module.js:503:32)
    at tryModuleLoad (module.js:466:12)
    at Function.Module._load (module.js:458:3)
    at Function.Module.runMain (module.js:605:10)
    at startup (bootstrap_node.js:144:16)
    at bootstrap_node.js:561:3
(node:5848) [DEP0064] DeprecationWarning: tls.createSecurePair() is deprecated. Please use tls.Socket instead.

Hypothesis

Seems like a net.server sometimes doesn't complete it's creation before the test exits...
const server = net.createServer(common.mustCall((socket) => {

@refack refack changed the title test: investigate flakiness of test-async-wrap-getasyncid (on windows test: investigate flakiness of parallal/test-async-wrap-getasyncid (on windows) May 14, 2017
@refack refack added async_hooks Issues and PRs related to the async hooks subsystem. test Issues and PRs related to the tests. windows Issues and PRs related to the Windows platform. labels May 14, 2017
@refack refack self-assigned this May 14, 2017
@refack
Copy link
Contributor Author

refack commented May 14, 2017

@refack
Copy link
Contributor Author

refack commented May 14, 2017

@addaleax @AndreasMadsen is it possible that the anonymous scope + eager gc calls, cleans up the server before it has a chance to (h)->flags = UV__HANDLE_REF?

@addaleax
Copy link
Member

As far as I understand common.mustCall(), all missed functions are listed on exit. If you were right, that would mean that the .listen() handler wouldn’t get called, but it’s not listed so it seems safe to assume it does get called?

@AndreasMadsen
Copy link
Member

The only explanation that makes sense to me, is that the client connects, sends data, and shutdown itself before the actual server received any connection. Thus the libuv loop doesn't know that there a connection incoming and will just exit the process.

We could wait until the server received its connection before running handle.shutdown() or alternatively simply remove the common.mustCall from net.createServer().

In any case, the net.createServer() should actually not call socket.end(); or socket.destroy(); since that creates a race conditon between socket.destroy(); and handle.shutdown().

@refack
Copy link
Contributor Author

refack commented May 14, 2017

safe to assume it does get called?

Yep...

We could wait until the server received its connection before running handle.shutdown()

👍 handle.shutdown() only in req.on('end')
I'm on it.

@AndreasMadsen
Copy link
Member

handle.shutdown() only in req.on('end')

We need to wait for both server.on('connection') and wreq.oncomplete. In particular, req.on('end') will not happen at all before handle.shutdown().

@AndreasMadsen
Copy link
Member

AndreasMadsen commented May 24, 2017

@refack was this issue solved? I don't remember.

@refack
Copy link
Contributor Author

refack commented May 24, 2017

I think someone somewhere said something about #12854 might solve this... 🤔
Anyway I did not see it flake in a while.
Running another stress https://ci.nodejs.org/job/node-stress-single-test/nodes=win10/1232/

@AndreasMadsen
Copy link
Member

I think the script path was spelled wrong, rerunning: https://ci.nodejs.org/job/node-stress-single-test/1240/

@refack
Copy link
Contributor Author

refack commented May 29, 2017

✔️ More rigorous stress (-j 16 parallel/test-async-wrap-getasyncid X 100): https://ci.nodejs.org/job/node-stress-single-test/1242/nodes=win10/

@refack
Copy link
Contributor Author

refack commented May 29, 2017

Closing

@refack refack closed this as completed May 29, 2017
@refack
Copy link
Contributor Author

refack commented Jun 1, 2017

Hello form smartos15-64

	9	parallel/test-async-wrap-getasyncid	
duration_ms	0.627
severity	fail
stack	
Mismatched <anonymous> function calls. Expected exactly 1, actual 0.
    at Object.exports.mustCall (/home/iojs/build/workspace/node-test-commit-smartos/nodes/smartos15-64/test/common/index.js:483:10)
    at Object.<anonymous> (/home/iojs/build/workspace/node-test-commit-smartos/nodes/smartos15-64/test/parallel/test-async-wrap-getasyncid.js:162:42)
    at Module._compile (module.js:569:30)
    at Object.Module._extensions..js (module.js:580:10)
    at Module.load (module.js:503:32)
    at tryModuleLoad (module.js:466:12)
    at Function.Module._load (module.js:458:3)
    at Function.Module.runMain (module.js:605:10)
    at startup (bootstrap_node.js:158:16)
(node:53941) [DEP0064] DeprecationWarning: tls.createSecurePair() is deprecated. Please use tls.Socket instead.

@refack refack reopened this Jun 1, 2017
@addaleax addaleax changed the title test: investigate flakiness of parallal/test-async-wrap-getasyncid (on windows) test: investigate flakiness of parallel/test-async-wrap-getasyncid (on windows) Jun 1, 2017
@addaleax addaleax changed the title test: investigate flakiness of parallel/test-async-wrap-getasyncid (on windows) test: investigate flakiness of parallel/test-async-wrap-getasyncid Jun 1, 2017
@addaleax addaleax added async_wrap and removed windows Issues and PRs related to the Windows platform. async_hooks Issues and PRs related to the async hooks subsystem. labels Jun 1, 2017
@refack refack removed their assignment Jun 1, 2017
@joyeecheung
Copy link
Member

I am getting a timeout of this test today on MacOS 10.12.4 and can reproduce this throughout the day (I have not rebooted my machine yet):

▶ tools/test.py parallel/test-async-wrap-getasyncid
=== release test-async-wrap-getasyncid ===
Path: parallel/test-async-wrap-getasyncid
(node:28564) [DEP0064] DeprecationWarning: tls.createSecurePair() is deprecated. Please use tls.Socket instead.
Command: out/Release/node /Users/joyee/projects/node/test/parallel/test-async-wrap-getasyncid.js
--- TIMEOUT ---
[01:00|% 100|+   0|-   1]: Done

@refack
Copy link
Contributor Author

refack commented Jun 16, 2017

I am getting a timeout of this test today on MacOS 10.12.4 and can reproduce this throughout the day (I have not rebooted my machine yet):

Do you have a way of knowing which case without breaking the spell? (especially interested if it's the one starting at L160)

@tniessen
Copy link
Member

node-test-binary-windows > #9523 > 0,vs2015,win2008r2:

not ok 6 parallel/test-async-wrap-getasyncid
  ---
  duration_ms: 0.272
  severity: fail
  stack: |-
    Mismatched <anonymous> function calls. Expected exactly 1, actual 0.
        at Object.exports.mustCall (c:\workspace\node-test-binary-windows\RUN_SUBSET\0\VS_VERSION\vs2015\label\win2008r2\test\common\index.js:486:10)
        at Object.<anonymous> (c:\workspace\node-test-binary-windows\RUN_SUBSET\0\VS_VERSION\vs2015\label\win2008r2\test\parallel\test-async-wrap-getasyncid.js:162:42)
        at Module._compile (module.js:569:30)
        at Object.Module._extensions..js (module.js:580:10)
        at Module.load (module.js:503:32)
        at tryModuleLoad (module.js:466:12)
        at Function.Module._load (module.js:458:3)
        at Function.Module.runMain (module.js:605:10)
        at startup (bootstrap_node.js:158:16)
    (node:3320) [DEP0064] DeprecationWarning: tls.createSecurePair() is deprecated. Please use tls.Socket instead.
  ...

@refack refack added the flaky-test Issues and PRs related to the tests with unstable failures on the CI. label Jun 30, 2017
@joyeecheung
Copy link
Member

joyeecheung commented Jul 5, 2017

@refack Bumped into this one again, this line

dns.resolve6('::1', () => {})

causes a hang on my machine when ipv6 is disabled automatically since my current ISP does not support that. I guess I could not stably reproduce this before because I keep moving around...

BTW dig ::1 AAAA hangs on my machine as well

@refack
Copy link
Contributor Author

refack commented Jul 12, 2017

@refack We could start out by extracting the offending test to a separate file. Generally speaking, I actually don't like having this many logically separate tests in one file, it makes it hard to debug.

I'll get to it after my nap 😪

@refack
Copy link
Contributor Author

refack commented Jul 13, 2017

@refack We could start out by extracting the offending test to a separate file. Generally speaking, I actually don't like having this many logically separate tests in one file, it makes it hard to debug.

I'll get to it after my nap 😪

Well I napped on it, and I'm now thinking there's probably a non-async-hooks bug somewhere in there, so I'll copy the flaky block to a new file, as I don't want to make this disappear without a solution. Maybe do a lite pummel in the new test (loop a 1000 times or something)

@Trott Trott added the smartos Issues and PRs related to the SmartOS platform. label Jul 13, 2017
@Trott
Copy link
Member

Trott commented Jul 13, 2017

On CI, with this recent recurrence (perhaps since disabling snapshots?), we're (so far) only seeing it on SmartOS, right?

/cc @nodejs/platform-smartos

@refack
Copy link
Contributor Author

refack commented Jul 13, 2017

On CI, with this recent recurrence (perhaps since disabling snapshots?), we're (so far) only seeing it on SmartOS, right?

Nope. macOS fail is post snapshotpocalypse https://ci.nodejs.org/job/node-test-commit-osx/11049/nodes=osx1010/
/cc @nodejs/platform-macos

@misterdjules
Copy link

@refack @Trott I can reproduce on SmartOS, will look into this. I would be surprised if I could make any significant progress on this before Monday though, is that ok?

@refack
Copy link
Contributor Author

refack commented Jul 13, 2017

@misterdjules of course. Even Amazing. This issue has been there for a 2 months with relatively low frequency of failures. And we're stumped.

@misterdjules
Copy link

From a quick look, it seems to me that there is a race between the client and the server.

In the test's current form, the server is closed by the client, because I would think the test currently assumes that if the client successfully wrote data on the client connection and closed it, then the server must have called its connection callback.

However, it is not necessarily the case, and I would think the server instance should be closed only when we can be sure that the server ran its connection callback which is...in the connection callback.

I'm currently stress testing the following patch:

diff --git a/test/parallel/test-async-wrap-getasyncid.js b/test/parallel/test-async-wrap-getasyncid.js
index 6b9d33b..778ca4c 100644
--- a/test/parallel/test-async-wrap-getasyncid.js
+++ b/test/parallel/test-async-wrap-getasyncid.js
@@ -160,6 +160,7 @@ if (common.hasCrypto) {
   const stream_wrap = process.binding('stream_wrap');
   const tcp_wrap = process.binding('tcp_wrap');
   const server = net.createServer(common.mustCall((socket) => {
+    server.close();
     socket.on('data', (x) => {
       socket.end();
       socket.destroy();
@@ -176,7 +177,6 @@ if (common.hasCrypto) {
 
     sreq.oncomplete = common.mustCall(() => {
       handle.close();
-      server.close();
     });
 
     wreq.handle = handle;

on SmartOS and I'll report results back. In the meantime I'd appreciate initial feedback on the analysis above.

@misterdjules
Copy link

I'll report results back.

parallel/test-async-wrap-getasyncid has been running with the change mentionned above successfully for > 12 hours without any failure on the same SmartOS VM on which I was able to consistently reproduce the problem < 1 hour. I'll submit a PR asap, which might be no sooner than Monday (today is a day off for me).

misterdjules pushed a commit to misterdjules/node-1 that referenced this issue Jul 17, 2017
The test used to assume that if the client successfully writes data to
the server and closes the connection, then the server must have received
that data and ran its connection callback wrapped by common.mustCall.

However, it is not necessarily the case, and as a result the test was
flaky.

With this change, the server is closed only once the server's connection
callback has been called, which guarantees that the server can accept
connections until it actually accepted the single connection that this
test requires to accept, making the test not flaky.

Fixes: nodejs#13020
@misterdjules
Copy link

PR submitted at #14329.

@misterdjules
Copy link

FWIW, I ran parallel/test-async-wrap-getasyncid with the changes in #14329 for > 48 hours without a single failure on SmartOS.

@misterdjules
Copy link

Fixed by 2da1af0.

misterdjules pushed a commit that referenced this issue Jul 25, 2017
The test used to assume that if the client successfully writes data to
the server and closes the connection, then the server must have received
that data and ran its connection callback wrapped by common.mustCall.

However, it is not necessarily the case, and as a result the test was
flaky.

With this change, the server is closed only once the server's connection
callback has been called, which guarantees that the server can accept
connections until it actually accepted the single connection that this
test requires to accept, making the test not flaky.

PR-URL: #14329
Fixes: #13020
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Refael Ackermann <refack@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Wyatt Preul <wpreul@gmail.com>
addaleax pushed a commit that referenced this issue Jul 27, 2017
The test used to assume that if the client successfully writes data to
the server and closes the connection, then the server must have received
that data and ran its connection callback wrapped by common.mustCall.

However, it is not necessarily the case, and as a result the test was
flaky.

With this change, the server is closed only once the server's connection
callback has been called, which guarantees that the server can accept
connections until it actually accepted the single connection that this
test requires to accept, making the test not flaky.

PR-URL: #14329
Fixes: #13020
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Refael Ackermann <refack@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Wyatt Preul <wpreul@gmail.com>
@refack
Copy link
Contributor Author

refack commented Oct 7, 2017

Windows VS2017 is back.
https://ci.nodejs.org/job/node-test-binary-windows/11616/COMPILED_BY=vs2017,RUNNER=win2016,RUN_SUBSET=2/console

not ok 479 sequential/test-async-wrap-getasyncid
  ---
  duration_ms: 0.215
  severity: crashed
  stack: |-
    oh no!
    exit code: CRASHED
  ...

@refack refack reopened this Oct 7, 2017
@refack refack removed the smartos Issues and PRs related to the SmartOS platform. label Oct 7, 2017
@refack
Copy link
Contributor Author

refack commented Oct 11, 2017

And ARM:

not ok 263 sequential/test-async-wrap-getasyncid
  ---
  duration_ms: 1.736
  severity: fail
  stack: |-
    Mismatched <anonymous> function calls. Expected exactly 1, actual 2.
        at Object.exports.mustCall (/home/iojs/build/workspace/node-test-binary-arm/test/common/index.js:485:10)
        at Object.<anonymous> (/home/iojs/build/workspace/node-test-binary-arm/test/sequential/test-async-wrap-getasyncid.js:19:35)
        at Module._compile (module.js:600:30)
        at Object.Module._extensions..js (module.js:611:10)
        at Module.load (module.js:521:32)
        at tryModuleLoad (module.js:484:12)
        at Function.Module._load (module.js:476:3)
        at Function.Module.runMain (module.js:641:10)
        at startup (bootstrap_node.js:187:16)
    (node:1457) [DEP0064] DeprecationWarning: tls.createSecurePair() is deprecated. Please use tls.Socket instead.
    (node:1457) [INSPECTOR_ASYNC_STACK_TRACES_NOT_AVAILABLE] Warning: Warning: Async stack traces in debugger are not available on 32bit platforms. The feature is disabled.
  ...

https://ci.nodejs.org/job/node-test-binary-arm/10781/RUN_SUBSET=1,label=pi2-raspbian-wheezy/console

@misterdjules
Copy link

@refack I'm thinking that it might be beneficial to open separate issues for these 2 new occurrences. The reason is that the windows failure is a crash, and so seems like a different problem than the original issue.

The failure on ARM seems more similar to the original issue, but it's still different (2 actual calls vs 0 actual call in the original issue), and so the cause might be different.

Being able to discuss these separate problems in separate issues could help make discussions less confusing.

@BridgeAR BridgeAR added freebsd Issues and PRs related to the FreeBSD platform. windows Issues and PRs related to the Windows platform. arm Issues and PRs related to the ARM platform. and removed freebsd Issues and PRs related to the FreeBSD platform. labels Dec 29, 2017
@refack refack removed their assignment Oct 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arm Issues and PRs related to the ARM platform. flaky-test Issues and PRs related to the tests with unstable failures on the CI. test Issues and PRs related to the tests. windows Issues and PRs related to the Windows platform.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants