Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix shutdown of closed fd when c-ares opens a second fd #15435

Merged
merged 1 commit into from
Jun 7, 2018

Conversation

apolcyn
Copy link
Contributor

@apolcyn apolcyn commented May 17, 2018

After our conversation earlier, I realized we are currently doing a "use after close" on the UDP socket during the fallback to TCP case, and that the ARES_STAY_OPEN flag appears to be actually necessary rather than a simplification.

After adding the log message that this PR add's to the iomgr, the bug can currently be reproduced with:

tools/run_tests/start_port_server.py
make make resolver_component_tests_runner_invoker -j8
make make resolver_component_test -j8
export GRPC_DNS_RESOLVER=ares
bins/opt/resolver_component_tests_runner_invoker --test_bin_name=resolver_component_test > log 2>&1

Now check the log and you'll find a message of "Error shutting down fd. errno: 9" within the
test that triggers fallback to TCP.

The problem is that the UDP socket is closed by c-ares once the query finishes (during the call to ares_process_fd on the TCP socket. But then the UDP socket read closure is left hanging, and we don't really have any way of kicking it out besides calling grpc_fd_shutdown on it, which happens to call shutdown on the closed socket (shutdown returns an error of EBADFD).


update: the added GPR_ERROR log message on shutdown errors caused a bunch of @no_logging test/core/end2end tests to fail because shutdown is failing with ENOTCONN. I've updated to filter out those logs because as I know of the shutdown API, ENOTCONN can occur for reasons outside of an applications control.


Update: also removed already_closed parameter to grpc_fd_orphan

This PR wound up removing the already_closed parameter to grpc_fd_orphan because we realized that c-ares was the only caller that set already_closed to true, and it turns out that it's better for c-ares to pass a non-NULL release_fd parameter to grpc_fd_orphan than it is to pass already_closed=true, because a non-NULL release_fd parameter has better semantics (not only does it avoid closing, but e.g. under epollsig it also handles removal from the epoll set).


This change is Reviewable

@grpc-testing
Copy link

****************************************************************

libgrpc.so

     VM SIZE                                                                                        FILE SIZE
 ++++++++++++++ GROWING                                                                          ++++++++++++++
  +0.0%     +48 [None]                                                                              +872  +0.0%
  +2.4%     +64 src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver_posi     +64  +2.4%
       +22%     +64 grpc_ares_ev_driver_create                                                           +64   +22%
  +0.6%     +64 src/core/lib/iomgr/ev_epollex_linux.cc                                               +64  +0.6%
       +87%     +71 fd_shutdown                                                                          +71   +87%

  +0.0%    +176 TOTAL                                                                              +1000  +0.0%


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



@grpc-testing
Copy link

[trickle] No significant performance differences

@grpc-testing
Copy link

[microbenchmarks] No significant performance differences

@apolcyn apolcyn force-pushed the fix_shutdown_closed_socket branch 2 times, most recently from 386dc84 to 6ae80d7 Compare May 18, 2018 02:06
@grpc-testing
Copy link

****************************************************************

libgrpc.so

     VM SIZE                                                                                        FILE SIZE
 ++++++++++++++ GROWING                                                                          ++++++++++++++
  +0.0%     +48 [None]                                                                              +840  +0.0%
  +2.4%     +64 src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver_posi     +64  +2.4%
       +22%     +64 grpc_ares_ev_driver_create                                                           +64   +22%
  +0.6%     +64 src/core/lib/iomgr/ev_epollex_linux.cc                                               +64  +0.6%
       +84%     +69 fd_shutdown                                                                          +69   +84%

  +0.0%    +176 TOTAL                                                                               +968  +0.0%


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



1 similar comment
@grpc-testing
Copy link

****************************************************************

libgrpc.so

     VM SIZE                                                                                        FILE SIZE
 ++++++++++++++ GROWING                                                                          ++++++++++++++
  +0.0%     +48 [None]                                                                              +840  +0.0%
  +2.4%     +64 src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver_posi     +64  +2.4%
       +22%     +64 grpc_ares_ev_driver_create                                                           +64   +22%
  +0.6%     +64 src/core/lib/iomgr/ev_epollex_linux.cc                                               +64  +0.6%
       +84%     +69 fd_shutdown                                                                          +69   +84%

  +0.0%    +176 TOTAL                                                                               +968  +0.0%


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



@grpc-testing
Copy link

****************************************************************

libgrpc.so

     VM SIZE                                                                                        FILE SIZE
 ++++++++++++++ GROWING                                                                          ++++++++++++++
  +0.0%     +48 [None]                                                                              +840  +0.0%
  +2.4%     +64 src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver_posi     +64  +2.4%
       +22%     +64 grpc_ares_ev_driver_create                                                           +64   +22%
  +0.6%     +64 src/core/lib/iomgr/ev_epollex_linux.cc                                               +64  +0.6%
       +84%     +69 fd_shutdown                                                                          +69   +84%

  +0.0%    +176 TOTAL                                                                               +968  +0.0%


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



@grpc-testing
Copy link

[trickle] No significant performance differences

2 similar comments
@grpc-testing
Copy link

[trickle] No significant performance differences

@grpc-testing
Copy link

[trickle] No significant performance differences

@grpc-testing
Copy link

[microbenchmarks] No significant performance differences

2 similar comments
@grpc-testing
Copy link

[microbenchmarks] No significant performance differences

@grpc-testing
Copy link

[microbenchmarks] No significant performance differences

@markdroth
Copy link
Member

Thanks for catching this bug! This is what I was fumbling towards in some of my questions yesterday, but I didn't have enough of the code in my head to articulate the question clearly. But seeing this PR made me look at the code in more detail, and I think you were on the right track to begin with.


Reviewed 2 of 2 files at r1, 2 of 2 files at r3.
Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed.


src/core/lib/iomgr/ev_epollex_linux.cc, line 443 at r3 (raw file):

    if (shutdown(fd->fd, SHUT_RDWR)) {
      if (errno != ENOTCONN) {
        gpr_log(GPR_ERROR, "Error shutting down fd. errno: %d", errno);

I'd like @sreecha to confirm that this change is reasonable.


test/cpp/naming/resolver_component_test.cc, line 277 at r3 (raw file):

}

void LogAbortOnError(gpr_log_func_args* args) {

I don't think it really makes sense to test for log messages. Logs tend to be useful for humans but not very appropriate for automated testing.

Unfortunately, I don't have a better suggestion. The only way I see to test for this bug directly would be to find a way to open a new fd with the same number between the time that c-ares closes it and when we call shutdown on it, so that we can verify that's not happening. But it's not clear how to orchestrate that.


Comments from Reviewable

@apolcyn
Copy link
Contributor Author

apolcyn commented May 18, 2018

Review status: 2 of 10 files reviewed at latest revision, 2 unresolved discussions.


test/cpp/naming/resolver_component_test.cc, line 277 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I don't think it really makes sense to test for log messages. Logs tend to be useful for humans but not very appropriate for automated testing.

Unfortunately, I don't have a better suggestion. The only way I see to test for this bug directly would be to find a way to open a new fd with the same number between the time that c-ares closes it and when we call shutdown on it, so that we can verify that's not happening. But it's not clear how to orchestrate that.

Taking a look to see if it's doable to catch it that way, maybe using a background thread that opens and closes sockets


Comments from Reviewable

@grpc-testing
Copy link

****************************************************************

libgrpc.so

     VM SIZE                                                                                          FILE SIZE
 ++++++++++++++ GROWING                                                                            ++++++++++++++
  +0.0%     +88 [None]                                                                             +1.30Ki  +0.0%
  +7.3%    +192 src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver_posix.    +192  +7.3%
       +63%    +128 fd_node_destroy                                                                       +128   +63%
      [NEW]    +120 fd_node_shutdown_locked(fd_node*, char const*) [clone .isra.1] [clone .part.2]        +120  [NEW]
       +22%     +64 grpc_ares_ev_driver_create                                                             +64   +22%
      +8.3%      +7 [Unmapped]                                                                              +7  +8.3%
  +0.6%     +64 src/core/lib/iomgr/ev_epollex_linux.cc                                                 +64  +0.6%
       +84%     +69 fd_shutdown                                                                            +69   +84%

  +0.0%    +344 TOTAL                                                                              +1.55Ki  +0.0%


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



@grpc-testing
Copy link

[trickle] No significant performance differences

@grpc-testing
Copy link

[microbenchmarks] No significant performance differences

@apolcyn apolcyn force-pushed the fix_shutdown_closed_socket branch from 21630cd to 594e049 Compare May 18, 2018 23:38
@grpc-testing
Copy link

****************************************************************

libgrpc.so

     VM SIZE                                                                                          FILE SIZE
 ++++++++++++++ GROWING                                                                            ++++++++++++++
  +0.0%     +88 [None]                                                                             +1.35Ki  +0.0%
  +7.3%    +192 src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver_posix.    +192  +7.3%
       +63%    +128 fd_node_destroy                                                                       +128   +63%
      [NEW]    +120 fd_node_shutdown_locked(fd_node*, char const*) [clone .isra.1] [clone .part.2]        +120  [NEW]
       +22%     +64 grpc_ares_ev_driver_create                                                             +64   +22%
      +8.3%      +7 [Unmapped]                                                                              +7  +8.3%
  +0.6%     +64 src/core/lib/iomgr/ev_epollex_linux.cc                                                 +64  +0.6%
       +89%     +73 fd_shutdown                                                                            +73   +89%

  +0.0%    +344 TOTAL                                                                              +1.60Ki  +0.0%


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



1 similar comment
@grpc-testing
Copy link

****************************************************************

libgrpc.so

     VM SIZE                                                                                          FILE SIZE
 ++++++++++++++ GROWING                                                                            ++++++++++++++
  +0.0%     +88 [None]                                                                             +1.35Ki  +0.0%
  +7.3%    +192 src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver_posix.    +192  +7.3%
       +63%    +128 fd_node_destroy                                                                       +128   +63%
      [NEW]    +120 fd_node_shutdown_locked(fd_node*, char const*) [clone .isra.1] [clone .part.2]        +120  [NEW]
       +22%     +64 grpc_ares_ev_driver_create                                                             +64   +22%
      +8.3%      +7 [Unmapped]                                                                              +7  +8.3%
  +0.6%     +64 src/core/lib/iomgr/ev_epollex_linux.cc                                                 +64  +0.6%
       +89%     +73 fd_shutdown                                                                            +73   +89%

  +0.0%    +344 TOTAL                                                                              +1.60Ki  +0.0%


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



@apolcyn
Copy link
Contributor Author

apolcyn commented May 31, 2018

heads up that there are test failures introduced in the last commit that I need to fix here

Copy link
Contributor

@sreecha sreecha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the PR description, can you please add a separate note on why you removed already_closed parameter ?

This will help us in future (just in case we realize that we do need already_closed :) )

@grpc-testing
Copy link

****************************************************************

libgrpc.so

     VM SIZE                                                                                          FILE SIZE
 ++++++++++++++ GROWING                                                                            ++++++++++++++
  +0.4%     +48 src/core/lib/iomgr/ev_epollex_linux.cc                                                 +48  +0.4%
      [NEW]    +177 fd_orphan                                                                             +177  [NEW]
       +89%     +73 fd_shutdown                                                                            +73   +89%
      +1.0%      +3 [Unmapped]                                                                              +3  +1.0%
  +0.6%     +16 src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver_posix.     +16  +0.6%
       +21%    +213 grpc_ares_notify_on_event_locked                                                      +213   +21%
      [NEW]    +116 fd_node_shutdown_locked(fd_node*, char const*) [clone .isra.1] [clone .part.2]        +116  [NEW]
       +22%     +64 grpc_ares_ev_driver_create                                                             +64   +22%

 -------------- SHRINKING                                                                          --------------
  -0.3%     -32 src/core/lib/iomgr/ev_poll_posix.cc                                                    -32  -0.3%
      [DEL]    -364 fd_orphan                                                                             -364  [DEL]
  -0.2%     -16 src/core/lib/iomgr/ev_epollsig_linux.cc                                                -16  -0.2%
      [DEL]    -656 fd_orphan                                                                             -656  [DEL]
  -0.5%     -16 src/core/lib/iomgr/tcp_client_posix.cc                                                 -16  -0.5%
      -1.2%     -16 on_writable                                                                            -16  -1.2%
  -0.2%     -16 src/core/lib/iomgr/tcp_posix.cc                                                        -16  -0.2%
      -7.4%     -14 [Unmapped]                                                                             -14  -7.4%
      -2.0%      -2 tcp_unref                                                                               -2  -2.0%
  -0.3%     -16 src/core/lib/iomgr/tcp_server_posix.cc                                                 -16  -0.3%
      -6.4%     -16 deactivated_all_ports                                                                  -16  -6.4%

 -+-+-+-+-+-+-+ MIXED                                                                              +-+-+-+-+-+-+-
  -0.0%     -24 [None]                                                                                +240  +0.0%

  -0.0%     -56 TOTAL                                                                                 +208  +0.0%


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



@grpc-testing
Copy link

[trickle] No significant performance differences

@apolcyn
Copy link
Contributor Author

apolcyn commented May 31, 2018

@sreecha I added a section to explain for removal of already_closed to the description

@grpc-testing
Copy link

[microbenchmarks] No significant performance differences

@apolcyn
Copy link
Contributor Author

apolcyn commented May 31, 2018

test failures are unrelated, @markdroth PTAL this should be ready for another look

@apolcyn apolcyn force-pushed the fix_shutdown_closed_socket branch from 1e56d5a to fd883d5 Compare June 4, 2018 17:18
@grpc-testing
Copy link

****************************************************************

libgrpc.so

     VM SIZE                                                                                          FILE SIZE
 ++++++++++++++ GROWING                                                                            ++++++++++++++
  +0.0%      +8 [None]                                                                                +240  +0.0%
  +0.4%     +48 src/core/lib/iomgr/ev_epollex_linux.cc                                                 +48  +0.4%
      [NEW]    +177 fd_orphan                                                                             +177  [NEW]
       +89%     +73 fd_shutdown                                                                            +73   +89%
      +1.0%      +3 [Unmapped]                                                                              +3  +1.0%
  +0.6%     +16 src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver_posix.     +16  +0.6%
       +21%    +213 grpc_ares_notify_on_event_locked                                                      +213   +21%
      [NEW]    +116 fd_node_shutdown_locked(fd_node*, char const*) [clone .isra.1] [clone .part.2]        +116  [NEW]
       +22%     +64 grpc_ares_ev_driver_create                                                             +64   +22%

 -------------- SHRINKING                                                                          --------------
  -0.3%     -32 src/core/lib/iomgr/ev_poll_posix.cc                                                    -32  -0.3%
      [DEL]    -364 fd_orphan                                                                             -364  [DEL]
  -0.2%     -16 src/core/lib/iomgr/ev_epollsig_linux.cc                                                -16  -0.2%
      [DEL]    -656 fd_orphan                                                                             -656  [DEL]
  -0.5%     -16 src/core/lib/iomgr/tcp_client_posix.cc                                                 -16  -0.5%
      -1.2%     -16 on_writable                                                                            -16  -1.2%
  -0.2%     -16 src/core/lib/iomgr/tcp_posix.cc                                                        -16  -0.2%
      -7.4%     -14 [Unmapped]                                                                             -14  -7.4%
      -2.0%      -2 tcp_unref                                                                               -2  -2.0%
  -0.3%     -16 src/core/lib/iomgr/tcp_server_posix.cc                                                 -16  -0.3%
      -6.4%     -16 deactivated_all_ports                                                                  -16  -6.4%

  -0.0%     -24 TOTAL                                                                                 +208  +0.0%


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



@grpc-testing
Copy link

[trickle] No significant performance differences

@grpc-testing
Copy link

[microbenchmarks] No significant performance differences

@markdroth
Copy link
Member

I'd prefer to see the polling change split into its own PR, since it seems potentially invasive. Thanks!


Reviewed 3 of 12 files at r13, 19 of 19 files at r14.
Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


src/core/lib/iomgr/ev_poll_posix.cc, line 436 at r14 (raw file):

  gpr_mu_lock(&fd->mu);
  REF_BY(fd, 1, reason); /* remove active status, but keep referenced */
  if (!has_watchers(fd)) {

What if has_watchers() is false but release_fd is non-null? Won't we incorrectly close the fd in that case?


src/core/lib/iomgr/ev_posix.h, line 102 at r14 (raw file):

   notify_on_write.
   MUST NOT be called with a pollset lock taken */
void grpc_fd_orphan(grpc_fd* fd, grpc_closure* on_done, int* release_fd,

Can we move this polling change to a separate PR? It seems invasive enough that I'd prefer to keep it separate.


test/cpp/naming/resolver_component_test.cc, line 390 at r11 (raw file):

Previously, apolcyn wrote…

One more idea: perhaps the stress test could be run as a standalone C++ test that doesn't use any of the *runner.py or *_invoker.cc files. My hesitation is that the subprocess management, health checking, and bazel vs. non-bazel compatibility requirements would cause it to become complicated though.

I think the way you have it now is good enough. None of these options are that pretty, so let's just do whatever is the least amount of work.

Thanks for looking into this.


Comments from Reviewable

@grpc-testing
Copy link

****************************************************************

libgrpc.so

     VM SIZE                                                                                          FILE SIZE
 ++++++++++++++ GROWING                                                                            ++++++++++++++
  +0.0%      +8 [None]                                                                                +968  +0.0%
  +0.6%     +64 src/core/lib/iomgr/ev_epollex_linux.cc                                                 +64  +0.6%
       +89%     +73 fd_shutdown                                                                            +73   +89%
  +0.6%     +16 src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver_posix.     +16  +0.6%
       +21%    +213 grpc_ares_notify_on_event_locked                                                      +213   +21%
      [NEW]    +116 fd_node_shutdown_locked(fd_node*, char const*) [clone .isra.1] [clone .part.2]        +116  [NEW]
       +22%     +64 grpc_ares_ev_driver_create                                                             +64   +22%

  +0.0%     +88 TOTAL                                                                              +1.02Ki  +0.0%


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



@grpc-testing
Copy link

[trickle] No significant performance differences

@apolcyn
Copy link
Contributor Author

apolcyn commented Jun 5, 2018

Done. I've removed this from this PR, and opened #15648 for all of that instead.


Review status: 8 of 10 files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.


Comments from Reviewable

@apolcyn
Copy link
Contributor Author

apolcyn commented Jun 5, 2018

Review status: 8 of 10 files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.


test/cpp/naming/resolver_component_test.cc, line 390 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think the way you have it now is good enough. None of these options are that pretty, so let's just do whatever is the least amount of work.

Thanks for looking into this.

Ack. sounds good, I'll leave things as is


Comments from Reviewable

@grpc-testing
Copy link

[microbenchmarks] Performance differences noted:
Benchmark                                                                                 cpu_time    real_time
----------------------------------------------------------------------------------------  ----------  -----------
BM_PumpStreamServerToClient<InProcess>/32768                                              -7%         -7%
BM_StreamingPingPongWithCoalescingApi<MinInProcess, NoOpMutator, NoOpMutator>/262144/1/1  +8%         +8%

@apolcyn
Copy link
Contributor Author

apolcyn commented Jun 5, 2018

Review status: 8 of 10 files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.


src/core/lib/iomgr/ev_poll_posix.cc, line 436 at r14 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

What if has_watchers() is false but release_fd is non-null? Won't we incorrectly close the fd in that case?

I believe things are still correct in this case: if release_rd is non-null, then fd->released will be set, and so close_fd_locked will actually not close the fd (the fd->released flag is consumed by the close_fd_locked function.)


Comments from Reviewable

@markdroth
Copy link
Member

Nice work!

@apolcyn
Copy link
Contributor Author

apolcyn commented Jun 7, 2018

bazel test failures: #15610
mac test failure is bins/opt/h2_compress_test simple_request under poll. It looks like bins/opt/h2_compress_test simple_request has a history of failure (under tsan).

@apolcyn apolcyn merged commit 701a4b1 into grpc:master Jun 7, 2018
@lock lock bot locked as resolved and limited conversation to collaborators Jan 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants