Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23182][CORE] Allow enabling TCP keep alive on the RPC connections #20512

Closed
wants to merge 1 commit into from

Conversation

peshopetrov
Copy link
Contributor

@peshopetrov peshopetrov commented Feb 5, 2018

What changes were proposed in this pull request?

Make it possible for the master to enable TCP keep alive on the RPC connections with clients.

How was this patch tested?

Manually tested.

Added the following:

spark.rpc.io.enableTcpKeepAlive  true

to spark-defaults.conf.

Observed the following on the Spark master:

$ netstat -town | grep 7077
tcp6       0      0 10.240.3.134:7077       10.240.1.25:42851       ESTABLISHED keepalive (6736.50/0/0)
tcp6       0      0 10.240.3.134:44911      10.240.3.134:7077       ESTABLISHED keepalive (4098.68/0/0)
tcp6       0      0 10.240.3.134:7077       10.240.3.134:44911      ESTABLISHED keepalive (4098.68/0/0)

Which proves that the keep alive setting is taking effect.

It's currently possible to enable TCP keep alive on the worker / executor, but is not possible to configure on other RPC connections. It's unclear to me why this could be the case. Keep alive is more important for the master to protect it against suddenly departing workers / executors, thus I think it's very important to have it. Particularly this makes the master resilient in case of using preemptible worker VMs in GCE. GCE has the concept of shutdown scripts, which it doesn't guarantee to execute. So workers often don't get shutdown gracefully and the TCP connections on the master linger as there's nothing to close them. Thus the need of enabling keep alive.

This enables keep-alive on connections besides the master's connections, but that shouldn't cause harm.

@jerryshao
Copy link
Contributor

Is it possible that TCP keepalive is disable by kernel, so that your approach cannot be worked? I was thinking if it is better to add application level heartbeat msg to detect lost workers?

@peshopetrov
Copy link
Contributor Author

For completeness it should be possible to enable OS-level TCP keep alives. The client does enable TCP keepalive on its side and it should be possible on the server too.

However, independent of that it perhaps makes sense to also have application level heartbeats because in the JVM it seems it's not possible to tune the timeouts of TCP keepalive.

@peshopetrov
Copy link
Contributor Author

Any update?
We have rolled out our Spark clusters with this change and it seems to be working great. We see no lingering connections on the masters.

@vundela
Copy link

vundela commented Apr 24, 2018

cc @squito @vanzin
Can you please comment on this PR?

@squito
Copy link
Contributor

squito commented Apr 24, 2018

this is just far enough outside my expertise I don't have an opinion -- but @zsxwing might have some thoughts

@srowen
Copy link
Member

srowen commented Dec 16, 2018

Is there any downside to enabling keepalive? should it be on by default, or always? seems OK to me.

@SparkQA
Copy link

SparkQA commented Dec 17, 2018

Test build #4475 has finished for PR 20512 at commit c5e2d98.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@peshopetrov
Copy link
Contributor Author

I don't think there is, but didn't want to change the default behavior.

@srowen
Copy link
Member

srowen commented Dec 17, 2018

Ignore the test failure for now, I think it's unrelated.
For Spark 3, I think we can make the default 'true' and leave this as a safety valve, if this is probably what people want to set always.
For back-porting to Spark 2.x, we could then open an identical PR that sets it to false.

@SparkQA
Copy link

SparkQA commented Dec 22, 2018

Test build #4483 has finished for PR 20512 at commit c5e2d98.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another look at the code and as you say it's already enabled in TransportClientFactory.createClient, always. It also sets things like TCP_NODELAY. I think at least this should be on by default, or could be? and I am not sure if it's worth another config. I'm thinking of consistency here. @rxin wrote the code for that part in the client a looong time ago.

Copy link
Contributor

@squito squito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another look at this again. Its worth noting that this change is much broader than what is suggested in the summary -- its changing the setting for every server, eg. driver's rpc server, external shuffle server, etc.

I'm not a real expert here, but I think turning on keep-alive is probably fine. The downside is a few more msgs sent over idle connections to make sure they're alive, and perhaps false positives in cases where one end is just under heavy load.

@srowen you mentioned NODELAY as well -- that seems like a bigger change to me, one I'd investigate more before changing (maybe others have enough experience to say more definitively whether its OK to flip, just saying that I myself would proceed more cautiously on that one).

@@ -172,6 +174,14 @@ public boolean verboseMetrics() {
return conf.getBoolean(SPARK_NETWORK_VERBOSE_METRICS, false);
}

/**
* Whether to enable TCP keep-alive. If true, the TCP keep-alives are enabled, which removes
* connections that are idle for too long.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really accurate -- it sends a msg when the connection is idle, but if the connection is still up then it leaves the connection up. Its only if the keep alive msg doesnt' get ack'ed that the connection is removed. I'd either say "... to detect broken connections" OR just leave it at "whether to enable TCP keep-alive", as its perhaps best explained by easily searchable external references.

@rxin
Copy link
Contributor

rxin commented Jan 2, 2019

I pinged some people offline who know more about networking to take a look at this. It might take a while before they can due to holiday schedule though.

@coderplay
Copy link
Contributor

coderplay commented Jan 8, 2019

Clarification: I am not a spark expert, just got an invitation from @rxin because he think I have some knowledge about linux TCP.

Generally, the patch looks good to me for the purpose of preventing inactivity from disconnecting the channel. But from the diff, looks like this commit will impact other RPCs or shuffling transport as well. It's not only for master RPCs as the title declared.

Min

@srowen
Copy link
Member

srowen commented Jan 8, 2019

I see, @peshopetrov , could you maybe narrow the scope of the change to only affect master RPCs? that seems OK, and seems like your intent.

@peshopetrov
Copy link
Contributor Author

peshopetrov commented Jan 8, 2019

TCP keepalive will be disabled unless explicitly set per transport type. E.g.:

spark.rpc.io.enableTcpKeepAlive      true
spark.shuffle.io.enableTcpKeepAlive  true

We actually set both, but if only spark.rpc.io.enableTcpKeepAlive is set then it only applies for RPCs, or am I missing something?

@srowen
Copy link
Member

srowen commented Jan 9, 2019

@peshopetrov it looks like the TransportServer is used in two implementations of the shuffle service as well as the block transfer service, in addition to the RPC server. I think that's what people are getting at.

I don't see an objection to setting keep-alive on all of these connections, so one resolution is just to change the JIRA/PR to reflect the fact that it's going to affect more than RPCs.

Another resolution is to add new arguments or configs to make only the RPC server enable this new setting. That could be fine too.

I personally would favor the first option; just update the description. I don't see an argument that there's substantial downside to enabling this for all of these server connections.

@rxin
Copy link
Contributor

rxin commented Jan 9, 2019 via email

@srowen
Copy link
Member

srowen commented Jan 9, 2019

My understanding is that there is a benefit for the master's RPC connections, as described in the PR description. The benefit or cost for other server connections is unclear; probably no harm other than a few extra messages on slow connection; probably no real upside.

The argument for this change as-is is that it addresses the RPC master connection and just applies the setting consistently elsewhere, where it's probably neither positive nor negative.

There is also an argument for making this change a little more complex to affect only the master RPCs, as described in the PR currently.

I could live with either one.

@srowen
Copy link
Member

srowen commented Jan 12, 2019

@rxin just to finish this off, which argument is more compelling to you? I tend to favor making this change, but could live with a variation that is narrower too.

@coderplay
Copy link
Contributor

coderplay commented Jan 12, 2019

TCP keep alive is non-invasive, the only minor downside is that it would generate extra network packets. But thinking about spark's big data exchange use cases, these packets are negligible. I am good with this one config ruling all transport types, just please change the title/commit messages.

@rxin
Copy link
Contributor

rxin commented Jan 12, 2019 via email

@srowen srowen changed the title [SPARK-23182][CORE] Allow enabling TCP keep alive on the master RPC connections. [SPARK-23182][CORE] Allow enabling TCP keep alive on the RPC connections Jan 12, 2019
@SparkQA
Copy link

SparkQA commented Jan 13, 2019

Test build #4506 has finished for PR 20512 at commit c5e2d98.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 13, 2019

Test build #4509 has finished for PR 20512 at commit c5e2d98.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jan 13, 2019

Merged to master

@srowen srowen closed this in c01152d Jan 13, 2019
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

Make it possible for the master to enable TCP keep alive on the RPC connections with clients.

## How was this patch tested?

Manually tested.

Added the following:
```
spark.rpc.io.enableTcpKeepAlive  true
```
to spark-defaults.conf.

Observed the following on the Spark master:
```
$ netstat -town | grep 7077
tcp6       0      0 10.240.3.134:7077       10.240.1.25:42851       ESTABLISHED keepalive (6736.50/0/0)
tcp6       0      0 10.240.3.134:44911      10.240.3.134:7077       ESTABLISHED keepalive (4098.68/0/0)
tcp6       0      0 10.240.3.134:7077       10.240.3.134:44911      ESTABLISHED keepalive (4098.68/0/0)
```

Which proves that the keep alive setting is taking effect.

It's currently possible to enable TCP keep alive on the worker / executor, but is not possible to configure on other RPC connections. It's unclear to me why this could be the case. Keep alive is more important for the master to protect it against suddenly departing workers / executors, thus I think it's very important to have it. Particularly this makes the master resilient in case of using preemptible worker VMs in GCE. GCE has the concept of shutdown scripts, which it doesn't guarantee to execute. So workers often don't get shutdown gracefully and the TCP connections on the master linger as there's nothing to close them. Thus the need of enabling keep alive.

This enables keep-alive on connections besides the master's connections, but that shouldn't cause harm.

Closes apache#20512 from peshopetrov/master.

Authored-by: Petar Petrov <petar.petrov@leanplum.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants