Add Spark multi-user support for standalone mode #750

jerryshao · 2013-07-30T09:03:19Z

This patch add multi-user support for standalone mode.

Currently Spark ExecutorBackend do not distinguish the user who submits app, instead it uses the user who start Spark cluster to communicate with hdfs, this will introduce file permission issue. This patch solves this issue in two aspects:

ExecutorBackend use the app's user to access hdfs, this will keep the same file permission with the app's user.
For security hdfs, client driver get delegation token and distribute to cluster, ExecutorBackend use delegation token to get service authentication and access hdfs.

I've tested it on CDH4.1.2, Hadoop 1.0.4 with or without security enabled, but I cannot cover other different versions.

This patch does not solve delegation token renew issue when communicate with security hadoop, for long-run apps like Spark Streaming, Shark Server, this will make access failure when delegation token expires.

Any advice about renew mechanism is really appreciated, I will add it in this patch.

Thanks
Jerry

AmplabJenkins · 2013-07-30T09:03:30Z

Thank you for your pull request. An admin will review this request soon.

markhamstra · 2013-07-30T14:40:25Z

There's a lot of code duplication among the hadoop1, hadoop2 and yarn utils. Can you DRY this out, please?

andyk · 2013-07-30T14:54:41Z

Jenkins, ok to test.

jerryshao · 2013-07-31T02:40:14Z

Hi @markhamstra , thanks for your advice. Spark use option compile to deal with hadoop1, hadoop2 and yarn separately, if I extract out duplicated code, a common package is needed, but where to add this package is a problem. My concern is that add it to spark may pollute spark with different version of Hadoop, if you have a good solution please let me know.

Thanks
Jerry

velvia · 2013-08-02T07:21:28Z

Hi Jerry, I believe someone suggested the use of hadoop-client jar to abstract out talking to different versions.
http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client/1.0.1

It seems this is affecting packaging and other issues, so maybe we should accelerate this work.

jerryshao · 2013-08-02T08:03:07Z

Hi @velvia , would you please describe to be more specific? I cannot catch your meaning thoroughly.

tgravescs · 2013-08-05T13:30:32Z

core/src/hadoop1/scala/spark/deploy/SparkHadoopUtil.scala

+  def runAsUser(func: (Product) => Unit, args: Product, user: String) {
+    val ugi = UserGroupInformation.createRemoteUser(user)
+    if (UserGroupInformation.isSecurityEnabled) {
+      Option(System.getenv(HDFS_TOKEN_KEY)) match {


an environment variable isn't very secure for passing the token as anyone on that machine could simply do a ps and get the token. Perhaps this is ok for the first cut and if you are limiting access to the machines but I think this will eventually need to be made more secure.

sorry I was wrong. environment variables are read only for the user owning the process.

tgravescs · 2013-08-05T13:36:36Z

@velvia I also don't follow what you are asking. The hadoop api has changed between versions. They have back petalled a bit on the latest hadoop 2 versions (2.1.0-beta) to try to make most of the api's compatible with hadoop 1. So perhaps once spark moves to a later version of hadoop 2 some of those will not be needed.

velvia · 2013-08-05T15:08:31Z

Ok, let me try to be more clear.

Currently, Spark is compiled directly against a specific version of a
Hadoop jar. As you discovered, this leads to problems because you have to
manually recompile Spark against different Hadoop versions.

There has been talk recently that we should try building against the
hadoop-client jar. This may allow us to have a Hadoop-version-independent
build of Spark so that you no longer need to build against a specific
Hadoop version. It would also remove a huge chain of dependencies from
the distribution.

I personally don't have experience with hadoop-client so can't vouch for if
it would work, but it's worth trying.

On Mon, Aug 5, 2013 at 6:36 AM, tgravescs notifications@github.com wrote:

@velvia https://github.com/velvia I also don't follow what you are
asking. The hadoop api has changed between versions. They have back
petalled a bit on the latest hadoop 2 versions (2.1.0-beta) to try to make
most of the api's compatible with hadoop 1. So perhaps once spark moves to
a later version of hadoop 2 some of those will not be needed.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/750#issuecomment-22106304
.

Because the people who are crazy enough to think they can change the world,
are the ones who do. -- Steve Jobs

tgravescs · 2013-08-05T20:16:24Z

the hadoop2-yarn profile definitely uses api's that do not exist in hadoop 1. I'm not sure about the hadoop1 and hadoop2 profiles.

AmplabJenkins · 2013-08-05T21:33:42Z

Thank you for your pull request. An admin will review this request soon.

jerryshao · 2013-08-06T01:14:32Z

@velvia and @tgravescs . thanks for your comments.

Pass token between parent and child process using environment variable is a simple and easy to implement way currently, but it is not secure as you said, people can get this token under /proc/. While I think all that based on the prerequisite that people can log in that machine using the same user as the processor, so that was my compromise implementation.

For Hadoop client jar, I'm not familiar with Hadoop client and not sure this jar can solve different version's api compatibility. So I will try to investigate on it to see if it is feasible.

mateiz · 2013-08-06T01:54:38Z

@jerryshao looks like the environment var can only be read by that job's user according to @tgravescs's last comment, so hopefully that's fine?

Jey Kottalam from the AMP Lab has been working on a hadoop-client version of Spark's build system, so we probably don't need to worry about that here. The key is to just make sure this works in all the versions of the Hadoop code. However, I am curious as to whether the code for this will be the same across Hadoop versions or not. Has the security API changed between Hadoop 1 and 2?

jerryshao · 2013-08-06T02:14:16Z

@mateiz It's not the job's user but the process owner who can read environment var, I think they are different, and currently that' OK as I described in the last comment.

Most of the security API is same between Hadoop 1 and 2 except some of them are deprecated, but its quite different in Hadoop-yarn.

mateiz · 2013-08-11T02:14:40Z

Okay, thanks. We might wait until #803 is merged to merge this, since that changes some of the ways we interact with Hadoop, but this is definitely something we want in 0.8. CCing @jey to take a look at this too.

jey · 2013-08-20T18:47:09Z

@mateiz, is this targeted at 0.8? If so, I can look at updating it to work with our current master that has #838 (hadoop agnostic builds) merged.

mateiz · 2013-08-20T19:09:59Z

Yeah, this would be nice to add in if you don't mind taking a look.

jerryshao · 2013-08-21T01:25:39Z

HI @jey , Do you have any design doc about Hadoop integration, cause so many code has changed, I have to figure out how to update my patch.

jey · 2013-08-21T22:28:43Z

@jerryshao, I'm happy to take care of updating this PR, but had a question: does your patch provide the same functionality under YARN and has it been tested with YARN? Thanks.

jey · 2013-08-21T23:40:25Z

Here's my branch with an initial conversion of your patch. I haven't tested it against an HDFS install with security enabled yet.

https://github.com/jey/spark/tree/hdfs-auth

jerryshao · 2013-08-22T00:33:27Z

Hi @jey , thanks for your help. YARN has already provided multi-user support and HDFS auth by @tgravescs , so my patch only implement this functionality in standalone mode.

I will checkout your branch and run on my security cluster to see if it is OK.

jerryshao · 2013-08-22T09:33:52Z

Hi @jey , I checked out your branch and tested in CDH 4.1.2 cluster with and without security enabled, seems fine. Also all the unit test is passed.

BTW, there's a problem when I run sbt/sbt gen-idea or sbt/sbt eclipse to create project profile:

[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn]  ::          UNRESOLVED DEPENDENCIES         ::
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn]  :: org.apache.hadoop#hadoop-yarn-api;2.0.0-mr1-cdh4.1.2: not found
[warn]  :: org.apache.hadoop#hadoop-yarn-common;2.0.0-mr1-cdh4.1.2: not found
[warn]  :: org.apache.hadoop#hadoop-yarn-client;2.0.0-mr1-cdh4.1.2: not found
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::

Also I changed Hadoop version to 1.0.4, same problem occurs:

[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn]  ::          UNRESOLVED DEPENDENCIES         ::
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn]  :: org.apache.hadoop#hadoop-yarn-api;1.0.4: not found
[warn]  :: org.apache.hadoop#hadoop-yarn-common;1.0.4: not found
[warn]  :: org.apache.hadoop#hadoop-yarn-client;1.0.4: not found
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::

sbt still try to create module yarn even yarn is not enabled, I think these above dependency is not existed. I'm not sure its my wrong use of this command or something should be changed in SparkBuild.scala to treat Yarn module separately.

jey · 2013-08-23T17:30:45Z

Hi @jerryshao, thanks for catching that bug. I just submitted #860 with a fix for that issue.

rxin · 2013-09-22T06:45:39Z

Hi @jey and @jerryshao - did you guys decide on who is going to continue this patch?

jerryshao · 2013-09-22T10:12:24Z

Hi @rxin, I think jey's updated patch is fine, I've already tested it under CDH 4.1.2, it's fine with security enabled or not, besides I have some concerns:

I have no enough versions of Hadoop, I only tested it under CDH 4.1.2 and Apache 1.0.4, I'm not sure it is OK in other versions.
Passing HDFS delegation token from worker to executor backend using environment variable in my implementation is not an elegant way. I think I can change to use Akka way to send send hdfs token after executor is registered.
No HDFS delegation token renewal mechanism. Delegation token will be expired after 7 days by default, so we should renew it before the expiration, otherwise applications like Shark Server and Spark Streaming will be failed.

Generally I will continue this patch after jey's work to make it more reasonable.

Thanks
Jerry

jey · 2013-09-22T18:05:34Z

I've only rebased the patch and don't know anything about HDFS security and related issues, so I think it would make sense for @jerryshao to continue the patch.

rxin · 2013-09-26T21:07:13Z

Hi @jerryshao - can you take this over and submit a new pr to the asf repo?

jerryshao · 2013-09-27T00:49:49Z

Ok, I will refactor this patch and submit to asf repo.

I need this to be public for the implementation of SharkServer2. However, I think this functionality is generally useful and should be pretty stable. Author: Michael Armbrust <michael@databricks.com> Closes mesos#750 from marmbrus/metastoreTypes and squashes the following commits: f51b62e [Michael Armbrust] Make Hive Metastore conversion functions publicly visible.

Add Spark multi-user support for standalone mode

7a3dd47

tgravescs reviewed Aug 5, 2013
View reviewed changes

jey mentioned this pull request Aug 23, 2013

Fix IDE project generation under SBT #860

Merged

jerryshao closed this Sep 27, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Spark multi-user support for standalone mode #750

Add Spark multi-user support for standalone mode #750

jerryshao commented Jul 30, 2013

AmplabJenkins commented Jul 30, 2013

markhamstra commented Jul 30, 2013

andyk commented Jul 30, 2013

jerryshao commented Jul 31, 2013

velvia commented Aug 2, 2013

jerryshao commented Aug 2, 2013

tgravescs Aug 5, 2013

tgravescs Aug 5, 2013

tgravescs commented Aug 5, 2013

velvia commented Aug 5, 2013

tgravescs commented Aug 5, 2013

AmplabJenkins commented Aug 5, 2013

jerryshao commented Aug 6, 2013

mateiz commented Aug 6, 2013

jerryshao commented Aug 6, 2013

mateiz commented Aug 11, 2013

jey commented Aug 20, 2013

mateiz commented Aug 20, 2013

jerryshao commented Aug 21, 2013

jey commented Aug 21, 2013

jey commented Aug 21, 2013

jerryshao commented Aug 22, 2013

jerryshao commented Aug 22, 2013

jey commented Aug 23, 2013

rxin commented Sep 22, 2013

jerryshao commented Sep 22, 2013

jey commented Sep 22, 2013

rxin commented Sep 26, 2013

jerryshao commented Sep 27, 2013

Add Spark multi-user support for standalone mode #750

Add Spark multi-user support for standalone mode #750

Conversation

jerryshao commented Jul 30, 2013

AmplabJenkins commented Jul 30, 2013

markhamstra commented Jul 30, 2013

andyk commented Jul 30, 2013

jerryshao commented Jul 31, 2013

velvia commented Aug 2, 2013

jerryshao commented Aug 2, 2013

tgravescs Aug 5, 2013

Choose a reason for hiding this comment

tgravescs Aug 5, 2013

Choose a reason for hiding this comment

tgravescs commented Aug 5, 2013

velvia commented Aug 5, 2013

tgravescs commented Aug 5, 2013

AmplabJenkins commented Aug 5, 2013

jerryshao commented Aug 6, 2013

mateiz commented Aug 6, 2013

jerryshao commented Aug 6, 2013

mateiz commented Aug 11, 2013

jey commented Aug 20, 2013

mateiz commented Aug 20, 2013

jerryshao commented Aug 21, 2013

jey commented Aug 21, 2013

jey commented Aug 21, 2013

jerryshao commented Aug 22, 2013

jerryshao commented Aug 22, 2013

jey commented Aug 23, 2013

rxin commented Sep 22, 2013

jerryshao commented Sep 22, 2013

jey commented Sep 22, 2013

rxin commented Sep 26, 2013

jerryshao commented Sep 27, 2013