-
Notifications
You must be signed in to change notification settings - Fork 384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Spark multi-user support for standalone mode #750
Conversation
Thank you for your pull request. An admin will review this request soon. |
There's a lot of code duplication among the hadoop1, hadoop2 and yarn utils. Can you DRY this out, please? |
Jenkins, ok to test. |
Hi @markhamstra , thanks for your advice. Spark use option compile to deal with hadoop1, hadoop2 and yarn separately, if I extract out duplicated code, a common package is needed, but where to add this package is a problem. My concern is that add it to spark may pollute spark with different version of Hadoop, if you have a good solution please let me know. Thanks |
Hi Jerry, I believe someone suggested the use of hadoop-client jar to abstract out talking to different versions. It seems this is affecting packaging and other issues, so maybe we should accelerate this work. |
Hi @velvia , would you please describe to be more specific? I cannot catch your meaning thoroughly. |
def runAsUser(func: (Product) => Unit, args: Product, user: String) { | ||
val ugi = UserGroupInformation.createRemoteUser(user) | ||
if (UserGroupInformation.isSecurityEnabled) { | ||
Option(System.getenv(HDFS_TOKEN_KEY)) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an environment variable isn't very secure for passing the token as anyone on that machine could simply do a ps and get the token. Perhaps this is ok for the first cut and if you are limiting access to the machines but I think this will eventually need to be made more secure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I was wrong. environment variables are read only for the user owning the process.
@velvia I also don't follow what you are asking. The hadoop api has changed between versions. They have back petalled a bit on the latest hadoop 2 versions (2.1.0-beta) to try to make most of the api's compatible with hadoop 1. So perhaps once spark moves to a later version of hadoop 2 some of those will not be needed. |
Ok, let me try to be more clear. Currently, Spark is compiled directly against a specific version of a There has been talk recently that we should try building against the I personally don't have experience with hadoop-client so can't vouch for if On Mon, Aug 5, 2013 at 6:36 AM, tgravescs notifications@github.com wrote:
Because the people who are crazy enough to think they can change the world, |
the hadoop2-yarn profile definitely uses api's that do not exist in hadoop 1. I'm not sure about the hadoop1 and hadoop2 profiles. |
Thank you for your pull request. An admin will review this request soon. |
@velvia and @tgravescs . thanks for your comments. Pass token between parent and child process using environment variable is a simple and easy to implement way currently, but it is not secure as you said, people can get this token under /proc/. While I think all that based on the prerequisite that people can log in that machine using the same user as the processor, so that was my compromise implementation. For Hadoop client jar, I'm not familiar with Hadoop client and not sure this jar can solve different version's api compatibility. So I will try to investigate on it to see if it is feasible. |
@jerryshao looks like the environment var can only be read by that job's user according to @tgravescs's last comment, so hopefully that's fine? Jey Kottalam from the AMP Lab has been working on a hadoop-client version of Spark's build system, so we probably don't need to worry about that here. The key is to just make sure this works in all the versions of the Hadoop code. However, I am curious as to whether the code for this will be the same across Hadoop versions or not. Has the security API changed between Hadoop 1 and 2? |
@mateiz It's not the job's user but the process owner who can read environment var, I think they are different, and currently that' OK as I described in the last comment. Most of the security API is same between Hadoop 1 and 2 except some of them are deprecated, but its quite different in Hadoop-yarn. |
Yeah, this would be nice to add in if you don't mind taking a look. |
HI @jey , Do you have any design doc about Hadoop integration, cause so many code has changed, I have to figure out how to update my patch. |
@jerryshao, I'm happy to take care of updating this PR, but had a question: does your patch provide the same functionality under YARN and has it been tested with YARN? Thanks. |
Here's my branch with an initial conversion of your patch. I haven't tested it against an HDFS install with security enabled yet. |
Hi @jey , thanks for your help. YARN has already provided multi-user support and HDFS auth by @tgravescs , so my patch only implement this functionality in standalone mode. I will checkout your branch and run on my security cluster to see if it is OK. |
Hi @jey , I checked out your branch and tested in CDH 4.1.2 cluster with and without security enabled, seems fine. Also all the unit test is passed. BTW, there's a problem when I run
Also I changed Hadoop version to 1.0.4, same problem occurs:
sbt still try to create module yarn even yarn is not enabled, I think these above dependency is not existed. I'm not sure its my wrong use of this command or something should be changed in |
Hi @jerryshao, thanks for catching that bug. I just submitted #860 with a fix for that issue. |
Hi @jey and @jerryshao - did you guys decide on who is going to continue this patch? |
Hi @rxin, I think jey's updated patch is fine, I've already tested it under CDH 4.1.2, it's fine with security enabled or not, besides I have some concerns:
Generally I will continue this patch after jey's work to make it more reasonable. Thanks |
I've only rebased the patch and don't know anything about HDFS security and related issues, so I think it would make sense for @jerryshao to continue the patch. |
Hi @jerryshao - can you take this over and submit a new pr to the asf repo? |
Ok, I will refactor this patch and submit to asf repo. |
I need this to be public for the implementation of SharkServer2. However, I think this functionality is generally useful and should be pretty stable. Author: Michael Armbrust <michael@databricks.com> Closes mesos#750 from marmbrus/metastoreTypes and squashes the following commits: f51b62e [Michael Armbrust] Make Hive Metastore conversion functions publicly visible.
This patch add multi-user support for standalone mode.
Currently Spark ExecutorBackend do not distinguish the user who submits app, instead it uses the user who start Spark cluster to communicate with hdfs, this will introduce file permission issue. This patch solves this issue in two aspects:
I've tested it on CDH4.1.2, Hadoop 1.0.4 with or without security enabled, but I cannot cover other different versions.
This patch does not solve delegation token renew issue when communicate with security hadoop, for long-run apps like Spark Streaming, Shark Server, this will make access failure when delegation token expires.
Any advice about renew mechanism is really appreciated, I will add it in this patch.
Thanks
Jerry