Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC encrypted write should fallback to CPU [databricks] #5604

Merged

Conversation

razajafri
Copy link
Collaborator

This PR falls back to the CPU if any of the ORC encryption configurations are set.

The configs that we are checking for are

hadoop.security.key.provider.path
orc.key.provider
orc.encrypt
orc.mask

fixes #5463

Signed-off-by: Raza Jafri rjafri@nvidia.com

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@revans2
Copy link
Collaborator

revans2 commented May 24, 2022

@razajafri naive question. Why does it that any of them are set? Will ORC encrypt the output if any one of them are set by themselves?

writer.option("orc.mask", mask)
writer.format("orc").mode('overwrite').option("path", write_path).saveAsTable(spark_tmp_table_factory.get())
if path == "" and provider == "" and encrypt == "" and mask == "":
pytest.xfail("Expected to fail as non-encrypted write will not fallback to CPU")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like xfail here. If we expect it to NOT fall back to the CPU, then we should write a test that verifies that it did that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will a skip be better?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A skip is slightly better, but still ugly.

@jlowe jlowe added this to the May 23 - Jun 3 milestone May 24, 2022
@razajafri
Copy link
Collaborator Author

@razajafri naive question. Why does it that any of them are set? Will ORC encrypt the output if any one of them are set by themselves?

No, from what I have researched, all of them have to be set to encrypt the file. This is us being defensive because either the user has made a mistake or they don't know what they are doing. In either case, I wanted CPU version to handle the error checking if any and inform the user. I can change that if we don't want to fallback to the CPU if all the conf are not set

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@revans2
Copy link
Collaborator

revans2 commented May 24, 2022

But the configs overlap is some cases with reading encrypted data. Do we want to fall back to the CPU for a write when the read is the only thing that is encrypted?

@razajafri
Copy link
Collaborator Author

But the configs overlap is some cases with reading encrypted data. Do we want to fall back to the CPU for a write when the read is the only thing that is encrypted?

That's a good point but I haven't found a way to set the options in a global sense. The only two ways that I found we can set the options are while creating a table like

CREATE TABLE encrypted (
  ssn STRING,
  email STRING,
  name STRING
)
USING ORC
OPTIONS (
  hadoop.security.key.provider.path "kms://http@localhost:9600/kms",
  orc.key.provider "hadoop",
  orc.encrypt "pii:ssn,email",
  orc.mask "nullify:ssn;sha256:email"
)

Further details of this can be found here and the other one is while reading or writing the individual file like spark.read.option(...).orc(...) or spark.write.option(...).orc(...)

In the first case, the table write and read are both encrypted so I don't think there is a way to have one and not the other, in the second case, it's up to the user whether they want to encrypt the read or write.

I wasn't able to find the case you are suggesting which will provide a global option to be used for writing ORC. If this case exists, then yes this could be a problem.

revans2
revans2 previously approved these changes May 25, 2022
@razajafri
Copy link
Collaborator Author

build

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri
Copy link
Collaborator Author

razajafri commented May 26, 2022

So, it seems like Databricks 3.2.1 is failing this test because HadoopShimsPre2_3$NullKeyProvider is loaded

Edit: More details can be found at https://issues.apache.org/jira/browse/SPARK-34578

@razajafri
Copy link
Collaborator Author

build

@razajafri razajafri requested a review from revans2 May 27, 2022 18:38
@razajafri
Copy link
Collaborator Author

@revans2 can you take another look please

@pytest.mark.parametrize("provider", ["", "hadoop"])
@pytest.mark.parametrize("encrypt", ["", "pii:a"])
@pytest.mark.parametrize("mask", ["", "sha256:a"])
@pytest.mark.skipif(is_databricks104_or_later(), reason="The test will fail on Databricks10.4 because `HadoopShimsPre2_3$NullKeyProvider` is loaded")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to try to gset the provider to something else to be able to test behavior on DB?

Copy link
Collaborator Author

@razajafri razajafri May 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried setting it to test:/// which is what spark sets in their test code, but that doesn't work.

I am following the Spark teams lead on just skipping these tests although their tests are in Scala so they have control over skipping it only if the provider doesn't have test keys. I am just skipping them for DB10.4 only.

Edit: Sorry, I meant setting the provider URL to test:///. Let me try to set the provider to one of the options and see

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other combinations of the conf have resulted in similar errors

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what type of errors? Are you implying that we can't use encryption on databricks?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I am not saying DB doesn't support encryption but when I set the provider to unknown I get errors saying that the provider is required. When I set it to hadoop I get the following error

Cause: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (localhost executor driver): java.lang.IllegalArgumentException: Unknown key pii

If I set it to aws I get the error that the provider url is unreachable or something along those lines. I imagine an encryption server needs to be setup before encryption can be actually tested on DB.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok lets leave it as skip for 22.06, can you file a follow on issue to see if we can find a way to test on db 10.4. We can discuss on the issue to see if we think its needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the follow-on. Please add to the issue or comment on it if something needs to be added

#5722

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri
Copy link
Collaborator Author

build

@sameerz sameerz added the task Work required that improves the product but is not user facing label Jun 2, 2022
@razajafri razajafri merged commit db7b611 into NVIDIA:branch-22.06 Jun 2, 2022
@razajafri razajafri deleted the SP-5463-ORC-encrypted-writes branch June 2, 2022 04:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Check on ORC encryption for writes
5 participants