ORC encrypted write should fallback to CPU [databricks] #5604

razajafri · 2022-05-24T03:44:45Z

This PR falls back to the CPU if any of the ORC encryption configurations are set.

The configs that we are checking for are

hadoop.security.key.provider.path
orc.key.provider
orc.encrypt
orc.mask

fixes #5463

Signed-off-by: Raza Jafri rjafri@nvidia.com

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

revans2 · 2022-05-24T14:52:51Z

@razajafri naive question. Why does it that any of them are set? Will ORC encrypt the output if any one of them are set by themselves?

revans2 · 2022-05-24T14:54:19Z

integration_tests/src/main/python/orc_write_test.py

+            writer.option("orc.mask", mask)
+        writer.format("orc").mode('overwrite').option("path", write_path).saveAsTable(spark_tmp_table_factory.get())
+    if path == "" and provider == "" and encrypt == "" and mask == "":
+        pytest.xfail("Expected to fail as non-encrypted write will not fallback to CPU")


I don't like xfail here. If we expect it to NOT fall back to the CPU, then we should write a test that verifies that it did that.

will a skip be better?

A skip is slightly better, but still ugly.

razajafri · 2022-05-24T18:37:26Z

@razajafri naive question. Why does it that any of them are set? Will ORC encrypt the output if any one of them are set by themselves?

No, from what I have researched, all of them have to be set to encrypt the file. This is us being defensive because either the user has made a mistake or they don't know what they are doing. In either case, I wanted CPU version to handle the error checking if any and inform the user. I can change that if we don't want to fallback to the CPU if all the conf are not set

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

revans2 · 2022-05-24T19:31:58Z

But the configs overlap is some cases with reading encrypted data. Do we want to fall back to the CPU for a write when the read is the only thing that is encrypted?

razajafri · 2022-05-24T20:09:50Z

But the configs overlap is some cases with reading encrypted data. Do we want to fall back to the CPU for a write when the read is the only thing that is encrypted?

That's a good point but I haven't found a way to set the options in a global sense. The only two ways that I found we can set the options are while creating a table like

CREATE TABLE encrypted (
  ssn STRING,
  email STRING,
  name STRING
)
USING ORC
OPTIONS (
  hadoop.security.key.provider.path "kms://http@localhost:9600/kms",
  orc.key.provider "hadoop",
  orc.encrypt "pii:ssn,email",
  orc.mask "nullify:ssn;sha256:email"
)

Further details of this can be found here and the other one is while reading or writing the individual file like spark.read.option(...).orc(...) or spark.write.option(...).orc(...)

In the first case, the table write and read are both encrypted so I don't think there is a way to have one and not the other, in the second case, it's up to the user whether they want to encrypt the read or write.

I wasn't able to find the case you are suggesting which will provide a global option to be used for writing ORC. If this case exists, then yes this could be a problem.

razajafri · 2022-05-25T16:45:23Z

build

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri · 2022-05-26T19:40:55Z

So, it seems like Databricks 3.2.1 is failing this test because HadoopShimsPre2_3$NullKeyProvider is loaded

Edit: More details can be found at https://issues.apache.org/jira/browse/SPARK-34578

razajafri · 2022-05-26T19:44:49Z

build

razajafri · 2022-05-27T18:38:36Z

@revans2 can you take another look please

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuOrcFileFormat.scala

tgravescs · 2022-05-31T14:40:15Z

integration_tests/src/main/python/orc_write_test.py

+@pytest.mark.parametrize("provider", ["", "hadoop"])
+@pytest.mark.parametrize("encrypt", ["", "pii:a"])
+@pytest.mark.parametrize("mask", ["", "sha256:a"])
+@pytest.mark.skipif(is_databricks104_or_later(), reason="The test will fail on Databricks10.4 because `HadoopShimsPre2_3$NullKeyProvider` is loaded")


does it make sense to try to gset the provider to something else to be able to test behavior on DB?

I tried setting it to test:/// which is what spark sets in their test code, but that doesn't work.

I am following the Spark teams lead on just skipping these tests although their tests are in Scala so they have control over skipping it only if the provider doesn't have test keys. I am just skipping them for DB10.4 only.

Edit: Sorry, I meant setting the provider URL to test:///. Let me try to set the provider to one of the options and see

Other combinations of the conf have resulted in similar errors

what type of errors? Are you implying that we can't use encryption on databricks?

So, I am not saying DB doesn't support encryption but when I set the provider to unknown I get errors saying that the provider is required. When I set it to hadoop I get the following error

Cause: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (localhost executor driver): java.lang.IllegalArgumentException: Unknown key pii

If I set it to aws I get the error that the provider url is unreachable or something along those lines. I imagine an encryption server needs to be setup before encryption can be actually tested on DB.

ok lets leave it as skip for 22.06, can you file a follow on issue to see if we can find a way to test on db 10.4. We can discuss on the issue to see if we think its needed.

Here is the follow-on. Please add to the issue or comment on it if something needs to be added

#5722

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri · 2022-06-01T20:09:54Z

build

ORC encrypted write should fallback to CPU

3d32c6f

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

revans2 reviewed May 24, 2022

View reviewed changes

jlowe added this to the May 23 - Jun 3 milestone May 24, 2022

skip the test instead of xfail

8fb524a

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

revans2 previously approved these changes May 25, 2022

View reviewed changes

Skip ORC encryption test for DB10.4

10dba53

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri dismissed revans2’s stale review via 10dba53 May 26, 2022 19:40

razajafri requested a review from revans2 May 27, 2022 18:38

tgravescs reviewed May 31, 2022

View reviewed changes

addressed review comments

aa1234f

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

tgravescs approved these changes Jun 1, 2022

View reviewed changes

sameerz added the task Work required that improves the product but is not user facing label Jun 2, 2022

razajafri merged commit db7b611 into NVIDIA:branch-22.06 Jun 2, 2022

razajafri deleted the SP-5463-ORC-encrypted-writes branch June 2, 2022 04:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORC encrypted write should fallback to CPU [databricks] #5604

ORC encrypted write should fallback to CPU [databricks] #5604

razajafri commented May 24, 2022

revans2 commented May 24, 2022

revans2 May 24, 2022

razajafri May 24, 2022

revans2 May 24, 2022

razajafri commented May 24, 2022

revans2 commented May 24, 2022

razajafri commented May 24, 2022

razajafri commented May 25, 2022

razajafri commented May 26, 2022 •

edited

Loading

razajafri commented May 26, 2022

razajafri commented May 27, 2022

tgravescs May 31, 2022

razajafri May 31, 2022 •

edited

Loading

razajafri May 31, 2022

tgravescs Jun 1, 2022

razajafri Jun 1, 2022

tgravescs Jun 1, 2022

razajafri Jun 2, 2022

razajafri commented Jun 1, 2022

ORC encrypted write should fallback to CPU [databricks] #5604

ORC encrypted write should fallback to CPU [databricks] #5604

Conversation

razajafri commented May 24, 2022

revans2 commented May 24, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

razajafri commented May 24, 2022

revans2 commented May 24, 2022

razajafri commented May 24, 2022

razajafri commented May 25, 2022

razajafri commented May 26, 2022 • edited Loading

razajafri commented May 26, 2022

razajafri commented May 27, 2022

Choose a reason for hiding this comment

razajafri May 31, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

razajafri commented Jun 1, 2022

razajafri commented May 26, 2022 •

edited

Loading

razajafri May 31, 2022 •

edited

Loading