Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC encrypted write should fallback to CPU [databricks] #5604

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions integration_tests/src/main/python/orc_write_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
import pytest

from asserts import assert_gpu_and_cpu_writes_are_equal_collect, assert_gpu_fallback_write
from spark_session import is_databricks104_or_later
from datetime import date, datetime, timezone
from data_gen import *
from marks import *
Expand Down Expand Up @@ -166,3 +167,30 @@ def create_empty_df(spark, path):
lambda spark, path: spark.read.orc(path),
data_path,
conf={'spark.rapids.sql.format.orc.write.enabled': True})

@allow_non_gpu('DataWritingCommandExec')
@pytest.mark.parametrize("path", ["", "kms://http@localhost:9600/kms"])
@pytest.mark.parametrize("provider", ["", "hadoop"])
@pytest.mark.parametrize("encrypt", ["", "pii:a"])
@pytest.mark.parametrize("mask", ["", "sha256:a"])
@pytest.mark.skipif(is_databricks104_or_later(), reason="The test will fail on Databricks10.4 because `HadoopShimsPre2_3$NullKeyProvider` is loaded")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to try to gset the provider to something else to be able to test behavior on DB?

Copy link
Collaborator Author

@razajafri razajafri May 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried setting it to test:/// which is what spark sets in their test code, but that doesn't work.

I am following the Spark teams lead on just skipping these tests although their tests are in Scala so they have control over skipping it only if the provider doesn't have test keys. I am just skipping them for DB10.4 only.

Edit: Sorry, I meant setting the provider URL to test:///. Let me try to set the provider to one of the options and see

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other combinations of the conf have resulted in similar errors

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what type of errors? Are you implying that we can't use encryption on databricks?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I am not saying DB doesn't support encryption but when I set the provider to unknown I get errors saying that the provider is required. When I set it to hadoop I get the following error

Cause: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (localhost executor driver): java.lang.IllegalArgumentException: Unknown key pii

If I set it to aws I get the error that the provider url is unreachable or something along those lines. I imagine an encryption server needs to be setup before encryption can be actually tested on DB.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok lets leave it as skip for 22.06, can you file a follow on issue to see if we can find a way to test on db 10.4. We can discuss on the issue to see if we think its needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the follow-on. Please add to the issue or comment on it if something needs to be added

#5722

def test_orc_write_encryption_fallback(spark_tmp_path, spark_tmp_table_factory, path, provider, encrypt, mask):
def write_func(spark, write_path):
writer = unary_op_df(spark, gen).coalesce(1).write
if path != "":
writer.option("hadoop.security.key.provider.path", path)
if provider != "":
writer.option("orc.key.provider", provider)
if encrypt != "":
writer.option("orc.encrypt", encrypt)
if mask != "":
writer.option("orc.mask", mask)
writer.format("orc").mode('overwrite').option("path", write_path).saveAsTable(spark_tmp_table_factory.get())
if path == "" and provider == "" and encrypt == "" and mask == "":
pytest.skip("Skip this test when none of the encryption confs are set")
gen = IntegerGen()
data_path = spark_tmp_path + '/ORC_DATA'
assert_gpu_fallback_write(write_func,
lambda spark, path: spark.read.orc(path),
data_path,
'DataWritingCommandExec')
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,17 @@ object GpuOrcFileFormat extends Logging {
s"${RapidsConf.ENABLE_ORC_WRITE} to true")
}

val keyProviderPath = options.getOrElse("hadoop.security.key.provider.path", "")
val keyProvider = options.getOrElse("orc.key.provider", "")
val encrypt = options.getOrElse("orc.encrypt", "")
val mask = options.getOrElse("orc.mask", "")

if (!keyProvider.isEmpty || !keyProviderPath.isEmpty || !encrypt.isEmpty || !mask.isEmpty) {
meta.willNotWorkOnGpu("Encryption is not yet supported on GPU. If encrypted ORC " +
"writes are not required unset the \"hadoop.security.key.provider.path\" and " +
"\"orc.key.provider\" and \"orc.encrypt\" and \"orc.mask\"")
}

FileFormatChecks.tag(meta, schema, OrcFormatType, WriteFileOp)

val sqlConf = spark.sessionState.conf
Expand Down