-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORC encrypted write should fallback to CPU [databricks] #5604
ORC encrypted write should fallback to CPU [databricks] #5604
Conversation
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri naive question. Why does it that any of them are set? Will ORC encrypt the output if any one of them are set by themselves? |
writer.option("orc.mask", mask) | ||
writer.format("orc").mode('overwrite').option("path", write_path).saveAsTable(spark_tmp_table_factory.get()) | ||
if path == "" and provider == "" and encrypt == "" and mask == "": | ||
pytest.xfail("Expected to fail as non-encrypted write will not fallback to CPU") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like xfail here. If we expect it to NOT fall back to the CPU, then we should write a test that verifies that it did that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will a skip be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A skip is slightly better, but still ugly.
No, from what I have researched, all of them have to be set to encrypt the file. This is us being defensive because either the user has made a mistake or they don't know what they are doing. In either case, I wanted CPU version to handle the error checking if any and inform the user. I can change that if we don't want to fallback to the CPU if all the conf are not set |
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
But the configs overlap is some cases with reading encrypted data. Do we want to fall back to the CPU for a write when the read is the only thing that is encrypted? |
That's a good point but I haven't found a way to set the options in a global sense. The only two ways that I found we can set the options are while creating a table like
Further details of this can be found here and the other one is while reading or writing the individual file like In the first case, the table write and read are both encrypted so I don't think there is a way to have one and not the other, in the second case, it's up to the user whether they want to encrypt the read or write. I wasn't able to find the case you are suggesting which will provide a global option to be used for writing ORC. If this case exists, then yes this could be a problem. |
build |
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
So, it seems like Databricks 3.2.1 is failing this test because Edit: More details can be found at https://issues.apache.org/jira/browse/SPARK-34578 |
build |
@revans2 can you take another look please |
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuOrcFileFormat.scala
Outdated
Show resolved
Hide resolved
@pytest.mark.parametrize("provider", ["", "hadoop"]) | ||
@pytest.mark.parametrize("encrypt", ["", "pii:a"]) | ||
@pytest.mark.parametrize("mask", ["", "sha256:a"]) | ||
@pytest.mark.skipif(is_databricks104_or_later(), reason="The test will fail on Databricks10.4 because `HadoopShimsPre2_3$NullKeyProvider` is loaded") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to try to gset the provider to something else to be able to test behavior on DB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried setting it to test:///
which is what spark sets in their test code, but that doesn't work.
I am following the Spark teams lead on just skipping these tests although their tests are in Scala so they have control over skipping it only if the provider doesn't have test keys. I am just skipping them for DB10.4 only.
Edit: Sorry, I meant setting the provider URL to test:///
. Let me try to set the provider to one of the options and see
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other combinations of the conf have resulted in similar errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what type of errors? Are you implying that we can't use encryption on databricks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I am not saying DB doesn't support encryption but when I set the provider to unknown
I get errors saying that the provider is required. When I set it to hadoop
I get the following error
Cause: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (localhost executor driver): java.lang.IllegalArgumentException: Unknown key pii
If I set it to aws
I get the error that the provider url is unreachable or something along those lines. I imagine an encryption server needs to be setup before encryption can be actually tested on DB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok lets leave it as skip for 22.06, can you file a follow on issue to see if we can find a way to test on db 10.4. We can discuss on the issue to see if we think its needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the follow-on. Please add to the issue or comment on it if something needs to be added
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
build |
This PR falls back to the CPU if any of the ORC encryption configurations are set.
The configs that we are checking for are
fixes #5463
Signed-off-by: Raza Jafri rjafri@nvidia.com