-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] GetJsonObject removes leading space characters #10215
Labels
bug
Something isn't working
Comments
revans2
added
bug
Something isn't working
? - Needs Triage
Need team to review and classify
labels
Jan 18, 2024
revans2
changed the title
[BUG] GetJsonPath removes leading space characters
[BUG] GetJsonObject removes leading space characters
Jan 23, 2024
Ran some tests on current plugin with 3.4.1, gpu's results look the same as https://jsonpath.com/ 's, but not the same as cpu's, which treats scala> val df = Seq("""{" a":"b","b":"c"}""", """{"a":"b","b":"c"}""", """{"a":"b"," a":"c"}""").toDF("value")
df: org.apache.spark.sql.DataFrame = [value: string]
scala> df.write.mode("overwrite").parquet("json_obj_data")
scala> val df2 = spark.read.parquet("json_obj_data")
df2: org.apache.spark.sql.DataFrame = [value: string]
scala> df2.selectExpr("value", "get_json_object(value, '$[\\\' a\\\']') as ba", "get_json_object(value, '$. a') as sa", "get_json_object(value, '$.a') as a").show()
+------------------+----+----+----+
| value| ba| sa| a|
+------------------+----+----+----+
|{" a":"b","b":"c"}|null|null|null|
|{"a":"b"," a":"c"}| b| b| b|
| {"a":"b","b":"c"}| b| b| b|
+------------------+----+----+----+
scala> spark.conf.set("spark.rapids.sql.enabled", true)
scala> spark.conf.set("spark.rapids.sql.expression.GetJsonObject", true)
scala> df2.selectExpr("value", "get_json_object(value, '$[\\\' a\\\']') as ba", "get_json_object(value, '$. a') as sa", "get_json_object(value, '$.a') as a").show()
24/02/22 14:50:49 WARN GpuOverrides:
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
@Partitioning <SinglePartition$> could run on GPU
*Exec <ProjectExec> will run on GPU
*Expression <Alias> get_json_object(value#94, $[' a']) AS ba#133 will run on GPU
*Expression <GetJsonObject> get_json_object(value#94, $[' a']) will run on GPU
*Expression <Alias> get_json_object(value#94, $. a) AS sa#134 will run on GPU
*Expression <GetJsonObject> get_json_object(value#94, $. a) will run on GPU
*Expression <Alias> get_json_object(value#94, $.a) AS a#135 will run on GPU
*Expression <GetJsonObject> get_json_object(value#94, $.a) will run on GPU
*Exec <FileSourceScanExec> will run on GPU
+------------------+----+----+----+
| value| ba| sa| a|
+------------------+----+----+----+
|{" a":"b","b":"c"}| b| b|null|
|{"a":"b"," a":"c"}| c| c| b|
| {"a":"b","b":"c"}|null|null| b|
+------------------+----+----+----+ Not sure if it's a bug of spark, we can match it in cudf as another |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
Spark and https://jsonpath.com/ agree that white space should not be stripped from the path. Our implementation treats
$. a
and$.a
the same, but no other implementation I have found does that. What is more, even if the string is quoted we still end up doing the same thing$[' b']
is the same as$['b']
.The text was updated successfully, but these errors were encountered: