Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32002][SQL]Support ExtractValue from nested ArrayStruct #28860

Closed
wants to merge 5 commits into from

Conversation

AngersZhuuuu
Copy link
Contributor

@AngersZhuuuu AngersZhuuuu commented Jun 18, 2020

What changes were proposed in this pull request?

For data nest.json

{"a": [{"b": [{"c": [1,2]}]}]}
{"a": [{"b": [{"c": [1]}, {"c": [2]}]}]}

run with

val df: DataFrame = spark.read.json(testFile("nest-data.json"))
df.createTempView("nest_table")
sql("select a.b.c from nest_table").show()

will got error

org.apache.spark.sql.AnalysisException: cannot resolve 'nest_table.`a`.`b`['c']' due to data type mismatch: argument 2 requires integral type, however, ''c'' is of string type.; line 1 pos 7;
'Project [a#6.b[c] AS c#8|#6.b[c] AS c#8]
+- SubqueryAlias `nest_table`
+- Relationa#6 json

This is because after resolve a.b, it construct GetArrayStructField from a#
result with type ArrayType(ArrayType(StructType)), then it can't be extract through current ExtractValue method.

Support extract StructField form nested Array-Struct combination

Why are the changes needed?

Support more use case

Does this PR introduce any user-facing change?

Yes, users will be able to select from nested array and structs.

How was this patch tested?

Added UT

@SparkQA
Copy link

SparkQA commented Jun 18, 2020

Test build #124217 has finished for PR 28860 at commit da9a3d5.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 18, 2020

Test build #124220 has finished for PR 28860 at commit b8f4a69.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 19, 2020

Test build #124252 has finished for PR 28860 at commit 400ad7d.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jun 19, 2020

Test build #124268 has finished for PR 28860 at commit 400ad7d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu changed the title [SPARK-32002][SQL]Support Extract valve from nested ArrayStruct [SPARK-32002][SQL]Support ExtractValue from nested ArrayStruct Jun 20, 2020
@gatorsmile
Copy link
Member

cc @MaxGekk @HyukjinKwon

@AngersZhuuuu
Copy link
Contributor Author

Seems this way will change the schema, maybe we should add a new class to get multidimensional array form GetArrayStruct

Co-authored-by: Ruslan Dautkhanov <Tagar@users.noreply.github.com>
@SparkQA
Copy link

SparkQA commented Jun 20, 2020

Test build #124314 has finished for PR 28860 at commit b6e92c0.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@HyukjinKwon
Copy link
Member

cc @viirya too FYI

@@ -59,6 +59,23 @@ object ExtractValue {
GetArrayStructFields(child, fields(ordinal).copy(name = fieldName),
ordinal, fields.length, containsNull)

case (ExtractNestedArray(StructType(fields), containsNull, containsNullSeq),
NonNullLiteral(v, StringType)) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

case (ExtractNestedArray(
    StructType(fields), containsNull, containsNullSeq), NonNullLiteral(v, StringType)) =>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also update the documentation and table above.

@@ -95,6 +112,50 @@ object ExtractValue {

trait ExtractValue extends Expression

object ExtractNestedArray {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add documentation here.

@@ -95,6 +112,50 @@ object ExtractValue {

trait ExtractValue extends Expression

object ExtractNestedArray {

type ReturnType = Option[(DataType, Boolean, Seq[Boolean])]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add some comments for what this type means.


def extractArrayStruct(expr: Expression): ReturnType = {
expr match {
case gas @ GetArrayStructFields(child, _, _, _, _) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid arguments matching. This is actually an anti pattern - https://github.com/databricks/scala-style-guide#pattern-matching

expr match {
case gas @ GetArrayStructFields(child, _, _, _, _) =>
extractArrayStruct(child) match {
case Some((e, deep)) => Some(e, deep + 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deep -> depth?

extractArrayType(dataType)
}

def extractArrayType(dataType: DataType): ReturnType = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we combine this and unapply?


checkAnswer(sql(
"""
|SELECT a.b.c FROM nest
Copy link
Member

@HyukjinKwon HyukjinKwon Jun 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add deeper cases? Also, you can simplify the test cases, for example, as below:

val df = spark.range(10).select(array(struct(array(struct("id")).alias("col1"))).alias("col0"))
df.selectExpr("col0.col1.id")

@@ -95,6 +112,50 @@ object ExtractValue {

trait ExtractValue extends Expression

object ExtractNestedArray {
Copy link
Member

@HyukjinKwon HyukjinKwon Jun 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also name it something like ExtractNestedArrayType

@SparkQA
Copy link

SparkQA commented Jun 26, 2020

Test build #124544 has finished for PR 28860 at commit b6e92c0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@guiyanakuang
Copy link
Member

org.apache.spark.sql.execution.ProjectionOverSchema.scala. A Scala extractor that projects an expression over a given schema. I think this code also needs to have matching rules added to it. To support nested ArrayStruct. Otherwise, ParquetSchemaPruning won't work either.

@AngersZhuuuu
Copy link
Contributor Author

org.apache.spark.sql.execution.ProjectionOverSchema.scala. A Scala extractor that projects an expression over a given schema. I think this code also needs to have matching rules added to it. To support nested ArrayStruct. Otherwise, ParquetSchemaPruning won't work either.

Yea, right, current way change the schema and won't work well for all case, I am thinking about add a new class to handle nested ArrayStruct type

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants