[SPARK-25935][SQL] Prevent null rows from JSON parser #22938

MaxGekk · 2018-11-04T09:22:30Z

What changes were proposed in this pull request?

An input without valid JSON tokens on the root level will be treated as a bad record, and handled according to mode. Previously such input was converted to null. After the changes, the input is converted to a row with nulls in the PERMISSIVE mode according the schema. This allows to remove a code in the from_json function which can produce null as result rows.

How was this patch tested?

It was tested by existing test suites. Some of them I have to modify (JsonSuite for example) because previously bad input was just silently ignored. For now such input is handled according to specified mode.

HyukjinKwon · 2018-11-04T09:26:20Z

add to whitelist

HyukjinKwon · 2018-11-04T09:47:53Z

ok to test

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

SparkQA · 2018-11-04T11:01:47Z

Test build #98439 has finished for PR 22938 at commit 0589d91.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-11-04T12:06:59Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

@@ -240,16 +240,6 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext {
      Seq(Row("1"), Row("2")))
  }

-  test("SPARK-11226 Skip empty line in json file") {


I removed the test because it is not relevant to the default mode PERMISSIVE any more. And the SQLQuerySuite is not perfect place for it.

Where is it moved to then? Does that mean we don't have a regression test for SPARK-11226 anymore?

SparkQA · 2018-11-04T12:11:27Z

Test build #98444 has finished for PR 22938 at commit 0589d91.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-04T15:24:32Z

Test build #98448 has finished for PR 22938 at commit d1bad7c.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-04T20:17:04Z

Test build #98452 has finished for PR 22938 at commit c4d6a80.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

SparkQA · 2018-11-06T22:41:26Z

Test build #98527 has finished for PR 22938 at commit a7e016a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-11-07T07:07:42Z

@HyukjinKwon Are you ok with the changes?

attilapiros · 2018-11-07T13:42:59Z

@MaxGekk I have checked out your PR and played a little bit with it: created a new unit test as a copy of "from_json - input=array, schema=array, output=array" with an invalid JSON.

I expected to get an InternalRow(null) for an array schema but I got null.

After debugging a little I have found the reason is result is InternalRow(null) after val result = rows.next() in the convertRow method.
And calling the getArray(0) on it gives back the null. The same is true for calling getMap(0) (it gives back null).

Please fix these and add a small unit test for each.

HyukjinKwon · 2018-11-07T13:52:43Z

Yea, looks fine in general. Will take a look within this week or weekends.

HyukjinKwon · 2018-11-07T13:54:36Z

@attilapiros, mind showing rough small test codes for it please? just want to see if this is something we should fix or not.

attilapiros · 2018-11-07T13:59:53Z

@HyukjinKwon Sure, the test would be for invalid JSON array:

test("from_json - input=invalid JSON array, schema=array, output=array") {
    val input = """[{"a": 1}, {a": 2}]"""
    val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil))
    val output = InternalRow(1) :: InternalRow(2) :: Nil
    checkEvaluation(JsonToStructs(schema, Map.empty, Literal(input), gmtId), InternalRow(null))
  }

I have corrupted the JSON by removing " from the key at the second element.

Running this test fails with:

Incorrect evaluation (codegen off): from_json(ArrayType(StructType(StructField(a,IntegerType,true)),true), [{"a": 1}, {a": 2}], Some(GMT)), actual: null, expected: [null]
ScalaTestFailureLocation: org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper$class at (ExpressionEvalHelper.scala:191)
org.scalatest.exceptions.TestFailedException: Incorrect evaluation (codegen off): from_json(ArrayType(StructType(StructField(a,IntegerType,true)),true), [{"a": 1}, {a": 2}], Some(GMT)), actual: null, expected: [null]
	at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
	at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
	at org.scalatest.Assertions$class.fail(Assertions.scala:1089)
	at org.scalatest.FunSuite.fail(FunSuite.scala:1560)

MaxGekk · 2018-11-07T14:47:57Z

I guess the problem belongs to FailureSafeParser, in particular

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala

Line 36 in 57eddc7

private val nullResult = new GenericInternalRow(schema.length)

. FailureSafeParser was created to safety parse structs not arrays and maps. I think need to properly prepare nullResult for arrays and maps. I will look at it. Thank you @attilapiros for the example.

MaxGekk · 2018-11-07T14:56:05Z

At least it doesn't fail on the cases https://github.com/apache/spark/pull/22938/files#diff-6626026091295ad8c0dfb66ecbcd04b1R568 and https://github.com/apache/spark/pull/22938/files#diff-6626026091295ad8c0dfb66ecbcd04b1R565 which this PR addresses actually. So, I am getting exactly one row from FailureSafeParser

MaxGekk · 2018-11-07T16:01:49Z

I made a fix for broken array and map in JsonToStructs because inside of FailureSafeParser is not clear from where the call came. I am still not sure that wrapping actual type by StructType before passing it to FailureSafeParser was right decision in #22237 /cc @cloud-fan Doing this we cannot distinguish ArrayType/MapType as root type from StructType(StructField(ArrayType/MapType)), and return appropriate null result in the case of bad record.

SparkQA · 2018-11-07T18:49:21Z

Test build #98555 has finished for PR 22938 at commit 5c17aef.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-07T22:50:19Z

Test build #98565 has finished for PR 22938 at commit 9a38626.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-11-08T02:08:25Z

docs/sql-migration-guide-upgrade.md

@@ -15,6 +15,8 @@ displayTitle: Spark SQL Upgrading Guide

  - Since Spark 3.0, the `from_json` functions supports two modes - `PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing of malformed JSON records. For example, the JSON string `{"a" 1}` with the schema `a INT` is converted to `null` by previous versions but Spark 3.0 converts it to `Row(null)`.

+  - In Spark version 2.4 and earlier, JSON data source and the `from_json` function produced `null`s if there is no valid root JSON token in its input (` ` for example). Since Spark 3.0, such input is treated as a bad record and handled according to specified mode. For example, in the `PERMISSIVE` mode the ` ` input is converted to `Row(null, null)` if specified schema is `key STRING, value INT`. 


just for curiosity, how can the json data source return null rows?

When we use the data source, we can specify the schema as StructType only. In that case, we get a Seq[InternalRow] or Nil from JacksonParser which is flatMapped, or BadRecordException which is converted to Iterator[InternalRow]. It seems there is no way to get null rows. The difference between JSON datasource and JSON functions is formers don't (and cannot) do flattening. So, the Nil case should be handled especially (this PR addresses the case).

In Spark version 2.4 and earlier, JSON data source and the from_json function produced nulls

Shall we update this? According to what you said, JSON data source can't produce null.

cloud-fan · 2018-11-08T04:31:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

+    case _: StructType => (row: InternalRow) => row
+    case _: ArrayType => (row: InternalRow) =>
+      if (row.isNullAt(0)) {
+        new GenericArrayData(Array())


I think this is the place from_json is different from json data source. A data source must produce data as rows, while the from_json can return array or map.

I think the previous behavior also makes sense. For array/map, we don't have the corrupted column, and returning null is reasonable. Actually I prefer null over empty array/map, but we need more discussion about this behavior.

I also thought what is better to return here - null or empty Array/MapData. In the case of StructType we return Row in the PERMISSIVE mode. For consistency should we return empty array/map in this mode too?

Maybe we can consider special mode when we can return null for the bad record? For now it is easy to do since we use FailureSafeParser.

but we need more discussion about this behavior.

@cloud-fan Should I send an email to the dev list or we can discuss this here?

I could revert the recent commits and prepare a separate PR for the behaviour change. WDYT?

I think it's okay to return null for map and array. Let's make some changes to make it null for map and array.

I have discarded the recent changes.

SparkQA · 2018-11-09T18:25:34Z

Test build #98655 has finished for PR 22938 at commit 84de402.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-10T00:11:34Z

Test build #98660 has finished for PR 22938 at commit 9132af3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-11T20:04:46Z

Test build #98701 has finished for PR 22938 at commit 35b3013.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-11-12T07:06:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

+      assert(!rows.hasNext)
+      castRow(result)
+    } else {
+      throw new IllegalArgumentException("Expected one row from JSON parser.")


This can only happen when we have a bug, right?

Right, it must not happen.

cloud-fan · 2018-11-12T07:07:41Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

@@ -1115,6 +1115,7 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
        Row(null, null, null),
        Row(null, null, null),
        Row(null, null, null),
+        Row(null, null, null),


so for json data source, previous behavior is, we would skip the row even it's in PERMISSIVE mode. Shall we clearly mention it in the migration guide?

so for json data source, previous behavior is, we would skip the row even it's in PERMISSIVE mode.

Yes, we skipped such rows if Jackson parser wasn't able to find any root tokens. So, not only empty strings and gaps got into the category.

Shall we clearly mention it in the migration guide?

Sure.

cloud-fan · 2018-11-12T07:08:28Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

@@ -1813,6 +1817,7 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
      val path = dir.getCanonicalPath
      primitiveFieldAndType
        .toDF("value")
+        .repartition(1)


why is the repartition required?

As far as I remember I added the repartition(1) here and in other places because to eliminate empty files. Such empty files are produced by empty partitions. Probably we could avoid writing empty files at least in the case of text-based datasources but any case let's look at TextOutputWriter, for example. It creates an input stream for a file in its constructor:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala

Line 151 in 46110a5

private val writer = CodecStreams.createOutputStream(context, new Path(path))

and closes the empty file in

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala

Line 162 in 46110a5

writer.close()

. So, even if we didn't write anythings to the file, it creates an empty file.

From the read side, when Jackson parser tries to read the empty file, it cannot detect any JSON tokens on the root level and returns null from nextToken() for which I throw a bad record exception for now -> Row(...) in PERMISSIVE mode.

MaxGekk · 2018-11-15T08:01:37Z

@cloud-fan @HyukjinKwon Do you agree with the proposed changes, or there is anything which blocks the PR for now?

MaxGekk · 2018-11-21T12:48:30Z

@HyukjinKwon @cloud-fan May I ask you to look at this PR one more time.

cloud-fan · 2018-11-21T17:32:30Z

LGTM except the migration guide. JSON data source can't produce null rows, but skip it even with permisive mode.

SparkQA · 2018-11-22T00:23:06Z

Test build #99143 has finished for PR 22938 at commit 6a8cac3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-11-22T01:35:39Z

thanks, merging to master!

SparkQA · 2018-11-22T01:38:40Z

Test build #99144 has finished for PR 22938 at commit 1ef2d5b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-11-22T02:00:07Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

@@ -1892,7 +1898,7 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
        .text(path)

      val jsonDF = spark.read.option("multiLine", true).option("mode", "PERMISSIVE").json(path)
-      assert(jsonDF.count() === corruptRecordCount)
+      assert(jsonDF.count() === corruptRecordCount + 1) // null row for empty file


Wait, does this mean that it reads an empty record from empty file after this change?

If that's true, we should not do this. Empty files can be generated in many cases for now and the behaviour is not currently well defined. If we rely on this behaviour, it will cause some weird behaviours or bugs hard to fix.

shall we skip empty files for all the file-based data sources?

HyukjinKwon · 2018-11-22T02:03:38Z

Sorry for the late response. The change looks good to me too in general but I had two questions (see also #22938 (comment)).

## What changes were proposed in this pull request? This PR reverts #22938 per discussion in #23325 Closes #23325 Closes #23543 from MaxGekk/return-nulls-from-json-parser. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? An input without valid JSON tokens on the root level will be treated as a bad record, and handled according to `mode`. Previously such input was converted to `null`. After the changes, the input is converted to a row with `null`s in the `PERMISSIVE` mode according the schema. This allows to remove a code in the `from_json` function which can produce `null` as result rows. ## How was this patch tested? It was tested by existing test suites. Some of them I have to modify (`JsonSuite` for example) because previously bad input was just silently ignored. For now such input is handled according to specified `mode`. Closes apache#22938 from MaxGekk/json-nulls. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? This PR reverts apache#22938 per discussion in apache#23325 Closes apache#23325 Closes apache#23543 from MaxGekk/return-nulls-from-json-parser. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

MaxGekk added 2 commits November 4, 2018 00:24

Eliminate producing nulls by JSON parser

31cb534

Updating the migration guide

0589d91

HyukjinKwon reviewed Nov 4, 2018

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala Outdated Show resolved Hide resolved

MaxGekk added 2 commits November 4, 2018 14:47

Addressing Hyukjin's review comments

0aa4f60

Removing not relevant test

d1bad7c

MaxGekk commented Nov 4, 2018

View reviewed changes

fix for R test

c4d6a80

attilapiros reviewed Nov 6, 2018

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala Outdated Show resolved Hide resolved

attilapiros reviewed Nov 6, 2018

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala Outdated Show resolved Hide resolved

Addressing Attila's review comments

a7e016a

cloud-fan reviewed Nov 8, 2018

View reviewed changes

Merge remote-tracking branch 'origin/master' into json-nulls

35b3013

MaxGekk force-pushed the json-nulls branch from 9132af3 to 35b3013 Compare November 11, 2018 16:24

cloud-fan reviewed Nov 12, 2018

View reviewed changes

MaxGekk added 2 commits November 21, 2018 21:30

Updating the migration guide

6a8cac3

Taking into account JSON datasource behavior

1ef2d5b

asfgit closed this in 38628dd Nov 22, 2018

HyukjinKwon reviewed Nov 22, 2018

View reviewed changes

cloud-fan mentioned this pull request Nov 26, 2018

[SPARK-26161][SQL] Ignore empty files in load #23130

Closed

cloud-fan mentioned this pull request Dec 16, 2018

[SPARK-26376][SQL] Skip inputs without tokens by JSON datasource #23325

Closed

MaxGekk mentioned this pull request Jan 14, 2019

[SPARK-25935][SQL] Allow null rows for bad records from JSON parsers #23543

Closed

MaxGekk deleted the json-nulls branch August 17, 2019 13:33

		@@ -15,6 +15,8 @@ displayTitle: Spark SQL Upgrading Guide

		- Since Spark 3.0, the `from_json` functions supports two modes - `PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing of malformed JSON records. For example, the JSON string `{"a" 1}` with the schema `a INT` is converted to `null` by previous versions but Spark 3.0 converts it to `Row(null)`.

		- In Spark version 2.4 and earlier, JSON data source and the `from_json` function produced `null`s if there is no valid root JSON token in its input (` ` for example). Since Spark 3.0, such input is treated as a bad record and handled according to specified mode. For example, in the `PERMISSIVE` mode the ` ` input is converted to `Row(null, null)` if specified schema is `key STRING, value INT`.

[SPARK-25935][SQL] Prevent null rows from JSON parser #22938

[SPARK-25935][SQL] Prevent null rows from JSON parser #22938

Conversation

MaxGekk commented Nov 4, 2018

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon commented Nov 4, 2018

HyukjinKwon commented Nov 4, 2018

SparkQA commented Nov 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 4, 2018

SparkQA commented Nov 4, 2018

SparkQA commented Nov 4, 2018

SparkQA commented Nov 6, 2018

MaxGekk commented Nov 7, 2018

attilapiros commented Nov 7, 2018 • edited Loading

HyukjinKwon commented Nov 7, 2018

HyukjinKwon commented Nov 7, 2018

attilapiros commented Nov 7, 2018

MaxGekk commented Nov 7, 2018

MaxGekk commented Nov 7, 2018

MaxGekk commented Nov 7, 2018

SparkQA commented Nov 7, 2018

SparkQA commented Nov 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Nov 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 9, 2018

SparkQA commented Nov 10, 2018

SparkQA commented Nov 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MaxGekk commented Nov 15, 2018

MaxGekk commented Nov 21, 2018

cloud-fan commented Nov 21, 2018

SparkQA commented Nov 22, 2018

cloud-fan commented Nov 22, 2018

SparkQA commented Nov 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Nov 22, 2018 • edited Loading

attilapiros commented Nov 7, 2018 •

edited

Loading

HyukjinKwon Nov 11, 2018 •

edited

Loading

HyukjinKwon commented Nov 22, 2018 •

edited

Loading