DRILL-7979: Self-Closing XML Tags Cause Schema Change Exceptions #2283

cgivre · 2021-08-02T03:26:55Z

DRILL-7979: Self-Closing XML Tags Cause Schema Change Exceptions

Description

Self closing XML tags are dealt with strangely by java's streaming parser. If you have data where you have one row containing a self closing XML tag foo () but then in the next row foo contains a map or other nested field, Drill will throw a schema change exception.
This proposed fix causes Drill to ignore self-closing tags unless they have attributes, which allows data like this to be successfully queried.

For instance, prior to this PR, the data below would not work, but now can be successfully queried.

<row>
  <foo/>
  <bar/>
</row>
<row>
  <foo>
     <f1>v1</f1>
     <f2>v2</f2>
   </foo>
   <bar/>
</row>

Documentation

No user facing changes.

Testing

Added additional unit test and tested manually.

jnturton

I started out adding specific, implementation-level comments but I've paused that to back off and ask: is this really a self-closing tag thing, or is the situation the same for any empty element that also occurs as a parent element? In my tests on master. the problem is the same for either of the following, which I believe are also equivalent in the XML spec.

<!-- self-closing -->
<foo/>

<!-- just empty -->
<foo></foo>

If I've got right end of the stick here then I suggest that we adjust all the naming to refer to the "empty element" case, rather than the "self-closing" case.

Next, following on from our comments on Jira and the idea of using maps for this case, what do you think of the following approach?

When our first encounter with an element foo is empty, and therefore ambiguous in terms of type, we default to the non-leaf case and make it a map.
For subsequent parent foo elements we return populated maps. For subsequent empty foo elements we return empty maps.
For subsequent leaf elements <foo>bar</foo>, which we would normally map to varchar but where we find that we've already got a map from step 1, we put the element value into the map under a hardcoded special key, e.g. { '__value__': 'bar' }.

The above will also work in the case when the first element encountered is empty but has attributes <foo a='b' /> while the element discarding logic in the present patch does not discard such elements. If you're not crazy about this it's no problem and I've probably got a couple more specific remarks to add on the implementation.

contrib/format-xml/src/main/java/org/apache/drill/exec/store/xml/XMLUtils.java

jnturton · 2021-08-02T12:05:55Z

contrib/format-xml/src/main/java/org/apache/drill/exec/store/xml/XMLReader.java

@@ -72,9 +72,10 @@
  private InputStream fsStream;
  private XMLEventReader reader;
  private ImplicitColumns metadata;
+  private boolean isSelfClosingEvent;


Did you consider adding something like IGNORED_ELEMENT or SELF_CLOSING_TAG state to the xmlState enum? Would that come out any simpler than the new boolean isSelfClosingEvent?

I thought about this, however the issue is where the self-closing tag occurs. Adding an additional state might mess with the other functionality.

jnturton · 2021-08-02T14:46:23Z

Perhaps we should be trying for consistency with what Drill does with analogous JSON data. Querying this document

[
	{
		"foo": null
	},
	{
		"foo": { "bar": 0 }
	}
]

gives you

foo      |
---------|
{}       |
{"bar":0}|

. The null value becomes an empty map, as I proposed for empty XML elements, but things are otherwise different. Adding an object with an int property {"foo": 2} returns an error, not a map with a special key {'__value__' : 2 }. Changing that second object hold "foo": [ 1, 2, 3 ] makes the foo column an array. Somehow Drill is able to delay its decision on the column type until the ocurrence of the first non-null value. Is this something that's possible with Easy format plugins?

cgivre · 2021-08-02T15:29:04Z

I started out adding specific, implementation-level comments but I've paused that to back off and ask: is this really a self-closing tag thing, or is the situation the same for any empty element that also occurs as a parent element? In my tests on master. the problem is the same for either of the following, which I believe are also equivalent in the XML spec.

<foo/>


<foo></foo>
If I've got right end of the stick here then I suggest that we adjust all the naming to refer to the "empty element" case, rather than the "self-closing" case.

Next, following on from our comments on Jira and the idea of using maps for this case, what do you think of the following approach?

When our first encounter with an element foo is empty, and therefore ambiguous in terms of type, we default to the non-leaf case and make it a map.

For subsequent parent foo elements we return populated maps. For subsequent empty foo elements we return empty maps.

For subsequent leaf elements <foo>bar</foo>, which we would normally map to varchar but where we find that we've already got a map from step 1, we put the element value into the map under a hardcoded special key, e.g. { '__value__': 'bar' }.

The above will also work in the case when the first element encountered is empty but has attributes <foo a='b' /> while the element discarding logic in the present patch does not discard such elements. If you're not crazy about this it's no problem and I've probably got a couple more specific remarks to add on the implementation.

@dzamo Thanks for the response. The real issue is that we don't know the schema as we're scanning the file, so we have to do the best we can. The issue is that with the empty fields (self-closing or otherwise) we don't really know what they are until we see real data. For instance, if we decide to make them an empty map, we'll get an error if the next record shows up as a scalar. The current approach was to treat empty fields as scalars which then causes issues if we encounter a map in the next row.
You asked in an other comment about perhaps treating all empty elements in the same manner. There was a specific challenge as to how the self closing tags which is why I made this PR. I'm actually working on another project to get the XML reader to download a provided schema (the XSD link) which would actually solve a lot of issues reading XML.

jnturton · 2021-08-02T17:04:19Z

Yes, I think I did grok the unknown schema problem. The thought above, which somehow escaped all the striking out I did to it after thinking a bit more, was to take advantage of the fact that scalar string can be embedded into a single element map. The tuple generating code would need to become aware when it should do this.

My second comment's comparison of the situation with a JSON property that is first null, then an object, is also a bit dubious because empty XML elements do not represent nulls (from I what read) so much as zero length strings.

If there is an effort to make querying XML behave in a more similar way to querying equivalent JSON, for some definition of equivalent, it should probably wait for another PR.

jnturton

Looks good if you're happy that my minor inline comments are covered.

cgivre · 2021-08-02T22:31:12Z

Yes, I think I did grok the unknown schema problem. The thought above, which somehow escaped all the striking out I did to it after thinking a bit more, was to take advantage of the fact that scalar string can be embedded into a single element map. The tuple generating code would need to become aware when it should do this.

My second comment's comparison of the situation with a JSON property that is first null, then an object, is also a bit dubious because empty XML elements do not represent nulls (from I what read) so much as zero length strings.

If there is an effort to make querying XML behave in a more similar way to querying equivalent JSON, for some definition of equivalent, it should probably wait for another PR.

I think you're right about that. From what I remember, there is an option for Drill's JSON parser to treat NaN and something else as null. For XML I don't know how you'd distinguish between an empty string and null.

This was also an issue with some data I was working on. The JSON version used empty strings to denote null then subsequent rows would contain maps which would cause SchemaChange exceptions. The only way to fix that was to use the UNION data type.

cgivre · 2021-08-02T23:21:54Z

@dzamo Are we good to go on this PR?

jnturton

+1

jnturton · 2021-08-03T09:25:34Z

Now that I test master again after this merge, both the self-closing and long form empty XML element cases work perfectly. As with the JSON example above, an empty <foo></foo> followed by a parent <foo><bar>1</bar></foo> results in a empty map for the <foo></foo>, while before the merge I got an error. I have to confess that I couldn't see all of this getting sorted about by this PR, which seemed focussed on self-closing tags only, but perhaps I was testing an old build in the first place. Nice one!

cgivre added 4 commits July 29, 2021 15:34

Initial commit

851cdad

WIP

c0abd9c

Unit test working but attribute not popping off stack

cae08e4

Everything working

97305e7

cgivre added the bug label Aug 2, 2021

cgivre self-assigned this Aug 2, 2021

cgivre added 3 commits August 1, 2021 23:27

Removed Logback.xml

f3bd1c6

Fixed Unit Test

c305099

Fixed corrupt file

9f66f04

cgivre requested a review from jnturton August 2, 2021 06:44

jnturton reviewed Aug 2, 2021

View reviewed changes

jnturton self-assigned this Aug 2, 2021

jnturton closed this Aug 2, 2021

jnturton reopened this Aug 2, 2021

jnturton self-requested a review August 2, 2021 17:43

jnturton approved these changes Aug 2, 2021

View reviewed changes

Addressed Review Comments

ca1af92

jnturton approved these changes Aug 3, 2021

View reviewed changes

cgivre merged commit 129d740 into apache:master Aug 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRILL-7979: Self-Closing XML Tags Cause Schema Change Exceptions #2283

DRILL-7979: Self-Closing XML Tags Cause Schema Change Exceptions #2283

cgivre commented Aug 2, 2021

jnturton left a comment •

edited

Loading

jnturton Aug 2, 2021

cgivre Aug 2, 2021

jnturton commented Aug 2, 2021 •

edited

Loading

cgivre commented Aug 2, 2021

jnturton commented Aug 2, 2021

jnturton left a comment

cgivre commented Aug 2, 2021

cgivre commented Aug 2, 2021

jnturton left a comment

jnturton commented Aug 3, 2021

DRILL-7979: Self-Closing XML Tags Cause Schema Change Exceptions #2283

DRILL-7979: Self-Closing XML Tags Cause Schema Change Exceptions #2283

Conversation

cgivre commented Aug 2, 2021

DRILL-7979: Self-Closing XML Tags Cause Schema Change Exceptions

Description

Documentation

Testing

jnturton left a comment • edited Loading

Choose a reason for hiding this comment

jnturton Aug 2, 2021

Choose a reason for hiding this comment

cgivre Aug 2, 2021

Choose a reason for hiding this comment

jnturton commented Aug 2, 2021 • edited Loading

cgivre commented Aug 2, 2021

jnturton commented Aug 2, 2021

jnturton left a comment

Choose a reason for hiding this comment

cgivre commented Aug 2, 2021

cgivre commented Aug 2, 2021

jnturton left a comment

Choose a reason for hiding this comment

jnturton commented Aug 3, 2021

jnturton left a comment •

edited

Loading

jnturton commented Aug 2, 2021 •

edited

Loading