Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-7979: Self-Closing XML Tags Cause Schema Change Exceptions #2283

Merged
merged 8 commits into from
Aug 3, 2021

Conversation

cgivre
Copy link
Contributor

@cgivre cgivre commented Aug 2, 2021

DRILL-7979: Self-Closing XML Tags Cause Schema Change Exceptions

Description

Self closing XML tags are dealt with strangely by java's streaming parser. If you have data where you have one row containing a self closing XML tag foo () but then in the next row foo contains a map or other nested field, Drill will throw a schema change exception.
This proposed fix causes Drill to ignore self-closing tags unless they have attributes, which allows data like this to be successfully queried.

For instance, prior to this PR, the data below would not work, but now can be successfully queried.

<row>
  <foo/>
  <bar/>
</row>
<row>
  <foo>
     <f1>v1</f1>
     <f2>v2</f2>
   </foo>
   <bar/>
</row>

Documentation

No user facing changes.

Testing

Added additional unit test and tested manually.

@cgivre cgivre added the bug label Aug 2, 2021
@cgivre cgivre self-assigned this Aug 2, 2021
@cgivre cgivre requested a review from jnturton August 2, 2021 06:44
Copy link
Contributor

@jnturton jnturton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started out adding specific, implementation-level comments but I've paused that to back off and ask: is this really a self-closing tag thing, or is the situation the same for any empty element that also occurs as a parent element? In my tests on master. the problem is the same for either of the following, which I believe are also equivalent in the XML spec.

<!-- self-closing -->
<foo/>

<!-- just empty -->
<foo></foo>

If I've got right end of the stick here then I suggest that we adjust all the naming to refer to the "empty element" case, rather than the "self-closing" case.

Next, following on from our comments on Jira and the idea of using maps for this case, what do you think of the following approach?

  1. When our first encounter with an element foo is empty, and therefore ambiguous in terms of type, we default to the non-leaf case and make it a map.
  2. For subsequent parent foo elements we return populated maps. For subsequent empty foo elements we return empty maps.
  3. For subsequent leaf elements <foo>bar</foo>, which we would normally map to varchar but where we find that we've already got a map from step 1, we put the element value into the map under a hardcoded special key, e.g. { '__value__': 'bar' }.

The above will also work in the case when the first element encountered is empty but has attributes <foo a='b' /> while the element discarding logic in the present patch does not discard such elements. If you're not crazy about this it's no problem and I've probably got a couple more specific remarks to add on the implementation.

@@ -72,9 +72,10 @@
private InputStream fsStream;
private XMLEventReader reader;
private ImplicitColumns metadata;
private boolean isSelfClosingEvent;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider adding something like IGNORED_ELEMENT or SELF_CLOSING_TAG state to the xmlState enum? Would that come out any simpler than the new boolean isSelfClosingEvent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this, however the issue is where the self-closing tag occurs. Adding an additional state might mess with the other functionality.

@jnturton
Copy link
Contributor

jnturton commented Aug 2, 2021

Perhaps we should be trying for consistency with what Drill does with analogous JSON data. Querying this document

[
	{
		"foo": null
	},
	{
		"foo": { "bar": 0 }
	}
]

gives you

foo      |
---------|
{}       |
{"bar":0}|

. The null value becomes an empty map, as I proposed for empty XML elements, but things are otherwise different. Adding an object with an int property {"foo": 2} returns an error, not a map with a special key {'__value__' : 2 }. Changing that second object hold "foo": [ 1, 2, 3 ] makes the foo column an array. Somehow Drill is able to delay its decision on the column type until the ocurrence of the first non-null value. Is this something that's possible with Easy format plugins?

@cgivre
Copy link
Contributor Author

cgivre commented Aug 2, 2021

I started out adding specific, implementation-level comments but I've paused that to back off and ask: is this really a self-closing tag thing, or is the situation the same for any empty element that also occurs as a parent element? In my tests on master. the problem is the same for either of the following, which I believe are also equivalent in the XML spec.

<!-- self-closing -->
<foo/>

<!-- just empty -->
<foo></foo>

If I've got right end of the stick here then I suggest that we adjust all the naming to refer to the "empty element" case, rather than the "self-closing" case.

Next, following on from our comments on Jira and the idea of using maps for this case, what do you think of the following approach?

  1. When our first encounter with an element foo is empty, and therefore ambiguous in terms of type, we default to the non-leaf case and make it a map.
  2. For subsequent parent foo elements we return populated maps. For subsequent empty foo elements we return empty maps.
  3. For subsequent leaf elements <foo>bar</foo>, which we would normally map to varchar but where we find that we've already got a map from step 1, we put the element value into the map under a hardcoded special key, e.g. { '__value__': 'bar' }.

The above will also work in the case when the first element encountered is empty but has attributes <foo a='b' /> while the element discarding logic in the present patch does not discard such elements. If you're not crazy about this it's no problem and I've probably got a couple more specific remarks to add on the implementation.

@dzamo Thanks for the response. The real issue is that we don't know the schema as we're scanning the file, so we have to do the best we can. The issue is that with the empty fields (self-closing or otherwise) we don't really know what they are until we see real data. For instance, if we decide to make them an empty map, we'll get an error if the next record shows up as a scalar. The current approach was to treat empty fields as scalars which then causes issues if we encounter a map in the next row.
You asked in an other comment about perhaps treating all empty elements in the same manner. There was a specific challenge as to how the self closing tags which is why I made this PR. I'm actually working on another project to get the XML reader to download a provided schema (the XSD link) which would actually solve a lot of issues reading XML.

@jnturton
Copy link
Contributor

jnturton commented Aug 2, 2021

Yes, I think I did grok the unknown schema problem. The thought above, which somehow escaped all the striking out I did to it after thinking a bit more, was to take advantage of the fact that scalar string can be embedded into a single element map. The tuple generating code would need to become aware when it should do this.

My second comment's comparison of the situation with a JSON property that is first null, then an object, is also a bit dubious because empty XML elements do not represent nulls (from I what read) so much as zero length strings.

If there is an effort to make querying XML behave in a more similar way to querying equivalent JSON, for some definition of equivalent, it should probably wait for another PR.

@jnturton jnturton self-assigned this Aug 2, 2021
@jnturton jnturton closed this Aug 2, 2021
@jnturton jnturton reopened this Aug 2, 2021
@jnturton jnturton self-requested a review August 2, 2021 17:43
Copy link
Contributor

@jnturton jnturton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good if you're happy that my minor inline comments are covered.

@cgivre
Copy link
Contributor Author

cgivre commented Aug 2, 2021

Yes, I think I did grok the unknown schema problem. The thought above, which somehow escaped all the striking out I did to it after thinking a bit more, was to take advantage of the fact that scalar string can be embedded into a single element map. The tuple generating code would need to become aware when it should do this.

My second comment's comparison of the situation with a JSON property that is first null, then an object, is also a bit dubious because empty XML elements do not represent nulls (from I what read) so much as zero length strings.

If there is an effort to make querying XML behave in a more similar way to querying equivalent JSON, for some definition of equivalent, it should probably wait for another PR.

I think you're right about that. From what I remember, there is an option for Drill's JSON parser to treat NaN and something else as null. For XML I don't know how you'd distinguish between an empty string and null.

This was also an issue with some data I was working on. The JSON version used empty strings to denote null then subsequent rows would contain maps which would cause SchemaChange exceptions. The only way to fix that was to use the UNION data type.

@cgivre
Copy link
Contributor Author

cgivre commented Aug 2, 2021

@dzamo Are we good to go on this PR?

Copy link
Contributor

@jnturton jnturton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@cgivre cgivre merged commit 129d740 into apache:master Aug 3, 2021
@jnturton
Copy link
Contributor

jnturton commented Aug 3, 2021

Now that I test master again after this merge, both the self-closing and long form empty XML element cases work perfectly. As with the JSON example above, an empty <foo></foo> followed by a parent <foo><bar>1</bar></foo> results in a empty map for the <foo></foo>, while before the merge I got an error. I have to confess that I couldn't see all of this getting sorted about by this PR, which seemed focussed on self-closing tags only, but perhaps I was testing an old build in the first place. Nice one!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants