-
Notifications
You must be signed in to change notification settings - Fork 985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRILL-7979: Self-Closing XML Tags Cause Schema Change Exceptions #2283
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started out adding specific, implementation-level comments but I've paused that to back off and ask: is this really a self-closing tag thing, or is the situation the same for any empty element that also occurs as a parent element? In my tests on master
. the problem is the same for either of the following, which I believe are also equivalent in the XML spec.
<!-- self-closing -->
<foo/>
<!-- just empty -->
<foo></foo>
If I've got right end of the stick here then I suggest that we adjust all the naming to refer to the "empty element" case, rather than the "self-closing" case.
Next, following on from our comments on Jira and the idea of using maps for this case, what do you think of the following approach?
- When our first encounter with an element
foo
is empty, and therefore ambiguous in terms of type, we default to the non-leaf case and make it a map. - For subsequent parent
foo
elements we return populated maps. For subsequent emptyfoo
elements we return empty maps. - For subsequent leaf elements
<foo>bar</foo>
, which we would normally map to varchar but where we find that we've already got a map from step 1, we put the element value into the map under a hardcoded special key, e.g.{ '__value__': 'bar' }
.
The above will also work in the case when the first element encountered is empty but has attributes <foo a='b' />
while the element discarding logic in the present patch does not discard such elements. If you're not crazy about this it's no problem and I've probably got a couple more specific remarks to add on the implementation.
contrib/format-xml/src/main/java/org/apache/drill/exec/store/xml/XMLUtils.java
Outdated
Show resolved
Hide resolved
@@ -72,9 +72,10 @@ | |||
private InputStream fsStream; | |||
private XMLEventReader reader; | |||
private ImplicitColumns metadata; | |||
private boolean isSelfClosingEvent; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider adding something like IGNORED_ELEMENT
or SELF_CLOSING_TAG
state to the xmlState
enum? Would that come out any simpler than the new boolean isSelfClosingEvent
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this, however the issue is where the self-closing tag occurs. Adding an additional state might mess with the other functionality.
Perhaps we should be trying for consistency with what Drill does with analogous JSON data. Querying this document [
{
"foo": null
},
{
"foo": { "bar": 0 }
}
] gives you
. The null value becomes an empty map, as I proposed for empty XML elements, but things are otherwise different. Adding an object with an int property |
@dzamo Thanks for the response. The real issue is that we don't know the schema as we're scanning the file, so we have to do the best we can. The issue is that with the empty fields (self-closing or otherwise) we don't really know what they are until we see real data. For instance, if we decide to make them an empty map, we'll get an error if the next record shows up as a scalar. The current approach was to treat empty fields as scalars which then causes issues if we encounter a map in the next row. |
Yes, I think I did grok the unknown schema problem. The thought above, which somehow escaped all the striking out I did to it after thinking a bit more, was to take advantage of the fact that scalar string can be embedded into a single element map. The tuple generating code would need to become aware when it should do this. My second comment's comparison of the situation with a JSON property that is first null, then an object, is also a bit dubious because empty XML elements do not represent nulls (from I what read) so much as zero length strings. If there is an effort to make querying XML behave in a more similar way to querying equivalent JSON, for some definition of equivalent, it should probably wait for another PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good if you're happy that my minor inline comments are covered.
I think you're right about that. From what I remember, there is an option for Drill's JSON parser to treat This was also an issue with some data I was working on. The JSON version used empty strings to denote |
@dzamo Are we good to go on this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Now that I test master again after this merge, both the self-closing and long form empty XML element cases work perfectly. As with the JSON example above, an empty |
DRILL-7979: Self-Closing XML Tags Cause Schema Change Exceptions
Description
Self closing XML tags are dealt with strangely by java's streaming parser. If you have data where you have one row containing a self closing XML tag foo () but then in the next row
foo
contains a map or other nested field, Drill will throw a schema change exception.This proposed fix causes Drill to ignore self-closing tags unless they have attributes, which allows data like this to be successfully queried.
For instance, prior to this PR, the data below would not work, but now can be successfully queried.
Documentation
No user facing changes.
Testing
Added additional unit test and tested manually.