Skip to content
This repository was archived by the owner on Mar 24, 2025. It is now read-only.

Commit b79b1a9

Browse files
committed
Add missed other default case when parsing/inferring XML documents
This PR adds the support for skipping multiple white spaces around a comment. This should have been added but missed. As `XMLStreamConstants.COMMENT` is always skipped [here](https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/parsers/StaxXmlParser.scala#L51-L52) and [here](https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/util/InferSchema.scala#L85-L86) but it seems it is possible to have the `COMMENT` is between white spaces. In this case, `factory.setProperty(XMLInputFactory.IS_COALESCING, true)` does not coalesce both white spaces. In more details, ```xml <a> <!-- comment --> <b>...</b> </a> ``` in this case, `<!--comment -->` is surrounded with whitespaces. This produces the events as blow: ```bash XMLStreamConstants.CHARACTERS # whitespace XMLStreamConstants.COMMENT # comment XMLStreamConstants.CHARACTERS # whitespace XMLStreamConstants.START_ELEMENT # <b> ``` Current codes always filter `XmlEvent.COMMENT` so it ends up with ```bash XMLStreamConstants.CHARACTERS # whitespace XMLStreamConstants.CHARACTERS # whitespace XMLStreamConstants.START_ELEMENT # <b> ``` which does not happen in normal cases because we are coalescing multiple `XMLStreamConstants.CHARACTERS` into single one as below: ```bash XMLStreamConstants.CHARACTERS # whitespace XMLStreamConstants.START_ELEMENT # <b> ``` Author: hyukjinkwon <[email protected]> Closes #166 from HyukjinKwon/missed-other-cases.
1 parent 617a31e commit b79b1a9

File tree

3 files changed

+6
-0
lines changed

3 files changed

+6
-0
lines changed

src/main/scala/com/databricks/spark/xml/parsers/StaxXmlParser.scala

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,7 @@ private[xml] object StaxXmlParser {
112112
case _: EndElement if data.isEmpty => null
113113
case _: EndElement if options.treatEmptyValuesAsNulls => null
114114
case _: EndElement => data
115+
case _ => convertField(parser, dataType, options)
115116
}
116117

117118
case (c: Characters, ArrayType(st, _)) =>

src/main/scala/com/databricks/spark/xml/util/InferSchema.scala

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,7 @@ private[xml] object InferSchema {
143143
case _: EndElement if data.isEmpty => NullType
144144
case _: EndElement if options.treatEmptyValuesAsNulls => NullType
145145
case _: EndElement => StringType
146+
case _ => inferField(parser, options)
146147
}
147148
case c: Characters if !c.isWhiteSpace =>
148149
// This means data exists

src/test/resources/null-nested-struct.xml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
<root>
33
<item>
44
<b>
5+
<!-- nested comments -->
56
<es>
67
<e>1</e>
78
</es>
@@ -10,6 +11,9 @@
1011
<item>
1112
<!-- Issue 117 - This is where an empty Row would be produced instead of null -->
1213
<b>
14+
15+
<!-- nested comments -->
16+
1317
<es></es>
1418
</b>
1519
</item>

0 commit comments

Comments
 (0)