added a comment - - edited
Some examples for discussion are shown below - I am only discussing XmlParser to start with. All the examples use this parser:
def parser = new XmlParser()
def encode(s) { s.replaceAll(' ', /_/).replaceAll('\n', /\\n/).replaceAll('\r', /\\r/).replaceAll('\t', /\\t/) }
I should also note that I am using a locally modified version of XmlParser which fixes GROOVY-5119 among other things that I haven't committed yet - but I am wanting to discuss what I think the expected behaviour should be and current limitations. The first example is here:
def xml1 = '''\
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE books [
<!ELEMENT books (book)*>
<!ELEMENT book (name, author, comment)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT comment (#PCDATA)>
]>
<books>
<book>
<name>Groovy in Action</name>
<author>
Dierk et al
</author>
<comment> </comment>
</book>
</books>
'''
def result1 = [:].withDefault{''}
[true, false].each { trim ->
parser.trimWhitespace = trim
def books = parser.parseText(xml1)
books.each { book ->
book.each {
result1[trim] += it instanceof Node ? "[${it.name()}:${encode(it.text())}]" : '(' + encode(it) + ')'
}
}
}
assert result1 == [
(true): /[name:Groovy_in_Action][author:Dierk_et_al][comment:]/,
(false):/[name:Groovy_in_Action][author:\n____________Dierk_et_al\n________][comment:_]/
]
Because we are using a DTD, the underlying Java parser has removed ignorable white space for us. I think most people would want the result obtained with trimWhitespace false (i.e. the opposite of what the default is). However, some people might actually want to preserve the whitespace found in the original documents and currently we don't support this. We could change the behaviour of XmlParser to retain the ignorable whitespace - there is a currently empty ignorableWhitespace(...) method that we could use. The question then remains do we just drop the "autoTrimming" (i.e. current default) behaviour or have a flag to allow that too?
The next example doesn't have a DTD. The underlying parser doesn't attempt to work out ignorable whitespace - it assumes that any part of the document may contain mixed content, so in theory nothing is ignorable. The results are:
def xml2 = '''\
<?xml version="1.0" ?>
<books>
<book>
<name>Groovy in Action</name>
<author>
Dierk et al
</author>
<comment> </comment>
</book>
</books>
'''
def result2 = [:].withDefault{''}
[true, false].each { trim ->
parser.trimWhitespace = trim
def books = parser.parseText(xml2)
books.each { book ->
book.each {
result2[trim] += it instanceof Node ? "[${it.name()}:${encode(it.text())}]" : '(' + encode(it) + ')'
}
}
}
assert result2 == [
(true): /[name:Groovy_in_Action][author:Dierk_et_al][comment:]/,
(false):/(\n)(_)(_)(_)(_)(\n________)[name:Groovy_in_Action](\n________)[author:\n____________Dierk_et_al\n________](\n________)[comment:_](\n____)(\n)/
]
With trimming set to "true", the result we get is not unreasonable but we can see that the single space "comment" element has lost information, i.e. the single character. With trimming set to false we now get all of the text nodes containing whitespace. We get the "space" back for the "comment" element but now have a whole bunch of annoying text nodes that we have to "step over" while processing. I suspect most people would prefer to have the "false" result for example 1 with the ability to get the extra text nodes in special circumstances. This would be the very breaking part of a change here. People would expect "book.children().size() == 3" not "13". So do we try to mimic what the underlying XML would do if it did have a DTD?
The final example is a mixed content example for completeness and which might be relevant for future discussions but I won't say any more about it just yet.
def xml3 = '<p><b>Groovy in Action</b> by <em>Dierk et al</em></p>'
def result3 = [:].withDefault{''}
[true, false].each { trim ->
parser.trimWhitespace = trim
def p = parser.parseText(xml3)
p.each {
result3[trim] += it instanceof Node ? "[${it.name()}:${encode(it.text())}]" : '(' + encode(it) + ')'
}
}
assert result3 == [
(true): /[b:Groovy_in_Action](by)[em:Dierk_et_al]/,
(false):/[b:Groovy_in_Action](_by_)[em:Dierk_et_al]/
]
XmlParser uses the property trimWhitespace with (wrong) default true, and XmlSlurper uses the property keepWhitespace with the (wrong) default false. Should have a consistent API.