groovy
  1. groovy
  2. GROOVY-5086

XmlSlurper extremly slow when parsing HTML with DOCTYPE

    Details

    • Type: Bug Bug
    • Status: Closed Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 1.6.4, 1.8.2, 1.8.3
    • Fix Version/s: None
    • Component/s: XML Processing
    • Labels:
    • Environment:
      OSX 10.6, OSX 10.7, Ubuntu 10.04
    • Number of attachments :
      0

      Description

      When parsing a XHTML document with a DOCTYPE declaration, the XMLSlurper parses for about two minutes. I reproduced this on several different groovy versions on OS X and Linux, but not using the groovy web console on appspot.com. This has also been encountered by some on the grails user mailing list earlier this year: http://grails.1312388.n4.nabble.com/XMLSlurper-really-slow-reading-parsing-html-xml-file-td3433305.html

      Minimal example to trigger this:

      new XmlSlurper().parseText('''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" lang="de"><head><title>slurp</title></head><body /></html>''')

        Activity

        Hide
        blackdrag blackdrag added a comment -

        is it possible this is caused by the underlaying xml parser and that it tries to get the dtd from the net?

        Show
        blackdrag blackdrag added a comment - is it possible this is caused by the underlaying xml parser and that it tries to get the dtd from the net?
        Hide
        Christoph Neuroth added a comment -

        Good point, passing the same xml into new XmlParser().parseText() actually has the same effect. And trying to fetch that dtd file with curl aborts with "empty reply from server" after a while, while fetching it with Chrome works instantly. So this might be a networking problem, but then I think there should be a way to avoid fetching external resources with XmlParser/Slurper...

        Show
        Christoph Neuroth added a comment - Good point, passing the same xml into new XmlParser().parseText() actually has the same effect. And trying to fetch that dtd file with curl aborts with "empty reply from server" after a while, while fetching it with Chrome works instantly. So this might be a networking problem, but then I think there should be a way to avoid fetching external resources with XmlParser/Slurper...
        Show
        Christoph Neuroth added a comment - This explains the issue: http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic/
        Hide
        Christoph Neuroth added a comment -

        Workaround, thanks to Christian Sonne Jensen[1]: You can disable fetching external DTDs. So, this is not really a bug, but maybe the groovy classes should provide an easier way around this and document the issue so people stop DDOSing the w3c servers

        s = new XmlSlurper()
        s.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false)

        [1] http://groovy.329449.n5.nabble.com/XmlParser-XmlSlurper-howto-disable-DTD-validation-td353673.html

        Show
        Christoph Neuroth added a comment - Workaround, thanks to Christian Sonne Jensen [1] : You can disable fetching external DTDs. So, this is not really a bug, but maybe the groovy classes should provide an easier way around this and document the issue so people stop DDOSing the w3c servers s = new XmlSlurper() s.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false) [1] http://groovy.329449.n5.nabble.com/XmlParser-XmlSlurper-howto-disable-DTD-validation-td353673.html

          People

          • Assignee:
            Guillaume Laforge
            Reporter:
            Christoph Neuroth
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: