With Groovy, it’s very easy to parse XML data and extract arbitrary information. This works great as long as the input data is well-formed, but you can’t always guarantee that in real-world scenarios. Think of extracting data from HTML pages. They are very often a mess when it comes to XML validity and that’s where the TagSoup library comes to the rescue.
There are two major problems with HTML input:
- DTD resolution
- Missing closing tags
We are going to build a simple Groovy script that prints the list of questions on StackOverflow’s start page. The straight forward solution looks something like that
def slurper = new XmlSlurper()
def htmlParser = slurper.parse("http://stackoverflow.com/")
htmlParser.'**'.findAll{ it.@class == 'question-hyperlink'}.each {
println it
}
We parse http://stackoverflow.com with XMLSlurper, loop over all tags with the class attribute ‘question-hyperlink’ and print it. But when running the script we get the following exception:
Caught: java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/html4/strict.dtd at html_parser.run(html_parser.groovy:7)
def slurper = new XmlSlurper()
slurper.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false)
def htmlParser = slurper.parse("http://stackoverflow.com/")
htmlParser.'**'.findAll{ it.@class == 'question-hyperlink'}.each {
println it
}
@Grab(group='org.ccil.cowan.tagsoup',
module='tagsoup', version='1.2' )
def tagsoupParser = new org.ccil.cowan.tagsoup.Parser()
def slurper = new XmlSlurper(tagsoupParser)
def htmlParser = slurper.parse("http://stackoverflow.com/")
htmlParser.'**'.findAll{ it.@class == 'question-hyperlink'}.each {
println it
}




I’ve been using NekoHTML lately. The syntax is a bit different though and I haven’t benchmarked yet if it’s faster or slower than XmlSlurper.
FYI, the syntax is a little like this in Groovy:
String get(def uri) {
builder.request(uri, GET, TEXT, {}).text
}
Document document(def uri) {
DOMParser parser = new DOMParser()
parser.parse(new InputSource(new StringReader(get(uri))))
parser.document
}
(Builder is an HTTP Builder)
Thanks for pointing to this. I know some other libraries aiming the same goal, e.g. TidyHTML, but I never heard of NekoHTML.
Looks like this one is the way to go if you would like to use a DOMParser, though I really like XMLSlurper’s syntax in Groovy.
Performance is not really an issue in my projects, so I wouldn’t really care which one is faster. More important is reliability. How close is the result to the real intention of the page.