With Groovy, it’s very easy to parse XML data and extract arbitrary information. This works great as long as the input data is well-formed, but you can’t always guarantee that in real-world scenarios. Think of extracting data from HTML pages. They are very often a mess when it comes to XML validity and that’s where the TagSoup library comes to the rescue.
There are two major problems with HTML input:
- DTD resolution
- Missing closing tags
We are going to build a simple Groovy script that prints the list of questions on StackOverflow’s start page. The straight forward solution looks something like that
def slurper = new XmlSlurper()
def htmlParser = slurper.parse("http://stackoverflow.com/")
htmlParser.'**'.findAll{ it.@class == 'question-hyperlink'}.each {
println it
}
We parse http://stackoverflow.com with XMLSlurper, loop over all tags with the class attribute ‘question-hyperlink’ and print it. But when running the script we get the following exception:
Caught: java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/html4/strict.dtd at html_parser.run(html_parser.groovy:7)
def slurper = new XmlSlurper()
slurper.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false)
def htmlParser = slurper.parse("http://stackoverflow.com/")
htmlParser.'**'.findAll{ it.@class == 'question-hyperlink'}.each {
println it
}
@Grab(group='org.ccil.cowan.tagsoup',
module='tagsoup', version='1.2' )
def tagsoupParser = new org.ccil.cowan.tagsoup.Parser()
def slurper = new XmlSlurper(tagsoupParser)
def htmlParser = slurper.parse("http://stackoverflow.com/")
htmlParser.'**'.findAll{ it.@class == 'question-hyperlink'}.each {
println it
}
I’ve been using NekoHTML lately. The syntax is a bit different though and I haven’t benchmarked yet if it’s faster or slower than XmlSlurper.
FYI, the syntax is a little like this in Groovy:
String get(def uri) {
builder.request(uri, GET, TEXT, {}).text
}
Document document(def uri) {
DOMParser parser = new DOMParser()
parser.parse(new InputSource(new StringReader(get(uri))))
parser.document
}
(Builder is an HTTP Builder)
Thanks for pointing to this. I know some other libraries aiming the same goal, e.g. TidyHTML, but I never heard of NekoHTML.
Looks like this one is the way to go if you would like to use a DOMParser, though I really like XMLSlurper’s syntax in Groovy.
Performance is not really an issue in my projects, so I wouldn’t really care which one is faster. More important is reliability. How close is the result to the real intention of the page.
great:) helped me rigth now for some simple custom html testing
Thank you!
Really useful for some web automation!
Once I initially commented I clicked the -Notify me when new feedback are added- checkbox and now each time a remark is added I get four emails with the identical comment. Is there any approach you may take away me from that service? Thanks!
Very interesting article, makes what I was doing with Java way shorter. But I was wondering how would I select an element that is, say, all tags after a certain class, or all bold text on the page. What I’m trying to scrape isn’t a class unfortunately…
I’ve found a partial solution to selecting other elements. You can use ” it.name() == ‘p’ ” for all tags, or replace it with ‘h1′ for all h1 tags. If anyone else has more info on how to more specifically select page elements I’d still like more info…