<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>MacLovin &#187; groovy</title>
	<atom:link href="http://www.maclovin.de/tag/groovy/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.maclovin.de</link>
	<description>An Apple a day keeps the Windows away</description>
	<lastBuildDate>Tue, 17 Aug 2010 21:45:50 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Robust HTML parsing the Groovy way</title>
		<link>http://www.maclovin.de/2010/02/robust-html-parsing-the-groovy-way/</link>
		<comments>http://www.maclovin.de/2010/02/robust-html-parsing-the-groovy-way/#comments</comments>
		<pubDate>Thu, 11 Feb 2010 19:50:06 +0000</pubDate>
		<dc:creator>Dennis</dc:creator>
				<category><![CDATA[Scripts]]></category>
		<category><![CDATA[groovy]]></category>
		<category><![CDATA[script]]></category>
		<category><![CDATA[web]]></category>

		<guid isPermaLink="false">http://www.maclovin.de/?p=445</guid>
		<description><![CDATA[With Groovy, it&#8217;s very easy to parse XML data and extract arbitrary information. This works great as long as the input data is well-formed, but you can&#8217;t always guarantee that in real-world scenarios. Think of extracting data from HTML pages. They are very often a...]]></description>
			<content:encoded><![CDATA[<p>With Groovy, it&#8217;s very easy to parse XML data and extract arbitrary information. This works great as long as the input data is well-formed, but you can&#8217;t always guarantee that in real-world scenarios. Think of extracting data from HTML pages. They are very often a mess when it comes to XML validity and that&#8217;s where the <a title="TagSoup" href="http://home.ccil.org/~cowan/XML/tagsoup/" target="_blank">TagSoup library</a> comes to the rescue.</p>
<p><span id="more-445"></span></p>
<p>There are two major problems with HTML input:</p>
<ul>
<li>DTD resolution</li>
<li>Missing closing tags</li>
</ul>
<p>We are going to build a simple Groovy script that prints the list of questions on StackOverflow&#8217;s start page. The straight forward solution looks something like that</p>
<pre class="brush: groovy">def slurper = new XmlSlurper()
def htmlParser = slurper.parse("http://stackoverflow.com/")

htmlParser.'**'.findAll{ it.@class == 'question-hyperlink'}.each {
	println	it
}</pre>
<p>We parse <a href="http://stackoverflow.com" target="_blank">http://stackoverflow.com</a> with XMLSlurper, loop over all tags with the class attribute &#8216;question-hyperlink&#8217; and print it. But when running the script we get the following exception:</p>
<blockquote>
<div id="_mcePaste">Caught: java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/html4/strict.dtd at html_parser.run(html_parser.groovy:7)</div>
</blockquote>
<div>XMLSlurper has problems with HTML DTDs. By using the information in <a title="Groovy XmlSlurper and HTTP 503 Response Code" href="http://stevefinck.blogspot.com/2009/12/groovy-xmlslurper.html" target="_blank">this post</a>, we get rid of the exception.</div>
<div>
<pre class="brush: groovy;highlight: 2">def slurper = new XmlSlurper()
slurper.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false)
def htmlParser = slurper.parse("http://stackoverflow.com/")

htmlParser.'**'.findAll{ it.@class == 'question-hyperlink'}.each {
	println	it
}</pre>
</div>
<div>So next try. The DTD exception is gone, but we get another one saying the closing link-tag is missing. And here comes TagSoup. It&#8217;s a library that tries to transform invalid HTML data into well-formed XML. And best of all, it works great together with XMLSlurper. Here is the final Script:</div>
<div>
<pre class="brush: groovy">@Grab(group='org.ccil.cowan.tagsoup',
      module='tagsoup', version='1.2' )
def tagsoupParser = new org.ccil.cowan.tagsoup.Parser()
def slurper = new XmlSlurper(tagsoupParser)
def htmlParser = slurper.parse("http://stackoverflow.com/")

htmlParser.'**'.findAll{ it.@class == 'question-hyperlink'}.each {
	println	it
}</pre>
</div>
<div>The first command uses the @Grab-annotation to load the TagSoup library. Next we create a TagSoup-Parser instance and pass it as constructor-parameter to XMLSlurper. That&#8217;s all and we even got rid of the <em>setFeature</em> workaround.</div>
<div>You know other tricks to make HTML parsing more robust? Then please leave them in the comments.</div>
<!-- PHP 5.x -->]]></content:encoded>
			<wfw:commentRss>http://www.maclovin.de/2010/02/robust-html-parsing-the-groovy-way/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
