Sunday, November 16

Open Source HTML parsers for Java

Simple HTML parsing and text extraction is, well, not so easy to do well. Over two years ago, I wrote a post: Open Source HTML parsers. Since then, I've mainly stuck by TagSoup as the best open-source choice for me, but today there are a few other alternatives that I'm considering for a new project.

HtmlCleaner - A small, lightweight parser that fixes up and re-orders HTML to produce well-formed XML. It won top marks in Ben McCann's comparison of HTML parsers. However, I tried it out on a few Wikipedia pages and the text it returned was not acceptable, it contained snippets of javascript and commented cdata content.

The best parsers are those found in the top web browsers. However, it's usually quite challenging (and slow) to use them in external programs.

Java Mozilla Html Parser - A Java wrapper around the Firefox HTML parser that provides a Java API to parse documents. The website is out-of-date, there was a v 0.3 release in October.

Of course, you still have the option to write your own for maximum flexibility and speed. I'm still waiting for a real production quality parser. We'll need something better than what's currently available today to deal with those messy billion document test collections that are coming soon.

No comments:

Post a Comment