Monday, April 10

Free as in Beer Java HTML Parsers

No, I'm not talking parsley, I'm talking parsing, not to be confused with one of my favorite grassy food garnishes. Looking to do some work in HTML? Let's talk options. First, let's face it-- Sun's sux. You'd think Sun would include a decent HTML parser in the core libraries, but they don't. The Swing HTML Parser is a joke. So, we'll be going open source for some industrial strength workhorses.

There's something for (almost) everyone's parsing needs. Does it need to be fast, work on anything, and have access to the DOM? You can't have everything! If its DOM you want you are going to take a performance hit -- its going to be less reliable and slower because HTML docs are messy. For my current situation I need a DOM tree, so I am going to focus primarily on those solutions. For those of you who want a fast, non-tree based tag parser, you might take a look at Jericho HTML parser, but I haven't used it. Now, on to the DOM Parsers!

Here are the top three free Java HTML parsers (tree based), to my knowledge:
Let's start with the basics. Nutch (The "NPR of Search Engines") uses NekoHTML and TagSoup as its primary supported parsers. Here are some recent benchmarks (March 06) I ran across from the Nutch user mailing list (they were running into problems with Neko hanging). The benchmarks are from recent, but not the latest versions of these two parsers. Here's the summary:

NekoHTML is faster than TagSoup, but TagSoup parses almost anything and is generally more reliable.

I think its fair to say that Neko and TagSoup are the two most popular. I'm not sure who actually uses HTML Parser... but I haven't run into one yet in production. According to the above benchmarks it didn't distinguish itself in either speed or memory, consistingly taking longer to parse and using more memory than its competitors.

Another project worth mentioning is JTidy. JTidy is a HTML cleaner and formatter -- and a gosh darn good one at that. In the process of cleaning it also happens to parse the HTML and so it can also be used as a parser. It is in the Nutch Sandbox, but it isn't "supported." I have heard of projects using it (mostly for testing) and it is reportedly a very reliable tool that produces DOM / XTML from otherwise rubbish code. However, I haven't used it -- so you'll have to do your own research with that one.

Happy HTML chopping-- keep those knives sharp and practice safe error handling!