There's something for (almost) everyone's parsing needs. Does it need to be fast, work on anything, and have access to the DOM? You can't have everything! If its DOM you want you are going to take a performance hit -- its going to be less reliable and slower because HTML docs are messy. For my current situation I need a DOM tree, so I am going to focus primarily on those solutions. For those of you who want a fast, non-tree based tag parser, you might take a look at Jericho HTML parser, but I haven't used it. Now, on to the DOM Parsers!
Here are the top three free Java HTML parsers (tree based), to my knowledge:
- NekoHTML (aka CyberNeko HTML parser)
- TagSoup (sometimes mistakenly called Tag Soup)
- HTML Parser (the most creative name of all!)
NekoHTML is faster than TagSoup, but TagSoup parses almost anything and is generally more reliable.
I think its fair to say that Neko and TagSoup are the two most popular. I'm not sure who actually uses HTML Parser... but I haven't run into one yet in production. According to the above benchmarks it didn't distinguish itself in either speed or memory, consistingly taking longer to parse and using more memory than its competitors.
Another project worth mentioning is JTidy. JTidy is a HTML cleaner and formatter -- and a gosh darn good one at that. In the process of cleaning it also happens to parse the HTML and so it can also be used as a parser. It is in the Nutch Sandbox, but it isn't "supported." I have heard of projects using it (mostly for testing) and it is reportedly a very reliable tool that produces DOM / XTML from otherwise rubbish code. However, I haven't used it -- so you'll have to do your own research with that one.
Happy HTML chopping-- keep those knives sharp and practice safe error handling!
I went from TagSoup to NekoHTML a few weeks ago after using TagSoup for about 2 years in a project. NekoHTML 1.9.12 release seems to be a lot more correct than TagSoup. TagSoup seems to be now a dead project as there haven't been any changes since 2007.
ReplyDelete