Simple HTML parsing and text extraction is, well, not so easy to do well. Over two years ago, I wrote a post: Open Source HTML parsers. Since then, I've mainly stuck by TagSoup as the best open-source choice for me, but today there are a few other alternatives that I'm considering for a new project.
The best parsers are those found in the top web browsers. However, it's usually quite challenging (and slow) to use them in external programs.
Java Mozilla Html Parser - A Java wrapper around the Firefox HTML parser that provides a Java API to parse documents. The website is out-of-date, there was a v 0.3 release in October.
Of course, you still have the option to write your own for maximum flexibility and speed. I'm still waiting for a real production quality parser. We'll need something better than what's currently available today to deal with those messy billion document test collections that are coming soon.