Using a few simple tools it is easy to create wrappers that reliably extract structured content from semi-structured HTML web pages.
First, there is the open source project Web-Harvest.
Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities.Another similar project is JScrape (Alpha). JScrape is very similar to Web-Harvest in technique and technology. From the website:
JScrape using the HttpClient API to get an input stream to a web page, then using the TagSoup API to turn the HTML into an acceptable DOM object and then from their saxon is used to apply the XQuery.Simple Alternative
Alternatively, it is easy to roll your own XML based extractor. You will need an HTML to XHTML converter such as NekoHtml or TagSoup (see my post last year on these parsers) and a XPath/XQuery engine such as XOM or JDom.
Here is a quick example using XOM and Tagsoup. You will need: Java JDK, Xom 1.1, and TagSoup 1.1. A simple data extractor:
// Setup your HTML to XML parser.
XMLReader htmlToXmlParser = new org.ccil.cowan.tagsoup.Parser();
XPathContext xpathContext = new XPathContext("html", "http://www.w3.org/1999/xhtml");
// Build / parse your HTML document (here represented as bytes from a String)
Document doc = new Builder(htmlToXmlParser).build(new ByteArrayInputStream(bytes));
// Query your document using XPath Expressions
Nodes nodes = doc.query("//html:span[@class='headline1']", xpathContext);
You may need some help getting started with XQuery / XPath Expressions. A great way to start is to download the freely available first chapter of XQuery: A guided Tour.
That's all there is to it, at least for simple wrappers. The hard part is scale and maintenance of large numbers of wrappers over time... and there are some commercial engines that help to manage this. More on these in a future post.