Friday, May 4

Open Source Scraping (Wrapper Generation) Tools

Web information extraction is also sometimes referred to as 'screen scraping' or 'web scraping'; converting the unstructured or semi-structured web content intended for human consumption into structured data suitable for computers.

Using a few simple tools it is easy to create wrappers that reliably extract structured content from semi-structured HTML web pages.

First, there is the open source project Web-Harvest.
Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities.
Another similar project is JScrape (Alpha). JScrape is very similar to Web-Harvest in technique and technology. From the website:
JScrape using the HttpClient API to get an input stream to a web page, then using the TagSoup API to turn the HTML into an acceptable DOM object and then from their saxon is used to apply the XQuery.
Simple Alternative
Alternatively, it is easy to roll your own XML based extractor. You will need an HTML to XHTML converter such as NekoHtml or TagSoup (see my post last year on these parsers) and a XPath/XQuery engine such as XOM or JDom.

Here is a quick example using XOM and Tagsoup. You will need: Java JDK, Xom 1.1, and TagSoup 1.1. A simple data extractor:

// Setup your HTML to XML parser.
XMLReader htmlToXmlParser = new org.ccil.cowan.tagsoup.Parser();
htmlToXmlParser.setFeature("http://xml.org/sax/features/namespace-prefixes", true);
XPathContext xpathContext = new XPathContext("html", "http://www.w3.org/1999/xhtml");

// Build / parse your HTML document (here represented as bytes from a String)
Document doc = new Builder(htmlToXmlParser).build(new ByteArrayInputStream(bytes));

// Query your document using XPath Expressions
Nodes nodes = doc.query("//html:span[@class='headline1']", xpathContext);


You may need some help getting started with XQuery / XPath Expressions. A great way to start is to download the freely available first chapter of XQuery: A guided Tour.

That's all there is to it, at least for simple wrappers. The hard part is scale and maintenance of large numbers of wrappers over time... and there are some commercial engines that help to manage this. More on these in a future post.

13 comments:

  1. It's arguable that web information extraction (IE) could be referred to screen scraping. The former it's only a technique frequently used, but IE it's a more wide concept, including tasks like entity recognition, template elements and coreference.

    Anyway, I'd like to recomend you other screen scraping tools:

    - Piggy Bank is a Firefox extension that allows to extract data from web pages, store that information and integrate it with other tools, like Google Maps.

    - BeautifulSoup is a screen scraping library for Python. I've used for a long time and I find it really useful. The API is really simple and effective.

    ReplyDelete
  2. Anonymous1:06 AM EDT

    Hi, I need a couple of Automatic Data Extraction systems (wrappers) to use them for my resesarch. Couldn't find any. can you help??

    ReplyDelete
  3. Anonymous11:10 AM EDT

    Have you tried irobotsoft? It is a free tool for web scraping with easy visual interfaces. It is also an integrated computation system, so you don't need to install other language tools.

    ReplyDelete
  4. Anonymous9:20 PM EDT

    does it scrape dynamic websites ?

    Any good open source api to parse dynamic sites ?

    I have tried web harvest. It fails sometimes.
    How about HtmlUnit ? Is it better than web harvest ?

    ReplyDelete
  5. Anonymous7:36 PM EST

    Do you have any ideas of a tool/library to do something like Amazon Universal Wish List?

    Thanks

    ReplyDelete
  6. Do you know any new tool in this field?

    ReplyDelete
  7. We are advancing Website organization and Design in sensible price......

    Buzz For Android

    ReplyDelete
  8. sms plugin for your business Marking ot to charm your business.......

    ReplyDelete
  9. Your informational post is good one to read and i think it can easily to reach the correct market place. so thank for creating this interesting blog.web data extractor software

    ReplyDelete
  10. This is the best application on the drop today. There access to enjoy great moments of relaxation: age of war 2|
    age of war 5
    Great! Thanks for sharing the information.Summon creatures to fight enemy units and demolish the opposing castle. Your castle is equipped with a crossbow, which you can use to shoot enemies age of war 6. Make sure you upgrade skills to increase your chances of winning battles.
    The goal of Age of War is to survive longer than the computer and to outlast him you’ll need to train the right troops while balancing your offence and defence in this high paced, quick thinking flash game age of war 4
    . Train troops of you own to combat the computers. As you kill off the computer troops, you will gain EXP and you will eventually advance to the next age.
    Choose a starter deck and prepare for an epic war!age of war 3

    Command your units in each battle to attack the enemy’s castle, while protecting your own base earn to die 2. Earn and upgrade cards to help you conquer the land.age of war

    happy wheels | tank trouble
    Thanks for the best blog.it was very useful for me.keep sharing such ideas in the future as well.this was actually what i was looking for,and i am glad to came here!
    cubefield It contains a plethora of tools and objects for level building such as harpoon guns,blocks and vans. Users can upload their maps to a public server where they are accessible

    ReplyDelete
  11. A good blog. Thanks for sharing the information. It is very useful for my future. keep sharing
    duck life 3 | Slither io |Red Ball 3 |

    ReplyDelete
  12. Utilizing natural ingredients from nature, can create a wide variety of herbal remedies that are very beneficial for health with natural ingredients authenticity

    Pengobatan Kelenjar Getah Bening
    Cara Mengangkat Kanker Payudara Tanpa Operasi
    Cara Menghilangkan Bopeng Bekas Jerawat
    Obat Liver Sampai Sembuh
    Cara Menyembuhkan Kanker Pankreas
    Atasi Masalah Lambung

    ReplyDelete