Wednesday, October 28

CMU Read the Web Project on M45

Jon points out a post on the Y! Developer Network Blog detailing their use of the M45 cluster for Information Extraction.

The post is by Andy Carlson and Justin Betteridge. They are PhD students working on the Read the Web project. The goal is to generate a knowledge from web documents.

They ran MapReduce jobs over a large web crawl to find:
  1. Given a list of patterns, what noun phrases fill in the blanks of those patterns?
  2. Given a list of noun phrases, what patterns do those noun phrases occur with?
  3. Given a list of patterns and noun phrases, how many times does each pattern co-occur with each noun phrase (or pair of noun phrases)?
They are currently scaling their techniques up to ClueWeb09 and using features from a dependency parse obtained from the Malt parser.

See their upcoming paper at WSDM 2010, Coupling Semi-Supervised Learning for information extraction.

You can also see the Read the Web course wiki page.

My group here at the CIIR uses M45 for large-scale extraction and organization work on the Million Book Project data. More on that work as it develops.