The post is by Andy Carlson and Justin Betteridge. They are PhD students working on the Read the Web project. The goal is to generate a knowledge from web documents.
They ran MapReduce jobs over a large web crawl to find:
- Given a list of patterns, what noun phrases fill in the blanks of those patterns?
- Given a list of noun phrases, what patterns do those noun phrases occur with?
- Given a list of patterns and noun phrases, how many times does each pattern co-occur with each noun phrase (or pair of noun phrases)?
See their upcoming paper at WSDM 2010, Coupling Semi-Supervised Learning for information extraction.
You can also see the Read the Web course wiki page.
My group here at the CIIR uses M45 for large-scale extraction and organization work on the Million Book Project data. More on that work as it develops.