Saturday, May 6

Rexa, Automated Information Extraction Meets CiteSeer

Andrew McCallum's team at UMass Amherst recently announced the release of a new Citeseer killer. So, what's so special? Well, if you are familiar with Andrew's work you know it must utilize automated Information Extraction (IE) technology . The Rexa team has a blog, where Andrew describes how Rexa aims to be next generation citeseer:
Rexa's chief enhancement is that we use advanced statistical machine learning algorithms to perform information extraction, and then make first-class, deduplicated, cross-referenced objects not only of research papers, but also people and grants--and in the future, more objects types.
In the process of learning about Rexa I ran across a new and quite interesting blog: Machine Learning Theory. I found Rexa and the blog through Data Mining, a blog by Matthew Hurst, a British Expat and another former Whizbang employee who is now director of research at Intelliseek.

Machine Learning is an awesome blog. John, the author (along with other authors), has a write-up on Rexa which is also quite good. The best thing was the discussion about Rexa, which brought up several areas for "future work."

Rexa is a fully automated solution, in my opinion, one of the problems with fully automated ML solutions is that it is well nye impossible to get near 100% precision. Reading the user comments, one user named Andrej, noticed a mis-attributed paper. Charles, one of Rexa's creators, responds: "The automatic extraction and author merging performed by Rexa has accuracy in the 90s, but inevitably there are errors. " Ninety percent+ accuracy is quite good, quite remarkable in fact. However, in important decisions, is this error rate acceptable?

More important than the % of errors, is the type of errors made by the system. Last year, I saw a presentation given by David Hull entitled "Commercializing Information Extraction: Lessons from WhizBang Labs." One of the major takeaways was this: the problem with fully automated ML based extraction is not the percentage of errors, but the types of "stupid" errors made that a human wouldn't make. Even a modest number of "stupid" errors in author attribution in Rexa could leave users like, Andrej, unsure about the quality of the attributions in the system. In his presentation, Hull outlines some of the issues that Whizbang had with its FlipDog jobs database. My boss, Steve, once had his job (and other major exec positions at Globalspec) listed on as postings on FlipDog because it mistook the executive information page as job postings. Both of these cases (FlipDog and Rexa) illustrate the need for an easy way for people to intervene and make corrections when the ML certainty is ambiguous. In Rexa's case the problem of coreference analysis, also known as record linkage or identity uncertainty, is a very difficult research problem and one the computer will certainly not achieve 100% precision. So, what do you do? Do we abandon fully automated solutions or merely accept there error rate and "stupid" mistakes?

For difficult ML problems like this is there a balance between a fully automated ML solution and a fully human manual solution that achieves high precision with minimal user interaction. Could it, for example, let the machine figure out the clear linkages, but defer to a human editor for decisions that are more dubious. Or at least make it clear to the user that the linkage is weak or uncertain. Then, in these less certain cases, can the machine learning algorithm simplify the problem for the user to make it tractable for manual review? For example, the algorithm might surface that out of many choices there are two highly probable author matches for a paper by “A. Bauer”: “Andreas Bauer” and "Andrej Bauer" and surface some rationale for tjudgmentment and let the user confirm. This semi-automated approach has clear speed advantages over a completely manual process and achieves a higher levelperceivedeved "intelligence" than a completely automated process. However, it does involve the user and therefore time and money. Does the increased precision, and more importantly the lack of "stupid mistakes," justify the added cost?

Let me know what you think of this trade-off. Are fully automated solutions viable or do semi-automated solutions still necessary because of user trust issues?

No comments:

Post a Comment