Friday, June 1

IBM Avatar: Combining Structured and Unstructured Data

I came across the blog of Anant Jhingran, a Distinguished Engineer and CTO of IBM's Information Management Division via Alon Halevy. In the process I learned about what IBM is doing with structured and unstructured information. Two projects that he highlighted were Avatar out of IBM Almaden and TAKMI out of IBM Tokyo.


From the website:
The goal of the Avatar project is two fold: (i) to enable the discovery and extraction of structured information buried in volumes of unstructured text (such as emails, web pages, and blogs), and (ii) to exploit this information to drive the next generation of search and business intelligence applications.
Anant describes the project in his post on Semantics:
What Avatar does is that it looks at a corpus of documents. And based on the analysis of documents, and knowing that there are only 6 different ways (oki, i am making it up, but you get the idea) in which people give their phone numbers in email ("Anant Jhingran, 408-xxx-xxxx:, "you can reach me at 408-xxx-xxx", "call me at 408-xxx-xxxx", ...) one can build the regular expression patterns, and voila, without any deep natural language processing, or building an OWL ontology, one can reliably derive people's phone numbers. I am grossly simplifying it, but you get the idea...
Avatar has three main components:
  1. Information Extraction System - It's purpose is to allow relatively unsophisticated users to build rule-based document annotators that can operate over very large corpora.

  2. Semantic Search - This takes a user's keyword search and transforms it into a structured query over the extracted concepts using statistical analysis (see the related paper Web-scale Data Integration: You can only afford to Pay As You Go ).

  3. Managing Uncertainty and Probabilistic Databases - The extracted annotations have a probability of being correct based on the precision of the extraction rules. Having a system that can deal with this uncertainty during querying and processing can improve the performance of the system.
TAKMI (Text Analysis and Knowledge Mining)
There are fewer details available on the project, but it says:
Although TAKMI was originally created for analyzing call center logs, it can be applicable for any type of large text data in general. In particular, we have offered a medical version of TAKMI system (called MedTAKMI) for analyzing medical publications.