Thursday, July 29

Recorded Future: Trend and event spotting from real-time news data

Yesterday Wired featured an article, Google, CIA Invest in ‘Future’ of Web Monitoring. The article stretches the truth a bit when it says that Google is doing business with the CIA. The link is tenuous, that both companies are interested in predictive analytics on news and real-time data. The subject of the article is a small Cambridge based company, Recorded Future. From the article's description,
Recorded Future strips from web pages the people, places and activities they mention. The company examines when and where these events happened (“spatial and temporal analysis”) and the tone of the document (“sentiment analysis”)... Recorded Future maintains an index with more than 100 million events, hosted on Amazon.com servers.
For a more detailed look at what the company is doing, take a look at the white paper published on the company blog, A whitepaper on temporal analytics. You can also read the Predictive Signals blog by Bill Ladd, the Chief Analytic Officer at Recorded Future.

Recorded Future is not alone in this field. For example, the Living Knowledge Project is also working on future prediction of news events from web data.

The people working in this field should be aware of the wealth of previous research analyzing event data in news. For example, the DARPA TIDES program on Topic Detection and Tracking (TDT). See James Allan's book, Topic Detection and Tracking for an overview. You can also look at some of Victor Lavrenko's work, specifically on TDT and AEnalyst for financial market prediction from news.

Quick Links of the Day: KDD Cup, Task Oriented Search, ScalaNLP, SIGIR

Any of these stories could be a full blog post. But, for now I'll just have to give you a few quick pointers:

SIGIR 2010 Industry day videos - complete videos of all the talks, via Noisy Channel.

ScalaNLP - A new NLP package in Scala from the Berkeley and Stanford NLP teams. Scala is hip new language for NLP that runs inside the JVM. See also the factorie project from UMass's IESL lab.

KDD Cup Challenge Results - This year's competition asked participants to predict student performance on mathematical problems from logs of student interaction with Intelligent Tutoring Systems.

TabCandy - from Matthew Hurst. Create groups of tabs for task-oriented search. Create a "save for later" group of tabs. Share "groups of tabs" across platforms and with your friends - "group browsing".

How Google Builds APIs from Google I/O

Research vs. Reality - Discuss.

Tuesday, July 27

KDD 2010 Coverage, Best Paper Awards

KDD 2010 is being held in Washington D.C. this week. I'm not attending, but everyone can participate because the keynotes are being streamed live. The keynote at 9am EST is from David Jensen from UMass Amherst, giving a talk on Computational Social Science.

Yesterday, was the first day of papers. Two that garnered lots of discussion on Twitter are:

Suggesting Friends Using the Implicit Social Graph
In this paper, we describe the implicit social graph which is formed by users' interactions with contacts and groups of contacts, and which is distinct from explicit social graphs in which users explicitly add other individuals as their "friends".
It won honorable mention in the industry paper category. Look for "Got the wrong Bob" and "Don't forget Bob" features in GMail labs.

Overlapping Experiment Infrastructure: More, Better, Faster Experimentation
In this paper, we describe Google’s overlapping experiment infrastructure that is a key component to solving these problems. In addition, because an experiment infrastructure alone is insufficient, we also discuss the associated tools and educational processes required to use it effectively.
The awards were also announced, see the KDD awards for the full list.

Best Research Paper:
Connecting the Dots Between News Articles
In this paper, we investigate methods for automatically connecting the dots - providing a structured, easy way to navigate within a new topic and discover hidden connections. We focus on the news domain: given two news articles, our system automatically finds a coherent chain linking them together. For example, it can recover the chain of events starting with the decline of home prices (January 2007), and ending with the ongoing health-care debate.
Best Industry/Government Paper
Optimizing Debt Collections Using Constrained Reinforcement Learning
In this paper, we propose and develop a novel approach to the problem of optimally managing the tax, and more generally debt, collections processes at financial institutions...We re port on our experience in an actual deployment of a tax collections optimization system based on the proposed approach, at New York State Department of Taxation and Finance.

SIGIR 2010 Workshops: CrowdSourcing for Search Evaluation

Last Friday was SIGIR workshop day. First up is the workshop on CrowdSourcing for Search Evaluation. It focuses on using Amazon's Mechanical Turk (MT) and similar service to provide judgments. I did not attend this workshop, but heard positive things from the attendees. The workshop is organized by Matt Lease, Vitor Carvalho, and Emine Yilmaz.

The presentations and papers in the program are available online. Here are a few I want to highlight:

A main highlight was the CrowdFlower keynote:
Better Crowdsourcing through Automated Methods for Quality Control
CrowdFlower provides commercial support for companies performing tasks on Mechanical Turk. Everyone had great things to say about this talk that kept people enthralled even though it was the end of the day; some said it was the best talk of the conference.

The other keynote was:
Design of experiments for crowdsourcing search evaluation: challenges and opportunities by Omar Alonso. Don't miss the slides from Omar's ECIR tutorial. They also had a paper at the workshop,

Detecting Uninteresting Content in Text Streams, which looked at using crowdsourcing to evaluate the 'interestingness' of tweets. They found that most tweets, 57% were not interesting. The found that generally, tweets that contain links tend to be interesting (81% accuracy) and that those without links that were interesting generally contained named entities.

Omar, Gabriella Kazai, and Stefano Mizzaro are working on a book on crowdsourcing that will be published by Springer in 2011.

My labmate, Henry Feild, presented a paper, Logging the Search Self-Efficacy of Amazon Mechanical Turkers.

Be sure to read over the rest of the program, because there are other great papers that I haven't had a chance to feature here.