Tuesday, February 3

Google Research Entity Annotations of the KBA Stream Corpus (FAKBA1)

I'm happy to announce that our Google Research is releasing the largest collection of entity-linked data every made publicly available. The dataset can be used for a wide variety of information retrieval and information extraction tasks.  

The Freebase Annotations of the TREC KBA Stream Corpus 2014 (FAKBA1) contains over 9.4 billion entity annotations from over 496 million documents. More details, including a link to download the data are available at:

This data set is an important data release because entity linking can be an expensive process that is difficult for researchers to perform at scale.  

The KBA Stream Corpus was designed to help track and filter important updates about entities as they change over time.   The goal of KBA was to recommend edits to Wikipedia editors based incoming streams of news and social media.  One of the tasks in the track is the "Cumulative Citation Recommendation" (CCR) task, whose goal is to recommend cite-worthy articles to editors.  There are also extraction tasks, Streaming Slot Filling, which suggests changes to an entity profile (similar to updating a Wikipedia infobox). 

In order to facilitate research in this field, we annotated all of the English documents from the TREC KBA Stream Corpus 2014 (http://trec-kba.org/kba-stream-corpus-2014.shtml) with entity links to Freebase. The entity links are resolved automatically, and are imperfect. For each named entity recognized we provide: the mention text, begin and end byte offsets, Freebase MID, and confidence scores. The dataset includes manual annotations of the TREC KBA CCR 2014 entity queries (in TSV format) that I performed. 

FAKBA1 has 394,051,027 documents with at least one entity annotated. There are over 9.4 billion entity mentions with links to Freebase. 

Although it's early, the dataset has a variety of possible applications, including:
  • TAC Knowledge Base Population Tasks - The goal is to construct a knowledge base, including tasks such as entity linking.  There is a new Tri-lingual track (Spanish, English, and Chinese) being planned for 2015.
  • TREC Temporal Summarization - A track focused on summarizing major world events.
  • TREC Dynamic Domain - A track focused on high-recall filtering, the KBA annotations could be used with the Local Politics vertical.
I hope the data set has broad applications to many researchers!  

You can also stay up-to-date about future releases by reading: