Thursday, October 27

CIKM 2011 Industry: Freebase: A Rosetta Stone for Entities

John Giannandrea, Google

What is Freebase?
 -> A machine representation of things in the world. (Person, place, thing in the world)
 -> Instead of working in the domain of text, we work in the domain of strong identified things
 -> Each object has an identifier, once you have it, it will also refer to an identity

Properties - relationships between objects
 - edges between the entity ids
 - edges are directional
 - properties create meaning

 - encode knowledge

 - a categorization of an entity
 - An entity can have multiple Types in Freebase
 - "Co-types" - Types are a mix-in
 - e.g. Arnold (politician, actor, athlete)

The real world is extremely messy.

Knowledge you can use
 - the current state
 - 25 Million topics (entities)
 - 500 million connections
 - 2x the size it was last year

>= 10 instances, 5790 types
1772 commons (survived community scrutiny)
4019 bases (people created)

Identity matching
 - reconciliation at scale
 - Wikipedia, Wordnet, Library of congress terms, Stanford library
 - any large open source term they have tried to import into Freebase

-> How? Whatever works.  MapReduce, Google Refine, and human judgment
-> This is possible if you know what an entity is.  (IBM example)

Freebase as a rosetta stone
 - keys
 - behind the websites, there is a structured database with keys (relational db with tables that have primary keys)
 - all of these keys leak out onto the web, "shakira" in the url
 - In the Freebase system they try to collect these keys to link the entity to external websites

URLs and Freebase keys
 - accrete the URLs and keys onto the object
 - Names are just another key (the entities themselves are the same across languages)

 - Freebase is schema less
 - It is fundamentally based on a graph store
 - Schema is described in the graph itself, just as the data ("Type: type")
 - The person "type" is an entity with an id, "Type:type:person"
 - Put the predicates into the graph system so that it can be updated

Google API to Schema Data
 - WIKL read ( a query language for inspecting the freebase graph)

How does Google use Freebase?
 - "I work in the search division"

Time in Freebase
 - everything has a start date and end date

How good is the quality?
 - varies depending on the entities (e.g. presidents is high quality, but obscure book there may be some duplicates)
 -> 99% accuracy
 -> curate the top 100k entries
 -> we'd rather not import data than import data that is bad
 -> (We imported the open library catalog, which has lots of duplicates.  never again.)

 - 25 M entities, 2x from last summer, 100M by next year
 - It depends on the domain
 - For common queries in search engines, it does very well
 - search engines handle lots of queries for celebrities, common places on earth

Confidence on facts
 - common criticisms 1) it's not a real database, 2) the assertions are not given weight, it doesn't capture uncertain facts
 - you can create a mediated way of doing that in the schema
 - how do you deal with controversial facts?  1) careful with type definitions.  countries are hard. (use UN definition)  unusual categories.  FIFA has its own definition.  World cups have been played with countries that don't exist.
 - for head entities, there are large number of people arguing

Ian - quality 99% accuracy is still 1 million incorrect for 100M.
 - sampling rate for how you draw entities.  you have a probability of confidence.
  (two kinds of sampling: 1) random sample, 2) traffic weighted sampling based on popularity)
 - 99% at the 95th percentile