Tuesday, November 8

Notes on Strata 2011: Entities, Relationships, and Semantics: the State of Structured Search


Entities, Relationships, and Semantics: the State of Structured Search


I didn't attend the talk, but I watched the video and took down notes on it for future reference.


Andrew Hogue (Google NY)
 - worked on google squared
 - QA on google, NER, local search
 - (extraction is never perfect) even with a clean db, with freebase.  coverage isn't good, 20/200 dog breeds
 - if you try to build a se on top of the incomplete db, users hit the limit, fall off the cliff and get frustrated
 - Tried to build user models of what people like (for Google+).  Do you like Tom Hanks, BIG? In the real-world.
   (Coincidentally, Google just rolled out Google+ Pages that represent entity pages)
    --> if the universe isn't complete, people, entities, then they get frustrated
    --> 1) get a bigger db.  2) fall back gracefully to a world of strings (hybrid systems)

Breck baldwin (alias-i)
 - go hunt down my blog post (on march 8 '09 on how to approach new NLP projects)
 - the biggest problem is the NLP system in the head vs. reality
 - three steps: 1) take some data an annotate it.  10 examples.  force fights earlier.  #1 best thing.  #2 build simple prototypes. info flow is hard.  #3 eval metric that maps to the business need

Evan Sandhause (NY Times)
 - on the semantic web (3.0) 
 - the semantic web is a complex implementation of good, simple ideas
 - get your toe wet with a few areas: 1) linked data, and 2) semantic markup
 - 1) linked data - all articles get categorized from a controlled vocabulary (strong ids tied to all docs). BUT -  No context to what those IDs mean. e.g. barack obama is the president of the united states.  Kansas city is the capital...  you need to link the external data to add new understanding.
   -- e.g. find all articles in A1, P1 that mention presidents of the United States
   -- e.g. find all articles that occur near park slope brooklyn
 2) semantic markup (rdfa, microformat, rich snippets).  They use rnews vocab as part of schema.org.

Wlodek Zadrozny (IBM.  Watson)
 - what are the open problems in QA
 - Trying to detect relations that occur in the candidate passages that are retrieved (in relevance to the question)
 - Then scores and ranks the candidate answers.  Some of it in RDF data.  Confidences are important because wrong answers are penalized.

keys to success: 1) data, 2) methodology, testing often  1. QA answer sets from historic archives. (200k qa pairs)  2. collection data sources. and 3. and test (trace) data (7k experiments, 20-700 mb per experiment.  lots of error analysis.
 - medical, legal, education

Questions
Q: NYT R&D.  The trend around NLP.  Certain things graduate on reliability.  What will these be over the next decade?
  -- Andrew.  The most interesting thing is QA.  Surface answers to direct questions.  (harvard college vs lebron james college)
  -- statistical approaches to language, (when do we have a good parse, vs. we don't know)
  -- Breck - classifiers are getting robust on sentiment, topic classification. breakthroughs in highly customized systems.  finely tuned to a domain in ways that bring lots of value.

Query vs. Document centric
  -- reason across documents at a meta-level.  What can you do when you have great meta-data? (we have hand-checked, clean, data)
  -- in Watson, an alternative to high-quality hand curated data is to augment existing sources with data from the web
     (see Statistical Source Expansion for Question Answering from Nico Schlaefer at CIKM 2011)

QA on the open web
 - Problem - not enough information from users.  People don't ask full NLP questions (30 to 1)

- Is there an answer?  (Google wins by giving people documents and presenting many possible answers)

Evan - the real-time metadata is needed for the website.  They use a rule based information extraction system which suggests terms they might want want to suggest.  Then the librarians review the producers tags.  

Breck - Recall is hard.  In NER and others.

Overall Summary
 - Wlodek - QA depends on having the data: 1) training/test data, 2) sources, and 3) system tests
 - Evan - Structured data is valuable to get out there, rNews and schema.org.  Publishers should publish it!  It will be a game changer.
 - Breck - 1) annotate your data before you do it. 2) have an eval metric, and 3) lingpipe is free, so use it.
 - Andrew - (involved in schema.org, freebase).  Share your data.  Get it out there.  And -- Ask longer queries!

Thursday, October 27

CIKM 2011 Industry: Toward Deep Understanding of User Behavior on the Web

Toward Deep Understanding of User Behavior on the Web
Vanja Josifovski, Yahoo! Research

Where is user understanding going?

What is the future of the web?
 - prevalent - everyone and everything
 - mutual understanding

Personalized laptops

Personalization today
 - Search personalization.  low entropy of intent.  Difficult to improve over the baseline
 --> effects are small in practice

Content recommendation and ad targeting
 - High entropy of intent
 - Still very crude with relatively low success rates

How do we need to move to the next level
 - more data, better reasoning, and scale

Data today
 - searches, page views
 - connections: friends, followers, and others
 - tweets

The data we don't have
 - jetlagged, need a run? need a pint?, worried about government debit?
 - the observable state is very thin

How to get more user data?
 - Only with added value to the user
 - Must be motivated to provide their data

Privacy is not dead, it's hibernating
 - the impact of data leaks online is relatively small

Methods
  - State of the art as we know it.
  - Popular that seem to work well in pratice
  --> Learn relationship between features rij = xiCzj
  --> Dimensionality reduction (random, topical models, recommender systems rij = uivj)
  --> Use of extenral knowledge: smoothing
      --> taxonomies

An elaborate user topic model (Ahmed, KDD 2011, Smola et al. VLDB 2010), yet so so simple
 - the user behavior at time T is a mixture of his behavior at time t-1 + global overall behavior
 - Very simple model

Using External Knowledge
 - Aggrawal et all KDD2007, KDD 2010

Is there more to it?
 -> What is the relative merit of the methods?
 -> They use the data in the same way and are mathematically very similar

Where is the limit? 
  -> what is the upper bound on the performance increase on a given dataset with this family of algorithms?

Scale
 - Today - MapReduce is a limiting barrier for many algorithms
 - Need the right abstractions in parallel environments
 - Move towards shared in memory, messages passing models (like Giraph)
 -- (we'll work this out)

Workflow complexity
 - the reality bites Hatch et al. CIKM 2011.  Massive workflows that run for hours.

Summary


CIKM
1) Deep user understanding - the tale of three communities

IR:
 - Good formalism that function practice
 - emphasis on metrics and standard collections

DB
 - seamless running of complex algorithms
 - new parallel computation paradigms


Towards deeper understanding
1) get users to give you more data by providing value
2) significantly increase the complexity of the models
3) scale in terms of data and system complexity


CIKM 2011 Industry: Model-Driven Research in Social Computing


Model-Driven Research in Social Computing
Ed Chi

Google Social Stats
250k words per minute on blogger, 360 million words per day
100M+ people take a social action on YouTube

Google+ Stats
40 million joined since launch
2x-3x more likely to share content with one of their circles than to make a public post

Hard to talk about because the systems are changing quite rapidly
Ed joined Google to work on Google+

Social Stream Research
Analytics
 - Factors impacting retweetability (IEEE Social computing)
 - Location field of user profiles

Motivation for studying languages
 - Twitter is an international phenomenon
 - How do users of different languages use Twitter?
 - How do bilingual users spread information across languages?

Data Collection & Processing
 - 62 M tweets (4 week), spritzer feed in april-may june 2010
 - Language detection with Google language API + LingPipe
 - 104 languages
 - Top 10 languages

English - 51%
Japanese - 19 %
Portuguese - 9.6% (mostly Brazil)
Indonesian - 5.6%
Spanish - 4.7%

Sampled 2000 random tweets
 - 2 human judges for each of the top 10 languages

Problems with French, German, and Malay.
Accuracy of Language Detection
 - Two Types of errors  (poor recognition for "tweet English") and for tweets with 1-2 words

Korean - recommend for conversation tweets
German - promote tweets with URLs

English serves as a hub language

Implications - need to understand when building a global network on language barriers
 - building a global community
 - the need for brokers of information between languages

Visible Social Signals from Shared Items (Chen, et al. CHI 2010/CHI 2011)
- After all day without WIFI, he would like a summary of what's happening in his social stream
- Eddi - Summarizing Social Streams
  --> What's happened since you last logged in
  --> A tag cloud of entities that were mentioned
  - A topic dashboard where tweets are organized into categorizes to drill into

Information Gathering/Seeking
 - The Filtering problem - I get 1,000+ things in my stream, but only have time for 10.  Which ones should I read?

 - The Discovery Problem
 -- millions of URLs are posted,

Zerozero88.com
 - twitter as the platform
 - URLs as the medium
 - a personal newspaper that produces personal headlines

URL Sources (from tweets) -> Topic  Relevance Model, and Social Network Model

URL Sources
 - Consider all URLs was impossible
 -- FoF URLS from followee-of-followers
  --> Social local news is better
- Popular - URLs that are popular across whole of Twitter
   --> popular news is better

Topic Relevance Model
 - A user Tweets about things, which creates a term vector profile.
 - Cosine similarity with URLs
 - Topic Profile of URLs - Built from tweets that contain the URL, but tweets are short and RT makes word frequencies goofy.
 - Adopt a term expansion technique, extract nouns from tweet and feed it into a Wikipedia search engine as a topic detection technique

Topic Profile of User
 - Self-topic
 - Information producer - the things they tweet about
 - Information gatherer - what they like to read
 - Build profiles from froms and aggregate them.

Social Module
 - Take FoF neighborhood, and count the votes for a URL
 - Simple counting doesn't work very well.
 - Votes are weighted using social network structure

Study Design
 - Each subject evaluating 5 URL recommendations from each of the 12 algorithms.  Show 60 URLs in a random order and ask for binary rating,

Summary of Results
 - Global popularity (1%) -- 32.50% are relevant, not bad, but not good enough
 - FoF only - 33% - naiive by itself without voting doesn't work great
 - Fof voting method - 65% (social voting only)
 - Popularity voting - 67%
 - FoF Self-Vote - 72% best performing

Algorithms differ not only in accuracy!
 - Relevance vs. Serendipity in recommendations (tension between discovery and affirming aspect)
 -> "What I crave is surprising, interesting, whimsy" this is where the value is
 -> Two elements two surprise: 1) have I seen this before, 2) non-obvious relationships between things

Design Rule
- Interaction costs determine number of people who participate
 - Reduce the interaction costs, then you can get a lot more people into the system
 - For Google+ this is key to deliver this to people

Q&A:
Japanese crams more information into a tweet.  It is used more for conversation than broadcast in these environments

CIKM Industry talk: Jeff Hammerbacher on Analytical Platforms


Experiences Evolving a New Analytical Platform: What Works and What's Missing
Jeff Hammerbacher, Cloudera

Built the infrastructure team at Facebook, 0 to 2 PB of data

Take the infrastructure and make it available as open source.

Philosophy
The true challenges in the task of data mining.  Creating a data set with the relevant and accurate information, determining the appropriate analysis techniques

Exploratory data processing (IBM)

Taught the data science course at Berkeley earlier this year

1) Store all your organization's data in one place
  - data first, questions later
  - store first, structure later

Engineers are constrained when you force them to stop and model the data, which is constantly evolving.

Raw storage: $0.4 / GB (67 for 2 TB disk), Single HDFS instance > 50 PB on commodity hardware in one center

 Enable everyone to party on the data.  Use files because developers are not analysts.

Like the LAMP stack, there is a coherent analytical data management

Better underlying abstractions

Platform - Substrate
 - commodity servers (a big warehouse)
 -- open compute project (FB open source)
 - open source OS
  -- Linux
 - Open source config management
  -- Puppet, Chef
 - Coordination service
  -- ZooKeeper

Platform - Storage
 - Distributed, schemaless storage
 --> HDFS, Ceph (UCSC), MapR
 - Append-only table storage and metadata
  --> Avro, RCFile, HCatalog (Also: Thrift, Protocal Buffers)
 - Mutable table storage and metadata
 -- HBase

Compute
- Cluster resource management
 -- YARN (inter-job scheduling, like grid engine, for data intensive computing)
 -- Mesos
- Processing Framworks
 -- MapReduce, Hamster (MPI), Spark, Dryad, Pregel (Giraph), Dremel
- High level interfaces
  -- Crunch (like Google's Flume Java) , DryadLINQ, Pig, Hive

Platform
Integration
- Tool access
- Data ingest
 -- Sqoop, Flume
(Documents ingest is still an area that needs work.  There are crawlers, but they're still immature)

Trends
fat servers with fat pipes
2u, 24 gb ram, 12 drives, (bigger nodes)
os support for isolation (VMs have downsides)
Linux containers
  -- Google contributed initial patches, used for BORG
Local files system improvements
 -- btrfs

language
 - scalaql

CIKM 2011 Industry: Freebase: A Rosetta Stone for Entities

John Giannandrea, Google

What is Freebase?
 -> A machine representation of things in the world. (Person, place, thing in the world)
 -> Instead of working in the domain of text, we work in the domain of strong identified things
 -> Each object has an identifier, once you have it, it will also refer to an identity

Properties - relationships between objects
 - edges between the entity ids
 - edges are directional
 - properties create meaning

Graphs
 - encode knowledge

Types
 - a categorization of an entity
 - An entity can have multiple Types in Freebase
 - "Co-types" - Types are a mix-in
 - e.g. Arnold (politician, actor, athlete)

The real world is extremely messy.

Knowledge you can use
 - the current state
 - 25 Million topics (entities)
 - 500 million connections
 - 2x the size it was last year

>= 10 instances, 5790 types
1772 commons (survived community scrutiny)
4019 bases (people created)

Identity matching
 - reconciliation at scale
 - Wikipedia, Wordnet, Library of congress terms, Stanford library
 - any large open source term they have tried to import into Freebase

-> How? Whatever works.  MapReduce, Google Refine, and human judgment
-> This is possible if you know what an entity is.  (IBM example)

Freebase as a rosetta stone
 - keys
 - behind the websites, there is a structured database with keys (relational db with tables that have primary keys)
 - all of these keys leak out onto the web, "shakira" in the url
 - In the Freebase system they try to collect these keys to link the entity to external websites

URLs and Freebase keys
 - accrete the URLs and keys onto the object
 - Names are just another key (the entities themselves are the same across languages)

Schema
 - Freebase is schema less
 - It is fundamentally based on a graph store
 - Schema is described in the graph itself, just as the data ("Type: type")
 - The person "type" is an entity with an id, "Type:type:person"
 - Put the predicates into the graph system so that it can be updated

Google API to Schema Data
 - WIKL read ( a query language for inspecting the freebase graph)

How does Google use Freebase?
 - "I work in the search division"

Time in Freebase
 - everything has a start date and end date

How good is the quality?
 - varies depending on the entities (e.g. presidents is high quality, but obscure book there may be some duplicates)
 -> 99% accuracy
 -> curate the top 100k entries
 -> we'd rather not import data than import data that is bad
 -> (We imported the open library catalog, which has lots of duplicates.  never again.)

Q&A:
Recall
 - 25 M entities, 2x from last summer, 100M by next year
 - It depends on the domain
 - For common queries in search engines, it does very well
 - search engines handle lots of queries for celebrities, common places on earth

Confidence on facts
 - common criticisms 1) it's not a real database, 2) the assertions are not given weight, it doesn't capture uncertain facts
 - you can create a mediated way of doing that in the schema
 - how do you deal with controversial facts?  1) careful with type definitions.  countries are hard. (use UN definition)  unusual categories.  FIFA has its own definition.  World cups have been played with countries that don't exist.
 - for head entities, there are large number of people arguing

Ian - quality 99% accuracy is still 1 million incorrect for 100M.
 - sampling rate for how you draw entities.  you have a probability of confidence.
  (two kinds of sampling: 1) random sample, 2) traffic weighted sampling based on popularity)
 - 99% at the 95th percentile


Wednesday, October 26

CIKM 2011 Keynote II: Justin Zobel on Biomedicine


Data, Health, and Algorithmics: Computational Challenges for Biomedicine
by Justin Zobel

(I missed the first part of this talk)

The Central Dogma
- DNA consts of sequence of four bases, A, C, G, T
- The concept of a gene is now uncertain

SNP analysis (Nature 00)
 - used PCA

Revolution
 - Read DNA directly
 - $1000 by the end of 2012

The data is erroful, incomplete, voluminous, ambiguous.

Also, reads are not very random

Within a few years there will be DNA databases of 10-100 terabases, which we will use to find matches to short read data


Challenge: Assembly
 - Imagine a million copies of a phone book, a million pages long
 - Shredded into tiny pieces, each no more than 20 or 30 characters
 - 99.999% are thrown away.
 - The task: reconstruct the phone book from the billion remaining pieces

The Problem of assembling short reads.

Genomes are a combinatorial minefield
 - vast quantities of repeated material

The genome is cheap, but the analysis is expensive

de Bruijn graph
 - Divide 7-base reads into kmers (3mers)
 - each node is a kmer, each arc is an overlap

The graph is about 4 terabytes, and it needs to be in memory.

Succint 'Gossamer' representation
 - fast access with simple index
 - space down by a factor > 10
 - cuts the storage down to 32 GB

DNA dictionaries
 - there is no grammar for DNA that would allow construction of a parser
 - a dictionary of all possible tokens would be impossible large

Dictionary - any representative string
 -> solves the text compression problem for DBs

Genetics for diagnosis
 -> inference diagnosis based on symptons replaced by ones based on DNA analysis
 -> drug effect and health outcome determined directly from historical health records
 - Built to simplifly, improve, and automate bureaucratic decisions

'Guardian Angel' clinical decisions
 - Electronic health records analyzed on the fly to check whether a mistake is about to be made.

Health at Home
 -> health monitoring deeply embedded in our e-lifestyle activities
 -> webcam that determines how well you are based on your skin
 -> iphone app that tells if your drunk

Computer Science vs heatlth research
 - many algorithmic solutions are not biologically meaningful
 - spend money on IT and the number of errors is decreased (It saves lives.)

13 of the top 25 questions (Science mag July 2005) are about DNA

The real way to have an impact on medicine is in the clinic -> text mining records.  Helping doctors make decisions.

Tuesday, October 25

CIKM 2011 Keynote: David Karger


Creating User Interfaces that Entice People to Manage Better Information
By David Karger (MIT)

History:
HayStack - Per user Information Environments (1999)

Current State of IKM (Information and Knowledge Management)
  - We take users with extremely rich landscapes of information and we give them keyboards to barely sketch their interested.  Algorithms work really really hard on that sketch.  

 - We work hard to make computers do IKM well
 - People are better than computers at IKM

Question:
  - In what ways can we give people the ability to manage more or better information?
 - How do we make them want to?

1) Capture more data digitally
2) Collaborate to understand lecture notes

Capture of Information Scraps  
 - The state of PIM
 - The desks all have computers, but we have huge piles of paper (never put into it)
 - 27 participants, 5 Orgs, 1 hour in situ interviews
#1 using computer is distracting / impossible
  -- people instead just grab random notes to write things down
  -- Interfaces for Staying in the Flow (Ben Bederson, Ubiquity 2004
  -- (Being "in the zone', in the flow)

#2 chimeras fight between apps
  -- Meeting notes with TODOs and follow up meetings
#3 Diverse information forms don't fit apps
Types of information
  TODOs, meeting Notes, Name and Contact information

#4 Want in view at right time --workflow integration

Costs to digital capture
 - costs: effort to choose place, imposted schema, entry time is a distraction
Fixes: no organization, plain text, in the browser, cross-computer sync offline+online

list.it (open source mico note tool for Firefox.
 --> 25,000 downloads, 16,625 registered users, 920 volunteers, 116k contributed notes

Types of notes:
TODOs, Web bookmarks, Concat information
median time to write something is 7.4s
median number of lines is 4

35% - ease/speed
20% simplicity
20% direct replacement for post-its

Detour: Note Science
  -- How do people keep and acccess information in list-it?

3 coders
first clustered, identified 4 archetypes

MISC - MIT Open Scrap Corpus (available online)

NB: Classroom Discussion
Stellar Classroom discussion tool
 - 50 most active classes made 3275 posts
  -- no heavily populated posts
- Nb: forum in context  (happen in the margin of lecture notes)
 - Highlight a section of the post, write a comments
 --> Implicit context (how do I get 3 from 1)

Benefits - Discuss as you read without existing note view
 -- Context is clear because the PDF content is there
 -- annotations create a heat map of lecture notes

15 classes, 4 different universities
(Annotation required), usage of the tool doubled over the term.
 --> they liked seeing that they weren't the only one that was confused.
  --> rich interaction

NB specific benefits
 --> "Why?"
 --> The social benefits outweighted the use of paper

FEEDME
 - Artificial Collaborative Filtering

Vast amounts of content, how do we get the good stuff
machine learning recommenders - users rate what they read, content recommendation, collaborative filtering (find people with similar likes, predict what they will like)

Effort
 - have to read lots of junk to train system
 - have to spend energy now for future benefit
 - many users won't ever get started

Quality
 - ML algorithms imperfect
 - Deliver reading irrelevant content, worry about what is missed

Alternative: People

Email is dominant in information sharing
Median 6 - people do want more relevant links
Sharers are reluctant to spam their friends
 (unsure of relevance, may have seen it already, too much effort)

Fixes
-> let them use email, reasssure sender that content is relevant.  Aand that the recipient isn't overloaded. One-click sharing

Firefox plugin
1. recoomend recipients to reduce time and effort for sharing
 (uses ML to find people to recommend)

One-click thanks

Recommendation Algorithm
 -- rochio classifier

Assessment
 - two week study for $30
 - 60 google reader users recruited on blogs
 - Viewed 85k posts, shared 713 posts
 - Significant increase in sharing

Recipients were happy - 80.4% of the posts contain novel content

Recommendations Useful

Do overload indicators help
 - 1/3 of subjects with them said they were favorite feature
 - 30 of shares resulted a thanks

Machine filtering
 - have to read stuff

Structured Data
We all know structured data is good.
it supports
 -> rich vizualizations, filtering, sorting, queries, merge data

Epicious (old version)
 -> filter by ingredient, cuisine, part of meal

Mere mortals just write text or HTML

Structured data takes skill
 - design a data model,

Plain authors are left behind
 -> less power to communication effectively

Coping: Information Extraction
 - Entity Recognition, Coference, relationship extraction

Imperfect, so errors creep in.

Alternative: Give regular people tools that let people author structured data
 -> to communicate well

Do we need this? Yes.

Approach
- HTML is the language of the web
 - Extend it to talk about data
 - Anyone authoring HTML should be able to author data and interactive visualization
- Edit data-html in web, blogs, wikis

(like spreadsheets)

Publishing data is easy, just put a spreadsheet online.  rows are items, columns are properties

Data
 Items (recipes)
 - Each has properties, Title, source magainze, publication date, etc...
 - Vizualization - a collection of a view of data items
     -- bar chart, sortable list, map, thumbnail set

Bound to peroperties
 - sort by property

Facets for filtering information
 -> specificy a property, user clicks to select
 -> templates -> format per item.
 - HTML with "fill in the blanks"

Key primitives of a data page
Data - a spreadsheet

Exhibit javascript library

1800 websites using exhibits
hobby stores, science
(lots of strange hobbyists)
Veggie guide to Glasgow

Not very scalable (fast for < 100 items)

Side effects - the data is out there.  (structured data is the side effect)

Wibit
Datapress - data visualization inside the blog
DIDO - WYSIWYG editor

Conclusion
- People are powerful information managers
In each case, it's about giving people the tools to be information managers

Wait, There's more
 --> manage structured data by making it look like a spreadsheet
--> Atomate -> help users translate incoming data data into structured data

We work hard to make computers do IKM well,
Don't assume people are passive IK consumers
Give people tools that can encourage active engagement in IKM

All the links are at haystack.csail.mit.edu/blog

Questions:
The success of exhibit came from why HeyStack didn't succeed.  It's not the only measure of success that lots of people use a tool.  It's still an interesting piece of research.

Wednesday, September 21

Twitter Acquires Julpan Real-Time Search Engine

Today, Julpan (a stealth-mode search engine) announced it is being acquired by Twitter. (see TechCrunch coverage)

There are not a lot of public details, but here are a few.

Julpan is a real-time search engine based in NYC.  It focuses on analyzing social and real-time information from news, Twitter, and other sources.  It was founded in mid-2010 by Ori Allon.  Ori is an Ex-Googler from the search quality team (Google acquired his patented thesis work called Orion, see the Google feature post.).


Sadly, pending the integration with Twitter,  the Julpan search products (Newsgrep and LiveBite) are no longer available online.

Tuesday, July 26

SIGIR 2011 Best Paper Award

The SIGIR 2011 best paper awards were announced.

The winner is:
M. Ageev, Q. Guo, D. Lagun, and E. Agichtein

Honorable mention goes to:
Kevin Haas, Peter Mika, Paul Tarjan and Roi Blanco

See also the notes from the SIGIR 2011 keynote addresses by Qi Lu and ChenXiang Zhai.

SIGIR 2011 Keynote ChenXiang Zhai: Beyond Search: Statistical topic models for text analysis

ChengXiang Zhai gave the second keynote address at SIGIR 2011 held this week in Beijing.

Here are the notes from my friend and fellow UMass grad student Michael Bendersky (follow him on @bemikelive). Also, be sure to check out his workshop on Query Representation and Understanding.

Be sure to read Michael's notes from Qi Lu's first keynote talk on the Future of the Web & Search.

Beyond Search: Statistical topic models for text analysis
  • Complex Task Completion Flow
    - Multiple Searches → Information Synthesis & Analysis → Task Completion
    - Sometimes the process above is iterative

    Examples of complex tasks
    • What laptop to buy?
    • What’s hot in database research?
    • What do people say in blogs on a certain topics? How does the topic coverage change over time?
    • What people like/dislike about “Da Vinci Code”?

  • Can we model complex tasks in a general way?
  • Can we solve them in a unified framework?
  • How do we bring users into the loop?

  • Proposed solution – Statistical Topic Models
    - Generative model
    - Captures language models shifts based on topics
    - Language model serves as a convenient topic representation
    - Every document has a lot of contextual data (metadata)
    o Author
    o Communities
    o Location
    o Author’s occupation
    o User labels
  • Any combination of contextual data can induce partition over the documents

  • We should make topics depend on context variables
    o Text is generated from a contextualized PLSA model
    o Fitting such a model enables a wide range of analysis tasks on a document

  • Applications of contextual topic models
    o Social Network Analysis can aid to derive more coherent topic models
    o Opinion mining – integration of expert reviews and personal opinions
    • Take into account the well-formed and faceted design of expert reviews to impose context on personal opinions, which come from a variety of unstructured sources (blogs, micro-blogs, review sites, comments)
    • Derive integrated expert/personal opinions on different aspects
    • Infer aspect ratings and weights

  • Using topic models to go from search engine to analysis engine
    o Tasks
    • What is a task?
    • How is task different from information need/intent?
    • How do we help users to express tasks
    o What does ranking mean in analysis engine?
    o How to evaluate the output of the analysis engine?
    o Operators to allow analysis of search results
    -- Select, Split, Intersection/Union, Interpret, Rank, Compare
    • Operators can be combined, similar to SQL/InQuery languages

SIGIR 2011 Keynote Talk: Qi Lu and The Future of the Web & Search

Qi Lu, the president of Microsoft's Online Services Division gave the first keynote address at SIGIR 2011 happening this week in Beijing. He laid out Microsoft's vision for the future.

I am in San Francisco at Twitter, but luckily my friend and fellow UMass grad student Michael Bendersky is taking notes (follow him on @bemikelive). Also, be sure to check out his workshop on Query Representation and Understanding.

Future of the Web & Search
  • Agenda
    - Perspective of the web/IT industry
    - Future of search
    - Role of IR
    - Challenges
    - Opportunity

  • The heritage: web of documents
    The future:
    - Social web - Facebook profiles, like buttons
    - Geospatial web: Mobile devices
    - Temporal web: Collection of information over time, real-time microblogging
    - Application web: Fundamental design of the browser doesn’t support new application models

  • IT industry of the future
    - Devices + cloud services
    - Changing the user intent capturing from rigid keyboard/mouse/keywords combination to more natural modalities
    • Understanding the natural language
    • Voice recognition
    - On mobile devices
    - In living room products
    • Body gestures - Microsoft Kinect
    • Image/Audio/Video capturing

  • Vision: of the future of search
    o Empower people with knowledge
    o Re-organize the web for search to unlock the full potential of the web
    • Better discovery
    • More informed decisions
    • Easier task completions

  • Role of IR
    o Understanding user intent
    o Modeling web of the world
    • People/places/things
    • Relations
    o Task completion & decision making
    o Incentive engineering for making people do more things on the web

  • Challenges
    o Measurement, evaluation & self-correction
    • Some things are inherently hard to evaluation: objectiveness, design, opinions
    • Search results have profound influence on the way people perceive the world
    • It is important that they have no inherent bias or skew

    o Privacy

    o Lack of
    • Tools & understanding in existing disciplines
    • Training & development if cross-disciplinary talent

    o Barriers for academia research
    • Access to data
    • Computing infrastructure
    • Funding
    • Not just based on company agenda
    • Funding projects based on pure creativity

  • Opportunities
    • Opportunities for key breakthroughs in the areas of
    • Serendipitous discovery (e.g. Hunch.com)
    • Information theory for the age of the web and social networks
    • Science of big data

    • Broadening collaborations
    • Research
    • Development (API/tools)
    • Investment (Training & Development)

    • Vibrant community
Follow #sigir2011 for more news, although given the censorship in China, the results are very sparse.

Tuesday, June 21

Inside ACL: Building Watson DeepQA keynote Address by David Ferrucci

This morning David Ferrucci gave the Association for Computation Linguistics (ACL) 2011 keynote talk. Michael Bendersky is attending the conference and was very generous to send me his notes on the first keynote talk. Be sure to read his paper, Joint Annotation of Search Queries. Here are his notes from the talk,

Building Watson: An Overview of the DeepQA Project
  • What’s the difference between playing chess and understanding human language?
    - People find chess difficult and natural language easy
    - Many non-scientists don’t realize how difficult human language understanding really is

  • Computers are good at
    - Understanding formulas
    - Understanding structured query languages

  • Computers are bad at
    - Parsing ambiguous natural language

  • The system challenges
    - Open domain
    - Complex language
    - High precision
    - Accurate confidence – only buzz in when you’re very confident
    - High speed

  • Core technologies
    - Deep parsing – using a proprietary IBM technology that has been developed over the last 20 years
    - Relation detection
    - Multiple parse interpretations
    - Multiple query formulations per parse

  • Co-reference resolution
    - The entire research was driven by a single end-to-end metric – how much the proposed solution improves the Jeopardy game
    - Some improvements on a single algorithm might be redundant or harmful in the overall solution

  • Jeopardy is open-domain – not using ontologies that were crafted specifically for Jeopardy
    - Using general resources: Wordnet, YAGO

  • Learning from Reading
    - Parsing sentences in the text
    - Generalization and Statistical Aggregation

  • Some questions require decomposition and synthesis
    - Using techniques to decompose questions into parts
    - Synthesis of answers from different parts
    - Helps in answering questions that involve puns/rhyming

  • Some questions require finding a missing link between concepts
    - Using spreading activation to find links
    - eg, link between “shirt”, “tv remote”, “telephone” -> buttons

  • Metrics for performance evaluation
    - Plot x- % answered, y – Precision
    - Winners clouds – answered at least 50% of the questions, precision 80-92%
    - The goal was to get Watson into the winner cloud – achieved and went over the cloud by the Jeopardy game

  • Great leaps in performance from 2007. In the beginning, breaking even in the game seemed like an accomplishment

  • Watson is self-contained. Deciding what content to use is very hard – the amount of hardware is limited.

  • Guidelines
    - Specific large hand-crafted methods won’t cut it
    - Combining intelligence from diverse methods using machine learning techniques
    - Massive Parallelism is a Key Enabler

  • DeepQA – QA system underlying Watson
    - Many components for parsing and multiple answer generation
    - Logistic regression to weight the different features and rank the answers

  • Search systems used: Indri & Lucene. Both were modified to reduce run-time

  • Work process
    - All group members working in the same open space room
    - NLP researchers, IR researchers, ML researchers, linguists, statisticians
    - 8,000 experiments – all documented with tools that allow analysis by question/algorithm/features

  • Run-time
    - Single CPU time for answering a question – 2 hours
    - Scaled out to 3,000 CPU’s – 2-3 seconds
    - Enabled by the built-in parallelization of the algorithms
What I find particularly striking is the deep analysis of a contained corpus, particularly the analysis to find various kinds of missing links. The hardware is limited and the corpus is very circumscribed in order to run complex and expensive algorithms - and it results in significant improvements!
  • How would you develop a system for the real-time web where what's meaningful is constantly in flux?
Ultimately, the true test of DeepQA will be how it generalizes to domains beyond Jeopardy. I hope this is just the beginning for Watson.
Thanks again to Michael for his notes. Look for more highlights from ACL coming soon!

Tuesday, June 14

Google Inside Search Event today

Today is a big press event on search at Google, Inside Search. Be sure to check out the "Evolution of Search" timeline at the bottom of the page.

Check out the new Search Globe, a visualization of worldwide search.

A big theme of the event is:

"Breaking down the barriers to knowledge"
- Make it faster and easier to enter queries across all platforms - especially mobile. Combining voice search and translation across all search platforms.

On Search and Knowledge re-org.

Amit - the classic data hierarchy:
Data
Information
Knowledge

Search has taken an amazing job of taking the billions and billions of pages of data and turning it into information. We are now setting our sights on knowledge - the relationship of things to one another.

You can watch the live stream and Danny Sullivan is live blogging.

More as it develops.

Monday, June 6

Watch me Compete on MasterChef Season 2 TONIGHT

Set your DVRs and tune in to the premier of Fox's MasterChef today, Monday, at 8pm. I am competing to be America's next MasterChef! For more on my cooking, read my modernist cooking blog, CookingPhD.

Masterchef is cooking meets American Idol for amateur cooks. I was selected as one of 100 final contestants flown to LA out of 40,000 people that auditioned for the show. Watch me cook my signature dish for Gordon Ramsay, Graham Elliot, and Joe Bastianich.

Here is quick promo video, with me searing my signature smoked duck at 2:11:


A bunch of the cast will be tweeting on #masterchef.

The second part of the premier will air tomorrow, Tuesday at 8pm!

(The episodes should also be on Hulu at some future date)

Jeff, aka @cookingphd

Wednesday, June 1

Twitter Releases Search+ Relevance based search

Today marks a significant milestone for real-time search. The results are now ranked based on relevance instead of purely based on recency. The announcement was made by CEO Jack Dorsey at the All Things D conference earlier today. A key important feature is that the new search incorporates rich media results, as mentioned on their blog announcement,
Not only will it deliver more relevant Tweets when you search for something or click on a trending topic, but it will also show you related photos and videos, right there on the results page. It's never been easier to get a sense of what's happening right now, wherever your curiosity takes you.
Danny Sullivan has an article covering the release and what "relevance" means in the context of real-time search:
Relevance for us today is using a combination of signals, your follower graph, who you follow, who’s following you. Another aspect is just looking at the content itself and the resonance of the content,” Mike Abbott, Twitter’s vice president of engineering.
The Twitter Engineering blog has on the update has more detail on the evolution of Twitter search since the original Summize days. Here is a small excerpt on what is needed to provide personalized relevance and filtering:
  • Static signals, added at indexing time
  • Resonance signals, dynamically updated over time
  • Information about the searcher, provided at search time
The post has more details worth reading on the infrastructure that goes into the search
For more news on twitter search be sure to follow @twittersearch.

Wednesday, May 18

Inside Search: New Official Google Search Quality Blog

Today, Amit Singhal, head of search quality announced the creation of a new Inside Search blog. It is an extension of the, This Week in Search column that highlights new features and product announcements. As Amit writes,
...we got feedback that people wanted their search news and information as it happens, not just weekly. So, we’re starting Inside Search as a place where you can find regular updates on the intricacies of search and our team. We have more engineers working on search than any other product, and each one of us has stories to tell.
I look forward to hearing from a wide variety of voices on the team. With more than 500 improvements last year, it can be difficult to keep up with the changes, it's sometimes useful to have them pointed out more explicitly.

Perhaps it will also lead to a bit more transparency in search ranking and quality at Google. At least there is an official place for voices to speak publicly.

Saturday, May 7

NY Times: The Stanford Facebook App Class

The NY Times today reports has an article, The Class That Built Apps, and Fortunes. The Stanford FB app class is CS377W: Creating Engaging Facebook Apps. The class was taught by BJ Fogg, David McClure, and Dan Greenberg.

One take away from the class is an important reminder from BJ Fogg,
What smart people do, what engineers tend to do, is overthink and from the beginning we said to do simple things. But, the inclination is to do something fancier, more complicated. What happened over time was that the students teams discovered that over time is that the complicated things never worked, that simple things took off.
It was a hugely popular class, with hundreds of people interested in it. From the NYT article,
Working in teams of three, the 75 students created apps that collectively had 16 million users in just 10 weeks.
A key component of the class is the social aspect of the applications being built. The class is part of the Stanford Persuasive Tech Lab. From the lab's description,
Our lab specializes in persuasion via technology, so this is naturally our focus when studying Facebook. We want to understand the how motivation and influence operate on Facebook.
It was a big experiment in virality. As student Johnny Win describes it,
The hardest part of any project is to find the initial traction that will get you users and engagement and build from there. Rather than building a road to the moon, build the first step.
The students learned important lessons. You need people to use a product you build. It's great to capture attention, but it's more important to do something meaningful. There are enough punch the monkey, hot or not, and similar apps to waste your time on. Create apps that people solve a problem that matters. For example, an application like ReCaptcha that helps to digitize books and fight spam.

The efforts of the Stanford persuasive tech lab have taken on important projects: Health, Peace Dot, and others. I hope that the students were also taught the principles that motivate these projects as part of the app course in addition to how to create popular apps.

From Search to Knowledge at Google

Techcrunch reports that Search as high-level product group in Google no longer exists. As part of Google's re-org under new CEO Larry Page, the search group has been renamed the "Knowledge Group". Search Engine Land is reporting on the promotion of Alan Eustace from the SVP of Engineering and Research to Google’s Senior Vice President, Knowledge. My understanding is that this represents an expanded view of the products in search. Beyond helping people find information, the group's goals include,
... enhancing people’s understanding and facilitating the creation of knowledge.
Although all the details are not public, it sounds as if Udi Manber leads the engineering team on information products that are not core search. The details are speculation on my part, but his responsibilities may include products like Knol, Freebase, and Aardvark. It might also include some of Google's data management tools: Google Refine, Fusion Tables, and Public Data Explorer.

Does this mean that search is no longer a "core" product at Google? I don't think so. Instead, it indicates an astute awareness that search needs to be tied to other projects that manage information - social QA, Wikipedia-like knowledge bases, structured data, and other information tools.

In the academic world, we should also consider information retrieval in the context of tools and communities that create and share information: digital libraries, NLP, information (and relation) extraction, and the semantic web. These are all inter-connected components of the information processing and knowledge management ecosystem. As Google's re-org to create a "Knowledge" team indicates, these communities need to communicate and coordinate effectively towards a broader vision of helping people find, create, analyze, and share information.


Wednesday, May 4

Watch me on Fox's MasterChef USA Season 2

Fox announced that I am one of the 100 contestants chosen for the new season of MasterChef!

Thursday, April 28

Greplin: Personal Search for the Social Network Era

Greplin is a cloud search service that indexes your social network and personal information stored in web services. It provides a central hub for searching all your online data. Greplin is a small startup company with six engineers. Instead of building its own cluster, it leverages Amazon EC2 for indexing capacity. The TechCrunch article reports today that:
They’ve now indexed some 1.5 billion documents. And they’re indexing about 30 million new documents per day.
The TechCrunch article exaggerates the scale issue. The more significant scale issues relate to query volume, and the article does not report on those numbers. Furthermore, a large component of the documents Greplin indexes are short FB and Twitter updates. Greplin has more relaxed indexing requirements than real-time search: in the FAQ Greplin says it can take up to 20 minutes or even up to a day to index your documents.

My current Greplin index has approximately 54,000 documents. It has 30k from Gmail, 7k from Facebook, 17k from Twitter, and around 500 from LinkedIn. The basic search functionality seems reasonable enough. It is very snappy with search as you type. The advanced search capabilities are a bit limited. For example, search by date is missing.

Greplin is still in its infancy. The search interface could benefit blending document results from different sources into a more unified result list. For example, see the recent work on "aggregated" and "federated" search [e.g. A Methodology for Evaluating Aggregated Search Results from ECIR 2011]. Furthermore, I would like a faceted search UI to support exploratory search. They could learn a lot by looking at the extensive research on Personal Information Management (PIM) and Desktop search, like Jamie Teevan's research along with Sue Dumais' work on Landmarks and Stuff I've Seen. (For more on PIM - you can also read Jinyoung Kim, one of my labmates).

I have significant reservations concerning my data privacy. Do I trust Greplin with my indexed data? It needs at least partial copies to show snippets of results. At least it claims I can delete my indices for a service at any time. However, it is a very coarse mechanism. There is no version of a robots.txt for my personal data so that I can specify mechanisms for "do not index" or "do not cache" at a granular level.

I have a few invites. If you want to try it out leave a request in the comments.

Tuesday, April 19

ECIR 2011 Best Paper Awards and Other Highlights

First up are the best paper awards which were announced tonight. There were two awards, one for best paper and one for best student paper:
The paper Learning Models for Ranking Aggregates by Craig MacDonald and Iadh Ounis was also nominated.

The best poster prize was A novel reranking approach inspired by quantum measurement by Zhao et al. (via Owen Phelan).

An trend at the conference is the handling of ranking and evaluating "aggregate" results, aka "blended" results or "universal search" where results from multiple verticals are blended into a single presentation. In addition to the above two papers, there is:
Other trends in the conference appear to be:
  • Crowd sourcing evaluation (an entire session)
  • Realtime and Microblog (Twitter) applications (multiple papers across tracks)
The DDR 2011 workshop on diversity in document retrieval also proved popular. The proceedings are available for download. There is a fair bit of discussion on Twitter, #ddr2011.

There are two other papers from UMass that I want to highlight:

Thursday, April 7

SIGIR 2011 Results

Today, the SIGIR paper acceptance/rejections were sent out. What was your result? Let me know in the comments. What did you think of the review quality? Will there be a new influx of new submissions to online journals for rejected papers?

This year there were 545 papers submitted and 108 were accepted (19.8%). Despite controversy that some papers might not receive oral presentations, all papers will have full presentations.

Instead of complaining about the reviewers of my rejected paper, I would instead like to thank the reviewers for their time and consideration, regardless of the outcome, because writing reviews takes a lot of time and effort.

My congratulations to the accepted authors. I look forward to the papers.

Tuesday, March 29

WWW 2011 Day 2: More Workshops and Tutorials

This is more on the WWW conference in India this week. I'm not attending, but I wanted to point out a few things that caught my attention. Today there are more tutorials and workshops. See also the workshops and tutorials from Day 1.

Workshops

Social Media Engagement (SoME 2011) - It focuses on how to measure user engagement (captivated and motivated to participate) and satisfaction with social media.

SemSearch 2011 - The fourth workshop on Semantic Search. The most interesting aspect of the workshop is the data challenge. What I find most compelling is the manually constructed "List" or "Type" queries that are more complex than the other entity queries. The manually constructed queries utilize the attributes and relationships, which make semantic data unique, e.g. [Japanese-born players who have played in MLB where the British monarch is also head of state].

Tutorials

Social Media Analytics - is being taught by Jure Leskovec from Stanford. The slide materials are available online. Last fall he also taught a related class, Social and Information Network Analysis. He is current teaching a class on Mining Massive Datasets.

Web-Based Open Domain Information Extraction is being presented by Marius Pasca from Google. He will be giving a related tutorial, Web Search Queries as a Corpus at the upcoming ACL conference in Portland.

Latent Variable Models on the Internet - Amr Amhed and Alex Smola are presenting work on using Graphical Models on web data. From the description, we will describe inference algorithms for collaborative filtering, recommendation, latent dirichlet allocation, and advanced clustering models.

Social Recommender Systems - Ido Guy and David Carmel from IBM Research are giving a tutorial on social recommender systems. See also the recent SRS2011 workshop which was also organized by Ido.



Monday, March 28

WWW 2011 this week

The WWW 2011 conference is happening this week in Hyderabad India. I'm not attending, so drop me a message or an email with notes or highlights.

Be sure to check out the full program.

Today is Tutorial and Workshop day,

Relevant Tutorials

Distributed Web Retrieval by Ricardo Baeza Yates from Yahoo! Research Barcelona.

Selected Workshops

USEWOD2011 - 1st International Workshop on Usage Analysis and the Web of Data
This workshop will investigate the synergy between semantics and semantic-web technology on the one hand and analysis and mining of usage data on the other hand.

It is encouraging to see a discussion beyond blatant spam to more subtle issues of authority, credibility, and reputation.

Temporal Web Analytics Workshop
TWAW focuses on temporal data analysis along the time dimension for Web data that has been collected over extended time periods.

Stay tuned for more news WWW later this week!

Friday, March 4

Evgeniy Gabrilovich wins 2010 Karen Spärck Jones Award

The British Computing Society IRSG announced that the winner of the 2010 Karen Spärck Jones Award goes to Evgeniy Gabrilovich. Evgeniy is Senior Research Scientist and Manager of the NLP & IR Group of Yahoo! Research.

He will present a keynote talk at the upcoming ECIR 2011 conference later this month. His presentation will be Ad Retrieval Systems in vitro and in vivo: Knowledge-Based Approaches to Computational Advertising.

Congratulations Evgeniy! I have heard a lot of great things from Yahoo! Research interns who worked under his guidance.

Thursday, March 3

Google's War on Content Farms: Project Big Panda

In late February Google launched a significant update to its ranking algorithm to address "shallow content" pages. The change has been referred to as the "Farmer" update externally and internally it is known as "Panda".

Amit Singhal and Matt Cutts posted about the change on the Google blog, Finding more high quality sites in search. It reduced the rankings of "low quality sites" that aggregated content from other websites and didn't add a significant amount value to users. According to the post the update effected 11.8% of queries. They also launched the Chrome Blocklist Extension to let people block websites from their Google results. The O'Reilly Radar published an article with a very good overview of the discussion.

What is behind the change? The most informative article is a recent Wired interview by Stephen Levy, The ‘Panda’ That Hates Farms. It interview Matt Cutts and Amit Singhal who managed the update.

What was the answer? In short, they built a document quality classifier trained on lots of rater data. Here are some of the questions they asked raters from the article:
  • Would you be comfortable giving this site your credit card?
  • Would you be comfortable giving medicine prescribed by this site to your kids?
  • Do you consider this site to be authoritative?
  • Would it be okay if this was in a magazine?
  • Does this site have excessive ads?
These questions seem to ask about the authoritativeness and trust of the content on a page. The results were also confirmed by an 84% overlap between sites downgraded in the change and those that people blocked using the Chrome extension, even though it is not used as a feature in update.

How did Google become overrun with almost-spam content? Amit sheds a bit of light on the question in one of his answers:
So we did Caffeine in late 2009. Our index grew so quickly, and we were just crawling at a much faster speed. When that happened, we basically got a lot of good fresh content, and some not so good. The problem had shifted from random gibberish, which the spam team had nicely taken care of, into somewhat more like written prose. But the content was shallow.
The interview then gets bogged down in bigger issues around editorial process and transparency, which are important but not as technically interesting.

Wednesday, March 2

HeyStaks launches: Social and Collaborative Web Search App

Heystaks is a collaborative search startup that launched publicly yesterday at DemoCon. Heystaks has a browser / iPhone app that lets you share your search experiences. It lets you save searches and pages you find into "Staks" and then share them with your "Search Buddies".

VentureBeat has an article covering their launch which you should probably check out. Here is the video from their website:

The chief scientist at the company is Barry Smyth, a professor at the University College Dublin.

It's a bit early for a full review, but I tried it out and it seems promising. I have some privacy concerns about browser toolbars that save and share my search history, especially when the service is oriented towards public sharing of the information.

HeyStaks reminds me of the failed Yahoo! Search Pad, but with a more social focus, and it works across search engines. I hope it has better luck.

I would like to see the service evolve to have more collaboration in the search beyond saving and sharing results. For example some deeper integration that Gene Golovchinsky, Jeremy Pickens, and others have been advocating. See their paper, Algorithmic mediation for collaborative exploratory search which won the best paper award at SIGIR 2008.

My congratulations to Heystaks on the launch. I look forward to Chome and Android apps versions that I hope will be soon to follow.

News Highlights: Bing Price search, Yahoo! Boss, Google Data Publishing, and more

Here is a round up of news from around the web:
  • Bing adds price recognition to its query support. You can now search for "digital camera under $200" and it will automatically add the price filter. It is a good step in the right direction. How about something a bit harder? "Canon 12 MP Camera under $200" with the manufacturer and megapixel attribute restrictions.
We created this format to address a key problem in the Public Data Explorer and other, similar tools, namely, that existing data formats don’t provide enough information to support easy yet powerful data exploration by non-technical users.
  • Scala tip: Check out REPL for interactive debugging.

Tuesday, March 1

WhistlePig: A minimalist real-time search engine

William Morgan recently announced the release of Whistlepig, a real-time search engine written in C with Ruby bindings. It is now up to release 0.4. Whistlepig is a minimalist in memory search system with ranking by reverse date. You can read William's blog post for his motivations for writing it. Here is a description from the current readme:
Roughly speaking, realtime search means:
- documents are available to to queries immediately after indexing, without any reindexing or index merging;
- later documents are more important than earlier documents.

Whistlepig takes these principles to an extreme.
- It only returns documents in the reverse (LIFO) order to which they were
added, and performs no ranking, reordering, or scoring.
- It only supports incremental indexing. There is no notion of batch indexing or index merging.
- It does not support document deletion or modification (except in the
special case of labels; see below).
- It only supports in-memory indexes.

Features that Whistlepig does provide:
- Incremental indexing. Updates to the index are immediately available to
readers.
- Fielded terms with arbitrary fields.
- A full query language and parser with conjunctions, disjunctions, phrases, negations, grouping, and nesting.
- Labels: arbitrary tokens which can be added to and removed from documents at any point, and incorporated into search queries.
- Early query termination and resumable queries.
- A tiny, <>

Monday, February 28

Palantir: Next Gen Platform for Information Analysis

Palantir is a very ambitious new tech company building a high-powered information analysis platform. They currently have products targeted for the government and the financial industries. Their product is a highly specialized enterprise data system to support intelligence and business analysts.

What does Palantir do?
... the most central hard problem that we address in trying to enable the analyst is data modeling, the process of figuring out what data types are relevant to a domain, defining what they represent in the world, and deciding how to represent them in the system. At Palantir we make sure our data model (ontology) is both flexible and dynamic, and that it mirrors the concepts people naturally use when reasoning about the domain.
The platform handles both structured and unstructured information and performs extraction and data integration. See their infrastructure page and white videos for a few more details.

Their data platform handles objects. An Object in their platform has four object components:
- Properties: text object attributes
- Media: images, video, and binary data
- Notes: free text
- Relationships: links between objects

Clients can specialize this generic object to have specific types using their "Dynamic Ontology" tool to define the semantics. Their platform has one fixed schema with 5 tables: object, property, notes, media, and object-object. An object is linked to one or more data sources which is critical for data lineage and access controls.

A key component of the platform is search over the objects. According to their blog, their scenario has two differentiating features from web search:
  • Realtime indexing and querying – we need information to be available immediately as it changes in the system.
  • Leak-proof access controls – we need the search engine to help us make sure that we don’t have information leaking across access control boundaries.
They go into more detail on their modifications to Lucene for their use cases in two blog posts, Search with a Twist Part I and Part II. From the comments, it sounds like they are using a custom branch of Lucene 2.4.

Palantir's platform combines data processing over large heterogenous datasets, filtering, mapping, visualization, and search in unique ways to create a compelling toolset. It built an intelligence platform that the Government could not do themselves by recruiting a team of uber-geek talent lured by hip silicon valley panache worthy of James Bond.

Friday, February 25

Google "Recipe View" Search Disappointing and Dangerous

Today Google announced Recipe View in a blog post. It is a specialized view of search results restricted to recipes. Recipe View lets you search for recipes without adding text to your query. It searches over recipes from most of the major recipe websites. Google is using semantic data that is marked up using the rich snippets format. I'm very excited by the idea. I want to like it, but I don't. Let me explain.

It is exciting to see structured data being leveraged by Google for recipe search. Exploratory search and faceted metadata offer a lot of potential to improve food search. However, I'm disappointed by Google's incarnation. The biggest feature the interface adds is the ability to restrict the results by whether or not a recipe contains a particular ingredient. I don't think that this is very interesting or useful. Did anyone who really cooks use this? The other facets are similarly lacking in utility. Calories aren't as meaningful as sodium, sugar, and fat content. They could have considered useful facets: chef/publisher, cuisine, vegan/vegetarian, gluten-free, cooking technique, complexity, etc... but they ignored these. Clearly, they didn't put much effort or thought into this revision.

More importantly, I think that Google Recipe View vertical is currently dangerous and detrimental. When activated, it effectively excludes content from blogs and small website publishers. These websites do not use the rich snippet format. Rich snippet markup provides additional metadata, but it should not be required to be included in Recipe View. It is pretty easy to automatically identify whether or not a page contains a recipe using a text classifier and search logs. Personally, I find that content from these websites to often be the most useful and interesting. Until Google fixes this issue, webmasters and publishers should consider if it is worth their effort to adopt.

I would send Google Recipe View back to the kitchen... it's under cooked and lacks seasoning.

Note: I recently started a food blog, which does not use rich snippet markup (yet).

Friday, February 11

WSDM 2011 - Best paper awards

Best paper award: Unbiased Offline Evaluation of Contextual bandit based News Article Recommendation Algorithms by Lihong Li, Wei Chu, John Langford and Xaunhui Wang

Best student paper award: Correcting for Missing Data in Information Cascades by Eldar Sadikov, Montserrat Medina, Jure Leskovec and Hector Garcia-Molina


Thursday, February 10

"Bing Dialogue Model"by Harry Shum - Second WSDM 2011 Keynote


The 2nd keynote talk at WSDM 2011 was an intriguing peak at the Bing's model of user intent, by Harry Shum, VP of Search Product Development, Microsoft.



* Challenges at launch
* Google market share has been steadily growing from 2005-2008 (Bing launch)
* Google is a consumer brand and a habit

* Bing gained 5.1% query traffic share (worldwide) since launch

* 3 elements of Search Quality
1) Relevance
  • Ranking based on meaning not keywords
  • Direct answers
2) Speed
  • Reduce effort to complete tasks
  • Direct answers
  • Fewer clicks
3) Ease of Use (User experience)
  • Intuitive query interface
  • Relevance is hard

* Demo of Bing features

1) “Quick access” – surfacing customer service phone # for query {delta airline}

2) Enhanced movie results for query that match movie titles )
* Rent, buy, watch online, reviews, posters

3) Microsoft Academic Search
* Faceted interface: filter results by author, venue
* Summary pages for author information
* Academic activity
* Co-author graph
* Disambiguation of author names

4) Summary of important information for queries that match geo-locations
* Weather
* Overview of tourist destinations
* Maps

5) Parsing natural language queries
* Parse the query {flight to Taipei feb 12 returning feb 13} to provide fast access to Bing Travel

6) Music search
* Enhanced results for queries that match musician names
* Preview songs, lyrics, bio

7) Facebook Integration
* Results that were liked by Facebook friends
* Surface Facebook profiles in searches of matching friend names

* Internet searchers are becoming more Task Centric
* Decision Making: 66% people are using search to make decisions
* Top search tasks: Entertainment, Games, Health, Travel, Shopping, Directions,…

* Tasks are becoming more sophisticated
* Longer queries
* Longer sessions

* 10 blue links are no longer sufficient
* Instead there’s a need for organized “Whole Page” experience
* Search Paradigm Shift
* From “hit or miss” model to “dialogue” model
* Understanding query intent
* Incorporating structured data into search results
* Relevance on the session level
* Minimize the effort to complete task

* Bing Dialogue Model (see the image above)

* Four levels of dialogue

1) Query level
* Query auto-completion
* Spelling correction
* Interaction with user in mobile devices with touch screens

2) Document level
* Title, snippet, deep links presentation
* Extended document preview on hover over the result

3) Page level
* Quick Tabs for relevant verticals
* Entity-based result summary
* Algorithmic results
* Related query suggestions
* Search history

4) Session level
* History-aware results comparisons