Thursday, October 27

CIKM 2011 Industry: Toward Deep Understanding of User Behavior on the Web

Toward Deep Understanding of User Behavior on the Web
Vanja Josifovski, Yahoo! Research

Where is user understanding going?

What is the future of the web?
 - prevalent - everyone and everything
 - mutual understanding

Personalized laptops

Personalization today
 - Search personalization.  low entropy of intent.  Difficult to improve over the baseline
 --> effects are small in practice

Content recommendation and ad targeting
 - High entropy of intent
 - Still very crude with relatively low success rates

How do we need to move to the next level
 - more data, better reasoning, and scale

Data today
 - searches, page views
 - connections: friends, followers, and others
 - tweets

The data we don't have
 - jetlagged, need a run? need a pint?, worried about government debit?
 - the observable state is very thin

How to get more user data?
 - Only with added value to the user
 - Must be motivated to provide their data

Privacy is not dead, it's hibernating
 - the impact of data leaks online is relatively small

  - State of the art as we know it.
  - Popular that seem to work well in pratice
  --> Learn relationship between features rij = xiCzj
  --> Dimensionality reduction (random, topical models, recommender systems rij = uivj)
  --> Use of extenral knowledge: smoothing
      --> taxonomies

An elaborate user topic model (Ahmed, KDD 2011, Smola et al. VLDB 2010), yet so so simple
 - the user behavior at time T is a mixture of his behavior at time t-1 + global overall behavior
 - Very simple model

Using External Knowledge
 - Aggrawal et all KDD2007, KDD 2010

Is there more to it?
 -> What is the relative merit of the methods?
 -> They use the data in the same way and are mathematically very similar

Where is the limit? 
  -> what is the upper bound on the performance increase on a given dataset with this family of algorithms?

 - Today - MapReduce is a limiting barrier for many algorithms
 - Need the right abstractions in parallel environments
 - Move towards shared in memory, messages passing models (like Giraph)
 -- (we'll work this out)

Workflow complexity
 - the reality bites Hatch et al. CIKM 2011.  Massive workflows that run for hours.


1) Deep user understanding - the tale of three communities

 - Good formalism that function practice
 - emphasis on metrics and standard collections

 - seamless running of complex algorithms
 - new parallel computation paradigms

Towards deeper understanding
1) get users to give you more data by providing value
2) significantly increase the complexity of the models
3) scale in terms of data and system complexity

CIKM 2011 Industry: Model-Driven Research in Social Computing

Model-Driven Research in Social Computing
Ed Chi

Google Social Stats
250k words per minute on blogger, 360 million words per day
100M+ people take a social action on YouTube

Google+ Stats
40 million joined since launch
2x-3x more likely to share content with one of their circles than to make a public post

Hard to talk about because the systems are changing quite rapidly
Ed joined Google to work on Google+

Social Stream Research
 - Factors impacting retweetability (IEEE Social computing)
 - Location field of user profiles

Motivation for studying languages
 - Twitter is an international phenomenon
 - How do users of different languages use Twitter?
 - How do bilingual users spread information across languages?

Data Collection & Processing
 - 62 M tweets (4 week), spritzer feed in april-may june 2010
 - Language detection with Google language API + LingPipe
 - 104 languages
 - Top 10 languages

English - 51%
Japanese - 19 %
Portuguese - 9.6% (mostly Brazil)
Indonesian - 5.6%
Spanish - 4.7%

Sampled 2000 random tweets
 - 2 human judges for each of the top 10 languages

Problems with French, German, and Malay.
Accuracy of Language Detection
 - Two Types of errors  (poor recognition for "tweet English") and for tweets with 1-2 words

Korean - recommend for conversation tweets
German - promote tweets with URLs

English serves as a hub language

Implications - need to understand when building a global network on language barriers
 - building a global community
 - the need for brokers of information between languages

Visible Social Signals from Shared Items (Chen, et al. CHI 2010/CHI 2011)
- After all day without WIFI, he would like a summary of what's happening in his social stream
- Eddi - Summarizing Social Streams
  --> What's happened since you last logged in
  --> A tag cloud of entities that were mentioned
  - A topic dashboard where tweets are organized into categorizes to drill into

Information Gathering/Seeking
 - The Filtering problem - I get 1,000+ things in my stream, but only have time for 10.  Which ones should I read?

 - The Discovery Problem
 -- millions of URLs are posted,
 - twitter as the platform
 - URLs as the medium
 - a personal newspaper that produces personal headlines

URL Sources (from tweets) -> Topic  Relevance Model, and Social Network Model

URL Sources
 - Consider all URLs was impossible
 -- FoF URLS from followee-of-followers
  --> Social local news is better
- Popular - URLs that are popular across whole of Twitter
   --> popular news is better

Topic Relevance Model
 - A user Tweets about things, which creates a term vector profile.
 - Cosine similarity with URLs
 - Topic Profile of URLs - Built from tweets that contain the URL, but tweets are short and RT makes word frequencies goofy.
 - Adopt a term expansion technique, extract nouns from tweet and feed it into a Wikipedia search engine as a topic detection technique

Topic Profile of User
 - Self-topic
 - Information producer - the things they tweet about
 - Information gatherer - what they like to read
 - Build profiles from froms and aggregate them.

Social Module
 - Take FoF neighborhood, and count the votes for a URL
 - Simple counting doesn't work very well.
 - Votes are weighted using social network structure

Study Design
 - Each subject evaluating 5 URL recommendations from each of the 12 algorithms.  Show 60 URLs in a random order and ask for binary rating,

Summary of Results
 - Global popularity (1%) -- 32.50% are relevant, not bad, but not good enough
 - FoF only - 33% - naiive by itself without voting doesn't work great
 - Fof voting method - 65% (social voting only)
 - Popularity voting - 67%
 - FoF Self-Vote - 72% best performing

Algorithms differ not only in accuracy!
 - Relevance vs. Serendipity in recommendations (tension between discovery and affirming aspect)
 -> "What I crave is surprising, interesting, whimsy" this is where the value is
 -> Two elements two surprise: 1) have I seen this before, 2) non-obvious relationships between things

Design Rule
- Interaction costs determine number of people who participate
 - Reduce the interaction costs, then you can get a lot more people into the system
 - For Google+ this is key to deliver this to people

Japanese crams more information into a tweet.  It is used more for conversation than broadcast in these environments

CIKM Industry talk: Jeff Hammerbacher on Analytical Platforms

Experiences Evolving a New Analytical Platform: What Works and What's Missing
Jeff Hammerbacher, Cloudera

Built the infrastructure team at Facebook, 0 to 2 PB of data

Take the infrastructure and make it available as open source.

The true challenges in the task of data mining.  Creating a data set with the relevant and accurate information, determining the appropriate analysis techniques

Exploratory data processing (IBM)

Taught the data science course at Berkeley earlier this year

1) Store all your organization's data in one place
  - data first, questions later
  - store first, structure later

Engineers are constrained when you force them to stop and model the data, which is constantly evolving.

Raw storage: $0.4 / GB (67 for 2 TB disk), Single HDFS instance > 50 PB on commodity hardware in one center

 Enable everyone to party on the data.  Use files because developers are not analysts.

Like the LAMP stack, there is a coherent analytical data management

Better underlying abstractions

Platform - Substrate
 - commodity servers (a big warehouse)
 -- open compute project (FB open source)
 - open source OS
  -- Linux
 - Open source config management
  -- Puppet, Chef
 - Coordination service
  -- ZooKeeper

Platform - Storage
 - Distributed, schemaless storage
 --> HDFS, Ceph (UCSC), MapR
 - Append-only table storage and metadata
  --> Avro, RCFile, HCatalog (Also: Thrift, Protocal Buffers)
 - Mutable table storage and metadata
 -- HBase

- Cluster resource management
 -- YARN (inter-job scheduling, like grid engine, for data intensive computing)
 -- Mesos
- Processing Framworks
 -- MapReduce, Hamster (MPI), Spark, Dryad, Pregel (Giraph), Dremel
- High level interfaces
  -- Crunch (like Google's Flume Java) , DryadLINQ, Pig, Hive

- Tool access
- Data ingest
 -- Sqoop, Flume
(Documents ingest is still an area that needs work.  There are crawlers, but they're still immature)

fat servers with fat pipes
2u, 24 gb ram, 12 drives, (bigger nodes)
os support for isolation (VMs have downsides)
Linux containers
  -- Google contributed initial patches, used for BORG
Local files system improvements
 -- btrfs

 - scalaql

CIKM 2011 Industry: Freebase: A Rosetta Stone for Entities

John Giannandrea, Google

What is Freebase?
 -> A machine representation of things in the world. (Person, place, thing in the world)
 -> Instead of working in the domain of text, we work in the domain of strong identified things
 -> Each object has an identifier, once you have it, it will also refer to an identity

Properties - relationships between objects
 - edges between the entity ids
 - edges are directional
 - properties create meaning

 - encode knowledge

 - a categorization of an entity
 - An entity can have multiple Types in Freebase
 - "Co-types" - Types are a mix-in
 - e.g. Arnold (politician, actor, athlete)

The real world is extremely messy.

Knowledge you can use
 - the current state
 - 25 Million topics (entities)
 - 500 million connections
 - 2x the size it was last year

>= 10 instances, 5790 types
1772 commons (survived community scrutiny)
4019 bases (people created)

Identity matching
 - reconciliation at scale
 - Wikipedia, Wordnet, Library of congress terms, Stanford library
 - any large open source term they have tried to import into Freebase

-> How? Whatever works.  MapReduce, Google Refine, and human judgment
-> This is possible if you know what an entity is.  (IBM example)

Freebase as a rosetta stone
 - keys
 - behind the websites, there is a structured database with keys (relational db with tables that have primary keys)
 - all of these keys leak out onto the web, "shakira" in the url
 - In the Freebase system they try to collect these keys to link the entity to external websites

URLs and Freebase keys
 - accrete the URLs and keys onto the object
 - Names are just another key (the entities themselves are the same across languages)

 - Freebase is schema less
 - It is fundamentally based on a graph store
 - Schema is described in the graph itself, just as the data ("Type: type")
 - The person "type" is an entity with an id, "Type:type:person"
 - Put the predicates into the graph system so that it can be updated

Google API to Schema Data
 - WIKL read ( a query language for inspecting the freebase graph)

How does Google use Freebase?
 - "I work in the search division"

Time in Freebase
 - everything has a start date and end date

How good is the quality?
 - varies depending on the entities (e.g. presidents is high quality, but obscure book there may be some duplicates)
 -> 99% accuracy
 -> curate the top 100k entries
 -> we'd rather not import data than import data that is bad
 -> (We imported the open library catalog, which has lots of duplicates.  never again.)

 - 25 M entities, 2x from last summer, 100M by next year
 - It depends on the domain
 - For common queries in search engines, it does very well
 - search engines handle lots of queries for celebrities, common places on earth

Confidence on facts
 - common criticisms 1) it's not a real database, 2) the assertions are not given weight, it doesn't capture uncertain facts
 - you can create a mediated way of doing that in the schema
 - how do you deal with controversial facts?  1) careful with type definitions.  countries are hard. (use UN definition)  unusual categories.  FIFA has its own definition.  World cups have been played with countries that don't exist.
 - for head entities, there are large number of people arguing

Ian - quality 99% accuracy is still 1 million incorrect for 100M.
 - sampling rate for how you draw entities.  you have a probability of confidence.
  (two kinds of sampling: 1) random sample, 2) traffic weighted sampling based on popularity)
 - 99% at the 95th percentile

Wednesday, October 26

CIKM 2011 Keynote II: Justin Zobel on Biomedicine

Data, Health, and Algorithmics: Computational Challenges for Biomedicine
by Justin Zobel

(I missed the first part of this talk)

The Central Dogma
- DNA consts of sequence of four bases, A, C, G, T
- The concept of a gene is now uncertain

SNP analysis (Nature 00)
 - used PCA

 - Read DNA directly
 - $1000 by the end of 2012

The data is erroful, incomplete, voluminous, ambiguous.

Also, reads are not very random

Within a few years there will be DNA databases of 10-100 terabases, which we will use to find matches to short read data

Challenge: Assembly
 - Imagine a million copies of a phone book, a million pages long
 - Shredded into tiny pieces, each no more than 20 or 30 characters
 - 99.999% are thrown away.
 - The task: reconstruct the phone book from the billion remaining pieces

The Problem of assembling short reads.

Genomes are a combinatorial minefield
 - vast quantities of repeated material

The genome is cheap, but the analysis is expensive

de Bruijn graph
 - Divide 7-base reads into kmers (3mers)
 - each node is a kmer, each arc is an overlap

The graph is about 4 terabytes, and it needs to be in memory.

Succint 'Gossamer' representation
 - fast access with simple index
 - space down by a factor > 10
 - cuts the storage down to 32 GB

DNA dictionaries
 - there is no grammar for DNA that would allow construction of a parser
 - a dictionary of all possible tokens would be impossible large

Dictionary - any representative string
 -> solves the text compression problem for DBs

Genetics for diagnosis
 -> inference diagnosis based on symptons replaced by ones based on DNA analysis
 -> drug effect and health outcome determined directly from historical health records
 - Built to simplifly, improve, and automate bureaucratic decisions

'Guardian Angel' clinical decisions
 - Electronic health records analyzed on the fly to check whether a mistake is about to be made.

Health at Home
 -> health monitoring deeply embedded in our e-lifestyle activities
 -> webcam that determines how well you are based on your skin
 -> iphone app that tells if your drunk

Computer Science vs heatlth research
 - many algorithmic solutions are not biologically meaningful
 - spend money on IT and the number of errors is decreased (It saves lives.)

13 of the top 25 questions (Science mag July 2005) are about DNA

The real way to have an impact on medicine is in the clinic -> text mining records.  Helping doctors make decisions.

Tuesday, October 25

CIKM 2011 Keynote: David Karger

Creating User Interfaces that Entice People to Manage Better Information
By David Karger (MIT)

HayStack - Per user Information Environments (1999)

Current State of IKM (Information and Knowledge Management)
  - We take users with extremely rich landscapes of information and we give them keyboards to barely sketch their interested.  Algorithms work really really hard on that sketch.  

 - We work hard to make computers do IKM well
 - People are better than computers at IKM

  - In what ways can we give people the ability to manage more or better information?
 - How do we make them want to?

1) Capture more data digitally
2) Collaborate to understand lecture notes

Capture of Information Scraps  
 - The state of PIM
 - The desks all have computers, but we have huge piles of paper (never put into it)
 - 27 participants, 5 Orgs, 1 hour in situ interviews
#1 using computer is distracting / impossible
  -- people instead just grab random notes to write things down
  -- Interfaces for Staying in the Flow (Ben Bederson, Ubiquity 2004
  -- (Being "in the zone', in the flow)

#2 chimeras fight between apps
  -- Meeting notes with TODOs and follow up meetings
#3 Diverse information forms don't fit apps
Types of information
  TODOs, meeting Notes, Name and Contact information

#4 Want in view at right time --workflow integration

Costs to digital capture
 - costs: effort to choose place, imposted schema, entry time is a distraction
Fixes: no organization, plain text, in the browser, cross-computer sync offline+online (open source mico note tool for Firefox.
 --> 25,000 downloads, 16,625 registered users, 920 volunteers, 116k contributed notes

Types of notes:
TODOs, Web bookmarks, Concat information
median time to write something is 7.4s
median number of lines is 4

35% - ease/speed
20% simplicity
20% direct replacement for post-its

Detour: Note Science
  -- How do people keep and acccess information in list-it?

3 coders
first clustered, identified 4 archetypes

MISC - MIT Open Scrap Corpus (available online)

NB: Classroom Discussion
Stellar Classroom discussion tool
 - 50 most active classes made 3275 posts
  -- no heavily populated posts
- Nb: forum in context  (happen in the margin of lecture notes)
 - Highlight a section of the post, write a comments
 --> Implicit context (how do I get 3 from 1)

Benefits - Discuss as you read without existing note view
 -- Context is clear because the PDF content is there
 -- annotations create a heat map of lecture notes

15 classes, 4 different universities
(Annotation required), usage of the tool doubled over the term.
 --> they liked seeing that they weren't the only one that was confused.
  --> rich interaction

NB specific benefits
 --> "Why?"
 --> The social benefits outweighted the use of paper

 - Artificial Collaborative Filtering

Vast amounts of content, how do we get the good stuff
machine learning recommenders - users rate what they read, content recommendation, collaborative filtering (find people with similar likes, predict what they will like)

 - have to read lots of junk to train system
 - have to spend energy now for future benefit
 - many users won't ever get started

 - ML algorithms imperfect
 - Deliver reading irrelevant content, worry about what is missed

Alternative: People

Email is dominant in information sharing
Median 6 - people do want more relevant links
Sharers are reluctant to spam their friends
 (unsure of relevance, may have seen it already, too much effort)

-> let them use email, reasssure sender that content is relevant.  Aand that the recipient isn't overloaded. One-click sharing

Firefox plugin
1. recoomend recipients to reduce time and effort for sharing
 (uses ML to find people to recommend)

One-click thanks

Recommendation Algorithm
 -- rochio classifier

 - two week study for $30
 - 60 google reader users recruited on blogs
 - Viewed 85k posts, shared 713 posts
 - Significant increase in sharing

Recipients were happy - 80.4% of the posts contain novel content

Recommendations Useful

Do overload indicators help
 - 1/3 of subjects with them said they were favorite feature
 - 30 of shares resulted a thanks

Machine filtering
 - have to read stuff

Structured Data
We all know structured data is good.
it supports
 -> rich vizualizations, filtering, sorting, queries, merge data

Epicious (old version)
 -> filter by ingredient, cuisine, part of meal

Mere mortals just write text or HTML

Structured data takes skill
 - design a data model,

Plain authors are left behind
 -> less power to communication effectively

Coping: Information Extraction
 - Entity Recognition, Coference, relationship extraction

Imperfect, so errors creep in.

Alternative: Give regular people tools that let people author structured data
 -> to communicate well

Do we need this? Yes.

- HTML is the language of the web
 - Extend it to talk about data
 - Anyone authoring HTML should be able to author data and interactive visualization
- Edit data-html in web, blogs, wikis

(like spreadsheets)

Publishing data is easy, just put a spreadsheet online.  rows are items, columns are properties

 Items (recipes)
 - Each has properties, Title, source magainze, publication date, etc...
 - Vizualization - a collection of a view of data items
     -- bar chart, sortable list, map, thumbnail set

Bound to peroperties
 - sort by property

Facets for filtering information
 -> specificy a property, user clicks to select
 -> templates -> format per item.
 - HTML with "fill in the blanks"

Key primitives of a data page
Data - a spreadsheet

Exhibit javascript library

1800 websites using exhibits
hobby stores, science
(lots of strange hobbyists)
Veggie guide to Glasgow

Not very scalable (fast for < 100 items)

Side effects - the data is out there.  (structured data is the side effect)

Datapress - data visualization inside the blog

- People are powerful information managers
In each case, it's about giving people the tools to be information managers

Wait, There's more
 --> manage structured data by making it look like a spreadsheet
--> Atomate -> help users translate incoming data data into structured data

We work hard to make computers do IKM well,
Don't assume people are passive IK consumers
Give people tools that can encourage active engagement in IKM

All the links are at

The success of exhibit came from why HeyStack didn't succeed.  It's not the only measure of success that lots of people use a tool.  It's still an interesting piece of research.