Thursday, October 27

CIKM 2011 Industry: Model-Driven Research in Social Computing

Model-Driven Research in Social Computing
Ed Chi

Google Social Stats
250k words per minute on blogger, 360 million words per day
100M+ people take a social action on YouTube

Google+ Stats
40 million joined since launch
2x-3x more likely to share content with one of their circles than to make a public post

Hard to talk about because the systems are changing quite rapidly
Ed joined Google to work on Google+

Social Stream Research
 - Factors impacting retweetability (IEEE Social computing)
 - Location field of user profiles

Motivation for studying languages
 - Twitter is an international phenomenon
 - How do users of different languages use Twitter?
 - How do bilingual users spread information across languages?

Data Collection & Processing
 - 62 M tweets (4 week), spritzer feed in april-may june 2010
 - Language detection with Google language API + LingPipe
 - 104 languages
 - Top 10 languages

English - 51%
Japanese - 19 %
Portuguese - 9.6% (mostly Brazil)
Indonesian - 5.6%
Spanish - 4.7%

Sampled 2000 random tweets
 - 2 human judges for each of the top 10 languages

Problems with French, German, and Malay.
Accuracy of Language Detection
 - Two Types of errors  (poor recognition for "tweet English") and for tweets with 1-2 words

Korean - recommend for conversation tweets
German - promote tweets with URLs

English serves as a hub language

Implications - need to understand when building a global network on language barriers
 - building a global community
 - the need for brokers of information between languages

Visible Social Signals from Shared Items (Chen, et al. CHI 2010/CHI 2011)
- After all day without WIFI, he would like a summary of what's happening in his social stream
- Eddi - Summarizing Social Streams
  --> What's happened since you last logged in
  --> A tag cloud of entities that were mentioned
  - A topic dashboard where tweets are organized into categorizes to drill into

Information Gathering/Seeking
 - The Filtering problem - I get 1,000+ things in my stream, but only have time for 10.  Which ones should I read?

 - The Discovery Problem
 -- millions of URLs are posted,
 - twitter as the platform
 - URLs as the medium
 - a personal newspaper that produces personal headlines

URL Sources (from tweets) -> Topic  Relevance Model, and Social Network Model

URL Sources
 - Consider all URLs was impossible
 -- FoF URLS from followee-of-followers
  --> Social local news is better
- Popular - URLs that are popular across whole of Twitter
   --> popular news is better

Topic Relevance Model
 - A user Tweets about things, which creates a term vector profile.
 - Cosine similarity with URLs
 - Topic Profile of URLs - Built from tweets that contain the URL, but tweets are short and RT makes word frequencies goofy.
 - Adopt a term expansion technique, extract nouns from tweet and feed it into a Wikipedia search engine as a topic detection technique

Topic Profile of User
 - Self-topic
 - Information producer - the things they tweet about
 - Information gatherer - what they like to read
 - Build profiles from froms and aggregate them.

Social Module
 - Take FoF neighborhood, and count the votes for a URL
 - Simple counting doesn't work very well.
 - Votes are weighted using social network structure

Study Design
 - Each subject evaluating 5 URL recommendations from each of the 12 algorithms.  Show 60 URLs in a random order and ask for binary rating,

Summary of Results
 - Global popularity (1%) -- 32.50% are relevant, not bad, but not good enough
 - FoF only - 33% - naiive by itself without voting doesn't work great
 - Fof voting method - 65% (social voting only)
 - Popularity voting - 67%
 - FoF Self-Vote - 72% best performing

Algorithms differ not only in accuracy!
 - Relevance vs. Serendipity in recommendations (tension between discovery and affirming aspect)
 -> "What I crave is surprising, interesting, whimsy" this is where the value is
 -> Two elements two surprise: 1) have I seen this before, 2) non-obvious relationships between things

Design Rule
- Interaction costs determine number of people who participate
 - Reduce the interaction costs, then you can get a lot more people into the system
 - For Google+ this is key to deliver this to people

Japanese crams more information into a tweet.  It is used more for conversation than broadcast in these environments