Facebook is a generous sponsor of NESCAI this year. On Friday, Jonathan Chang from Facebook Data Science gave the introductory keynote address.. Even you don't read the rest of this post, you should check Jonathan's blog, pleasescoopme.com Here are my notes from his talk. Most of the talk was interesting charts and graphs, my text notes do not do it justice.
Reading the Social Pages: Understanding and predicting the demographics and behavior of Facebook users
Scale at Facebook 350 M – 30 day active users (over 400 now) 3.5 billion pieces of content shared every week 2.5 billion photos uploaded each month 80,000 facebook connect implementations
Scaling challenges - How to partition data flowing at 100 TB/day - How to serve content quickly?
Data Mining - What are users talking about?
Advertisement - Scaling: billions of ad impressions/day
Keyword Insight Extraction for Advertisers - Do users click on an ad share a common set of keywords in their profiles and feeds? - Learn a click-through model using boosted trees with user/ad features - Propagate positive labels to top-ranked users
What users are talking about (Lexicon) - Extract popular words from status updates and slice by demographic features (age, region) - Sentiment analysis - Keyword association - (old people like cranberry and lime with vodka… young people get drunk)
Ethnicities on Facebook - LDA on user names - Scraped 10k users from Myspace where they have ethnicity labels
Politices in Facebook - An “ideal point model” assign each person to a real number in a space representing their (negative inf, conservative, pos inf – liberal) - Page fanning network (users express this as positive sentiment) - People and pages get mapped to political positions
Geography Identification of US FB Population (Backstrom et al.) - From IP Addresses (lots of inaccurate data) - User provided address - Population density – low, med, high density - What is the probability that you know someone X miles away? - Using just friends and reported information, you can more precisely predict location than IP.
The Q&A session multiple people brought up concerns about privacy which were provocative. They veered a bit off-topic from the main thrust of the talk into larger policy discussion, which I thought was a bit unfair to the speaker.
Discussion and thoughts
The keynote gave a few instance where the wealth of data at Facebook collects allows interesting opportunities for study and analysis. Most of the applications they outlined used simple keyword analysis of users status updates and a few fields. The interesting part is how these techniques are combined with the social network graph. I would have liked more technical detail about the scaling challenges.
In the discussion afterwards, we came up with two areas that making search / data analysis challenging:
1) Each user has a unique social context. When you search, this context should be incorporated into the ranking. This personalized context makes search harder.
2) Many of the interesting questions involve traversing the social graph to answer questions that incorporate friends-of-friend information. The FOAF grows rapidly and creates a scale challenge.