Wednesday, April 21

Bixo Labs Makes Web Crawl Data Available

Bixo Labs announced today that the first data from the Public Terabyte Dataset Project is available. The project uses the Bixo crawler to collect a large set of webpages and make them publicly available.

A sample of the data to get started is available on Amazon S3, bixolabs-ptd-demo. The data is stored using Avro for serialization, the very simple schema is available on their website.

I look forward to seeing more from the project!

Monday, April 19

Facebook Data Science at NESCAI: Understanding and predicting demographics and behavior

Facebook is a generous sponsor of NESCAI this year. On Friday, Jonathan Chang from Facebook Data Science gave the introductory keynote address.. Even you don't read the rest of this post, you should check Jonathan's blog, Here are my notes from his talk. Most of the talk was interesting charts and graphs, my text notes do not do it justice.

Reading the Social Pages: Understanding and predicting the demographics and behavior of Facebook users

Scale at Facebook
350 M – 30 day active users (over 400 now)
3.5 billion pieces of content shared every week
2.5 billion photos uploaded each month
80,000 facebook connect implementations


Scaling challenges
- How to partition data flowing at 100 TB/day
- How to serve content quickly?

Data Mining
- What are users talking about?

- Scaling: billions of ad impressions/day

Keyword Insight Extraction for Advertisers
- Do users click on an ad share a common set of keywords in their profiles and feeds?
- Learn a click-through model using boosted trees with user/ad features
- Propagate positive labels to top-ranked users

What users are talking about (Lexicon)
- Extract popular words from status updates and slice by demographic features (age, region)
- Sentiment analysis
- Keyword association
- (old people like cranberry and lime with vodka… young people get drunk)

“Happinesss” (Kramer et al.)
- Read the team notes post
- Happiness is a broad concept
- Frequently measured using self-report
- Gross national happiness (GNH)

Ethnicities on Facebook
- LDA on user names
- Scraped 10k users from Myspace where they have ethnicity labels

Politices in Facebook
- An “ideal point model” assign each person to a real number in a space representing their (negative inf, conservative, pos inf – liberal)
- Page fanning network (users express this as positive sentiment)
- People and pages get mapped to political positions

Geography Identification of US FB Population
(Backstrom et al.)
- From IP Addresses (lots of inaccurate data)
- User provided address
- Population density – low, med, high density
- What is the probability that you know someone X miles away?
- Using just friends and reported information, you can more precisely predict location than IP.

The Q&A session multiple people brought up concerns about privacy which were provocative. They veered a bit off-topic from the main thrust of the talk into larger policy discussion, which I thought was a bit unfair to the speaker.

Discussion and thoughts
The keynote gave a few instance where the wealth of data at Facebook collects allows interesting opportunities for study and analysis. Most of the applications they outlined used simple keyword analysis of users status updates and a few fields. The interesting part is how these techniques are combined with the social network graph. I would have liked more technical detail about the scaling challenges.

In the discussion afterwards, we came up with two areas that making search / data analysis challenging:
1) Each user has a unique social context. When you search, this context should be incorporated into the ranking. This personalized context makes search harder.
2) Many of the interesting questions involve traversing the social graph to answer questions that incorporate friends-of-friend information. The FOAF grows rapidly and creates a scale challenge.