Friday, September 11

Yahoo! Key Scientific Challenges Summit: Machine Learning

Yesterday I posted the notes from Andrew Tomkins presentation on challenges in search. Today are more of Henry's notes from Machine Learning presented by Sathiya Keerthi Selvaraj, Senior Research Scientist. It covers the use of ML at Y! and some current challenges.

Application view
  • understanding user behavior
  • choosing best content for presentation
  • serving right ads
  • extracting/semantic tagging of content
  • dealing with spam
    - rich data makes solutions for these possible
ML problems view
  • standard ML problems
    - regression/classification/clustering/feature selection/etc.

  • statistics

  • scale
    - dealing with large data sets
    - discovering faster algorithms
    - fast surround (?)

  • structure/signal
    - adversarial learning
    - budget on real-time
    - preserving privacy
    - multi-task and transfer learning
    - graph transduction w/ many types of info
    - injecting knowledge into models (non-traditional training data)
  • experimental design/quality metrics
  • estimating CTR
  • rare events/anomaly detection
  • forecasting (page views for displaying advertising)
    Ex: content optimization (COKE)
    - matching content to user intent
    - maximize "long-term utility" (satisfaction)
  • online tracking of content affinity
    - multi-armed bandits and time series analysis
  • SVD for user modeling
Clustering documents
  • one document, many topics
  • using graphical model representation
  • speed up algorithms
    - parallel implementation via pipeline
    (fastest LDA code)
  • uses many tricks
  • 1000 iterations (near convergence) of 1M docs in a few hours
Vowpal Wabbit
  • online learning (linear regression)
  • optimized o get fastest speed up of algorithms
    - open source
    - available on github
  • can use hashing techniques
    - allows for very large feature space
  • modularized
    - can swap out linear regression for other ML models
  • use Yahoo! accounts
    - spammers pay people to solve captchas
  • very lucrative
  • >80% of email is spam
  • classifiers have to be quick
  • users hate good mail being classified as spam (FPR)
  • must protect privacy
Search Ranking (MLR)
  • features
    - queries only
    - documents only
    - queries AND documents
  • approaches
    - pointwise
    - pairwise
    - listwise
  • directly optimize a metric of interest
  • using click data for auto labeling
  • transfer learning
  • diversity
  • cascaded learning
Bid Generation
  • ML techniques suggest what bidders should bid fories they hadn't though of using queries they hadn't though of using
Domain-centric IE (PSOX)
  • wrappers
    - info extraction algorithms for pages with same/similar format
    - requires supervision
    - not scalable

  • web tables
    - looks for clean HTML tables
    - not scalable
    - needs some supervision

  • NLP
    - uses language signals
    - hard

  • domain-centric extraction
    - located somewhere between the above methods
    - look at one domain at a time
    e.g. blogs
    - what's the title? post time? etc.

  • schema
  • domain knowledge (weak labeling signals)
  • local presentation consistencies => accurate extraction
  • complex graphical models
  • domain-centric approach to deep web

A more interactive Bing 2.0 coming next week?

Yesterday news broke on Twitter about the demo of Bing 2.0 at the Microsoft annual meeting. Mary Jay Foley on ZDnet has coverage of the story.

The tweets say it's slotted for this fall and could be released as early as next week.

The exact details are sketchy, but one feature that came out was the use of interactive maps using Silverlight integrated into the results. Other "vizualization" features we also alluded to. For example, one tweet:
Bing 2.0, out this month, has some exciting new features. Imagine seeing maps plus pics from the neighborhood of a restaurant to try.
We'll have to wait and see for ourselves...

Thursday, September 10

Yahoo! Key Scientific Challenges Coverage I: Challenges in Search

Yahoo! put out a press release on the Key Scientific Challenges Summit. My labmate, Henry, attended the summit, presenting work on detecting searcher frustration. He took some great notes to share. First up are the notes from Andrew Tomkins, Chief Scientist, Yahoo! Search talking about key challenges in search.

Three key challenges
  1. Optimizing task-aware relevance (model long-running user tasks)
  2. Grid-based content analysis (new computing algorithms)
  3. Measure/predict/generate engagement
Many non-issues:
  1. Anything involving PageRank
  2. Algorithms for supercomputers
  3. Folksonomies and tag analysis (at least not yet)
  4. Friend of a friend
  5. Unsupervised user-facing techniques (showing result clusters)

Challenges (details)
1. Task-aware
o Dawn of search:
- navigation and packets of information

o Today:
- increasing migration of content online
- new forms of media available online
- infrastructure for payment more comfortable for users

o Moving away from 2.7 words and 10 blue links
- more structured results
- more satisfaction without clicking
- more interaction with web services
- much richer page structure

o The resources people search is changing
- search engines may or may not be the hub

2. Storage trends
o Storage is cheap: any company with tens of employees can store all
text produced by all humans on the planet
- multimedia is another story

o Move away from scale to deep understanding

o Richer models about what's on a page
- page semantics
- user consumption patterns
- aggregate properties
- how do we search it??

3. The problem is bigger than search -- Understanding the user
o why do people lurk versus participate?

o why do people create new personas?

o why are Facebook/YouTube/etc. so successful?

o what new genres are emerging?
- for content creation?
- participation?
- what tools are appropriate?

o haven't really gotten started
- many proxy measures based on views/clicks/etc.

o too low level

o some contributions
- click prediction
- dynamics of social network analysis
- models of viral marketing

o predictions of engagement still "embryonic"
- generation of engagement remains an art form

o need new science of engagement
- this is not a substitute for creativity
- scientific basis

Stay tuned for coverage of the machine learning challenges...

Wednesday, September 9

Getting started with Mahout on DeveloperWorks

Grant Ingersoll has an article, Introducing Apache Mahout, published on IBM Developer works.

The article gives an introduction to different ML tasks and Mahouts implementations. Mahout current has the Taste recommendation system developed by Sean Owen. Clustering implementations including k-Means, fuzzy k-Means, Canopy, Dirichlet, and Mean-Shift. A Naive Bayes text classifier.

Grant covers the basics of getting these working in the article.

At the end he comments on what's next for Mahout:
On the immediate horizon are Map-Reduce implementations of random decision forests for classification, association rules, Latent Dirichlet Allocation for identifying topics in documents, and more categorization options using HBase and other backing storage options...

CIKM 2009 Papers

The CIKM 2009 accepted papers list is available. Some of the papers are starting to appear online.

The CIIR has a several papers this year:

Jinyoung posted a list of papers he thinks look interesting.