Friday, September 11

Yahoo! Key Scientific Challenges Summit: Machine Learning

Yesterday I posted the notes from Andrew Tomkins presentation on challenges in search. Today are more of Henry's notes from Machine Learning presented by Sathiya Keerthi Selvaraj, Senior Research Scientist. It covers the use of ML at Y! and some current challenges.

Application view
  • understanding user behavior
  • choosing best content for presentation
  • serving right ads
  • extracting/semantic tagging of content
  • dealing with spam
    - rich data makes solutions for these possible
ML problems view
  • standard ML problems
    - regression/classification/clustering/feature selection/etc.

  • statistics

  • scale
    - dealing with large data sets
    - discovering faster algorithms
    - fast surround (?)

  • structure/signal
    - adversarial learning
    - budget on real-time
    - preserving privacy
    - multi-task and transfer learning
    - graph transduction w/ many types of info
    - injecting knowledge into models (non-traditional training data)
  • experimental design/quality metrics
  • estimating CTR
  • rare events/anomaly detection
  • forecasting (page views for displaying advertising)
    Ex: content optimization (COKE)
    - matching content to user intent
    - maximize "long-term utility" (satisfaction)
  • online tracking of content affinity
    - multi-armed bandits and time series analysis
  • SVD for user modeling
Clustering documents
  • one document, many topics
  • using graphical model representation
  • speed up algorithms
    - parallel implementation via pipeline
    (fastest LDA code)
  • uses many tricks
  • 1000 iterations (near convergence) of 1M docs in a few hours
Vowpal Wabbit
  • online learning (linear regression)
  • optimized o get fastest speed up of algorithms
    - open source
    - available on github
  • can use hashing techniques
    - allows for very large feature space
  • modularized
    - can swap out linear regression for other ML models
  • use Yahoo! accounts
    - spammers pay people to solve captchas
  • very lucrative
  • >80% of email is spam
  • classifiers have to be quick
  • users hate good mail being classified as spam (FPR)
  • must protect privacy
Search Ranking (MLR)
  • features
    - queries only
    - documents only
    - queries AND documents
  • approaches
    - pointwise
    - pairwise
    - listwise
  • directly optimize a metric of interest
  • using click data for auto labeling
  • transfer learning
  • diversity
  • cascaded learning
Bid Generation
  • ML techniques suggest what bidders should bid fories they hadn't though of using queries they hadn't though of using
Domain-centric IE (PSOX)
  • wrappers
    - info extraction algorithms for pages with same/similar format
    - requires supervision
    - not scalable

  • web tables
    - looks for clean HTML tables
    - not scalable
    - needs some supervision

  • NLP
    - uses language signals
    - hard

  • domain-centric extraction
    - located somewhere between the above methods
    - look at one domain at a time
    e.g. blogs
    - what's the title? post time? etc.

  • schema
  • domain knowledge (weak labeling signals)
  • local presentation consistencies => accurate extraction
  • complex graphical models
  • domain-centric approach to deep web

No comments:

Post a Comment