Application view
- understanding user behavior
- choosing best content for presentation
- serving right ads
- extracting/semantic tagging of content
- dealing with spam
- rich data makes solutions for these possible
- standard ML problems
- regression/classification/clustering/feature selection/etc. - statistics
- scale
- dealing with large data sets
- discovering faster algorithms
- fast surround (?) - structure/signal
- adversarial learning
- budget on real-time
- preserving privacy
- multi-task and transfer learning
- graph transduction w/ many types of info
- injecting knowledge into models (non-traditional training data)
- experimental design/quality metrics
- estimating CTR
- rare events/anomaly detection
- forecasting (page views for displaying advertising)
Ex: content optimization (COKE)
- matching content to user intent
- maximize "long-term utility" (satisfaction)
- online tracking of content affinity
- multi-armed bandits and time series analysis - SVD for user modeling
- one document, many topics
- using graphical model representation
- speed up algorithms
- parallel implementation via pipeline
(fastest LDA code) - uses many tricks
- 1000 iterations (near convergence) of 1M docs in a few hours
- online learning (linear regression)
- optimized o get fastest speed up of algorithms
- open source
- available on github - can use hashing techniques
- allows for very large feature space - modularized
- can swap out linear regression for other ML models
- use Yahoo! accounts
- spammers pay people to solve captchas - very lucrative
- >80% of email is spam
- classifiers have to be quick
- users hate good mail being classified as spam (FPR)
- must protect privacy
- features
- queries only
- documents only
- queries AND documents - approaches
- pointwise
- pairwise
- listwise - directly optimize a metric of interest
- using click data for auto labeling
- transfer learning
- diversity
- cascaded learning
- ML techniques suggest what bidders should bid fories they hadn't though of using queries they hadn't though of using
- wrappers
- info extraction algorithms for pages with same/similar format
- requires supervision
- not scalable - web tables
- looks for clean HTML tables
- not scalable
- needs some supervision - NLP
- uses language signals
- hard - domain-centric extraction
- located somewhere between the above methods
- look at one domain at a time
e.g. blogs
- what's the title? post time? etc. - schema
- domain knowledge (weak labeling signals)
- local presentation consistencies => accurate extraction
- complex graphical models
- domain-centric approach to deep web
0 comments:
Post a Comment