Ilya Segalovich, Yandex
Russian Search Market
- Yandex has 60+% market share
- It's all about small attention to details about the search
A Yandex overview
- started in 1997
- no 7 search engine in the world by # of queries
- 150 million queries per day
Variety of Markets
- 15 countries with cyrillic alphabet
- 77 regions in Russia
-> different culture, standard of living, average income, for example: Moscow, Magadan
-> large semi-autonomous ethnic groups (tatar, chech, bashkir)
-> neighbouring bilingual markets
Geo-specific queries
- Relevant result sets very significantly across regions and countries
pFound
- a probablistic measure of user satisfaction
- optimization goal at Yandex sinces 2007
- Similar to ERR, Chapelle 2009 --> hopefully someone can fill in the exact formula
- pFound, pBreak, pRel
Geo-specific Ranking
query -> query + user's region
- may need to build a specific formula for countries/region because of the variance and missing/lacking features in some of them.
Alternatives in Regionalization
- separate local indices or unified indx with geo-coded pages
- one query or region specific query
- query based local intent detection vs. results based local intent detection
- single ranking function vs. co-ranking and re-ranking of local results
- train one formula or train many formulas on local pools
Why use MLR?
Machine learning as a conveyor
- Some query classes require specific ranking
- many features
MatrixNet
A learning method
- boosted decision tree, "oblivious" trees.
- optimize for pFound
- solve regression tasks, train classifiers
Complexity of ranking formulas
20 bytes - 2006
14 kb - 2008
220 kb - 2009
120 MB - 2010
A sequence of More and More complex rankers
- pruning with the static rank (static features)
- use of simply dynamic features (such as bm25)
- complex formula that uses all the features available
- potentially up to million of matrices/trees for the very top documents
- see camazoglu, 2010 early exit optimization
Geo-dependent queries: pFound
- a big jump in 2009 in Quality
- 3x more local results than competitors in Russia, than #2 player
Lessons
- MLR is the only to regional search: it provides us the possiblity of tuning many geo-specific models at the same time.
Challenges
Complexity of the models is increasingly rapidly
-> don't fit into memory!
MLR is in its current setting does not fit well to time-specific queries
-> features of the fresh content are very sparse and temporal
Opacity of results of the MLR
- The backside of ML
Number of featuers grows faster than the number of judgments
-> hard to train ranking
Learning from clicks and user behavior is hard
Tens of GB of data per day!
Yandex and IR
- Participation and Support
- Yandex MLR at IR context
2 comments: