Thursday, July 22

SIGIR 2010 Industry Day: Machine Learning in Search Quality at Yandex

Machine Learning in Search Quality at Yandex
Ilya Segalovich, Yandex

Russian Search Market
- Yandex has 60+% market share
- It's all about small attention to details about the search

A Yandex overview
- started in 1997
- no 7 search engine in the world by # of queries
- 150 million queries per day

Variety of Markets
- 15 countries with cyrillic alphabet
- 77 regions in Russia
-> different culture, standard of living, average income, for example: Moscow, Magadan
-> large semi-autonomous ethnic groups (tatar, chech, bashkir)
-> neighbouring bilingual markets

Geo-specific queries
- Relevant result sets very significantly across regions and countries

- a probablistic measure of user satisfaction
- optimization goal at Yandex sinces 2007
- Similar to ERR, Chapelle 2009 --> hopefully someone can fill in the exact formula
- pFound, pBreak, pRel

Geo-specific Ranking
query -> query + user's region
- may need to build a specific formula for countries/region because of the variance and missing/lacking features in some of them.

Alternatives in Regionalization
- separate local indices or unified indx with geo-coded pages
- one query or region specific query
- query based local intent detection vs. results based local intent detection
- single ranking function vs. co-ranking and re-ranking of local results
- train one formula or train many formulas on local pools

Why use MLR?
Machine learning as a conveyor
- Some query classes require specific ranking
- many features

A learning method
- boosted decision tree, "oblivious" trees.
- optimize for pFound
- solve regression tasks, train classifiers

Complexity of ranking formulas
20 bytes - 2006
14 kb - 2008
220 kb - 2009
120 MB - 2010

A sequence of More and More complex rankers
- pruning with the static rank (static features)
- use of simply dynamic features (such as bm25)
- complex formula that uses all the features available
- potentially up to million of matrices/trees for the very top documents
- see camazoglu, 2010 early exit optimization

Geo-dependent queries: pFound
- a big jump in 2009 in Quality
- 3x more local results than competitors in Russia, than #2 player

- MLR is the only to regional search: it provides us the possiblity of tuning many geo-specific models at the same time.

Complexity of the models is increasingly rapidly
-> don't fit into memory!

MLR is in its current setting does not fit well to time-specific queries
-> features of the fresh content are very sparse and temporal

Opacity of results of the MLR
- The backside of ML

Number of featuers grows faster than the number of judgments
-> hard to train ranking

Learning from clicks and user behavior is hard
Tens of GB of data per day!

Yandex and IR
- Participation and Support
- Yandex MLR at IR context


  1. Nice description, thank you!
    In regard to the exact pFound formula, it is interesting that I can see two versions of it.
    In their older paper (it's in Russian, but one can figure it out, just see p.4) :
    it uses a probability that a user is not satisfied with just the previous link.
    In their sigir presentation, pFound is using the probability that the user is not satisfied with all previous results. Looks strange, but hopefully they will clarify this.

  2. Sorry for the confusion, it is indeed the same formula. I just overlooked the recursion.