Wednesday, February 24

Yahoo! Learning to Rank Challenge at ICML 2010

Yahoo! has announced a Learning to Rank challenge as part of the Learning to Rank Workshop at ICML 2010.

They are releasing (to participants) two large real-world datasets. The first dataset has:
29,921 queries
744,692 URLs
519 features

For details on the second set, see the website.

The URLs are rated on a graded scale, 0 (irrelevant) to 5 (perfect). The evaluation will use Normalized Discounted Cumulative Gain (NDCG) and Expected Reciprocal Rank (ERR).

The set only includes query and URL identifiers without the original information, so engineering new features seems unlikely.

The competition begins March 1st and goes through May 31st.

Wired Article on Google's Algorithm: Thoughts on Synonyms

I haven't been writing much recently. I was a bit burnt out after paper season. I submitted a short paper on synonym recognition to ACL 2010. I hope to share more on that in the future. On the topic of synonyms, the recent Wired article on How Google's Algorithm Rules the Web mentions briefly their synonym recognition algorithm.

Towards the middle of the article, Amit Singhal talks about synonyms. The first part talks about the straightforward mappings identified from query reformulations. I think the more interesting case is when you don't have millions of those to learn from. You can use the information on the web documents. Here's the relevant section,
Google’s synonym system understood that a dog was similar to a puppy and that boiling water was hot. But it also concluded that a hot dog was the same as a boiling puppy. The problem was fixed in late 2002 by a breakthrough based on philosopher Ludwig Wittgenstein’s theories about how words are defined by context... “Today, if you type ‘Gandhi bio,’ we know that bio means biography,” Singhal says. “And if you type ‘bio warfare,’ it means biological.”
This type of query sensitive synonym usage is quite important for web retrieval.

See also my recent previous post on Google's synonym effectiveness and their recent patent on using query context for determining synonyms.