Brady Forest, has an interview with lead Bing PM Sanaz Ahari. It is a two part interview. Part I introduces the query categorization system. In Part II, Sanaz goes into more detail on how the systems work. Both are high-level and not very technical. However, it's an interesting insight into practical applications of query classification and clustering.
One interesting area that I want to highlight is that they are taking a broad category and starting to model the classes of entities that apply to the domain:
And we already have abilities to classify quarries into domains and understand, okay, this query is a music query or this query is health query, et cetera, et cetera. And so the other problems that fall out of that is, okay, when people do do health quarries, what are the categories that fall out of that? Like how do we know that people are going to care about diseases and symptoms, et cetera, et cetera. And then the next problem after that is how do we know that we have a comprehensive understanding of all diseases?
It's sounds like they're doing some of it by hand, or at least in a semi-supervised manner. They don't go into details, but they mention the obvious suspects: Wikipedia, query logs, and document extraction.
One interesting note is that currently 20 percent of our queries have a categorized experience. It sounds like there is still a long way to go.