Friday, December 10

Seeking Summer Internship Opportunities

I am beginning to explore internship opportunities for the summer of 2011.

I am an applied researcher whose interests are search in specialized domains and building search tools to solve complex information needs. My experience includes search in the engineering domain, medical search, local business objects, food and recipes, and information extraction on book data. My work often involves processing of large datasets using distributed processing frameworks such as Hadoop and PIG.

I am looking for opportunities that fit with my background and preferably include research that could lead to a publishable paper in a major conference.

If you know of any opportunities that would be appropriate, please contact me via email. My CV is available from my website (Word, PDF).

Several of my fellow PhD students here in the CIIR are also seeking internships, so I would be happy to pass along any appropriate opportunities.

Wednesday, December 8

New Book: Mining of Massive Datasets

Anand Rajaraman and Jeffrey D. Ullman have put together a new ebook, Mining of Massive Datasets. The book builds on the course materials for the Stanford CS345 course "Web Mining" and the CS246 class, Mining Massive Data Sets.

From the ToC, the book covers:
  1. An introduction to data mining
  2. Large-scale processing with distributed file systems and MapReduce
  3. Similarity search: nearest neighbor, minhashing, LSH, etc...
  4. Algorithms for mining streaming data
  5. (Web) Graph analysis: Pagerank, HITS, and spam detection
  6. Frequent Itemset algorithms
  7. Clustering Algorithms
  8. Advertising on the web
  9. Recommendation Systems
It is an interesting blend of material that are not usually taught together. I look forward to examining it in more detail.

Tuesday, December 7

Barriers to Entry in Search Getting Lower

The Mim's Bits column in the MIT Tech Review has an article, You, Too Can Be the Next Google. In the article, Tom Annau, the CTO of blekko (see my previous post) argues that computing power is growing faster than the amount of 'useful' and 'interesting' content on the web.
"Web search is still an application that pushes the boundaries of current computing devices pretty hard," says Annau. But Blekko accomplishes a complete, up-to-the-minute index of the Web with less than 1000 servers...
To be more efficient, they are more careful about what they crawl by:
  1. Avoiding crawling spam and splog content
  2. Using a "split-crawl" strategy that refreshes different genres of content at different rates to ensure that blogs and news are refreshed often.
I'm not sure blekko's "efficiency" techniques are particularly interesting or novel. However, I do think that overall the ability to crawl and index the entire web is getting easier, especially with distributed crawlers (like Bixo).
"Whether we succeed or fail as as startup, it will be true that every year that goes by individual servers will become more and more powerful, and the ability to crawl and index the useful info on the Web will actually become more and more affordable," says Annau.
The recent Mei and Church paper in 2008, Entropy of search logs: how hard is search? with personalization? with backoff?, analyzed a large search engine log to determine the size of this 'interesting' part of the web. They find that they can encode the URLs from search logs using approximately 22 bits, millions of pages. As they say,
Large investments in clusters in the cloud could be wiped out if someone found a way to capture much of the value of billions with a small cache of millions.
In principle, if you knew these pages and had a way of accurately predicting which ones change, then the price of search can be significantly reduced. In the paper, they go on to highlight that a personalized page cache or one based on profiles of similar users offers an even greater opportunity. In short, there is great opportunity for small very personalized verticals.

I think the main reason that blekko needs a modest number of servers is that its query volume is small. One of the key reasons that Google and other web search engines need thousands and thousands of computers is to support very fast query latency for billions of queries per day from hundreds of millions users around the world. To pull this off Google keeps its search index in memory (see Jeff Dean WSDM 2009 keynote).

Wednesday, December 1

Google Fixes Problem

This afternoon Amit Singhal, from Google Search Quality, wrote a blog post about how Google fixed the recent fiasco. The store was broken by the NY Times article exposing how a disreputable merchant gained high ranking by being mean to his customers. He gained links and reputation by being written up negatively on many popular and important sites.

What's interesting is Amit's post is the insight into how Google's team approached solving the problem.

1) Only blocking the site would not solve the underlying problem.
2) Sentiment analysis wouldn't solve the problem because reputation was coming from neutral news sites with solid reputations. Google has not yet found a useful way to incorporate sentiment into ranking.
3) Expose the reviews and ratings next to the results, but this would not actually alter the ranking.

Their new secret undisclosed fixed detected the problematic merchant and several hundred other bad apples.

Tuesday, November 30

Lectures on User Behavior Modeling and Implicit Feedback from Query Logs

I am one of the TAs for the graduate IR course at UMass this semester. I recently gave two lectures on modeling user behavior and utilizing implicit user feedback from logs.

User Behavior Modeling. I covered models of information seeking behavior. Then, I went over the Google 3M (micro-, meso-, and macro-) characterizations of interactions. We looked at how we learn about these various levels of interactions through field and lab studies, instrumented panels, and query logs.

Implicit User Feedback. We finished up query log analysis including query classification, applications like disambiguation and trends. Most of the time was spent on interpreting clickthrough and browsing behavior to generate preference and relevance data.

If you want to learn more, a lot of the lectures build on materials from Eugene Agichtein's tutorial on Inferring User Intent at WWW 2010. If you want more detail, their intent project is a good place to start.

Call for participation of Academic IR community in Lucene

Otis Gospodnetic, a committer on the Lucene project put out a call on the SemaText blog for greater engagement of academia with the open source Solr/Lucene community. In particular, he is seeking ideas for advanced topics that would be worth of a MS/PhD thesis that would be implemented and contributed to the community.

If you have ideas, please add it to the public idea spreadsheet he started. I strongly you to go there and contribute.

Lucene is the most widely used search engine library. If important new academic ideas that improve retrieval get incorporated, the impact would be huge.

However, historically, the Lucene community and academia has been kept very separate. Instead, the research teams have developed their own systems, the fragmentation is apparent if you look at my list of open source search libraries.

Lucene's ranking algorithms are dated and it is inflexible and difficult to change. Because it is so widely adopted, it is hard to modify and extend in radical ways. If academia is going to get involved, some of these issues need to be addressed, and a lot of it is straightforward engineering work that would enable it to be a better research platform.

Thursday, November 4

Susan Dumais CIKM 2010 Keynote: Temporal Dynamics in Information Retrieval

I am still catching up on a backlog of items from last week.

Here are more of Michael's notes from Susan Dumais' keynote presentation at CIKM 2010 that addressed the impact of time on web search. Gene also has his notes from the presentation.
  • Change in IR
    • New documents and queries
    • Query volume changes seasonally/periodically
    • Document content changes over time
    • User interactions change over time (e.g., anchor text, page visits)
    • Relevant document for query change over time, “Hurricane Earl” (Sept. 2010 vs. before/after)
    • But -> evaluation corpora is usually static

  • Digital dynamics are relatively easy to capture, however tools for interacting with information are static (Browsers/search engines)

  • Characteristics of Web page change
    • Measuring web page change in a large web crawl
    • 33% of web pages changed over a period of 11 weeks
    • 66% of visited pages changed over 5 weeks, 63 changed every hr
    • Avg. time between changes – 123 hr.
    • .com pages change more often than .gov,.org
    • Knot point – the place on the change curve where the page stabilizes over time; Characterizes the way pages change
    • Term-level changes
      • Looking at characteristic term for the page and their “staying power”, e.g. “cookbooks” & “ingredients” have a high staying power for, “barbeque” is more transient

  • Revisitation Patterns on the Web
    • 60-80% of the pages you visited, you’ve already seen before
    • 4 revisit patterns:
      • Fast - Navigation within site
      • Hybrid - High quality fast pages
      • Medium - Popular homepages/mail & web applications
      • Slow - Entry pages, bank pages, accessed via search engines

  • Revisitations & Search (Teevan et al, SIGIR 2007, Tyler et al., WSDM 2010)
    • Repeat query 33%
    • Repeat click 39%

  • Relationships between revisits and change (Adar et al., CHI 2009)
    • Monitor change
    • Effect change is not related to change
    • Change can interfere with re-finding
    • The more visitors the page has, the more often it changes
    • Three pages:,,
      • Similar change patterns, but different revisit patterns:
      • NYT – fast revisit
      • Woot – medium revisit
      • Costco – slow revisit

  • Diff-IE – Building support for understanding change
    • Browser toolbar that highlights content that was changed since the last visit
    • Non-intrusive and personalized --- changes that are of interest to you, not to the publisher of the page
    • Helps to uncover unexpected important content
    • Facilitates serendipitous encounters
    • Helps to understand page dynamics
    • Will be publicly available later this month from
    • Research surveys show that Diff-IE drives more revisitation
      • Driving visits to pages that change frequently

  • Leveraging Temporal Dynamics for IR (Elsas & Dumais, WSDM 2010)
    • Use document change rate to set document priors
    • Use term longevity to weight terms
    • Evaluation using static data
      • Using 2k navigational queries
      • Dynamic model outperforms the static baseline

    • Ongoing evaluation collection (Understanding Temporal Query Dynamics, to appear in WSDM 2011)
      • Collect relevance judgments over time, e.g. “march madness” query
      • Document relevance changes over time

Wednesday, November 3

Yahoo! Open Sources S4 Real-Time MapReduce framework

Today Yahoo! announced the release of a new real-time MapReduce framework written in Java called S4. From the website,
S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data.
For more technical details you can read the technical overview or check out the code on github. The example application keeps counts of hash tags in a Twitter stream.

The framework was previously announced at a Y! lab event which discussed processing in Y!'s advertising platform.

Google Open Sources Sawzall

Google today open sourced sawzall, see the original publication. From its description,
Sawzall is a procedural language developed for parallel analysis of very large data sets (such as logs). It provides protocol buffer handling, regular expression support, string and array manipulation, associative arrays (maps), structured data (tuples), data fingerprinting (64-bit hash values), time values, various utility operations and the usual library functions operating on floating-point and string values. For years Sawzall has been Google's logs processing language of choice and is used for various other data analysis tasks across the company.

Components of Compelling Vertical Search

In this post, I will discuss key components of successful topic-specific vertical search. I was motivated to write it by the launch of blekko earlier this week.

Blekko is marketing its ability to slice the web up into verticals using slashtags. Blekko's slashtags define a list of hosts or page to focus a search. But, that is not enough to be successful. Search in a vertical needs to provide a significantly different experience from general web search. A compelling vertical search engine has the following key components:
  1. Vertical specific ranking. A focused topic should define and utilize ranking features unique to the vertical. It may as simple as the topical classification score for a page. It often requires applying information extraction to identify meaningful document fields. It should also leverage vertical-specific static rank features. For example, use a technique like topic-specific pagerank, an author/source popularity score, or other features.

  2. Rich results. The result objects should be presented in a way that uses the structured and semantic information from the topic. For example, simple examples of this include presentations that use data from Google Rich Snippets and SearchMonkey. This may include topic-specific metadata like authors, political perspectives, addresses, or aggregated user rating scores.

  3. Faceted UI. A vertical should exploit structured metadata for exploratory search. It should allow you to flexibly combine keyword search and structured attribute restriction to limit the search space by: price, airline, manufacturer, genre, date, etc... See the CHI 2006 tutorial and the relevant section from Marti Hearst's Search UI book on eBay Express.

  4. Domain knowledge. A restricted topical domain should model important relationships between objects and concepts to improve retrieval. For example, it should use a Freebase-like knowledge base of objects and their attributes. In a recipe search engine, it would would model ingredients and relationships such as contains:gluten or is kosher.

  5. Task Modeling. A key benefit of a narrow domain is that it should allow users to accomplish complex tasks more easily. It should provide tools and interfaces to more directly allow users to get things done.
Of course, it needs to keep up with web search engines in ranking, comprehensiveness, and freshness, which are all key components of search quality.

For more of my thoughts on these issues, you can see the slides from my ECIR 2008 Industry Day talk The Challenge of Engineering. Vertical Search.

Overall, creating a compelling vertical experience currently requires a lot of hard work and painstaking curation. It requires a deep understanding of the tasks that users perform. It requires modeling the topic and domain objects in meaningful ways. Combining these elements together is difficult to do well. It is extremely hard to do at the scale of the entire web across all topics.

Monday, November 1

Blekko Launches: Brings Transparency to Relevance Ranking

blekko launched its public beta on Monday. blekko is a new web search engine that focuses on creating an open and transparent process around search engine relevance ranking. blekko is attempting to differentiate itself as the open alternative to "closed" search engines and involve greater public participation in the ranking process.

Creating blekko is an impressive feat because they built their own system from the ground up to crawl, index, and rank a multi-billion page search index. This is hard to do well. They have accomplished a lot in short period of time, so I am excited about the changes we'll see as they evolve. I hope that they will take the risks that other search engines can't afford. One of their risky moves is opening up their ranking features.

Open vs. Closed Ranking
Google's "closed policy" is a difficult issue that has garnered significant criticism. For example, at ECIR 2008 in a QA with Amit Singhal, Daniel Tunkelang questioned the need to rely on security through obscurity. (For an updated perspective now that Daniel works at Google read his recent post on Google and Transparency). In response to a EU inquiry, Amit Singhal laid out the underlying philosophy of Google's ranking,
  1. Algorithmically-generated results.
  2. No query left behind.
  3. Keep it simple.
Although Google uses signals from humans in the form of links and click through on search results, it does not actively involve humans in the search process. Blekko is going to be different.

As a first step involving users in ranking, blekko allows users define their own search experience using "slashtags". Founder Rich Skrenta describes this on a recent blog post, on crowdsourcing relevance,
We're starting by letting users define their own vertical search experiences, using a feature we call slashtags. Slashtags let all of the vertical engines that people define on blekko live within the same search box. They also let you do a search and quickly pivot from one vertical to another.
You can contrast Google's philosophy with Blekko's, here are the first 3 of 11 points in Rich Skrenta's post,
  1. Search shall be open
  2. Search results shall involve people
  3. Ranking data shall not be kept secret
  4. ...
A philosophy is great, but it doesn't matter if your results suck. Blekko just launched, so let's take a closer look.

Blekko's ranking
I tried blekko and it is a very solid first effort. To experiment, I re-ran a variety of searches from my Google web history. I didn't conduct thorough experiments, but my impression is that the ranking and coverage is very reasonable, but not as good as Google or Bing's. SELand has a more comprehensive review with a side-by-side comparison with Google.

One frustration I encountered using blekko is that slashtags autofired and my query was automatically restricted to a vertical when it was overly restricted. This limited scope led to missing key relevant results and I manually backed off several times to the /web. Slashtags create added complexity which leads to problems.

I'd like to point out a few queries where I found that blekko's relevance particularly stumbled and could improve, [carbonell mmr] and [iron cook northampton]. Neither of these are easy queries. The first is somewhat ambiguous and the second is about a small local event. What I find hopeful with blekko is that I can begin to understand the underlying reason for failure. I clicked on "rank stats" or you can use the /rank tag, e.g. [carbonell mmr /rank]. For each result blekko also provides an "seo" link to see static rank features, As an IR researcher I find open access to this feature data very exciting. However, for the average searcher this level of detail is distracting and unnecessary. The "openness" features need to earn their real estate by being actionable, but right now they don't do that.

Instead of cluttering the search UI, I would like to see blekko be more open by providing the data through an API. It would let academics and searchers use this raw material to rerank results in new and novel ways.

On "Crowdsourcing relevance"
Slashtags are operators that can both restrict a search and change its ranking. They currently allow you to sort the results by /relevance or /date. Users can define slashes that tag hosts as relevant to topic. I started a slashtag on information-retrieval. However, restricting a query to a set of hosts using slashtags is a bit like performing surgery with a chainsaw. In the end you are missing key bits. This approach has several problems:
  1. The granularity of hosts is too coarse. The amount of content relevant to a topic could be a single page or section of website.

  2. Recall. A slashtag cannot be mantained by people in real-time and will miss relevant content.

  3. The semantics of a slashtag are not well defined and it is not obvious how to combine them, e.g. a combine a topic slashtag with a /date ranking.
The claim is that slashtags reduce spam by limiting search to a restricted set of trusted websites. However, I don't think that the impact of spam on search results is very compelling. Search engines are quite good at identifying and incorporating implicit user feedback to reduce the impact of irrelevant (spammy) results. There needs to be a more compelling reason.

Using slashtags doesn't address several key issues in crowdsourcing ranking. The first is that they doesn't address the obvious need to involve people in making relevance assessments on the results in a systematic way. Secondly, the core of search ranking is determining what features indicate relevance for a query and how they should be combined. blekko is not currently surfacing a way to change either of those aspects.

It remains to be seen how you could really let users change the relevance in meaningful ways and more importantly, measure the utility to everyone. It may be that academia could play some role in creating and testing features.

blekko's "10 blue links" search UI feels outdated. Modern search engines are incorporating rich results into SERPs. The "Universal Search" results blend images, videos, maps, even places and people. I hope that this is an area where we see blekko evolve quickly to catch up.

I won't be switching to blekko for regular use. I find the level of ranking information and features that they share very exciting and compelling. Because I can see the ranking pieces, it compels me to jump in and help make things better, because I can. However, I question the utility of the information for average users and the ability to deeply engage the public in useful ways to improve ranking.

Slashtags are fun to play with, but are they useful? Slicing the web into groups creates mini vertical search experiences. However, using the tags adds complexity that may not be necessary most of the time. The value offered by the slashtag verticals is quite limited right now. I hope that slashtags will evolve to allow users to do more curation and add more value as blekko matures.

Wednesday, October 27

CIKM 2010 Jamie Callan Keynote: Search Engine Support for Software Applications

I am not at CIKM, but Michael Bendersky sent me his notes from Jamie Callan's keynote address. Gene also gave his writeup on the FXPal Blog.

Jamie Callan: Search Engine Support for Software Applications

  • Motivation: SE (search engine) as a "language DB"
    • Computer Assisted Language Learning
    • Q&A
    • Read-the-Web

  • IR typically assumes a "user" is a person

  • Software applications are a new challenging class of SE users

  • There are very low expectations from a SE from an application "user" perspective
    • E.g., SE's are mostly used for keyword search

  • Recall-Precision tradeoff avoids SE's from using a highly structured query language (like Indri)
    • BOW query - high recall/low precision
    • Structured query - low recall/high precision

  • Motivation II: using rich language/information resources
    • Wordnet, Freebase, Dbpedia, ...
    • SE's are not very good at using them

  • Structured queries and documents are well-studied IR topics, but
    • Do we really understand them?
    • Maybe the basic structures, but not the more advanced ones

  • Document = structured object
    • Metadata:
    • Fielded text: title, chapters, sections, references
    • Relations to other documents

  • Example application: REAP Project: Computer Assisted Language Learning
    • Find interesting documents/passages for students based on their language level
    • Use a structured Indri query language to find relevant documents or document parts

  • A typical approach to fields
    • Exact Boolean match on the attributes
    • Can be brittle.

  • Another type of document structure
    • Text annotations in documents (POS, semantic labeling, co-referencing)
    • Annotations can be considered to be "small fields"

  • Problems with retrieval with text annotations
    • Annotations are not always 100% accurate / ambiguous
      • Missing annotations
      • Wrong annotation boundaries
      • Conflated annotations: white/JJ house/NN should be white/NP house/NP

    • Term weighting in short fields is hard - need to take field length normalization into account.

    • Problem of multiple matches: combining evidence from different fields from the same type is not a solved problem.

  • Relations among documents/entities
    • Hyperlinks & RDF
    • XML

  • Relational Retrieval (Lao & Cohen 2010)
    • Example for use: journal recommendations, expert finding
    • Some parts of metadata are "domain knowledge" --- they really reside outside the documents.

    • How to model domain knowledge as an integral part of the documents
      • Have different types of documents: paper, journal, authors...
      • Have typed relations between the documents: transcribes, appears in, ...
      • Have an Indri-like query language to match documents and relations

  • Inferred knowledge: Read-the-Web project
    • How to integrate the accumulated knowledge in SE's
    • Entity search is one example
    • General purpose solutions are still in progress.
More CIKM coverage soon.

Monday, October 25

Ray Ozzie on the future of computing

Ray Ozzie leaving Microsoft as Chief Architect. In a farewell memo, dawn of a new day, he points to the future,
Instead, to cope with the inherent complexity of a world of devices, a world of websites, and a world of apps & personal data that is spread across myriad devices & websites, a simple conceptual model is taking shape that brings it all together. We’re moving toward a world of 1) cloud-based continuous services that connect us all and do our bidding, and 2) appliance-like connected devices enabling us to interact with those cloud-based services....

It’s the dawn of a new day – the sun having now arisen on a world of continuous services and connected devices.

What does this shift imply for search? We are already seeing growth in mobile search. People are searching more because they have the capability. And these searches tend to be more local in nature because people are more often looking for actionable information now.

One possibility is what Eric Schmidt described as autonomous search. In this model the retrieval system is proactive, responding to queries, but also actively notifying the user due to changes in the environment. The might describe such a system is an "intelligent information agent".

Thursday, October 7

Twitter Launches Lucene Real-Time Search Architecture

The Twitter Search engineering team announced that they launched a New Twitter search Architecture. The previous system was based on the original Summize search system that Twitter acquired in 2008. The old technology was a MySQL-based system that became difficult to scale. I'm really amazed that they were able to make a MySQL search work for this long.

According to the blog, the new search system was designed to handle over 1,000 TPS (Tweets/sec) and 12,000 QPS (queries/sec) = over 1 billion queries per day . Besides the challenging query volume, the data needs to be available quickly, a tweet needed to be searchable in less than 10 seconds.

They turned to Lucene, a popular open source search library. To meet their latency and query serving requirements they needed to make extensive modifications to Lucene's core,
That’s why we rewrote big parts of the core in-memory data structures, especially the posting lists, while still supporting Lucene’s standard APIs. This allows us to use Lucene’s search layer almost unmodified. Some of the highlights of our changes include:
  • significantly improved garbage collection performance
  • lock-free data structures and algorithms
  • posting lists, that are traversable in reverse order
  • efficient early query termination
Their post says that these contributions will be rolled into Lucene and the Lucene realtime branch. It is a tantalizing overview and I would really like to see pointers to the details (e.g. JIRA issues).

The main benefit to users is that the new system is much more scalable and can support an index that is twice as large as previous versions which means that you can search for tweets further back in time.

An interesting parallel is that LinkedIn recently released LinkedIn Signal, which is a mashup of Twitter data with LinkedIn social network data for professionals. For details on how that system works, see their Signal Under the Hood post. One of the key components is the Zoie real-time search system built on top of Lucene.

Tuesday, September 28

Twitter Talk at UMass: Discovery and Emergence

Today Abdur, the Chief Scientist at Twitter gave a talk, Discovery and Emergence here at UMass. The talk was interactive with lots of questions. It was similar to the one he presented at SIGIR Social Media Workshop last year. If you read it, be sure to read Jon's comment. Here are a few of my notes from the talk, which focused heavily on the Trending Topics feature.

A few key points to remember. First we have to keep in the forefront:
It's not about the technology, it's how it enriches our lives and makes it better.

The Data
160 Million accounts. 90 million tweets per day. 16.7 gb of tweets. > 1000 tps
200,000 time line rps, 3GBs outbound data, 1 B queries per day

Tweets are searchable within seconds and the data is kept forever.

About 30% of search traffic is generated by clicks from trending topics.

In 1ms answer the following about a tweet:
- what language is this tweet?
- where was this tweet posted from?
- what are the entities in this tweet?

Every X min answer the following:
- Which tweets should you ignore?
- What topics are trending and where?

A key problem is how to evaluate the quality of trending topics. What makes one topic 'better' than another?

One of the coolest things I saw from the talk was the vizualization of the World Cup tweets, which was on their blog, World Cup 2010: A Global Conversation. It was created by Miguel Rios, whose work you can check out on his website.

Abdur ended with an admonition to researchers to think about the impact of their work,
Why does your research matter? Will it make the world a better place?

Tuesday, September 21

ECML PKDD 2010 Data Challenge: Measuring Web Data Quality

Yesterday the ECML PKDD Data Discovery challenge results were presented. See the website for the papers of the winning participants. The winning team used a bagged C4.5 decision tree for learning given the features.

A high level overview of the challenge from the website describe the challenge,
In this year's Discovery Challenge we target at more and different aspects. We want to develop site-level classification for the genre of the web sites (editorial, news, commercial, educational, "deep web", or Web spam and more) as well as their readability, authoritativeness, trustworthiness and neutrality.
The challenge dataset consists of 23M pages from 99K hosts in the .eu domain. Read the assessment guidelines.

The competition involves three tasks, see the full description of tasks. Here is a summary:

1. Classification task (English)
  • Web Spam
  • News/Editorial
  • Commercial
  • Educational/Research
  • Discussion
  • Personal/Leisure
  • Neutrality: from 3 (normal) to 1 (problematic)
  • Bias: 1 flags significant problems
  • Trustiness: from 3 (normal) to 1 (problematic)

2. Quality task (English)
Quality is measured as an aggregate function of genre, trust, factuality and bias and spam has lowest (0) quality

3. Multilingual quality task (German and French)
Same as task 2, but for non-English.

The interesting aspect of the challenge is that it moves away from spam/not spam labels to assessing more complex aspects of the quality of information.

Friday, September 17

A lesson in defining topic-based communities

There is a post on the stack overflow blog on how they are managing communities, Merging Season. At the heart of discussion: What is the right size of domain for a topic-based community? They are against one giant community, as they say:
Yahoo! Answers. Monumentally popular, enormous traffic, and containing absolutely no useful information, Yahoo! Answers is actually more of a teenage chat room than a place to get real answers.
They also highlight failed attempts to bring the Ubuntu and Unix community sites together to make a single community. The process of defining a "topical community" reminds me of the problems we have in IR when we define a "topic based vertical" to apply domain knowledge in retrieval. From their blog:
Communities consist of concentric circles. You share more with people in the inner circle than you do with people in the outer circles, but if you were in a strange place, you’d seek out people even from the larger circles. If you’re building a community (or a Stack Exchange site), it’s not immediately obvious which level is going to work...
They are developing rules that use the size and degree of overlap between communities to guide the process. It will be interesting how this plays out and what lessons we can use to apply to IR.

Monday, September 13

Google Scribe: Autocomplete beyond queries

Overshadowed by the Google Instant last week, a labs project called Google Scribe was launched. See some information on the help page.

Here is an example what it did with the initial words and accepting all further suggestions:

Jeff Dalton is a researcher at the University of California at Berkeley...

An amusing example. I'm actually quite surprised it autocompleted "researcher" correctly. However, Scribe got the university wrong. It looks like UC Berkeley wins the web popularity contest.

Overall, Scribe appears to be a straightforward application of web n-gram language models covered in an AJAX interface. Some of its mistakes demonstrate the drawbacks of not utilizing long range word dependencies and topical context. Still, an interesting step towards more autocompletion. I think there may be interesting opportunities to improve effectiveness by leveraging custom language models built from my other documents and web history.

Wednesday, September 8

AJAX Results as you Type: Google Instant

Google announced Google Instant on their official blog.

Google Instant - "Search at the speed of thought". Google Instant is results as you type with a new AJAX view of Google's search results. It takes the autocomplete feature a step further and sends the most probable search to the server to fetch the results. A few optimizations to make it possible:
  • Prioritizing searches - the biggest optimization is to run only the most probable searches.
  • User state - shortcut in process searches that are obsolete to avoid running all searches to completion
  • Result caches - improved result caching
Overall, Google claims that instant search saves 2-5 seconds per query. It took some really committed engineers at Google, including Ben Gomes, to make it possible. Kudos!

It is worth noting that many of these ideas have been in the community for awhile. For example, it reminds me of the CompleteSearch system, which has been around since SIGIR 2006. CompleteSearch has some novel prefix based search capability which is still beyond what Google rolled out today.

Tuesday, September 7

Autonomous Search: Did you know?

Eric Schmidt, CEO of Google gave a keynote address at IFA, a consumer electronics show in Germany. The keynote was covered in an article by paidContent. He emphasizes "mobile first" as very important. According to him, the new and most interesting applications are happening on smart phones.

This leads to what Eric describes as "autonomous search",
Ultimately, search is about finding what you want right now and The next step of search is doing this automatically. And so when I'm walking down Berlin and I like history my smart phone is doing searches constantly - did you know? did you know? did you know? This occurred here, this occurred there.

Because it knows who I am, it knows what I care about, and it knows where roughly where I am. This notion of autonomous search, the ability to tell me things I didn't know, but I'm probably very interested in, is the next great stage in search.
See also an earlier interview with Amit Singhal on the Evolution of Search. Many of these future search applications share ground in common with the field of Agent Planning in AI. One company taking an initial step in this direction is Siri (read me previous post on Siri and Darpa's CALO project), a task oriented virtual assistant for your iPhone. You could reframe much of what Eric described as "Intelligent Search". Tom Gruber, the CTO of SIRI, describes some principles from Intelligence at the Interface:
  • It knows a lot about you.
  • It understands you in context.
  • It is proactive.
  • It gets better with experience.
I think one of the key things that is new in "autonomous" or "intelligent" search is that the system proactively surfaces interesting information to the user and assists the user in performing actions. A key challenge is how to perform rigorous evaluation in such an immature and developing area. The task is a significant departure from some of the more traditional adhoc search tasks and requires a much richer user model.

Friday, August 20

GraphLab: Beyond MapReduce for Parallel Machine Learning

A team at the CMU Select Lab recently released a new software package, called GraphLab that provides an alternative to the MapReduce paradigm for developing Machine Learning algorithms. The work is described in the paper,

Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein (2010). "GraphLab: A New Parallel Framework for Machine Learning." Conference on Uncertainty in Artificial Intelligence (UAI). (PPT slides)

From the description on the website,
GraphLab provides a similar analog to the Map in the form of an Update Function. The Update Function however, is able to read and modify overlapping sets of data... In addition the update functions can be recursively triggered with one update function spawning the application of update functions to other vertices in the graph enabling dynamic iterative computation...

The GraphLab analog to Reduce is the Sync Operation. The Sync Operation also provides the ability to perform reductions in the background while other computation is running. Like the update function sync operations can look at multiple records simultaneously providing the ability to operate on larger dependent contexts.
Other than the paper, you can read the details page more information.

I need to think about this more.

Wednesday, August 11

Yahoo! Labs Interview: Towards Web Object Search

The Yahoo! Search blog has an interview with Dr. Ben Shahshahani of Y! labs on search. The questions in the interview cover real-time search, the use of social data, and object retrieval.

The interview begins with an introduction that search is moving beyond 10 links to a federated model that that blends objects from different verticals, also known as "Universal Search" by Google. He then continues about the increasing role that structured data is playing on the web. He says,
Now, the other thing that has been happening is an integration of structured data and unstructured data, so structured meaning that there are particular attributes to different entities. We have a pretty active technology and science effort in trying to understand the main object, attributes, and relationships – not just the text on a web page...
Later, he continues the thread when it comes to answering specific user intents,
Once a query comes in, the question is: “what is the intent” or “what are the common intents of the users submitting this query?” To answer that question, we use a variety of ways to understand the query – a lot of the queries are about objects... Objects are things in the real-world. They can be events, a location, a person or a product. Our active effort in understanding attributes and their relationships helps us find out the things you can do with those objects.
The last quote reminded me of the great presentation given by Eugene Agichtein at the Query Representation and Understanding workshop at SIGIR, Inferring User Intent from Interactions with the Search Results. As I recall, Eugene used search logs gathered from toolbar data to analyze different object attributes and tasks associated with different types of objects from the log. However, I don't recall all the details and the slides are not online.

Disclosure: I am an intern at Y! this summer working on object retrieval, so I'm a bit biased.

Tuesday, August 10

Quick Links of the Day: Blekko, Silicon Valley History, and I hate your paper, and P!=NP

  • Blekko - Daniel has a Blekko preview from the beta. See also Michael Arrington's post on SE Land. From Daniel's post:
    Rather, they are a way for users to “spin” their search results using a variety of filters. For example, [climate /liberal] and [climate /conservative] return very different results, because they are restricted to different sets of sites... In addition to providing a set of curated slashtags, Blekko allows users to define their own slashtags by specifying the sets of sites to be included.
    This is a very primitive means of creating mini vertical search engines. My first instinct is that slashtags remind me of Rollyo where you can "roll your own" by restricting search to a group of websites.

  • I Hate Your Paper - an article by New Scientist that looks look at how the reviewing process is broken and some ways that journals are exploring possible reforms. (thank Hany)

  • A historical perspective on the evolution of Silicon Valley by Russell Jurney

  • An in case you've been living under a rock, the P != NP proof attempt by HP labs, see #pnp.

Monday, August 2

NY Times Article: Bing Kicks Google in the Pants

The NY Times has an article on a new search "cold war" between Bing and Google. I think the article is pretty poorly done. The author paints Bing as an innovative upstart giving Google a "kick in the pants" and forcing it to play catch up. It selectively uses facts to misportray reality for the sake of a good story. The author fails to mentions areas where Google is innovating and Microsoft is playing catch up: mobile search, visual search, real-time search, social search, and others.

On valid point where I think Microsoft has done well is in it's vertical strategy as a means of differentiation. As the article says,
Microsoft has tried to attract people like Mr. Callan by excelling at answering frequently asked questions, like those related to travel, health, shopping, entertainment and local businesses. For example, Bing has flight search and prediction tools that reveal price fluctuations for certain routes, and advises customers whether to buy or wait. Bing Health uses data from sources like the Mayo Clinic and Healthwise...

Thursday, July 29

Recorded Future: Trend and event spotting from real-time news data

Yesterday Wired featured an article, Google, CIA Invest in ‘Future’ of Web Monitoring. The article stretches the truth a bit when it says that Google is doing business with the CIA. The link is tenuous, that both companies are interested in predictive analytics on news and real-time data. The subject of the article is a small Cambridge based company, Recorded Future. From the article's description,
Recorded Future strips from web pages the people, places and activities they mention. The company examines when and where these events happened (“spatial and temporal analysis”) and the tone of the document (“sentiment analysis”)... Recorded Future maintains an index with more than 100 million events, hosted on servers.
For a more detailed look at what the company is doing, take a look at the white paper published on the company blog, A whitepaper on temporal analytics. You can also read the Predictive Signals blog by Bill Ladd, the Chief Analytic Officer at Recorded Future.

Recorded Future is not alone in this field. For example, the Living Knowledge Project is also working on future prediction of news events from web data.

The people working in this field should be aware of the wealth of previous research analyzing event data in news. For example, the DARPA TIDES program on Topic Detection and Tracking (TDT). See James Allan's book, Topic Detection and Tracking for an overview. You can also look at some of Victor Lavrenko's work, specifically on TDT and AEnalyst for financial market prediction from news.

Quick Links of the Day: KDD Cup, Task Oriented Search, ScalaNLP, SIGIR

Any of these stories could be a full blog post. But, for now I'll just have to give you a few quick pointers:

SIGIR 2010 Industry day videos - complete videos of all the talks, via Noisy Channel.

ScalaNLP - A new NLP package in Scala from the Berkeley and Stanford NLP teams. Scala is hip new language for NLP that runs inside the JVM. See also the factorie project from UMass's IESL lab.

KDD Cup Challenge Results - This year's competition asked participants to predict student performance on mathematical problems from logs of student interaction with Intelligent Tutoring Systems.

TabCandy - from Matthew Hurst. Create groups of tabs for task-oriented search. Create a "save for later" group of tabs. Share "groups of tabs" across platforms and with your friends - "group browsing".

How Google Builds APIs from Google I/O

Research vs. Reality - Discuss.

Tuesday, July 27

KDD 2010 Coverage, Best Paper Awards

KDD 2010 is being held in Washington D.C. this week. I'm not attending, but everyone can participate because the keynotes are being streamed live. The keynote at 9am EST is from David Jensen from UMass Amherst, giving a talk on Computational Social Science.

Yesterday, was the first day of papers. Two that garnered lots of discussion on Twitter are:

Suggesting Friends Using the Implicit Social Graph
In this paper, we describe the implicit social graph which is formed by users' interactions with contacts and groups of contacts, and which is distinct from explicit social graphs in which users explicitly add other individuals as their "friends".
It won honorable mention in the industry paper category. Look for "Got the wrong Bob" and "Don't forget Bob" features in GMail labs.

Overlapping Experiment Infrastructure: More, Better, Faster Experimentation
In this paper, we describe Google’s overlapping experiment infrastructure that is a key component to solving these problems. In addition, because an experiment infrastructure alone is insufficient, we also discuss the associated tools and educational processes required to use it effectively.
The awards were also announced, see the KDD awards for the full list.

Best Research Paper:
Connecting the Dots Between News Articles
In this paper, we investigate methods for automatically connecting the dots - providing a structured, easy way to navigate within a new topic and discover hidden connections. We focus on the news domain: given two news articles, our system automatically finds a coherent chain linking them together. For example, it can recover the chain of events starting with the decline of home prices (January 2007), and ending with the ongoing health-care debate.
Best Industry/Government Paper
Optimizing Debt Collections Using Constrained Reinforcement Learning
In this paper, we propose and develop a novel approach to the problem of optimally managing the tax, and more generally debt, collections processes at financial institutions...We re port on our experience in an actual deployment of a tax collections optimization system based on the proposed approach, at New York State Department of Taxation and Finance.

SIGIR 2010 Workshops: CrowdSourcing for Search Evaluation

Last Friday was SIGIR workshop day. First up is the workshop on CrowdSourcing for Search Evaluation. It focuses on using Amazon's Mechanical Turk (MT) and similar service to provide judgments. I did not attend this workshop, but heard positive things from the attendees. The workshop is organized by Matt Lease, Vitor Carvalho, and Emine Yilmaz.

The presentations and papers in the program are available online. Here are a few I want to highlight:

A main highlight was the CrowdFlower keynote:
Better Crowdsourcing through Automated Methods for Quality Control
CrowdFlower provides commercial support for companies performing tasks on Mechanical Turk. Everyone had great things to say about this talk that kept people enthralled even though it was the end of the day; some said it was the best talk of the conference.

The other keynote was:
Design of experiments for crowdsourcing search evaluation: challenges and opportunities by Omar Alonso. Don't miss the slides from Omar's ECIR tutorial. They also had a paper at the workshop,

Detecting Uninteresting Content in Text Streams, which looked at using crowdsourcing to evaluate the 'interestingness' of tweets. They found that most tweets, 57% were not interesting. The found that generally, tweets that contain links tend to be interesting (81% accuracy) and that those without links that were interesting generally contained named entities.

Omar, Gabriella Kazai, and Stefano Mizzaro are working on a book on crowdsourcing that will be published by Springer in 2011.

My labmate, Henry Feild, presented a paper, Logging the Search Self-Efficacy of Amazon Mechanical Turkers.

Be sure to read over the rest of the program, because there are other great papers that I haven't had a chance to feature here.

Thursday, July 22

SIGIR 2010 Industry Day: Being Social: Context-aware and Personalized Info. Access

Being Social: Research in Context-aware and Personalized Information Access @ Telefonica
Xavier Amatriain, Karen Church and Josep M. Pujol, Telefónica

Context overload
- the device of the future for information seeking is no longer the desktop
- it is mobile: iPad, mobile phone.
- Mobile phones are "personal"
- Mobile users tend to seek "fresh" content

Where is the nearest florist?
-- this is pretty easy
-- where is that raelly cool cocktail bar I went to last month? (harder)
-- What bout discovery?
-- Interesting things close to me? Events?

Can we improve the search and disvery expdeirence of mobile users using social information?

Social Search Browser - SSB
- Karen Church
- iPhone web application + Facebook app
- displacies queries/questions by other users in that location
- users can post and interact with queries from others

SSB was a tool for helping and sharing....
A tool for supporting curiousity... an extension to my social Network


Crowds are not always wise. Predictions based on large datasets that are sparse and noisy.

User feedback is noisy
- you can trust if something is excellent, but not necessary the other way around.

"Trust Us - We're Experts
- "It is really only experts who can reliably account for the decisions"
- The Wisdom of the Few - SIGIR '09

Expert-based CF
An expert = individual that we can trust to have produced thtoughful, consistent, and reliable evaluations (ratings) of items in a given domain.

Working prototypes
- Music recommendations, mobile geo-located recommendations...

- Sometimes the experts are better than your direct social network.

SIGIR 2010 Industry Day: Lessons and Challenges from Product Search

Lessons and Challenges from Product Search
Daniel Rose, A9

Different Domains, Different Solutions
- Traditional IR,
- Enterprise search
- Web search
- Product Search
How are the issues different? Let's go back to user goals...

The Goals of Web Search
- Understsanding user goals in web search paper (WWW 2004).  Manually clustered queries until they were stable.
- Done at AltaVista in 2003 (not completely representative queries)
- Most product queries fell into other categories

Why do people search on Amazon?
- When they want to buy something?

Even ignoring the non-buying issues..

The Goals of the product Search
- Depends on where you are in the buying funnel.
-- Top: awareness, then Desire, then Interest, finally Action
St. Elmo Lewis, 1898
- Provide the right tools at the right stage in the process.

[roller coaster]
- toys and games
- sort by average customer review
- sort by price (is actually hard: new vs. used, amazon vs. third-party, etc...)

Different Tools for Different Stages
- Product search shows more fluid movement between searching and browsing behavior (relying on faceted metadata)
- Because of the nature of the search task?
- Because of the interfaces?

What Amazon Queries Look Like
- [which old testament book best represent the chronological structure]
- [shipping rates for amazon]
- [long black underbust corset] - still looking
- vs ISBN number -> about to buy it

(mostly one word, most the name of a thing.  except "generator")
top 10 across the us
(kindle, kindle fire, skyrim, mw3, sonic generations, cars 2)

queries in frequency deciles, by category
US, books, electronics, apparel
 --> very diverse, mispelling, miscategorization, all levels of the buying funnel

Context is King
- Some facets for Dresses vs. Digital Cameras
- The problem of facet selection
- Not a one size fits all UI solution for different facet types
- We can interpret your query in a smarter way: [timberland] boots inside shoes is a brand
- Timberland in music -> Timbaland the band (context dependent spelling correction)

Amazon is a MarketPlace...
- So search must be realtime
-- new products
-- new merchants
-- prices being changed all the time
-- items going in and out of stock all the time

Structured Data: "It's a gift... and a curse"
- Unlike the web search, we know the semantics of different bits of text
- We know what fields are important for customers (e.g. brand)
- A large degree of quality control (less adversarial problems)
- We don't have to do sentiment analysis to know if a review is positive/negative

A Curse
- Search engine needs to have both DBMS-like "right answer" behavior and IR-like "best answer" behavior
- Tradiontional IR mechanisms don't always work well for structured data
-- e.g. naive tf x idf  doesn't work well (see BM25F)

What happens when one of the fields is order of magnitudes bigger than others?
-- Search inside the book vs. brand name
- What happens when you don't have all the fields all the time? (missing data)
-- ratings, reviews correlate with user satisfaction, but it may not be there

Search Inside the Book
 - how often do you want to surface full-text matches vs. filter them out
 - (example query:  [byte-aligned compression])

Using Behavioral Data
- Powerful source of information for any search engine
- When is using behavioural data an invasion of privacy (or just plain creep), and when is it better for users?
- Customers of a business seem more comfortable with that business learning from past behavior.

Interpreting Behavioral Signals
Example: Are search result clicks good and bad?
- How many clicks are best?
-- 1: the customer found what their are looking for right away
-- many: comparison shopping and are looking around at multiple items
-- zero: the search result contained all the information necessary
Also, some items are inherently "click attractive", e.g. a book with a sexy cover

- "Why is the web so hard... to evaluate" (from snippet evaluation at Yahoo!) 2004

Evaluating Product Search Relevance
Common argument
-- Customers to to a shopping site to buy stuff
-- if a search engine change leads to customers buying mor stuff, they must have had their search need met more effectively.
-- Therefore, relevance can be measured by how much customers buy.
What's wrong with this argument?
-- besides ignoring the rest of the buying funnel, and that someone is ready to buy.

The A/B Test Mystery
- Compare ranking algorithms A and B
- Assign half of users A and half to B
- And the end the avg. revenue is higher in A than B.
-> algorithm A could be better than B, or Algorithm A could be recommending higher priced items than B
-> Algorithm A could be recommending completely unrelated, but very popular items.

So How to do Evaluation?
 - A/B tests, automated metrics, editorial relevance assessments (possibly crowdsourced).
 - Use all of them!

Lessons from IR
One idea: Generalizing the buying funnel
- The information seeking funnel
- Wandering: no information seeking goal in mind
- Exploring: have a general goal, but not a plan on achieving
-Seeking: have started to identify info needs that must be satisfied, but needs are open-ended
-Asking: have a very specific information need corresping to a closed class question
- Published in: The information seeking funnel, in Information-seeking support systems workshop 2008.

- Start thinking about how to meet user needs before user knows she has a need
- Offer different interaction mechanisms for different parts of the information seeking process
- Let type of content influence the way search works
- Design for realtime
- Interpret behavioral data carefully
- Exploit structure when have it
- Exploit context when you have it

(My Thoughts and Questions)
 - The world is not only Amazon.  What about linking the products to external sources, like consumer reports, dpreview and other sites?
  --> Amazon enhanced Wikipedia (e.g. Orson Scott Card)
 - Social, how is amazon incorporating social search?
 --> delicate balancing act with Facebook and other sources

 - Do you try and leverage mentions of products on book review sites? or within other books?
 - I recently went to barnes and noble and saw the new Orson Scott Card book, one of my favorite authors.  Why didn't Amazon surface that to me? (support for subscribing to authors)  Or, "buy the new top picks from this month's Cook's Illustrated"...
 - From my perspective, the recommendation quality of Amazon has decreased over time despite more of my data.  Does this reflect a shift in emphasis?

Microsoft Releases Learning to Rank Datasets

Microsoft Research announced that it is releasing a new MS LTR dataset.
We release two large scale datasets for research on learning to rank: MSLR-WEB30k with more than 30,000 queries and a random sampling of it MSLR-WEB10K with 10,000 queries.

136 features have been extracted for each query-url pair.
The dataset is a retired dataset. What makes this quite interesting is that the features have been released. You can see the feature list.

See also the Y! LTR datasets.

SIGIR 2010 Industry Day: Machine Learning in Search Quality at Yandex

Machine Learning in Search Quality at Yandex
Ilya Segalovich, Yandex

Russian Search Market
- Yandex has 60+% market share
- It's all about small attention to details about the search

A Yandex overview
- started in 1997
- no 7 search engine in the world by # of queries
- 150 million queries per day

Variety of Markets
- 15 countries with cyrillic alphabet
- 77 regions in Russia
-> different culture, standard of living, average income, for example: Moscow, Magadan
-> large semi-autonomous ethnic groups (tatar, chech, bashkir)
-> neighbouring bilingual markets

Geo-specific queries
- Relevant result sets very significantly across regions and countries

- a probablistic measure of user satisfaction
- optimization goal at Yandex sinces 2007
- Similar to ERR, Chapelle 2009 --> hopefully someone can fill in the exact formula
- pFound, pBreak, pRel

Geo-specific Ranking
query -> query + user's region
- may need to build a specific formula for countries/region because of the variance and missing/lacking features in some of them.

Alternatives in Regionalization
- separate local indices or unified indx with geo-coded pages
- one query or region specific query
- query based local intent detection vs. results based local intent detection
- single ranking function vs. co-ranking and re-ranking of local results
- train one formula or train many formulas on local pools

Why use MLR?
Machine learning as a conveyor
- Some query classes require specific ranking
- many features

A learning method
- boosted decision tree, "oblivious" trees.
- optimize for pFound
- solve regression tasks, train classifiers

Complexity of ranking formulas
20 bytes - 2006
14 kb - 2008
220 kb - 2009
120 MB - 2010

A sequence of More and More complex rankers
- pruning with the static rank (static features)
- use of simply dynamic features (such as bm25)
- complex formula that uses all the features available
- potentially up to million of matrices/trees for the very top documents
- see camazoglu, 2010 early exit optimization

Geo-dependent queries: pFound
- a big jump in 2009 in Quality
- 3x more local results than competitors in Russia, than #2 player

- MLR is the only to regional search: it provides us the possiblity of tuning many geo-specific models at the same time.

Complexity of the models is increasingly rapidly
-> don't fit into memory!

MLR is in its current setting does not fit well to time-specific queries
-> features of the fresh content are very sparse and temporal

Opacity of results of the MLR
- The backside of ML

Number of featuers grows faster than the number of judgments
-> hard to train ranking

Learning from clicks and user behavior is hard
Tens of GB of data per day!

Yandex and IR
- Participation and Support
- Yandex MLR at IR context

SIGIR 2010 Industry Day: Query Understanding at Bing

Query Understanding at Bing
Jan Pederson

Standard IR assumptions
- Queries are well-formed expressions of intent
- Best effort response to the query as given
Reality: queries contain errors
- 10% of queries are mispelled
- incorrect use of terms (large vocabulary gap)

Users will reformulate
- if results do not meet information need
Reality: If you don't understand what's wrong you can't reformulate. You miss good content and go down dead ends

- Take the query, understand what is being set and modify the query to get better results

Problem Definitions
- Best effort retrieval
-- Find the most relevant results for the user query
-- Query segmentation
-- Stemming and synonym expansion
-- Term deletion

Automated Query Reformulation
- Modify the user query to produce more relevant results for the inferred intent
-- spell correction
-- term deletion
-- This takes more liberty with the user's intent

Spelling correction
Example: blare house
- corrected to "blair house". There is a "recourse link" because the query was changed to back out.

- restaurant -> resturants
- sf -> san francisco

-> un jobs -> united nations (may already be there in anchor text)
- utilize co-click patterns to find un/united nations for that page
- it is especially important for long queries, tail queries
- not so good: federated news results for the query. Is the same query interpretation being used consistently? The news vertical did not perform expansion and there is a problem.

Term Relaxation
[what is a diploid chromosome] -> "what is a" is not important from the SE matching, it introduces noise

[where can I get an iPhone 4] -> where is an important part of the query. Removing "where" misses the whole point of the query

[bowker's test of change tutorial] -> test of symmetry is the correct terminology. How do you know that it is the incorrect terminology? If you relax it to "bowker's test" you get better results

Key Concepts
- Win/Loss ratios
-- wins are queries whose results improve
-- losses are queries whose results degrade
- Related to precision
-- but not all valid reformulations change results

- Pre vs. Post result analysis
-- Query alternatives generated pre-results
-- Blending decisions are post results

Query Evaluation
"Matching" level 0/l1/l2 -> inverted index, matching and ranking. Reduce billions to hundreds of thousands of pages. Much of the loss can occur here because it never made it into the candidate set. Assume that the other layers that use ML, etc... will bubble the correct results to the top.

"Merge" layer L3 -> the blending with multiple queries will be brought together

Federation layer L4 -> top layer coordinating activity

An important component is the query planner in L3 that performs annoation and rewriting.

Matching and Ranking
L0/l1 - 10^10 docs. l0 - boolean set operations, l1- ir score (a linear function over simple features like bm25, simple and fast, but not very accurate)

L2 reranking - 10^5 docs - ML heavily lifting: 1500 features, proximity

L3 reranking - 10^3 - federation and blending

L4 -> 10^1

Learning to rank Using Gradient Descent (for L2 layer)

Query Annotation

NLP Query annotation
- offline analysis
- Think of the annotations as a parse tree

Ambiguity preserving
- multiple interpretations

Backend independent
- shared

Structure and Attributes
- Syntax and semantics (how to handle leaf nodes in the tree)

Query Planning

[{un | "united nations"} jobs] -> l3-merge(l2-rank([un jobs]), l2-rank(["united nations" jobs])
[{un| "united nations"} jobs] -> l3-cascade(threshold, l2-rank([un jobs]), l2-rank(["united nations" jobs])
-- the second is less certain and conditional

Design Considerations
- one user query may generate multiple backend queries that are merged in L3
- Some queries are cheaper than others
-- query reduction can improve performance

- L3 merging has maximal information, but is costly

Multiple query plan strategies
- Depending on query analysis confidence

Query Analysis Models

Noisy Channel Model
argmax{P(reqire | query) } = arg max{ P(rewrite)P(query| rewrite)}

-- bayes inversion

- example: spelling
-- languagel model: likelihood of the correction
-- translation model: likelihood of the error occurring

Language Models
Based on Large-scale text mining
-- unigrams and N-grams (to favor common previously seen things, they make sense)
-- Probability of query term sequence
-- favor queries seen before
-- avoid nonsensical combinations

1T n-gram resource
-- see MS ngram work here in SIGIR

Translation Models
- training sets of aligned pairs (mispelling/correction; surface/stemmed)

Query log analysis
-- session reformulation
-- coclick-> associated queries
-- manual annotation

(missed the references, but see: Wei et al, Pang et al., Craswell)

- 60-70% of queries are reformulated
- Can radically improve results

Trade-off between relevance and efficiency
- rewrites can be costly
- win/loss ratio is the key

Especially important for tail queries
- no metadata to guide matching and ranking

SIGIR 2010 Industry Day: Search Flavours at Google

Search Flavours: Recent updates and Trends
Yossi Matias
Director of Israel R&D Center, Google

Solution for the search problem: imitate a person

Wish list
- knows everything
- language agnostic
- always up to day
- context sensitive
- understands me
- Good sense of timing
- Good sense of scope
- Smart about interaction

(Suggest answers to questions I didn't ask or didn't ask accurately)
In short, things we expect from people when we interact from experts or friends. This is subtle.

Demo of things
- auto suggest of weather; an intelligent guess at what the user will ask
- flight information for ua 101
- weather in the suggestion
- This is new because the user does not have a chance to finish the question
- How do we understand the feedback when they don't have any feedback (except to maybe stop typing)

- Being local [restaurant] (implicit context)
- world cup (now is a general answer, but a week or two ago it was very different)
- new forms of information: user generated content in real-time, Twitter
- [whale] it turns out there was a whale jumping on the ship
- Google trends shows hot topics

Greater Depth With Real-Time
- Example of an earthquake.
- Two minutes after an earthquake, tweets were surfacing in the results before a formal announcement

-- quick slide showing a chart, which he's not going into

Social Circle Personalization
- someone I know blogs about something or a picture, surface it

Understanding: What does Change mean?
- change = to adjust (adjust the brightness) , or convert, or switch all depending on the context

Paul McCartney Concert
- uploading real-time video from the concert
- A few may be good, but we don't want 300 clips all from the same concert

Web translation
- language agnostic
- NY Times translated into china
- translated search
- automatic captioning (translation of obama speech to add arabic captions)

Search by voice... any Voice
- People are starting to use it.
- How do you do it for any person, any language?
- The combination of voice search and translation is almost like science fiction
- This is a significant technology worth paying attention to

Search by Sight
- Google Goggles
- Mobile is important for contextual understanding (location)
- Phones are starting to take on behavior of smart agents
- 10 or 20 results are not useful on a smart phone, "im feeling lucky" is important

The power of data
- 1.6 billion Internet users
- A billion searches a day on Google worldwide
- He started working in ML and data mining
- From a research perspective there is a massive benefit of working with it


- how to leverage trends of data, such as user search to derive insights
- Trends over time, location, etc.
- Identify outbreaks of flu: find queries that correlate with CDC reports
- Google could predict the outbreak two weeks ahead of CDC, a heads up of something happening now.
- Nowcasting: forecasting the present based on information from the past
- Hal Varian: Predict economic indicators before they were published by the Govt.

Real Estate
- Using stastical models to provide up to the minute information on where we are on economic indicators for sectors of real-estate.
- It doesn't always work, but it's helpful

2010 World Cup - new Search
- popularity of David Villa, etc...
- South Africa, and sponsors getting attention

Researching Search Trends Time-Series
- Forecasting. Seasonality is a common case. Many queries have strong seasonal components (yearly/ weekly cycles)
- we can use time-series prediction models to forecast
- (e.g. skiing, sports)

- Define notions of how predictable and regular the search queries are
- About half the search queries are predictable in a 12 month ahead forecast with a mean abs prediction err of 12% on average

Health, Food & Drink, and .. are quite seasonal.

- Categories are more predictable than individual queries

Deviation from modeled prediction
- US automative industry, forecasting: august o8 - July 09
- The maintance and parts was ahead of forecast, new sales were below

See papers (a big long list...)
What can search predict? many publications by Hal Varian

There is no API, but it is possible to download. They are encouraging collaboration with researchers.

Big themes of the talk:
- real-time is expected ('local'), mobile access

SIGIR 2010 Best Paper Award Winners

The best paper awards were awarded last night at the banquet.

In this paper, we present a log-based study estimating the user value of trail following. We compare the relevance, topic coverage, topic diversity, novelty, and utility of full trails over that provided by sub-trails, trail origins (landing pages), and trail destinations (pages where trails end). Our findings demonstrate significant value to users in following trails, especially for certain query types. The findings have implications for the design of search systems, including trail recommendation systems that display trails on search result pages.
Best Student Paper
A comparison of general vs personalized affective models for the prediction of topical relevance, I. Arapakis, K. Athanasakos, J. Jose
The main goal is to determine whether the behavioural differences of users have an impact on the models' ability to determine topical relevance and if, by personalising them, we can improve their accuracy. For modelling relevance we extract a set of features from the facial expression data and classify them using Support Vector Machines. Our initial evaluation indicates that accounting for individual differences and applying personalisation introduces, in most cases, a noticeable improvement in the models' performance.

SIGIR Industry Day: Baidu on Future Search

Future Search: From Information Retrieval to Information Enabled Commerce
William Chang, Baidu

Two commerce revolutions
- 1995 the first web search engines (ebay, amazon, etc...)
- China miracle

Early History of IEC
- Early shippers: created corporations, but more important there is a futures market
- Commerce: coming together to trade: trading goods and information
- Local: Yellow pages created in 1886
- Local classified ads in papers
- Mail order: Sears catalogue in 1888 for farming supplies (enabled by efficient postal service)
- Credit cards: consumer production and data mining
- Development of "advertising science": print, radio, tv

I.E.C in our Daily Lives
- Restaurant menus
- Zagat, Michelin
- Shopping guides, supermarket aisles

Technology and Internet
- Walmart: real-time transaction tracking and inventory management; scale and speed
- Amazon: user generated reviews and recommendations, common business platform
- eBay
- Craigslist

Search Engines
- Y! Directory
- Lycos Crawler, Altavista big index, Excite HotBot
- Infoseek (1996-1999) where he worked
- OR queries
- Phrase inference and query rewriting
- Banner ads tied to search keywords
- real-time addurl
- anti-spam (adversarial IR)
- hyperlink voting and anchor text indexing
- log analysis and query suggestion
- / Overture paid placement
- Google ad platform: AdWord, AdSense

Search as Media

Working 'defintion of a media company (1997)
A media company's business is to help other businesses build brands, and a brand is the total loyalty of the company's customers. A "new media company" does this by leveraging the interactive nature of the Internet to enable users to communicate with one another..."
China Economics

China Background
- reality: only 15% of Internet users earn 5000/year
- inflation at 5% spurts of hyper-inflation
- education and personal aspiration: virtually no illiteracy, but there is a problem with brain drain (about a million of the best and brightest left and never came back)
- competition is fierce in school and in work
- gender equality: one child policy
- entrepreneurial spirit

The Economy
- GDP is growing 10% annually
- Despite a tradition of honoring "old brands" there are few new domestic brands and little marketing know-how
- Domestic commerice is still nascent, lacking IEC tools (no yellow pages or directories that work)

The Prize
- Highly developed Internet in user and usage count: 420 million users, 85% broadband, the average spends 20/week on the internet
- Sitra (sp?) the expedia of China
- Micropayments are made via phone bills: even children use it to buy games online

The Money
- Half the Internet population is under 25
- Tencent QQ is an IM used by everyone, virtual currency, with real economics
- Online games: Shanda, Giant etc:
- Taobao/Alibaba already 1% of GDP, dominates B2C goods
- Baidu web search dominates B2C services (health, education... help on cramming)
- China mobile: everyone uses it, and for almost everything
- Ctrip: integration of online, mobile, offline services

- Aladdain: Open Search Platform (allows webmasters to submit query and content pairs)
-- rich results that form an application

- iKnow (2005) an open Q&A platform: the largest in the world
-- has many partner websites, all with a Q&A panel on their website

- Ark: Open product database

- Map++ (embed yellow page like information on a map)

Baidu Aladdin:
- On the result page, there is a full panel with airline reservation
Housingr, Shopping

A few more ideas:
- The average chinese worker spends 2-3 hours per day on public transportation. They spend the time playing games or reading "new literature". This is an opportunity for mobile shopping recommendation
- Shopping malls are almost impossible to navigate. There are no directories or ways to find things

- Depends critically on information quality and security: spam
- Users demand quality, but there are still not solid reliable brands
- There are new novel business models to explore: a trillion dollar opportunity