Thursday, November 4

Susan Dumais CIKM 2010 Keynote: Temporal Dynamics in Information Retrieval

I am still catching up on a backlog of items from last week.

Here are more of Michael's notes from Susan Dumais' keynote presentation at CIKM 2010 that addressed the impact of time on web search. Gene also has his notes from the presentation.
  • Change in IR
    • New documents and queries
    • Query volume changes seasonally/periodically
    • Document content changes over time
    • User interactions change over time (e.g., anchor text, page visits)
    • Relevant document for query change over time, “Hurricane Earl” (Sept. 2010 vs. before/after)
    • But -> evaluation corpora is usually static

  • Digital dynamics are relatively easy to capture, however tools for interacting with information are static (Browsers/search engines)

  • Characteristics of Web page change
    • Measuring web page change in a large web crawl
    • 33% of web pages changed over a period of 11 weeks
    • 66% of visited pages changed over 5 weeks, 63 changed every hr
    • Avg. time between changes – 123 hr.
    • .com pages change more often than .gov,.org
    • Knot point – the place on the change curve where the page stabilizes over time; Characterizes the way pages change
    • Term-level changes
      • Looking at characteristic term for the page and their “staying power”, e.g. “cookbooks” & “ingredients” have a high staying power for allrecipes.com, “barbeque” is more transient

  • Revisitation Patterns on the Web
    • 60-80% of the pages you visited, you’ve already seen before
    • 4 revisit patterns:
      • Fast - Navigation within site
      • Hybrid - High quality fast pages
      • Medium - Popular homepages/mail & web applications
      • Slow - Entry pages, bank pages, accessed via search engines

  • Revisitations & Search (Teevan et al, SIGIR 2007, Tyler et al., WSDM 2010)
    • Repeat query 33%
    • Repeat click 39%

  • Relationships between revisits and change (Adar et al., CHI 2009)
    • Monitor change
    • Effect change is not related to change
    • Change can interfere with re-finding
    • The more visitors the page has, the more often it changes
    • Three pages: nytimes.com, woot.com, costco.com
      • Similar change patterns, but different revisit patterns:
      • NYT – fast revisit
      • Woot – medium revisit
      • Costco – slow revisit

  • Diff-IE – Building support for understanding change
    • Browser toolbar that highlights content that was changed since the last visit
    • Non-intrusive and personalized --- changes that are of interest to you, not to the publisher of the page
    • Helps to uncover unexpected important content
    • Facilitates serendipitous encounters
    • Helps to understand page dynamics
    • Will be publicly available later this month from
    • http://research.microsoft.com/en-us/projects/diffie/default.aspx
    • Research surveys show that Diff-IE drives more revisitation
      • Driving visits to pages that change frequently

  • Leveraging Temporal Dynamics for IR (Elsas & Dumais, WSDM 2010)
    • Use document change rate to set document priors
    • Use term longevity to weight terms
    • Evaluation using static data
      • Using 2k navigational queries
      • Dynamic model outperforms the static baseline

    • Ongoing evaluation collection (Understanding Temporal Query Dynamics, to appear in WSDM 2011)
      • Collect relevance judgments over time, e.g. “march madness” query
      • Document relevance changes over time

Wednesday, November 3

Yahoo! Open Sources S4 Real-Time MapReduce framework

Today Yahoo! announced the release of a new real-time MapReduce framework written in Java called S4. From the website,
S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data.
For more technical details you can read the technical overview or check out the code on github. The example application keeps counts of hash tags in a Twitter stream.

The framework was previously announced at a Y! lab event which discussed processing in Y!'s advertising platform.

Google Open Sources Sawzall

Google today open sourced sawzall, see the original publication. From its description,
Sawzall is a procedural language developed for parallel analysis of very large data sets (such as logs). It provides protocol buffer handling, regular expression support, string and array manipulation, associative arrays (maps), structured data (tuples), data fingerprinting (64-bit hash values), time values, various utility operations and the usual library functions operating on floating-point and string values. For years Sawzall has been Google's logs processing language of choice and is used for various other data analysis tasks across the company.

Components of Compelling Vertical Search

In this post, I will discuss key components of successful topic-specific vertical search. I was motivated to write it by the launch of blekko earlier this week.

Blekko is marketing its ability to slice the web up into verticals using slashtags. Blekko's slashtags define a list of hosts or page to focus a search. But, that is not enough to be successful. Search in a vertical needs to provide a significantly different experience from general web search. A compelling vertical search engine has the following key components:
  1. Vertical specific ranking. A focused topic should define and utilize ranking features unique to the vertical. It may as simple as the topical classification score for a page. It often requires applying information extraction to identify meaningful document fields. It should also leverage vertical-specific static rank features. For example, use a technique like topic-specific pagerank, an author/source popularity score, or other features.

  2. Rich results. The result objects should be presented in a way that uses the structured and semantic information from the topic. For example, simple examples of this include presentations that use data from Google Rich Snippets and SearchMonkey. This may include topic-specific metadata like authors, political perspectives, addresses, or aggregated user rating scores.

  3. Faceted UI. A vertical should exploit structured metadata for exploratory search. It should allow you to flexibly combine keyword search and structured attribute restriction to limit the search space by: price, airline, manufacturer, genre, date, etc... See the CHI 2006 tutorial and the relevant section from Marti Hearst's Search UI book on eBay Express.

  4. Domain knowledge. A restricted topical domain should model important relationships between objects and concepts to improve retrieval. For example, it should use a Freebase-like knowledge base of objects and their attributes. In a recipe search engine, it would would model ingredients and relationships such as contains:gluten or is kosher.

  5. Task Modeling. A key benefit of a narrow domain is that it should allow users to accomplish complex tasks more easily. It should provide tools and interfaces to more directly allow users to get things done.
Of course, it needs to keep up with web search engines in ranking, comprehensiveness, and freshness, which are all key components of search quality.

For more of my thoughts on these issues, you can see the slides from my ECIR 2008 Industry Day talk The Challenge of Engineering. Vertical Search.

Overall, creating a compelling vertical experience currently requires a lot of hard work and painstaking curation. It requires a deep understanding of the tasks that users perform. It requires modeling the topic and domain objects in meaningful ways. Combining these elements together is difficult to do well. It is extremely hard to do at the scale of the entire web across all topics.

Monday, November 1

Blekko Launches: Brings Transparency to Relevance Ranking

blekko launched its public beta on Monday. blekko is a new web search engine that focuses on creating an open and transparent process around search engine relevance ranking. blekko is attempting to differentiate itself as the open alternative to "closed" search engines and involve greater public participation in the ranking process.

Creating blekko is an impressive feat because they built their own system from the ground up to crawl, index, and rank a multi-billion page search index. This is hard to do well. They have accomplished a lot in short period of time, so I am excited about the changes we'll see as they evolve. I hope that they will take the risks that other search engines can't afford. One of their risky moves is opening up their ranking features.

Open vs. Closed Ranking
Google's "closed policy" is a difficult issue that has garnered significant criticism. For example, at ECIR 2008 in a QA with Amit Singhal, Daniel Tunkelang questioned the need to rely on security through obscurity. (For an updated perspective now that Daniel works at Google read his recent post on Google and Transparency). In response to a EU inquiry, Amit Singhal laid out the underlying philosophy of Google's ranking,
  1. Algorithmically-generated results.
  2. No query left behind.
  3. Keep it simple.
Although Google uses signals from humans in the form of links and click through on search results, it does not actively involve humans in the search process. Blekko is going to be different.

As a first step involving users in ranking, blekko allows users define their own search experience using "slashtags". Founder Rich Skrenta describes this on a recent blog post, on crowdsourcing relevance,
We're starting by letting users define their own vertical search experiences, using a feature we call slashtags. Slashtags let all of the vertical engines that people define on blekko live within the same search box. They also let you do a search and quickly pivot from one vertical to another.
You can contrast Google's philosophy with Blekko's, here are the first 3 of 11 points in Rich Skrenta's post,
  1. Search shall be open
  2. Search results shall involve people
  3. Ranking data shall not be kept secret
  4. ...
A philosophy is great, but it doesn't matter if your results suck. Blekko just launched, so let's take a closer look.

Blekko's ranking
I tried blekko and it is a very solid first effort. To experiment, I re-ran a variety of searches from my Google web history. I didn't conduct thorough experiments, but my impression is that the ranking and coverage is very reasonable, but not as good as Google or Bing's. SELand has a more comprehensive review with a side-by-side comparison with Google.

One frustration I encountered using blekko is that slashtags autofired and my query was automatically restricted to a vertical when it was overly restricted. This limited scope led to missing key relevant results and I manually backed off several times to the /web. Slashtags create added complexity which leads to problems.

I'd like to point out a few queries where I found that blekko's relevance particularly stumbled and could improve, [carbonell mmr] and [iron cook northampton]. Neither of these are easy queries. The first is somewhat ambiguous and the second is about a small local event. What I find hopeful with blekko is that I can begin to understand the underlying reason for failure. I clicked on "rank stats" or you can use the /rank tag, e.g. [carbonell mmr /rank]. For each result blekko also provides an "seo" link to see static rank features, http://www.ml.cmu.edu/mlpeople/affiliatedfaculty.html. As an IR researcher I find open access to this feature data very exciting. However, for the average searcher this level of detail is distracting and unnecessary. The "openness" features need to earn their real estate by being actionable, but right now they don't do that.

Instead of cluttering the search UI, I would like to see blekko be more open by providing the data through an API. It would let academics and searchers use this raw material to rerank results in new and novel ways.

On "Crowdsourcing relevance"
Slashtags are operators that can both restrict a search and change its ranking. They currently allow you to sort the results by /relevance or /date. Users can define slashes that tag hosts as relevant to topic. I started a slashtag on information-retrieval. However, restricting a query to a set of hosts using slashtags is a bit like performing surgery with a chainsaw. In the end you are missing key bits. This approach has several problems:
  1. The granularity of hosts is too coarse. The amount of content relevant to a topic could be a single page or section of website.

  2. Recall. A slashtag cannot be mantained by people in real-time and will miss relevant content.

  3. The semantics of a slashtag are not well defined and it is not obvious how to combine them, e.g. a combine a topic slashtag with a /date ranking.
The claim is that slashtags reduce spam by limiting search to a restricted set of trusted websites. However, I don't think that the impact of spam on search results is very compelling. Search engines are quite good at identifying and incorporating implicit user feedback to reduce the impact of irrelevant (spammy) results. There needs to be a more compelling reason.

Using slashtags doesn't address several key issues in crowdsourcing ranking. The first is that they doesn't address the obvious need to involve people in making relevance assessments on the results in a systematic way. Secondly, the core of search ranking is determining what features indicate relevance for a query and how they should be combined. blekko is not currently surfacing a way to change either of those aspects.

It remains to be seen how you could really let users change the relevance in meaningful ways and more importantly, measure the utility to everyone. It may be that academia could play some role in creating and testing features.

Presentation
blekko's "10 blue links" search UI feels outdated. Modern search engines are incorporating rich results into SERPs. The "Universal Search" results blend images, videos, maps, even places and people. I hope that this is an area where we see blekko evolve quickly to catch up.

Overall
I won't be switching to blekko for regular use. I find the level of ranking information and features that they share very exciting and compelling. Because I can see the ranking pieces, it compels me to jump in and help make things better, because I can. However, I question the utility of the information for average users and the ability to deeply engage the public in useful ways to improve ranking.

Slashtags are fun to play with, but are they useful? Slicing the web into groups creates mini vertical search experiences. However, using the tags adds complexity that may not be necessary most of the time. The value offered by the slashtag verticals is quite limited right now. I hope that slashtags will evolve to allow users to do more curation and add more value as blekko matures.