Saturday, December 6

Part II of Erik Selberg Interview on Meta-Search

The second in the three part series of Federated Search's interview with Erik Selberg is online.

I appreciate Erik's in-depth responses to the questions. In the process he shares some wisdom from his grad student days at UW. In particular, how the research focus of MetaCrawler evolved. For example, the fact that the mechanics of distributed querying wasn't an interesting research problem,
However, that tool could be used to collect a large number of web pages about a topic from “knowledgeable sources” and thus we could do something to analyze semantic structure. However, this wasn’t terribly well defined, and by the time we had MetaCrawler, we still weren’t sure what structure we’d want to investigate and even what kinds of semantics we were interested in. So, that part of the project was dropped, and we focused more on the research of MetaCrawler itself.
Things don't work out as planned, but good researchers adapt and shift focus. One last nugget of wisdom for researchers from the interview:
Oren’s advice on the matter was to always investigate surprises with great vigor. Predictable things are, well, predictable, and the research that comes from steady improvement, while beneficial, tends to be rather boring. However, when you discover something that was unexpected, the results and explanations are almost always exciting and fascinating.
I can't help but notice two connections of meta-search to current search engines.
  1. The decision to perform 'deep web surfacing' rather than federating results from third-party data sources. For example, Google has starting crawling the data behind forms. See the recent paper, Google's Deep-Web Crawl.

  2. The rise of "Universal Search", the process of blending results from multiple vertical search indices, is an interesting application of meta-search. Is there research that focused on the unique challenges of this use case? Considering the importance to industry, it's surprising to see the dirth of recent work in this area.

Friday, December 5

Should we use Amazon Public Data Sets for test collections

Amazon has a new service, Public Data Sets, where it provides free hosting on EC2 for collections of public data across different domains. This makes it simple to download them or perform computation on Amazon's S3 service.

Should IR groups be using it or a similar model to distribute and perform processing of test collections?

For example, there will likely be a billion document web corpus for TREC 2009. However, there's concern over the number of groups with the resources able to handle a collection that large.

Thursday, December 4

Federated Search Blog Part I of Erik Selberg Interview

Last Friday Federated Search Blog posted part I of an interview with Erik Selberg. Erik created MetaCrawler, one of the first meta-search engines. He wrote it for his master's project at the University of Washington (UW) and continued to work on meta-search for his dissertation.

The interview reminds me of the the article I wrote back in 2006 on the beginning of metasearch featuring MetaCrawler. Maybe sometime I'll get around to part II.

One quote from the interview struck me because it deals with the problem of extracting interesting research questions from engineering tasks. Erik writes,
Fundamentally, a Web service that simply sends a query to a number of search engines and brings back results isn’t all that interesting for a researcher. That’s an engineering problem, and not a difficult one. But there are a number of questions that ARE interesting — such as how do you optimally collate results? How do you scale the service?... Oren pushed me to answer those questions.
The ability to abstract the interesting problems in a system and focus on those is a skill I'm still in the process of acquiring.

Erik solved the problem of combining a bunch of unreliable search engines to create one that was very useful, in the process he pioneered early research on meta-search. It's amazing how far web search engines have come; from unreliable early prototypes developed by grad students into today's multi-billion dollar industry.

I look forward to reading part II.

Wednesday, December 3

Large Scale Cluster Computing Course at the University of Washington

Yesterday I did a quick roundup of IR courses that were offered. Today, I'd like to highlight the UW course on large-scale cluster computation that is being offered again this fall.

CSE 490H: Scalable Systems: Design, Implementation and Use of Large Scale Clusters
The topics covered are Map/Reduce, MapReduce algorithms, distributed file systems like the Google File System, cluster monitoring, power and availability issues. The course is taught by Ed Lazowska and Aaron Kimball. The class uses the widely used Hadoop Map-Reduce framework created by Doug Cutting and Yahoo! to give students hands on experience.

The four class assignments help students become familiar with real-world tools and tasks:
  1. Setup and test Apache Hadoop, using it to count words in a corpus and build and inverted index
  2. Run PageRank on Wikipedia to find the most highly cited articles.
  3. Assignments 3-4 build a rudimentary version of Google Maps.
    Assignment 3 create maps and tiles of the US from geographic survey data
  4. Use Amazon S3 storage and EC2 compute cluster to lookup addresses on the maps created in assignment three and connect it to a web-front end.
The assignments provide great step-by-step instructions for anyone interesting in getting familiar with Hadoop and getting a basic version setup and working.

Also the videos and slides of the lectures are available to view/download. This is fantastic because the speakers in the class look really interesting, such as Jeff Dean from Google and Werner Vogels from Amazon speaking about the tools and their future directions.

The class is a great quick-start on using Hadoop for cluster computation.

On a related note, you may also want to look at the lectures and materials for a mini-course on cluster computing for the Google interns.

Here at UMass we do large-scale indexing using a Map-Reduce like framework called TupleFlow that powers the Galago search engine; both were written by Trevor Strohman (now at Google).

Tuesday, December 2

Fall 2008 Information Retrieval Courses

Here is a selection of the Fall 2008 IR and search engine courses from around the web.

CS276 (updated for fall 2008) - The Stanford graduate IR course, taught by Christopher Manning and Prabhakar Raghavan. This is the standard IR course. Their new book Introduction to Information Retrieval is quickly becoming one of the standard texts.

CS572: Information Retrieval and Web Search
At Emory taught by Eugene Agichtein.

CS 4300 / INFO 4300 Information Retrieval
At Cornell, taught by William Arms.

CSI550: Information Retrieval
At University of Albany, taught by Prof. Tomek Strzalkowski.

In addition to the forementioned Stanford IR book, the new IR book from the UMass IR lab, Search Engines: Information Retrieval in Practice by Bruce Croft, Donald Metzler, and Trevor Strohman, seems to be gaining adoption.

See also my previous post on IR courses.

Monday, December 1

Sparse information on TREC 2008

The annual TREC conference was held a little over a week ago in Maryland. So far, there have been no public reports on how it went. It would be useful to have the meetings video taped and broadcast over the web, as well as other ways for interacting with non-attendees.

Rodrygo from Glasgow has a post covering the blog track workshop, focusing mainly on the discussion around the 2009 track. Notably, the opinion finding and polarity tasks are being discontinued.

Rodryo writes,
It was a consensus among the attendees that opinion retrieval and polarity detection are still open, relevant problems. Yet a few groups managed to deploy interesting techniques that achieved consistent opinion retrieval performances across several strongly performing baselines in the track this year, polarity detection approaches looked rather naive.