Saturday, August 2

SIGIR 2008 Workshop Proceedings

Most of the SIGIR workshop proceedings are now online. There's an overwhelming amount of material, but over the next week I hope to pick out a few of the highlights. I would love to hear what attendees learned from each workshop and what they found exciting or disappointing, blog about it or drop me an e-mail.

Learning to Rank (full proceedings)
The main purpose of this workshop is to bring together information retrieval researchers and
machine learning researchers... The goal is to design and apply methods to automatically learn a function from training data, such that the function can sort objects (e.g., documents) according to their degrees of relevance, preference, or importance as defined in a specific application.

See Jon's coverage.

Focused Retrieval - (full proceedings)
Focused retrieval has been used to extract relevant sections from academic documents; and the application to text book searching is obvious (such commercial systems already exist). The purpose of this workshop is to raise issues and promote discussion on focused retrieval - that is, Question Answering (QA), Passage Retrieval, and Element Retrieval (XML-IR).

Information Retrieval for Advertising - (full proceedings)
Online advertising systems incorporate many information retrieval techniques by combining content analysis, user interaction models, and commercial constraints. Advances in online advertising have come from integrating several core research areas: information retrieval, data mining, machine learning, and user modeling. The workshop will cover a range of topics on advertising, with a focus on application of information retrieval techniques.

Mobile Information Retrieval (MobIR '08) - ( full proceedings)
Mobile Information Retrieval (MobIR'08) is a timely workshop concerned with the indexing and retrieval of textual, audio and visual information such as text, graphics, animation, sound, speech, image, video and their various possible combinations for use in mobile devices with wireless network connectivity.

Beyond Binary Relevance: Preferences, Diversity, and Set-Level Judgments - ( full proceedings)
New methods like preference judgments or usage data require learning methods, evaluation measures, and collection procedures designed for them. This workshop will address research challenges at the intersection of novel measures of relevance, novel learning methods, and core evaluation issues.

Future Challenges in Expertise Retrieval - (full proceedings and slides)
The main theme of the workshop concerns future challenges in Expertise Retrieval. Instead of focusing on core algorithmic aspects of a specific expert finding scenario (as is the case for the TREC Expert Finding task), our aim is to broaden the topic area and to seek for potential connections with other related fields.

Analytics for Noisy Unstructured Text Data (full proceedings behind ACM web login)
Noise in text can be defined as any kind of difference between the surface form of a coded
representation of the text and the intended, correct, or original text. The goal of the AND workshops is to focus on the problems encountered in analyzing noisy documents coming from various sources.

Best student paper award: Latent Dirichlet Allocation Based Multi-Document Summarization by Rachit Arora and Balaraman Ravindran

Workshops without proceedings online (yet):
Aggregated Search
Speech Search (SSCS)

Lastly, an older, but highly related workshop from WWW 2008:
Adversarial Information Retrieval (AIRWeb) - (program with papers and slides)
The program is structured around 3 sessions with presentations of peer-reviewed papers on Adversarial IR on the Web, covering usage analysis, network analysis and content analysis; followed by one session with the Web Spam Challenge results and a panel on the future of Adversarial IR on the Web.

Thursday, July 31

SIGIR 2008 coverage from around the web

See also the earlier coverage of the learning to rank workshop and Greg Linden's keynote coverage.

First up, Paul Heymman on the Stanford InfoLab Blog has some of the best coverage of the conference I've read yet.

In case you missed Jon's earlier comment, he has coverage of the Learning to rank sessions and workshop.

Paraic Sheridan, covers the keynote on Google China and its future in Africa. Paraic is a computational linguist at the Centre for Next Generation Localisation (CNGL) at Dublin City University

Pranam Kolari, from Yahoo!'s web spam team has coverage of Kai Fu Lee's Keynote.

Best paper awards
I couldn't find the award information on the SIGIR 2008 site, but here's what I pieced together, please correct me if I'm wrong:
Algorithmic Mediation for Collaborative Exploratory Search (best paper award)
BrowseRank: Letting Web Users Vote for Page Importance (best student paper award)

Also, I look forward to reading Peter Bailey's paper Relevance Assessment: Are Judges Exchangeable and Does it Matter?
In the paper they examine the impact of assessor expertise on the quality of relevance judgments. In the end, they conclude:
...the Cranfield method of evaluation is somewhat robust to variations in relevance
judgements. Having controlled for task and topic expertise, system performance measures show statistically significant, but not large, differences. Similarly, system orderings allow us to identify “good” and “bad” IR systems at a broad-brush level.

Tuesday, July 29

What makes Cuil different: Index Infrastructure

I had a brief post yesterday on Cuil's launch, along with seemingly every other author in the blogosphere. My question is: What makes Cuil different from GYM? Here is what I have managed to glean from all the press coverage yesterday and my own experimentation with the engine.

Cuil's plans to differentiate itself

1) It's about the infrastructure, of course.
From a recent interview with GigaOm, Anna Patterson, formerly one of Google's infrastructure designers, reportedly said:
How it works is that company has an index of around 120 billion pages that is sorted on dedicated machines, each one tasked with conducting topic-specific search — for instance, health, sports or travel. This approach allows them to sift through the web faster (and probably cheaper) than Google...
The Forbes article has a little more detail on their query serving architecture:

Patterson and Costello's impressive feat is that they've done this with a total of 1,400 eight-CPU computers (1,000 find and data-mine Web pages, the remaining 400 serve up those pages) [JD: Even assuming there is no redundancy 120 billion docs / 400 servers = 300 million documents per node. This seems unrealistically high, especially considering that Lucene, a widely used search library can realistically handle 10-20 million.] ...

Cuil attempts to see relationships between words and to group related pages in a single server. Patterson says this enables quicker, more efficient searching: "While most queries [at competitors] go out to thousands of servers looking for an answer, 90% of our queries go to just one machine."

Finally, to compare with Google's architecture, a quote from Danny Sullivan's interview with Anna:
If they [Google] wanted to triple size of their index, they'd have to triple the size of every server and cluster. It's not easy or fast...increasing the index size will be 'non-trivial' exercise
According to the news, Cuil's index serving infrastructure is a key competitive advantage over Google and the other major players. It remains to be seen if they can leverage this platform to produce world-class results.

On their size claims

Last I heard, Google's index is rumored to be in the 40 billion range, Microsoft is in the 10-20+ billion range. Cuil claims their architecture allows at least a 3x increase in index size over Google. However, it's hard to verify this because Cuil's hit counts are badly broken: a search for [the] returns an estimated 250 documents. The lack of support for advanced search, such as site: also makes it difficult to compare coverage of individual sites, such as Wikipedia.

Other differentiating features:
  • Topic-specific ranking
    From Danny's interview, it sounds like Cuil is doing post-retrieval analysis of document content, analyzing phrase co-occurrence and extracting 'concepts'. From the interview:
    It figures out these relationships by seeing what type of words commonly appear across the entire set of pages it finds. Since "gryffindor" appears often on pages that also say "harry potter," it can tell these two words (well, three words -- but two different query terms) are related.

    Cuil then reportedly computes a topic specific link score. It sounds very similar to Teoma's HITS technology. Again, there is no support (yet) for Cuil's claim that this is superior to other search approaches.

  • UI and exploration
    Cuil has a non-standard two or three column layout of results, which attempts to feel more like a newspaper, with images associated with many results.

    It appears to use the information from the content analysis to create the 'Explore by Category' box to drill down into specific topics, as well as offering related searches as tabs across the top of the page.
Closing thoughts: The 120 billion pages people care about

Size matters, but it's more important to get the right content in the index. It's purely subjective since the web is infinite, but only a subset of the web is useful. Google tracks at least a trillion distinct URLs, and Cuil's crawls only a mere 186 billion (SE Land reference). It's critical that crawling and indexing be prioritized correctly. For example, despite the reported massive index size, this blog is not indexed by Cuil. While on its own this doesn't mean much, Daniel reports similar experiences with lack of coverage of his content.

I am unimpressed with Cuil's current coverage and relevance, but it's still early. Despite all the criticism (much of it justified), launching a search engine of this scale is an impressive feat. I think what Cuil's doing is exciting and I'm witholding judgment until it has time to mature. Once again, congratulations to the Cuil team and good luck with the long road ahead.

Monday, July 28

SIGIR Learning to Rank (LR4IR) 2008 Proceedings

The L4IR proceedings are available for download from the L4IR 2008 website.

Highlights to come...

I'm still waiting for more blog coverage of SIGIR 2008, and I'm not alone.

Cuil launches with 120 billion page index

Cuil launched.

Read the good coverage from GigaOm and Danny Sullivan over at SELand. It's an early stage product, but maybe I'll have a review later today.

Cuil has a formidable group of search veterans including Louis Monier from Altavista/eBay, Tom Costello from IBM Webfountain, and Google veterans Anna Patterson and Russell Power, see their bios on the Cuil management page. Congratulations to the entire team, launching a product of this scale is a monumental undertaking!