Thursday, December 18

December SIGIR Forum

A little light holiday reading, the December issue of the SIGIR Forum is available online.

What should grad students be learning?

Harvard professor, Michael Mitzenmacher, has a post on his blog titled What Else Should Grad Students Be Learning?

He highlights skills such as: time management, writing/speaking, leadership, and entrepreneurship. It's a good list. However, I would be interested to know what skills grad students should be developing that differ from what people should be developing in the 'real world'. In fact, I think industry can teach you these skills faster in many cases because the environment can be more demanding with near-term revenue and jobs on the line.

Personally, I think grad students should learn about funding and the grant process. This is similar to the reason that programmers in industry should know the basics of business and managerial accounting. The bottom line is that techies need to be able to live and communicate effectively with a non-technical audience, some of whom may be their bosses.

TREC 2009 Call for Participation

The TREC 2009 call for participation went out yesterday.

It should be an interesting year with new and larger web and blog collections.

The registration deadline is in February.

Tuesday, December 16

Synthese Article Recommendation System

Andre announced a first release of CISTI's Synthese Recommender system for journal articles. It uses Taste and the article citations as a substitute for user preference data.

You can search the 1.5 million BioMed articles and add them to your 'basket'. The system then recommends new articles you should read.

Searching for the longest path...

A large group of us in the IR lab recently finished a class in Advanced Algorithms. Here's an algorithms tribute, Find The Longest Path [lyrics] to the tune of Billy Joel's The Longest Time. It was reportedly written in 1988 by a grad student at John's Hopkins during a particularly hard algorithms final.

Now that the semester is over, it's good to be able to focus on research again. After all, the SIGIR 2009 deadline is only about a month away!

Friday, December 12

Erik Selberg Interview Part III

Sol has part III of his interview with Erik Selberg about meta-search.

You can read about his latest work at Amazon, dealing with data integration issues.

Monday, December 8

Off-topic: Let the Dinosaurs Die

Major newspapers and the big three car companies are dinosaurs and it's time for them to face extinction. Today, the Tribune group filed for bankruptcy, which includes the LA Times, Chicago Tribune, other national newspapers. Even the New York Times, which is one of the more tech-savvy, could be next. As Nick Bilton, NY Times R&D researcher writes in a recent O'Reilly article,
For the past 157 years (that's how old the newspaper is) we've essentially delivered 'dumb content' to people's doorsteps. You and I, irrespective of interests, location etc. have received the same newspaper on our doorsteps every morning. We're beginning to explore ways to make content smarter, to understand what you've read, which device you've read it on and your micro level interests—making the most important news find you, instead of you having to find it.
However, it may be too little, too late. Newspapers could have adapted to the Internet and the realities of the web, instead they chose to fight tooth and nail trying to forestall the inevitable change in business models. For example, they could have developed really good online classified services, but they didn't and Craigslist is eating them alive. Despite owning some of the best content and writers, they can't seem to figure out what to do with it all. Many of them have largely ignored the reality of content navigation via search engines and SEO. In short, they don't get the web. There are a few examples of individuals trying to change this, for example Marshall Simmonds formerly of (now owned by the NY Times) has been leading the charge to open up the NY Times archives. However, these companies have a lot of sunk costs in outdated technology and infrastructure and most of them won't survive the recession.

I'm even less sympathetic to the big three automakers whose short-sightedness building gas-guzzling trucks and SUVs has finally caught up with them.

If these companies can't adapt to the reality of the market then they should fail. There should be no government bailouts. I don't want to loan them my money when they ruined their companies and there does not appear to be concrete viable turnaround plans.

Standing by and letting car companies and newspapers fail will certainly hurt in the short run, but ultimately it will make our economy stronger and more competitive by getting rid of the weak and corrupt. Both newspapers and car companies will be replaced by new breeds of more nimble businesses that figure out ways to adapt and thrive in a new world.

Just a last note: This does not mean that I am unsympathetic to the hard-working blue collar workers at these companies. I have friends and relatives involved in these businesses. I think that if anything is done, the government should extend education money, health care, and other benefits to help these workers make ends meet and re-train. We should all chip and help our neighbors get by in tough times, I'd want others to do the same for me.

Saturday, December 6

Part II of Erik Selberg Interview on Meta-Search

The second in the three part series of Federated Search's interview with Erik Selberg is online.

I appreciate Erik's in-depth responses to the questions. In the process he shares some wisdom from his grad student days at UW. In particular, how the research focus of MetaCrawler evolved. For example, the fact that the mechanics of distributed querying wasn't an interesting research problem,
However, that tool could be used to collect a large number of web pages about a topic from “knowledgeable sources” and thus we could do something to analyze semantic structure. However, this wasn’t terribly well defined, and by the time we had MetaCrawler, we still weren’t sure what structure we’d want to investigate and even what kinds of semantics we were interested in. So, that part of the project was dropped, and we focused more on the research of MetaCrawler itself.
Things don't work out as planned, but good researchers adapt and shift focus. One last nugget of wisdom for researchers from the interview:
Oren’s advice on the matter was to always investigate surprises with great vigor. Predictable things are, well, predictable, and the research that comes from steady improvement, while beneficial, tends to be rather boring. However, when you discover something that was unexpected, the results and explanations are almost always exciting and fascinating.
I can't help but notice two connections of meta-search to current search engines.
  1. The decision to perform 'deep web surfacing' rather than federating results from third-party data sources. For example, Google has starting crawling the data behind forms. See the recent paper, Google's Deep-Web Crawl.

  2. The rise of "Universal Search", the process of blending results from multiple vertical search indices, is an interesting application of meta-search. Is there research that focused on the unique challenges of this use case? Considering the importance to industry, it's surprising to see the dirth of recent work in this area.

Friday, December 5

Should we use Amazon Public Data Sets for test collections

Amazon has a new service, Public Data Sets, where it provides free hosting on EC2 for collections of public data across different domains. This makes it simple to download them or perform computation on Amazon's S3 service.

Should IR groups be using it or a similar model to distribute and perform processing of test collections?

For example, there will likely be a billion document web corpus for TREC 2009. However, there's concern over the number of groups with the resources able to handle a collection that large.

Thursday, December 4

Federated Search Blog Part I of Erik Selberg Interview

Last Friday Federated Search Blog posted part I of an interview with Erik Selberg. Erik created MetaCrawler, one of the first meta-search engines. He wrote it for his master's project at the University of Washington (UW) and continued to work on meta-search for his dissertation.

The interview reminds me of the the article I wrote back in 2006 on the beginning of metasearch featuring MetaCrawler. Maybe sometime I'll get around to part II.

One quote from the interview struck me because it deals with the problem of extracting interesting research questions from engineering tasks. Erik writes,
Fundamentally, a Web service that simply sends a query to a number of search engines and brings back results isn’t all that interesting for a researcher. That’s an engineering problem, and not a difficult one. But there are a number of questions that ARE interesting — such as how do you optimally collate results? How do you scale the service?... Oren pushed me to answer those questions.
The ability to abstract the interesting problems in a system and focus on those is a skill I'm still in the process of acquiring.

Erik solved the problem of combining a bunch of unreliable search engines to create one that was very useful, in the process he pioneered early research on meta-search. It's amazing how far web search engines have come; from unreliable early prototypes developed by grad students into today's multi-billion dollar industry.

I look forward to reading part II.

Wednesday, December 3

Large Scale Cluster Computing Course at the University of Washington

Yesterday I did a quick roundup of IR courses that were offered. Today, I'd like to highlight the UW course on large-scale cluster computation that is being offered again this fall.

CSE 490H: Scalable Systems: Design, Implementation and Use of Large Scale Clusters
The topics covered are Map/Reduce, MapReduce algorithms, distributed file systems like the Google File System, cluster monitoring, power and availability issues. The course is taught by Ed Lazowska and Aaron Kimball. The class uses the widely used Hadoop Map-Reduce framework created by Doug Cutting and Yahoo! to give students hands on experience.

The four class assignments help students become familiar with real-world tools and tasks:
  1. Setup and test Apache Hadoop, using it to count words in a corpus and build and inverted index
  2. Run PageRank on Wikipedia to find the most highly cited articles.
  3. Assignments 3-4 build a rudimentary version of Google Maps.
    Assignment 3 create maps and tiles of the US from geographic survey data
  4. Use Amazon S3 storage and EC2 compute cluster to lookup addresses on the maps created in assignment three and connect it to a web-front end.
The assignments provide great step-by-step instructions for anyone interesting in getting familiar with Hadoop and getting a basic version setup and working.

Also the videos and slides of the lectures are available to view/download. This is fantastic because the speakers in the class look really interesting, such as Jeff Dean from Google and Werner Vogels from Amazon speaking about the tools and their future directions.

The class is a great quick-start on using Hadoop for cluster computation.

On a related note, you may also want to look at the lectures and materials for a mini-course on cluster computing for the Google interns.

Here at UMass we do large-scale indexing using a Map-Reduce like framework called TupleFlow that powers the Galago search engine; both were written by Trevor Strohman (now at Google).

Tuesday, December 2

Fall 2008 Information Retrieval Courses

Here is a selection of the Fall 2008 IR and search engine courses from around the web.

CS276 (updated for fall 2008) - The Stanford graduate IR course, taught by Christopher Manning and Prabhakar Raghavan. This is the standard IR course. Their new book Introduction to Information Retrieval is quickly becoming one of the standard texts.

CS572: Information Retrieval and Web Search
At Emory taught by Eugene Agichtein.

CS 4300 / INFO 4300 Information Retrieval
At Cornell, taught by William Arms.

CSI550: Information Retrieval
At University of Albany, taught by Prof. Tomek Strzalkowski.

In addition to the forementioned Stanford IR book, the new IR book from the UMass IR lab, Search Engines: Information Retrieval in Practice by Bruce Croft, Donald Metzler, and Trevor Strohman, seems to be gaining adoption.

See also my previous post on IR courses.

Monday, December 1

Sparse information on TREC 2008

The annual TREC conference was held a little over a week ago in Maryland. So far, there have been no public reports on how it went. It would be useful to have the meetings video taped and broadcast over the web, as well as other ways for interacting with non-attendees.

Rodrygo from Glasgow has a post covering the blog track workshop, focusing mainly on the discussion around the 2009 track. Notably, the opinion finding and polarity tasks are being discontinued.

Rodryo writes,
It was a consensus among the attendees that opinion retrieval and polarity detection are still open, relevant problems. Yet a few groups managed to deploy interesting techniques that achieved consistent opinion retrieval performances across several strongly performing baselines in the track this year, polarity detection approaches looked rather naive.

Tuesday, November 25

Towards task-centric social search engines

At the Demo 08 conference there was a panel on "Where the Web is Going". Greg focused on a section where Peter Norvig from Google and Prabhakar Raghavan from Yahoo! talk about "task-centric search" or "wish fulfillment", which helps people accomplish tasks instead simply an information retrieval engine. Prabhakar says,
Just as today we recognize keywords, we should recognize tasks and intent. The search engine should 'read' and synthesize the information to solve the intent...
Here are some of the complex tasks that I've performed or worked on in the last few months where the current retrieval technology did not solve my underlying task:
  • Find me the 'best' used car within 100 miles of Amherst, MA for less than $10,000.
  • Create a gourmet Thanksgiving menu that is gluten-free.
  • Plan a romantic weekend away to the Finger Lakes that involves hiking and wine tasting for less than $150 dollars per day.
  • Create a homemade Christmas pack of spice mixes and drinks as presents for friends.
These are all complex and highly subjective tasks that require personalization based on my budget, time constraints, skills, and personal preferences. To be able to answer these questions the search engine needs to be able to decompose the tasks to sub-tasks as well as "know me" at a much deeper level than a few keywords. Solving a task requires a dialogue with the engine to clarify ambiguity in the task and sub-tasks. It requires closer "collaboration" between the search engine and the user.

One intriguing thread of many of my search tasks is that I would like to poll my friends and family with knowledge of a specific domain (machine learning) or a geographic area (Finger Lakes). I then want to synthesize their opinions weighted by their expertise and preference similarity. What is the best way to leverage their expertise to help me solve my task?

What is the way forward from keyword-centric search to task-centric search? How will search engines help me leverage my social network and the knowledge of others to solve my task?

Friday, November 21

Google harnesses "Wisdom of Crowds" with Wiki of Search

See the official blog post. Here is Googler's Cedric and Corin's description of "SearchWiki",
...SearchWiki, a way for you to customize search by re-ranking, deleting, adding, and commenting on search results. With just a single click you can move the results you like to the top or add a new site... The changes you make only affect your own searches. But SearchWiki also is a great way to share your insights with other searchers. You can see how the community has collectively edited the search results by clicking on the "See all notes for this SearchWiki" link.
This has been in testing for awhile. See my previous post on Google's Wiki of Search and Eric Schmidt's notes from 2006.

I think getting users more involved is a good idea, especially when you have an audience as big as Google. However, I'm skeptical about the current system's utility. For example, it is disconnected from Google's related products, such as Google Notebook and Google Bookmarks. It doesn't allow me to incorporate my social network. I don't think that most people have a compelling desire to edit or comment on their own search results. A few common queries may get edits, but what about the long tail of search? That said, maybe I'll go change the ranking for a few vanity searches anyway ;-).

SE Land also has informative coverage.

Thursday, November 20

MIT Mathematics for Computer Science is a godsend

This semester I'm taking a class in Advanced Algorithms. We are currently investigating randomized algorithms.

I've found the MIT open courseware material a godsend. MIT offers a course, Mathematics for Computer Science (2002) with a significant section on probability theory, including the bounding techniques we've been studying. If you want a good crash course in stats I highly recommend reading the notes on lectures 10-14. The notes are clear and the examples fascinating. I'll share one of my favorites. Professor Chernoff did an investigation off the Mass. lottery, described in the notes for lectures 13-14:
There is a lottery game called Pick 4. In this game, each player picks 4 digits, defining a number in the range 0 to 9999. A winning number is drawn each week. The players who picked the winning number win some cash. A million people play the lottery, so the expected number of winners each week is 100... In this case, a fraction of all money taken in by the lottery was divided up equally among the winners. A bad strategy would be to pick a popular number. Then, even if you pick the winning number, you must share the cash with many other players. A better strategy is to pick a lot of unpopular numbers. You are just as likely to win with an unpopular number, but will not have to share with anyone. Chernoff found that peoples’ picks were so highly correlated that he could actually turn a 7% profit by picking unpopular numbers!
Most of the state-of-the-art retrieval algorithms are based statistics and the probability of a word occurrences in a document w.r.t a collection of documents. So, even if you aren't taking a class in algorithms, it's useful background to study for search.

Thank you MIT!

Wednesday, November 19

Berry picking your way through search

Gord Hotchkiss has a writeup titled Berrypicking Your Way Through Search where he looks at information seeking behavior.

In the article, he jumps off the from the insights from an older pre-web paper on information seeking behavior: The design of browsing and berrypicking techniques for the online search interface by Marcia Bates.

Here is a brief excerpt from the original article:

So throughout the process of information retrieval evaluation under the the classic model, the query is treated as a single unitary, one-time conception of the problem. Though this assumption is useful for simplifying IR system research, real-life searches frequently do not work this way... At each stage they are not just modifying the search terms used in order to get a better match for a single query. Rather the query itself (as well as the search terms used) is continually shifting, in part or whole. This type of search is here called an evolving search.

Another reminder that search is an inherently interactive process and classical models that do not account for this are very limiting. On a related note, see previous coverage of Nick Belkin's ECIR 2008 keynote address (and Daniel's notes).

Tuesday, November 18

Yahoo! BOSS API updated with Prisma document terms

Yahoo! Search announced an interesting extension to their public BOSS API on their blog today, from their description:
Key Terms is derived from a Yahoo! Search capability we refer to internally as "Prisma."... Key Terms is an ordered terminological representation of what a document is about. The ordering of terms is based on each term's frequency and its positional and contextual heuristics...Each result contains up to 20 terms describing the document.

Add the parameter view=keyterms to the BOSS request to see the new functionality.

I wonder if this is at all related to the Key Term Extraction API that Yahoo! provides.

Monday, November 17

Symposium on Semantic Knowledge Discovery, Organization, and Use

Over the weekend NYU hosted an NSF sponsored symposium on semantic knowledge. To summarize the description, the conference tackles the issue of extraction of knowledge from large corpora using automatic or semi-automatic methods. It is a forum to discuss research and provide a high-level picture of the field.

Daniel attended the symposium and has notes from Day 1 and Day 2. His notes are a good start, but I'm really disappointed by the dirth of information available for those who could not attend. The IRF symposium provides a good model for how to do this; there was a live stream of the presentations and the videos and slides are available after the conference.

Beyond basics, in the future we should enable remote audience registration and participation. We should be able to watch presentations and have online discussion. After all, traveling to conferences is expensive and often infeasible.

Sunday, November 16

Open Source HTML parsers for Java

Simple HTML parsing and text extraction is, well, not so easy to do well. Over two years ago, I wrote a post: Open Source HTML parsers. Since then, I've mainly stuck by TagSoup as the best open-source choice for me, but today there are a few other alternatives that I'm considering for a new project.

HtmlCleaner - A small, lightweight parser that fixes up and re-orders HTML to produce well-formed XML. It won top marks in Ben McCann's comparison of HTML parsers. However, I tried it out on a few Wikipedia pages and the text it returned was not acceptable, it contained snippets of javascript and commented cdata content.

The best parsers are those found in the top web browsers. However, it's usually quite challenging (and slow) to use them in external programs.

Java Mozilla Html Parser - A Java wrapper around the Firefox HTML parser that provides a Java API to parse documents. The website is out-of-date, there was a v 0.3 release in October.

Of course, you still have the option to write your own for maximum flexibility and speed. I'm still waiting for a real production quality parser. We'll need something better than what's currently available today to deal with those messy billion document test collections that are coming soon.

Tuesday, November 11

Practical AI Programming in Java available free online

Mark Watson published a new book, Practical Artificial Intelligence Programming in Java, 3rd edition. You can buy it in print, or download a PDF version for free, with the code.

Chapters 8 (Machine Learning with WEKA), 9 (Statistical NLP), and 10 (Information Gathering) are all highly relevant for those of us in IR/text processing fields.

The code is designed to be easy to learn and to teach concepts rather than being the most efficient, or latest state-of-the-art. One good benefit is that recommends other implementations for those seeking more depth.

Thanks Mark.

Friday, November 7

IRF Symposium on patent search wrap-up session notes

I watched the livestream of the closing session at the IRF Symposium on patent retrieval. The videos should be available next weekend. Here are my notes from the session, which included both Mark Sanderson and Steve Adams.

Mark Sanderson (Academic retrieval perspective)

- IRF Symposium 2007 was an introduction of IR people to IP people. What was striking then was an example given of a deliberate mispelling in a document because someone was trying to make sure their patent isn't found. This exposed the adversersial nature of some aspects of IP retrieval, which has parallels in the web retrieval community in the opposite direction.

- In 2008 academics drew on the experience, but much of this has been based on newspaper and web test collections. There is still a disconnect between what academics solved and what is relevant to the IP community. For example, academic groups are still evaluating using Precision@5 and MAP which focus on precision, instead of recall which matters more for IP. We need to look at new ways of assessing results.

Projects - Matrixware contributions
Alexandria System - a large-scale global archive of IP data
Leonardo System - an application development platform to access the data repositories. There is potential here for information studies specialists to study how IP searchers work and analyze their interactions.

He encouraged academics to take part in the CLEF IP and TREC chemical 2009 tracks this coming year. He drew parallels to the TREC legal track and the new and interesting understanding developed from that relationship. For example, legal track people are wedded to boolean retrieval and it was a big shock when ranked retrieval systems found documents that boolean search missed.

Steve Adams (IP Industry)

- He characterized the theme of this year as "hybrid".

Hybrid Documents
A patent document is a fundamental tension. At the end the patent office delivers a doc that serves both the legal community and technical community. Those two functions are often in tension. A single document to perform both these functions is something which takes a lot of practice. They are also hybrid because they contain both text and non-text data. We need retrieval system to pull out the non-text part of the documents.

Hybrid approaches to IR
No single system or paradigm is going to deliver all the results on all occasion for every search.

Multi-linguality - we were reminded there are multiple methods to retrieve documents: query translation and document translation and both are useful.

Annotation (Eric) – the basic question is: Do we get good retrieval based solely on the original document or do we need some form of enriched documents to give better retrieval? As we face ever expanding corpora, is it possible to continue automatically or semi-automatically enriching the documents this will be very helpful. Semantic annotation currently requires a stable ontology, but we have a very dynamic vocabulary that develops over time.

Boolean vs ranked - Leif's findability index was very interesting. It could be the beginning of evaluation tools. Both boolean and best match ranking have their place.

Hybrid Responsibilities
Pierre identified the fact that getting to the bottom of each players role is an important preliminary step: who does what? IP: Mark referred to ‘dirty data’, we need to improve our data at the early stage of document production, not after it has been published.

Monika’s paper, Multimedia challenge – the patent application of 20 years in the future. It may not be text at all. Send us the cad-cam files, send us the 3d crystollagraphic model, send us the chipmask. We are light years away from being able to search these types of documents.

Some of the Highlights seem to have been:
Mapping how easily Documents can be found - by Leif Azzopardi
Annotations and Ontologies in the Context of Patent Retrieval - Eric Gaussier
Also the Alexandria and Leonardo systems from Matrixware.

Thursday, November 6

Paul Olgivie is blogging again

Paul Olgivie from CMU has a blog, Information Retrieval on the Live Web. He's recently starting blogging again after a period of absence. He also posts on his company, mSpoke's, official blog. His post on mSpoke examines what features of a blog posts correlates with their popularity.

Welcome back Paul! I look forward to more interesting posts ;-).

I meant to write about this sooner, but Jon and Daniel beat me to it.

Paul's blog is a nice addition to my blogroll.

CIKM 2008 coverage and best interdisciplinary paper award

Greg has some of the best reporting including:
Matthew Hurst has rough notes from Andrew Tomkin's keynote.

Xing Yi returned and told us that the best interdisciplinary award went to:
Structural Relevance: A Common Basis for the Evaluation of Structured Document Retrieval by Sadek Ali, Mariano Consens, Gabriella Kazai, Mounia Lalmas.

Other highlights from Xing include:
Both of these look really interesting to me, I'll try to write something up on them this weekend.

Does anyone know who won the best poster award? The website hasn't been updated and we have a few people at the lab who would be interested.

I am also looking forward to the video lectures being available online.

Information Retrieval Facility Symposium 2008

The IR Facility works on patent search, bringing together IR researchers and professional patent examiners. The annual IR Facility Symposium 2008 is underway in Vienna. They are live streaming the event, if you want to watch presentations. Unfortunately, Thursday is over, but you can still catch tomorrow's presentations.

Here is a description from the programme,
The main themes of this year’s speeches are multilingual retrieval, annotation and ontology, retrieval in non-textual documents and the improvement of user interfaces. The latest scientific projects from the fields of semantic and linguistic retrieval, text mining, automated quality control and machine translation will be presented for the first time.
The CIIR here collaborates with the IRF. We have researchers there presenting work on using retrieval methods to detect errors in OCRed patent documents. I hope to have more details to follow.

The IRF also recently hosted the Patent Information Retrieval workshop at CIKM '08. The papers should be available through the ACM.

Wednesday, October 29

Google tests "Wiki of Search"

SE land has coverage of Google testing more integration of user feedback into search results,
Google Rolling out "SearchWiki". Calling this a "Search Wiki" is exaggerating since users can't enter notes. It points to Google trying to leverage their large audience more effectively using explicit user feedback.

It's a step in in the direction Eric Schmidt outlined back in the 2006 analyst day presentation. From the notes on Slide 8:
  • Encourage our large user base to actively contribute metadata that leads to better search results
  • Wiki of search: empower users/experts to improve search results in their domains of expertise — create a million verticals
  • Effectively integrate user feedback (ratings, comments, tags) into search
It's exciting to see some innovation in this area. It's a logical next step for Google to collect explicit feedback from users on the quality of results. For example, not that I dislike my blog ranking highly, but a search for [Kleinberg memetracker] could be improved by moving higher.

In somewhat related news, Microsoft recently showed a research prototype U Rank, see their blog, that leverages your social network and let's you organize and edit search results and share them with others. This is closer to a "wiki of search", but in the limited context of a social network.

Sparse CIKM coverage

In case you missed it, Greg had some brief coverage of Rakesh Agrawal's keynote.

Bruce Croft from the CIIR here at UMass is giving the keynote today: Unsolved Problems in Search (and how we might approach them). I'm afraid I don't have any inside information on the content of his presentation.

I look forward to watching the videos on

Monday, October 27

CIKM 2008 this week

CIKM 2008 started yesterday with tutorials, unfortunately I'm not there this year.

I can't find much online for most of them. I agree with Daniel who says, Please Blog!. Live blogging anyone? Here's a little bit that I could find on one of the tutorials:

Large graph mining: patterns, tools and case studies
(See also the tutorial from WWW 2008 and ECML 2007 on VideoLectures).

Elif and Jangwon from our lab here UMass are attending, so hopefully I'll have their highlights when they return.

You can read Jangwon's paper on blog search Blog Site Search Using Resource Selection.

Napa is gorgeous. It also happens to have some of the best restaurants in the world. I would love any coverage of those too ;-).

Friday, October 24

Yahoo! Inquisitor update: new platforms and personalized results

Inquisitor is a browser plugin (originally for Safari) that provides search-as-you-type and suggested searches capabilities. Yahoo! just announced it is now available for Firefox and IE.
Building on the work by the Yahoo! Research team in the paper "Information Re-Retrieval: Repeat Queries in Yahoo! Logs," the algorithm that generates the personalized results has been enhanced to return more targeted results.
Inquisitor will also search the contents of your bookmarks to help you re-discover old content.

I've been trying it out and I really like it.

See ReadWriteWeb's coverage.

Thursday, October 23

Kleinberg launches new memetracker using Spinn3r

Kevin Burton from Spinn3r announced that Kleinberg's team at Cornell are launching a new memetracker. Check it out.

(Spinn3r also recently announced they were providing the data for the ICWSM data challenge.)

Monday, October 20

ICWSM 2009 Data Challenge

Ashkay announced that the data for the ICWSM 2009 data challenge is available.

The dataset consists of 44 million blog posts (27 GB compressed) crawled by Spinn3r between August 1st and October 1st 2008. The paper deadline is in January, so get to work!

TREC Blog Track approved for 2009

Congratulations to Iadh and team at Glasgow. Their proposal for the 2009 TREC blog track was approved.

It's exciting to see that in at least one task the relevance will include not only topical relevance, but also include the 'quality' of the content. This is one my major criticisms of the current Cranfield/TREC paradigm and most current academic experiments.

I find the blog track interesting, and not just because I have a blog. I'm interested in utilizing the highly temporal nature of blog posts to study the importance of temporal relevance. For example, to study the trade-off between authority and recency in ranking.

See also my previous discussion of the 2009 blog track.

LETOR 3.0 beta released

Microsoft Research Asia announced the beta 3.0 release of its Learning to Rank (LETOR) benchmarking platform.

You can read the announcement on the website for the full list of updates and changes. As an example, there are new document features for ranking:
In LETOR3.0 we added in-link number, out-link number, length of URL, number of slashes in the URL, etc. as new features. Also, we extracted those existing features in all streams (URL, title, anchor and body), while features in some streams are missing in LETOR2.0. Overall, there are 64 features (Table. 2) which can be directly used by learning algorithms.
Also of note, the document parser used to index the documents changed (different tf counts) and the definition of some of the document fields differs slightly from version 2.0. Furthermore, the IDF calculation changed significantly.

Saturday, October 18

Thanks Daniel: featured on 'blogs I read'

Daniel has started a new series covering blogs he reads. I'm honored to be the first blog featured.

I've been blogging about search since late 2005. Early on I focused on domain-specific search engines, like Globalspec. More recently, it's become more technical and research oriented. I look forward to using it to share not only research highlights, but practical guides and code for practitioners. This is one area where my previous job as a developer limited my ability to be open and participate in these types of discussions. Here a few highlights from the past few years.

The most popular posts:
  1. Open source Information Extraction and Text Mining tools
  2. Open source search libraries
  3. Vertical search definition and context
  4. Open source Collaborative Filtering and recommendation engines
    (and the more recent web-scale recommendation engines).
  5. How to quickly reset a Java array
Clearly people are looking for open source tools to solve their IR related problems. They are also looking for tutorials and help using them.

My favorite posts:
  1. The Google Strategic Server Force
  2. 11 Myths of Computer Science
  3. Integrating a database of everything with web search
  4. Faceted Search at Ebay: Ebay Express
  5. My posts about research challenges in search:
    World Changing Research Opportunities and
    IR research challenges for 2008 and beyond
Blogging has been quite rewarding, and I hope to continue it for a long time to come.

Monday, October 6

Truvert: Green semantic search

Via Noisy Channel.

Truvert is a new semantic search engine built to demo Orcatec's semantic technology. They built a 'green search engine' using Yahoo! BOSS (see also my recent post on the BOSS-U workshop).

From their blog:
Truevert has solved the problem of semantic search by learning the meaning of words directly from the documents that it reads rather than by relying on a prebuilt taxonomy, ontology, dictionary, or thesaurus...
In conjunction with an excerpt from a more recent post:
Delivering focused search results depends on the ability to understand the meaning of words to a detailed level. This understanding will not come from syntactic analysis or from the construction of elaborate ontologies. It will come from using human-like processes on the documents themselves.
Interesting. I'd love to learn more about their semantic analysis technology.

Yahoo! BOSS-U Workshop

Recently, Yahoo! invited a group of academics from a handful of universities to talk about Yahoo! BOSS, including particpants from the CIIR here at UMass. Yahoo! has a writeup, BOSS Goes to College. I think that's my advisor, James Allan, standing towards the back in first picture. As Yahoo! writes:
The workshop consisted of a series of all-day sessions in which academics from MIT, Stanford, UIUC, UMass, and Purdue, and experts from the Yahoo! Search Team and Yahoo! Research brainstormed and discussed ways to incorporate BOSS-U into academic research and teaching programs.
Getting more academic involvement in what has been traditionally a very closed industrial environment is very encouraging. One of the goals is to provide academic researchers to web-scale data. To start the process, Yahoo! is being quite generous with access to their API for academic researchers. It will be interesting to see what research and ideas emerge from the collaboration.

Tuesday, September 30

Google Time Machine: 2001 edition

Search Google's oldest available index from 2001; aka search results without Wikipedia.

Ten Myths of Computer Science Research

Dave Jensen and David Smith recently gave a presentation here at UMass titled Myths of Research in Computer Science. A copy of their slides is now available online. The talk began with an ice breaker and small group discussion on spam filtering.

What would it take to convince you:
  • ... to buy a spam filter?
  • ... to win a nobel prize for spam filtering?
  • ... to publish a paper?
  • ... to grant a PhD?
To open they posed the following question: What is a 'major finding' in spam filtering?

Here are a few of the many ideas that surfaced:
  • develop a system that generates undetectable spam
  • create a high-accuracy system that performs automatic unsupervised learning so that the user is never bothered with spam again.
  • prove the problem of spam is the same as a currently known solvable or unsolvable problem
One of the interesting take aways was that a system usually doesn't get awarded the highest prize; the theory/knowledge behind it is the key.

Myth - a widely held, but false belief or idea (in this context). Myths get us into trouble when we say they are false, but we act like they are true.

Ten Eleven Myths of Computer Science Research
  1. Computer Science isn't science, it's just processing.

  2. The right questions and their possible answers are obvious.

  3. To find good research problems just look at what everyone else is doing.
    "I skate to where the puck is going to be, not to where it has been.” - Wayne Gretzky

  4. Science is just common sense.
    Myth: Good research is based on what your undergraduate degree trained you to do well.

  5. All findings in major journals are true.

  6. Failure is bad.
    Design an experiment to learn regardless of the outcome.

  7. Great researchers are born, not made.

  8. To be successful I just need to show my system is better.

  9. To be successful I have to work all the time.
    Focus on productivity.

  10. To be successful, I just need to do more of what I'm already doing
    1) think harder or 2) code more

  11. Applied Math/CS is not as good as theory
One of the great quotes that I took away from the talk was from Dave Jensen quoting Paul Cohen:
"The code you write today won't run in five years. Get over it. What will be used? It is the understanding derived from running the code."
They also referenced two great books: The Structure of Scientific Revolutions and Sciences of the Artificial.

See the website for last year's version.

If you want to learn more about methods to conduct constructive Computer Science research, I recommend David Jensen's Research Methods Class. The notes from the Spring 2008 are available.

Upon reflection, what struck me is that sometimes I have a tendency to follow what's hot right now rather than looking ahead to the future. Don't follow into this trap.

Thursday, September 18

Stanford NLP and Machine Learning courses online

Via Brendan O'Connor.

Instead of watching The Office or other mind rotting television this fall, you may want to consider watching NLP and ML lectures courtesy of Stanford.

Stanford's Engineering Everywhere is offering some course materials for free, including lecture videos and course notes. I hope CMU and other top CS programs do likewise.

There are two interesting courses, including lectures that are relevant for IR people:

Natural Language Processing (CS224N)
by Chris Manning (course site) (SEE link).

Machine Learning (CS229)
by Andrew Ng (course website) (SEE link).

(A small plug for the Machine Learning course, CMPSCI 689 here at UMass, which I look forward to taking.)

Monday, September 15

Beyond Relevance in evaluation

Over at the Google Blog, Scott Huffman writes an entry on Search Evaluation at Google.
Traditional search evaluation has focused on the relevance of the results, and of course that is our highest priority as well. But today's search-engine users expect more than just relevance. Are the results fresh and timely? Are they from authoritative sources? Are they comprehensive? Are they free of spam? Are their titles and snippets descriptive enough? Do they include additional UI elements a user might find helpful for the query (maps, images, query suggestions, etc.)? Our evaluations attempt to cover each of these dimensions where appropriate.
One of my biggest issues with TREC and similar environments is the single focus on topical based relevance. See my previous post on the TREC blog track. For example, a spam post that is relevant to a topic would be acceptable, even if you would never want to read it in real life. It's time we move beyond the basics and find ways to tackle the more challenging retrieval quality aspects in a way that is still amenable to cost effective measurement.

Note: I also highly recommend What People Think About When Searching by Daniel Russell who analyzes user intent and behavior at Google.

Friday, September 12

New Information Retrieval Group Blog: Probably Irrelevant

Jon introduced a new group blog for readers interested in Information Retrieval R&D: Probably Irrelevant. I like the name ;-)

Fernando Diaz, a recent CIIR alumnus has the first post: Blogs, queries, corpora. He's continuing the discussion that Iadh started on tasks for the TREC 2009 blog track (see my earlier post in response). Fernando focuses on the origins of the current TREC tasks and deriving future tasks from the behavior of real-world users of blog search engines. Fernando writes,
One question I hope will be resolved in the comments is where these query types came from. Are they derived from actual blog searchers?... One approach would be to inspect query logs to blog search engines for different retrieval scenarios and then improve performance for those scenarios.
He poses a very good question. I don't recall seeing any published research analyzing the behavior of users with blog search log data. Ultimately, the problem comes back to a fundamental issue that academia struggles to try and create relevant and realistic test scenarios without access to log data from real-world systems. However, hopefully we can at least try to improve what we have today.

I would like to see TREC topics begin to model the interactive nature of search. A starting pointing is acknowledging that users enter multiple queries in order to find information. Today, TREC topics are only a single query, which is unrealistic and overly simplistic. As a starting point, I advocate the development of multi-query topics developed from query refinement chains. Evaluation would be performed on each query in the chain and the results for the query chain combined. Thoughts?

Wednesday, September 10

Trec 2009 Blog Track Thoughts

Iadh asked for ideas and comments for the 2009 TREC blog track.

First, I'm looking forward to the new blog corpus. The 2006 blog corpus is small and only covers eleven weeks. Hopefully, the new 2008 corpus will be much larger over a longer time frame that includes the upcoming US presidential election and all of the controversy surrounding it.

I read What Should Blog Search Look Like by Hearst, Hurst, and Dumais. The paper has three key tasks it goes over:
1. Find out what are people thinking or feeling about X
over time.
2. Find good blogs/authors to read.
3. Find useful information that was published in blogs
sometime in the past.
The paper focuses heavily on the search features needed to support these tasks. It's main criticism of the current blog distillation task (roughly task 2 above) is that the current task focuses only on relevance and does not incorporate information about the quality of the content or authority of the blog discovered.

I also read On the Trec Blog Track which summarizes the last two years of the blog track. It talks about an extension to the existing opinion finding track that I think could be really interesting:
For example, for a given product, one might wish to have a list of its positive and negative features, supported by a set of opinionated sentences extracted from
blogs (Popescu & Etzioni 2005). Such a task complements work in the TREC Question Answering track.
An interesting extension to this would be to try and summarize the positive and negative opinions on individual features.

I focused on the section Lessons Learnt and future tasks. The paper outlines three new possible tasks:
  • Feed/Information Filtering - Inform me of new feeds or new blog posts about X.
  • Story Detection - Identify all posts related to story X. A possible variant is to ask the participating systems to provide the top important stories or events for a given date or a given range of dates.
  • Information Leaders - Identify all information leaders about topic X
The first two sound very interesting. They sound similar to some of the tasks in the older Topic Detection and Tracking community (TDT) that was done with news data.

Personally, I really like the first two because I spend a lot of my time reading blogs of other leading tech leaders and researchers to stay on top of interesting topics in information retrieval and other related interested topics. The current alert systems (i.e. Google Alerts) are inadequate; they don't find all of the new information and often find many duplicates. A sub task here could be linking and deduping different versions of the same story.

For the second task, it's interesting to find all posts about a story. However, it isn't very realistic to find ALL posts. A primary reason is that not all posts are useful or worth reading. For example, a post may simply be a link to the story without any other content, this is relevant but not very useful. Again, I would like to incorporate some sense of quality: find the highest quality posts on story X.

To me the third task is slightly less interesting. However, it would be interesting to try and link the conversations together and track the discussion across blogs (including both comments and posts). The end goal might be to discover novel subgroups off the original story.

Through all of these one of the themes that sticks out is the need to find not just relevant information, but the need to discover posts or blogs that contain quality or authoritative content.

TREC Blog Search: 2008 and beyond

Iadh over on Terrier Team has an update on the TREC 2008 blog track and is asking for thoughts and comments for the proposed 2009 edition. Please go comment or e-mail him.

To start, he gives a brief history of the track over the past three years it's run.
Our main findings and conclusions from the first two years of the Blog track at TREC are summarised in the ICWSM 2008 paper, entitled On the Trec Blog Track. The Blog track 2006 and 2007 overview papers provide further detailed analysis and results.

He also points to a position paper by Marti Hearst, et. al., What Should Blog Search Look Like? that will be presented at CIKM 2008.

I will read both papers and give it some thought. You will hear from me soon ;-).

Monday, September 8

Solving Search: A Game of Guess the Magic Words

Maybe sometimes when you are searching for information you feel like Popeye trying to open the magic cave door to rescue Olive Oyl from the forty thieves:
I wonder what words he used when he opened this door? Open sissy, open cecil, no that can't be it! whoop, it's giving way, it's giving way...
- Popeye watch the full episode via the IA (or jump to the second half on YouTube)
Unfortunately, Popeye's super strength won't help you with your search tasks. Or perhaps you feel like the bumbling fool Hasan attempting to prevent Daffy from stealing the master's treasure:

Using today's search engines is playing the game of Guess the Magic Words. Guess the right words, and Open Sesame! Guess poorly and you bang your head against the cave door for minutes or hours. How good are you?

A current problem for search engines today is this: If you type a long query and give the search engine more information about your information need you are likely to get worse results than if you entered only a few brief keywords.

Barney Pell from PowerSet describes the current language of search engines as keywordese. If you enter too few of these keywords your query is likely too vague; too many keywords and relevant documents are mistakenly filtered out. And too often we don't know the 'magic words' to find the desired information.

Until search engines utilize the information in long queries without being overwhelmed by the 'noise' search will remain broken.

Marissa Mayer said in a recent LA Times Article on Google's 10 year anniversary:
I think there will be a continued focus on innovation, particularly in search. Search is an unsolved problem. We have a good 90 to 95% of the solution, but there is a lot to go in the remaining 10%.
Search isn't 90% solved. It's not easy to quantify because search is constantly evolving. Regardless, beating the Guess the Magic Words level in the game of search is still a long way off.

Thursday, August 21

Live Labs Launches Photosynth: Still Mind Blowing

Photosynth launched. I've been following Photosynth since the awesome demo at TED in March. Getting to play with it yourself is almost as incredible as watching it on the big screen. Incredible.

O'Reilly Radar has good coverage.

Don't forget to subscribe to the Photosynth Team Blog.

Wednesday, August 20

Light on posting.... in transition.

I haven't had time to post anything significant recently, I've been packing and wrapping up loose ends.

Me and my wife are moving to Northampton to begin the long road to a PhD in CS from UMass Amherst at the CIIR. Orientation starts Sept. 2nd; my last day at Globalspec is Friday.

Needless to say, things have been quite hectic. More to come once we get settled.

Tuesday, August 19

Unlock your intellectual potential: Break it down and get started!

A great post from Daniel Lemire on The secret to intellectual productivity.

It reminds of some of the best advice on writing a good essay: create a rough outline and then start writing! Plan that you will re-rewrite or revise most the first draft.

This same principle also translates quite successfully to software development. Break the project down into small, well-defined chunks with clear responsibilities and functions. Getting these pieces build creates code that kind-of, sort-of works. Then you can redesign and reshape, adding layers of functionality and flushing out details based on what you learned building the 'draft'. This follows the principle of "Plan to throw the first one away, you will anyhow" from Brook's Mythical Man Month. Just don't put too much effort into the first one. ;-).

Friday, August 15

Information Seeking Support Systems Workshop Position Papers

The NSF Information Seeking Support Systems Workshop (IS3) position papers are now available.

via Daniel.

It's been really busy recently, so I haven't had time to read through the papers in depth yet. I am in the frenzied process of packing for the imminent move to western Mass. to begin the PhD program at UMass.

FreeBase Parallax

Parallax is a way to vizualize and navigate FreeBase data by David Huynh.

Daniel has in-depth coverage and covers some of David's previous projects, of course tying it back to exploratory search.

See also Max Wilson and Tim Finin's coverage.

Light Information Retrieval Summer Reading

Summer is almost over, but if haven't done your summer reading yet, get started! Here is some material to catch up on.

I previously mentioned the final publication of the new Intro to IR book from Manning, et al. It will be a classic.

There is also background material for the MPhil in Computer Speech, Text and Internet Technology (CSTIT) from the CL at Cambridge.

An Introduction to linguistics by Ted Briscoe.

Background on basic maths: set theory and logic, linear algebra, and probability.

The main book for the one year course CSTIT course is Speech and Language Processing by Daniel Jurafsky and James Martin. A new second edition of the book was released this Spring. It's at the top of my reading list. The first edition was a classic and I'll try to write up a review of the second edition when I get through it.

If that doesn't keep you busy, there is always Moffat, Zobel, and Hawking's Recommended Reading for IR Research Students.

Yahoo! Research at SIGIR 2008

Yahoo! Research has a post on SIGIR 2008, including a list of papers from Yahoo! presented at the conference.

Prabhakar Raghavan gave a talk, New Sciences for a New Web at the Infocomm Development Authority (IDA) of Singapore.

They also celebrated the release of Raghavan's new book: Introduction to Information Retrieval. The website now has a new PDF for online reading with hyperlinks.

Thursday, August 14

Microsoft's arrogance is annoying me: Firefox support

At Globalspec, we use Sharepoint as a collaboration tool. I have a big issue with it: it doesn't support Firefox. You get the 'down graded' experience. This means all the spiffy features are gone. For example, if you want to use the rich WYSIWYG text editor, you are out of luck. I'm not the only one complaining.

I have the same problem with Firefox using my old Hotmail account. The 'full' version of Hotmail prompts me that my Firefox 3.0 browser is unsupported and that the product may not function correctly.

Perhaps there are more complicated 'platform support considerations', but I am sick of the BS. Firefox has about 20% of the browser market. Microsoft, stop pretending like Firefox users don't exist. Real developers right code that supports rich experiences in all of the modern (post IE6) browsers.

In the meantime, I recommend Confluence as an alternative to Sharepoint's Wiki component. The downside is it doesn't come cheap...

Tuesday, August 5

Welcome to the Terrier Team

The Terrier guys at the University of Glasgow have a new blog, TerrierTeam.

Iadh's posted the team's research presented at SIGIR 2008.

Welcome to the blogosphere Iadh, Craig, and the rest of the Glasgow IR team. It's about time, considering you organize the TREC blog track! ;-)

Sunday, August 3

Learn how to concentrate

I want to learn How To Concentrate.

Later, the author writes:
Lackadaisical attention--slipshod thinking—diffused interest—scattering of mental forces, all these get us nowhere. The Jack-of-all-trades is a failure in the twentieth century. The big rewards of this age go to the man who can do one thing supremely well. That-means concentration.
I need to practice more:

Try to hold a chosen topic of thought for a fixed period of time. Do not be too ambitious... Try it on the following topics: "How I spent my last birthday." "My favorite book." "The best moving-picture I ever saw." "The most inspiring lecture I ever heard"... Be content with a minute, at the outset less, perhaps, if your pride permit.
Via Feld.

Saturday, August 2

SIGIR 2008 Workshop Proceedings

Most of the SIGIR workshop proceedings are now online. There's an overwhelming amount of material, but over the next week I hope to pick out a few of the highlights. I would love to hear what attendees learned from each workshop and what they found exciting or disappointing, blog about it or drop me an e-mail.

Learning to Rank (full proceedings)
The main purpose of this workshop is to bring together information retrieval researchers and
machine learning researchers... The goal is to design and apply methods to automatically learn a function from training data, such that the function can sort objects (e.g., documents) according to their degrees of relevance, preference, or importance as defined in a specific application.

See Jon's coverage.

Focused Retrieval - (full proceedings)
Focused retrieval has been used to extract relevant sections from academic documents; and the application to text book searching is obvious (such commercial systems already exist). The purpose of this workshop is to raise issues and promote discussion on focused retrieval - that is, Question Answering (QA), Passage Retrieval, and Element Retrieval (XML-IR).

Information Retrieval for Advertising - (full proceedings)
Online advertising systems incorporate many information retrieval techniques by combining content analysis, user interaction models, and commercial constraints. Advances in online advertising have come from integrating several core research areas: information retrieval, data mining, machine learning, and user modeling. The workshop will cover a range of topics on advertising, with a focus on application of information retrieval techniques.

Mobile Information Retrieval (MobIR '08) - ( full proceedings)
Mobile Information Retrieval (MobIR'08) is a timely workshop concerned with the indexing and retrieval of textual, audio and visual information such as text, graphics, animation, sound, speech, image, video and their various possible combinations for use in mobile devices with wireless network connectivity.

Beyond Binary Relevance: Preferences, Diversity, and Set-Level Judgments - ( full proceedings)
New methods like preference judgments or usage data require learning methods, evaluation measures, and collection procedures designed for them. This workshop will address research challenges at the intersection of novel measures of relevance, novel learning methods, and core evaluation issues.

Future Challenges in Expertise Retrieval - (full proceedings and slides)
The main theme of the workshop concerns future challenges in Expertise Retrieval. Instead of focusing on core algorithmic aspects of a specific expert finding scenario (as is the case for the TREC Expert Finding task), our aim is to broaden the topic area and to seek for potential connections with other related fields.

Analytics for Noisy Unstructured Text Data (full proceedings behind ACM web login)
Noise in text can be defined as any kind of difference between the surface form of a coded
representation of the text and the intended, correct, or original text. The goal of the AND workshops is to focus on the problems encountered in analyzing noisy documents coming from various sources.

Best student paper award: Latent Dirichlet Allocation Based Multi-Document Summarization by Rachit Arora and Balaraman Ravindran

Workshops without proceedings online (yet):
Aggregated Search
Speech Search (SSCS)

Lastly, an older, but highly related workshop from WWW 2008:
Adversarial Information Retrieval (AIRWeb) - (program with papers and slides)
The program is structured around 3 sessions with presentations of peer-reviewed papers on Adversarial IR on the Web, covering usage analysis, network analysis and content analysis; followed by one session with the Web Spam Challenge results and a panel on the future of Adversarial IR on the Web.

Thursday, July 31

SIGIR 2008 coverage from around the web

See also the earlier coverage of the learning to rank workshop and Greg Linden's keynote coverage.

First up, Paul Heymman on the Stanford InfoLab Blog has some of the best coverage of the conference I've read yet.

In case you missed Jon's earlier comment, he has coverage of the Learning to rank sessions and workshop.

Paraic Sheridan, covers the keynote on Google China and its future in Africa. Paraic is a computational linguist at the Centre for Next Generation Localisation (CNGL) at Dublin City University

Pranam Kolari, from Yahoo!'s web spam team has coverage of Kai Fu Lee's Keynote.

Best paper awards
I couldn't find the award information on the SIGIR 2008 site, but here's what I pieced together, please correct me if I'm wrong:
Algorithmic Mediation for Collaborative Exploratory Search (best paper award)
BrowseRank: Letting Web Users Vote for Page Importance (best student paper award)

Also, I look forward to reading Peter Bailey's paper Relevance Assessment: Are Judges Exchangeable and Does it Matter?
In the paper they examine the impact of assessor expertise on the quality of relevance judgments. In the end, they conclude:
...the Cranfield method of evaluation is somewhat robust to variations in relevance
judgements. Having controlled for task and topic expertise, system performance measures show statistically significant, but not large, differences. Similarly, system orderings allow us to identify “good” and “bad” IR systems at a broad-brush level.

Tuesday, July 29

What makes Cuil different: Index Infrastructure

I had a brief post yesterday on Cuil's launch, along with seemingly every other author in the blogosphere. My question is: What makes Cuil different from GYM? Here is what I have managed to glean from all the press coverage yesterday and my own experimentation with the engine.

Cuil's plans to differentiate itself

1) It's about the infrastructure, of course.
From a recent interview with GigaOm, Anna Patterson, formerly one of Google's infrastructure designers, reportedly said:
How it works is that company has an index of around 120 billion pages that is sorted on dedicated machines, each one tasked with conducting topic-specific search — for instance, health, sports or travel. This approach allows them to sift through the web faster (and probably cheaper) than Google...
The Forbes article has a little more detail on their query serving architecture:

Patterson and Costello's impressive feat is that they've done this with a total of 1,400 eight-CPU computers (1,000 find and data-mine Web pages, the remaining 400 serve up those pages) [JD: Even assuming there is no redundancy 120 billion docs / 400 servers = 300 million documents per node. This seems unrealistically high, especially considering that Lucene, a widely used search library can realistically handle 10-20 million.] ...

Cuil attempts to see relationships between words and to group related pages in a single server. Patterson says this enables quicker, more efficient searching: "While most queries [at competitors] go out to thousands of servers looking for an answer, 90% of our queries go to just one machine."

Finally, to compare with Google's architecture, a quote from Danny Sullivan's interview with Anna:
If they [Google] wanted to triple size of their index, they'd have to triple the size of every server and cluster. It's not easy or fast...increasing the index size will be 'non-trivial' exercise
According to the news, Cuil's index serving infrastructure is a key competitive advantage over Google and the other major players. It remains to be seen if they can leverage this platform to produce world-class results.

On their size claims

Last I heard, Google's index is rumored to be in the 40 billion range, Microsoft is in the 10-20+ billion range. Cuil claims their architecture allows at least a 3x increase in index size over Google. However, it's hard to verify this because Cuil's hit counts are badly broken: a search for [the] returns an estimated 250 documents. The lack of support for advanced search, such as site: also makes it difficult to compare coverage of individual sites, such as Wikipedia.

Other differentiating features:
  • Topic-specific ranking
    From Danny's interview, it sounds like Cuil is doing post-retrieval analysis of document content, analyzing phrase co-occurrence and extracting 'concepts'. From the interview:
    It figures out these relationships by seeing what type of words commonly appear across the entire set of pages it finds. Since "gryffindor" appears often on pages that also say "harry potter," it can tell these two words (well, three words -- but two different query terms) are related.

    Cuil then reportedly computes a topic specific link score. It sounds very similar to Teoma's HITS technology. Again, there is no support (yet) for Cuil's claim that this is superior to other search approaches.

  • UI and exploration
    Cuil has a non-standard two or three column layout of results, which attempts to feel more like a newspaper, with images associated with many results.

    It appears to use the information from the content analysis to create the 'Explore by Category' box to drill down into specific topics, as well as offering related searches as tabs across the top of the page.
Closing thoughts: The 120 billion pages people care about

Size matters, but it's more important to get the right content in the index. It's purely subjective since the web is infinite, but only a subset of the web is useful. Google tracks at least a trillion distinct URLs, and Cuil's crawls only a mere 186 billion (SE Land reference). It's critical that crawling and indexing be prioritized correctly. For example, despite the reported massive index size, this blog is not indexed by Cuil. While on its own this doesn't mean much, Daniel reports similar experiences with lack of coverage of his content.

I am unimpressed with Cuil's current coverage and relevance, but it's still early. Despite all the criticism (much of it justified), launching a search engine of this scale is an impressive feat. I think what Cuil's doing is exciting and I'm witholding judgment until it has time to mature. Once again, congratulations to the Cuil team and good luck with the long road ahead.

Monday, July 28

SIGIR Learning to Rank (LR4IR) 2008 Proceedings

The L4IR proceedings are available for download from the L4IR 2008 website.

Highlights to come...

I'm still waiting for more blog coverage of SIGIR 2008, and I'm not alone.

Cuil launches with 120 billion page index

Cuil launched.

Read the good coverage from GigaOm and Danny Sullivan over at SELand. It's an early stage product, but maybe I'll have a review later today.

Cuil has a formidable group of search veterans including Louis Monier from Altavista/eBay, Tom Costello from IBM Webfountain, and Google veterans Anna Patterson and Russell Power, see their bios on the Cuil management page. Congratulations to the entire team, launching a product of this scale is a monumental undertaking!

Wednesday, July 23

Google Knol: It's MY article, not yours!

Google Knol launches a platform to create Wikipedia-like articles, see the official Google blog entry.

However, there is a major difference from Wikipedia:
The key principle behind Knol is authorship. Every knol will have an author (or group of authors) who put their name behind their content. It's their knol, their voice, their opinion. We expect that there will be multiple knols on the same subject, and we think that is good... People can submit comments, rate, or write a review of a knol.

News from SIGIR 08: Kai Fu Lee keynote and first slides

SIGIR is happening this week in Singapore and I have to admit, I'm jealous of the attendees. I can't attend this year, and I've been waiting for coverage. There fantastic pictures and stories about the incredible food I'm missing on the SIGIR 08 Singapore Traveler blog. However, a few stories are trickling out:

Greg's coverage of Kai-Fu Lee's keynote on Google China.

Jon's posted his slides from his presentation:

I'm going to be taking a long weekend for vacation with the family. I will see you Sunday or Monday. Hopefully there will be more coverage when I return!

Yahoo!'s Peter Mika on Semantic Search

A recent article on Semantic Search by Yahoo! researcher Peter Mika, a key member of Yahoo!'s SearchMonkey team.

Via the Yahoo! search blog: Paving the way to Semantic Search.