Tuesday, July 29

What makes Cuil different: Index Infrastructure

I had a brief post yesterday on Cuil's launch, along with seemingly every other author in the blogosphere. My question is: What makes Cuil different from GYM? Here is what I have managed to glean from all the press coverage yesterday and my own experimentation with the engine.

Cuil's plans to differentiate itself

1) It's about the infrastructure, of course.
From a recent interview with GigaOm, Anna Patterson, formerly one of Google's infrastructure designers, reportedly said:
How it works is that company has an index of around 120 billion pages that is sorted on dedicated machines, each one tasked with conducting topic-specific search — for instance, health, sports or travel. This approach allows them to sift through the web faster (and probably cheaper) than Google...
The Forbes article has a little more detail on their query serving architecture:

Patterson and Costello's impressive feat is that they've done this with a total of 1,400 eight-CPU computers (1,000 find and data-mine Web pages, the remaining 400 serve up those pages) [JD: Even assuming there is no redundancy 120 billion docs / 400 servers = 300 million documents per node. This seems unrealistically high, especially considering that Lucene, a widely used search library can realistically handle 10-20 million.] ...

Cuil attempts to see relationships between words and to group related pages in a single server. Patterson says this enables quicker, more efficient searching: "While most queries [at competitors] go out to thousands of servers looking for an answer, 90% of our queries go to just one machine."

Finally, to compare with Google's architecture, a quote from Danny Sullivan's interview with Anna:
If they [Google] wanted to triple size of their index, they'd have to triple the size of every server and cluster. It's not easy or fast...increasing the index size will be 'non-trivial' exercise
According to the news, Cuil's index serving infrastructure is a key competitive advantage over Google and the other major players. It remains to be seen if they can leverage this platform to produce world-class results.

On their size claims

Last I heard, Google's index is rumored to be in the 40 billion range, Microsoft is in the 10-20+ billion range. Cuil claims their architecture allows at least a 3x increase in index size over Google. However, it's hard to verify this because Cuil's hit counts are badly broken: a search for [the] returns an estimated 250 documents. The lack of support for advanced search, such as site: also makes it difficult to compare coverage of individual sites, such as Wikipedia.

Other differentiating features:
  • Topic-specific ranking
    From Danny's interview, it sounds like Cuil is doing post-retrieval analysis of document content, analyzing phrase co-occurrence and extracting 'concepts'. From the interview:
    It figures out these relationships by seeing what type of words commonly appear across the entire set of pages it finds. Since "gryffindor" appears often on pages that also say "harry potter," it can tell these two words (well, three words -- but two different query terms) are related.

    Cuil then reportedly computes a topic specific link score. It sounds very similar to Teoma's HITS technology. Again, there is no support (yet) for Cuil's claim that this is superior to other search approaches.

  • UI and exploration
    Cuil has a non-standard two or three column layout of results, which attempts to feel more like a newspaper, with images associated with many results.

    It appears to use the information from the content analysis to create the 'Explore by Category' box to drill down into specific topics, as well as offering related searches as tabs across the top of the page.
Closing thoughts: The 120 billion pages people care about

Size matters, but it's more important to get the right content in the index. It's purely subjective since the web is infinite, but only a subset of the web is useful. Google tracks at least a trillion distinct URLs, and Cuil's crawls only a mere 186 billion (SE Land reference). It's critical that crawling and indexing be prioritized correctly. For example, despite the reported massive index size, this blog is not indexed by Cuil. While on its own this doesn't mean much, Daniel reports similar experiences with lack of coverage of his content.

I am unimpressed with Cuil's current coverage and relevance, but it's still early. Despite all the criticism (much of it justified), launching a search engine of this scale is an impressive feat. I think what Cuil's doing is exciting and I'm witholding judgment until it has time to mature. Once again, congratulations to the Cuil team and good luck with the long road ahead.


  1. My scathing review of Cuil notwithstanding, I am impressed with their technical accomplishment, and I am willing to believe that they've built a more efficient index than Google. But all is for naught if they deliver a sub-par experience. They have to put more work into their crawler and/or ranking algorithms if they're going to compete head-to-head with the big dogs.

  2. I think they probably got too much press hype for what's clearly still an early stage product.

    If you believe their index size claims, they've laid an impressive groundwork for future growth. We'll see how they leverage that platform.

    It's taken Microsoft years to build out and be competitive, Even with their experience, software takes time and I expect it will take Cuil a year or two of hard work to narrow the gap.

    Is it a real competitor to Google? Maybe, we'll see how they execute.

    In reality, with the VC interests involved my bet is that they'll exit for a few hundred million when the time is right.