Tuesday, January 31

Globalspec 2005 Results

Globalspec, my employer, today made its 2005 year-end announcement. It is a privately held company, so there are no earnings information, but the numbers are interesting nonetheless. Although it will probably be overshadowed in the press today because of Google releasing its year-end number later this afternoon.

Here is a sample from the press release:
During 2005, GlobalSpec sales volume grew 51% over 2004, while the company’s worldwideuser base expanded to more than 2 million registered users. GlobalSpec continues to add new registered users at the current rate of 20,000 per week. SpecSearch®, GlobalSpec’s trademarked technology allowing engineers and other scientific and technical professionals the ability to search by product specification, now includes more than 95 million parts in 1,400,000 product families from over 17,000 catalogs.
One thing that it doesn't say that I will add is that The Engineering Web grew significantly in 2005, both in the amount of content and in its quality. For instance, in 2005 Globalspec created lots of new parnerships, adding important and useful content from partners like the IEEE and Knovel. We cracked down on spam, which is a continuous process, and made significant investments in resources to growth the breadth and depth of content. We also made signficant changes to improve the freshness of the content.

But, that's just my opinion. Try some searches on The Engineering Web.

I would be interested in hearing peoples' opinions about its strengths, weaknesses, and would we could do to make it better. Have people noticed the improvements made in 2005?

Meta-Search Part I: The Beginning

Today, I start the first in what I hope will be a series of posts on meta-search. Meta-search engines have been around since almost the beginning of web search and they continue to spring up almost daily, like Gravee, which I reviewed earlier. They continue to be controversial and interesting, as evidenced by the articles written, such as this recent article on Search Engine Watch. I've always enjoyed looking at history and the development of technology, so I thought a fun way to jump into the foray would be to look at the technical innovators of yesteryear. Although theere have been meta-search engines that searched databases for a long time, I am going to restrict my discussion to meta-search on the web. I am going to take a look at the first two web meta-search engines, SavvySearch and MetaCrawler. The problems they attempted to solve are many of the same problems facing search today.

Background: The beginning

I have seen several resources, such as this history of search engines, that wrong identify MetaCrawler as the first web meta-search engine. The first search engine was in fact SavvySearch, but not by a wide margin. SavvySearch and MetaCrawler were both university research projects released within months of one another, SavvySearch in March 1995[2] and Metacrawler in July of the same year[1]. The bottom line is that they were both in development at the same time, in 1994 and were released to the public in '95.


SavvySearch was a research project out of the Colorado State University that attempted to provide a centralized interface to web search engines and specialized databases through intelligent serivce selection. It gave users with a "search plan" to execute their query aginst the most relevant subset of engines and databases for their query. Their goal was attempt to optimize two conflicting goals: "minimizing resource consumption and maximiming search quality. "[2] SavvySearch attempted to do back in 1994 what Gary Price describes as the future in the forementioned SEWatch article,
For a long time I've said verticals will continue to grow in popularity and importance as meta search tools which are getting better all of the time will allow various database and content publishers to offer material (free or fee) to end users who will select these databases at the time of their search based on their information need.
Like Gary's vision, SavvySearch searched not only web content, but also included specialized databases like Roget's Thesaurus, CNet's Shareware.com, and the IMDB. SavvySearch's raison D'etre was that no search engine, or even a group of engines was large enough to contemplate crawling the entire web. The major engines (Aliweb, Webcrawler, Lycos, Yahoo, etc...) lacked good coverage of the web. Furthermore, this was before major specialty sites started creating crawler-friendly database driven pages or provided DB feeds directly for indexing. SavvySearch's sibling, MetaCrawler, also addressed the coverage issue, but instead of focusing on intelligent service selection, it focused on addressing problems of freshness and relevance.

MetaCrawler was a project out of the University of Washington by graduate student Erik Selberg (advised by Oren Etzioni) that tackled not only recall but also staleness and (ir)relevance of search results of the search engines of the day. Unlike SavvySearch, it was a pure web search engine, although one of the future projects was to extend it to include databases[2]. While it did attempt to solve recall problems, its primary aim was to "verify" pages by fetching to ensure their existence and freshness.

Metacrawler saved the user time and work by quering six search engines: Galaxy, Infoseek, Lycos, Open Text, Webcrawler, and Yahoo!. It then de-duped and fetched all of the pages returned. That's right, it fetched all of the pages returned by every engine! It “verified” results, eliminating dead or modified pages that were not irrelevant. According to their paper, on average 14.88% of search results were removed because they were "dead." In addition to removing dead pages, the pages were re-scored based on the query term, pages that changed since they were fetched by the search engine were removed.

In the process of re-scoring pages that it fetched, it also generated query sensitive page snippets. Back in this pre-cambrian era of search, there were no query sensitive summaries. It was too expensive to store the cached content. Instead, search engines provided a list of URLs a query independent description of the page. Users were left to hunt and poke in order to discover how the page was related to their query in more depth. Metcrawler improved the percieved relevance of search because users could more easily understand why a search result was returned. Selberg describes this list of "references", "Each reference contains a clickable hypertext link to the reference, followed by local page context (if available), a confidence score, verified keywords, and the actual URL of the reference. Each word in the search query is automatically boldfaced." This feature would not be duplicated again (to my knowledge) until 1999 when Google released their search engine.

The biggest drawback to MetaCrawler was that it didn't store cached pages, in contrast with SavvySearch's emphasis on economy of resources, metacrawler was very bandwidth and time intensive. Their fetcher was highly optimized so that it could simultaneously download over 4,000 pages at a time[1]. Quite an accomplishment! However, fetching every result for every query doesn't scale, even with the benefits of caching. Instead, Selberg proposed that MetaCrawler could be a client side application and that ISPs could provide caching to speed page fetch time. However, this approach would have required substantial client-side bandwidth in the pre-broadband era. Even with MetaCrawler's highly optimized fetcher a query took on average over two minutes to verify all of the pages.[1] (page 8, table 4).

A brief comparison
SavvySearch searched up to 20 engines at once, while MetaCrawler queried only six. SavvySearch included topic spefic directories and databases, while MetaCrawler only searched web search engines. MetaCrawler was slower, but more reliable than SavvySearch.[4] Because MetaCrawler fetched all pages it could support more advanced query functionality, such as the minus query operator, restriction to a country, and a particular domain name extension. SavvySearch on the other hand did not support advanced query formats. Because it did no processing of pages itself, it was reduced to using the lowest common denominator. Neither provided a way to leverage the full advanced query power offered by most engines.

A primary reason for both of these engines birth was that in the era pre-Google, and even pre-AltaVista, having a single engine that provided even modest coverage of the web seemed impossible. Creators Selberg and Etzioni write,
Skeptical readers may argue that service providers could invest in more resources and provide more comprehensive indices to the web. However, recent studies indicated the rate of Web expansion and change makes a complete index virtually impossible.
They cite an interesting source, two researchers at CMU who helped develop the Lycos search engine. Mauldin and Leavitt write in their paper Web Agent Related Research at the Center for Machine Translation: "First, information discovery on the web (including gopher-space and ftp-space) is now (and will remain) too large a task...the scale is too great for the use of a single explorer agent to be effective." Advances in storage technology and cheap bandwidth allow GoogleBot and other search engines to just that.

SavvySearch and MetaCrawler paved the way both for today's search engines and for the next generation of meta-search engines. MetaCrawler was purchased by Infospace and continues to operate as a meta-search engine, but bears little resemblance to its former self. It provided a platform for research on the next-generation of meta-search engines, HuskySearch, which reesarched AI applications to query refinements and Grouper which explored document clustering in metasearch. MetaCrawler's dynamic summaries are now the de facto standard, with Google being a primary pioneer in bringing it to the masses. The problems that these search engines attempted to address: the proliferation of search engines and the lack of stability and coverage in their results continues today. There are more engines than ever and there is still a significant amount of difference between the results in even the major engines. In future parts in the series we'll take a closer look at some of these problems and meta-search engines today try to address them.

References and Resources
[1] E Selberg and O Etzioni. Multi-Service Search and Comparison Using the MetaCrawler, 1995.
[2] D Dreilinger and A Howe. Experiences with Selecting Search Engines using Meta-Search, 1997.
[3] MetaCrawler, HuskySearch, and Grouper.
[4] Sonnenreich. A History of Search Engines.