Tuesday, January 31

Meta-Search Part I: The Beginning

Today, I start the first in what I hope will be a series of posts on meta-search. Meta-search engines have been around since almost the beginning of web search and they continue to spring up almost daily, like Gravee, which I reviewed earlier. They continue to be controversial and interesting, as evidenced by the articles written, such as this recent article on Search Engine Watch. I've always enjoyed looking at history and the development of technology, so I thought a fun way to jump into the foray would be to look at the technical innovators of yesteryear. Although theere have been meta-search engines that searched databases for a long time, I am going to restrict my discussion to meta-search on the web. I am going to take a look at the first two web meta-search engines, SavvySearch and MetaCrawler. The problems they attempted to solve are many of the same problems facing search today.

Background: The beginning

I have seen several resources, such as this history of search engines, that wrong identify MetaCrawler as the first web meta-search engine. The first search engine was in fact SavvySearch, but not by a wide margin. SavvySearch and MetaCrawler were both university research projects released within months of one another, SavvySearch in March 1995[2] and Metacrawler in July of the same year[1]. The bottom line is that they were both in development at the same time, in 1994 and were released to the public in '95.

SavvySearch


SavvySearch was a research project out of the Colorado State University that attempted to provide a centralized interface to web search engines and specialized databases through intelligent serivce selection. It gave users with a "search plan" to execute their query aginst the most relevant subset of engines and databases for their query. Their goal was attempt to optimize two conflicting goals: "minimizing resource consumption and maximiming search quality. "[2] SavvySearch attempted to do back in 1994 what Gary Price describes as the future in the forementioned SEWatch article,
For a long time I've said verticals will continue to grow in popularity and importance as meta search tools which are getting better all of the time will allow various database and content publishers to offer material (free or fee) to end users who will select these databases at the time of their search based on their information need.
Like Gary's vision, SavvySearch searched not only web content, but also included specialized databases like Roget's Thesaurus, CNet's Shareware.com, and the IMDB. SavvySearch's raison D'etre was that no search engine, or even a group of engines was large enough to contemplate crawling the entire web. The major engines (Aliweb, Webcrawler, Lycos, Yahoo, etc...) lacked good coverage of the web. Furthermore, this was before major specialty sites started creating crawler-friendly database driven pages or provided DB feeds directly for indexing. SavvySearch's sibling, MetaCrawler, also addressed the coverage issue, but instead of focusing on intelligent service selection, it focused on addressing problems of freshness and relevance.

MetaCrawler
MetaCrawler was a project out of the University of Washington by graduate student Erik Selberg (advised by Oren Etzioni) that tackled not only recall but also staleness and (ir)relevance of search results of the search engines of the day. Unlike SavvySearch, it was a pure web search engine, although one of the future projects was to extend it to include databases[2]. While it did attempt to solve recall problems, its primary aim was to "verify" pages by fetching to ensure their existence and freshness.


Metacrawler saved the user time and work by quering six search engines: Galaxy, Infoseek, Lycos, Open Text, Webcrawler, and Yahoo!. It then de-duped and fetched all of the pages returned. That's right, it fetched all of the pages returned by every engine! It “verified” results, eliminating dead or modified pages that were not irrelevant. According to their paper, on average 14.88% of search results were removed because they were "dead." In addition to removing dead pages, the pages were re-scored based on the query term, pages that changed since they were fetched by the search engine were removed.

In the process of re-scoring pages that it fetched, it also generated query sensitive page snippets. Back in this pre-cambrian era of search, there were no query sensitive summaries. It was too expensive to store the cached content. Instead, search engines provided a list of URLs a query independent description of the page. Users were left to hunt and poke in order to discover how the page was related to their query in more depth. Metcrawler improved the percieved relevance of search because users could more easily understand why a search result was returned. Selberg describes this list of "references", "Each reference contains a clickable hypertext link to the reference, followed by local page context (if available), a confidence score, verified keywords, and the actual URL of the reference. Each word in the search query is automatically boldfaced." This feature would not be duplicated again (to my knowledge) until 1999 when Google released their search engine.

The biggest drawback to MetaCrawler was that it didn't store cached pages, in contrast with SavvySearch's emphasis on economy of resources, metacrawler was very bandwidth and time intensive. Their fetcher was highly optimized so that it could simultaneously download over 4,000 pages at a time[1]. Quite an accomplishment! However, fetching every result for every query doesn't scale, even with the benefits of caching. Instead, Selberg proposed that MetaCrawler could be a client side application and that ISPs could provide caching to speed page fetch time. However, this approach would have required substantial client-side bandwidth in the pre-broadband era. Even with MetaCrawler's highly optimized fetcher a query took on average over two minutes to verify all of the pages.[1] (page 8, table 4).

A brief comparison
SavvySearch searched up to 20 engines at once, while MetaCrawler queried only six. SavvySearch included topic spefic directories and databases, while MetaCrawler only searched web search engines. MetaCrawler was slower, but more reliable than SavvySearch.[4] Because MetaCrawler fetched all pages it could support more advanced query functionality, such as the minus query operator, restriction to a country, and a particular domain name extension. SavvySearch on the other hand did not support advanced query formats. Because it did no processing of pages itself, it was reduced to using the lowest common denominator. Neither provided a way to leverage the full advanced query power offered by most engines.

Conclusion
A primary reason for both of these engines birth was that in the era pre-Google, and even pre-AltaVista, having a single engine that provided even modest coverage of the web seemed impossible. Creators Selberg and Etzioni write,
Skeptical readers may argue that service providers could invest in more resources and provide more comprehensive indices to the web. However, recent studies indicated the rate of Web expansion and change makes a complete index virtually impossible.
They cite an interesting source, two researchers at CMU who helped develop the Lycos search engine. Mauldin and Leavitt write in their paper Web Agent Related Research at the Center for Machine Translation: "First, information discovery on the web (including gopher-space and ftp-space) is now (and will remain) too large a task...the scale is too great for the use of a single explorer agent to be effective." Advances in storage technology and cheap bandwidth allow GoogleBot and other search engines to just that.

SavvySearch and MetaCrawler paved the way both for today's search engines and for the next generation of meta-search engines. MetaCrawler was purchased by Infospace and continues to operate as a meta-search engine, but bears little resemblance to its former self. It provided a platform for research on the next-generation of meta-search engines, HuskySearch, which reesarched AI applications to query refinements and Grouper which explored document clustering in metasearch. MetaCrawler's dynamic summaries are now the de facto standard, with Google being a primary pioneer in bringing it to the masses. The problems that these search engines attempted to address: the proliferation of search engines and the lack of stability and coverage in their results continues today. There are more engines than ever and there is still a significant amount of difference between the results in even the major engines. In future parts in the series we'll take a closer look at some of these problems and meta-search engines today try to address them.

References and Resources
[1] E Selberg and O Etzioni. Multi-Service Search and Comparison Using the MetaCrawler, 1995.
[2] D Dreilinger and A Howe. Experiences with Selecting Search Engines using Meta-Search, 1997.
[3] MetaCrawler, HuskySearch, and Grouper.
http://www.cs.washington.edu/research/projects/WebWare1/www/metacrawler/
[4] Sonnenreich. A History of Search Engines.
http://www.wiley.com/legacy/compbooks/sonnenreich/history.html

2 comments:

  1. Anonymous2:31 PM EST

    Ah, thanks for the trip down memory lane. I remember back in early 1996, when I first discovered it, SavvySearch was my favorite search engine. At the time, relative to the existing web at the time, I recall it being as good as, if not better than, Google. (i.e. SavvySearch to 1996's web was >= than Google to 2001's web.)

    I've always found it quite annoying that Google's policy is to not allow metacrawlers. It makes its living off of the content of so many other data sources on the web. Why not allow someone else to make their living off of it?

    ReplyDelete
  2. Anonymous5:35 PM EST

    GLseek metasearch engine ( http://www.glseek.com ) can search and return results from Google, Yahoo, Live and Ask .
    Its a powerful metasearch engine and i like it too much.
    It can search for everything you want : web ,images, videos (great), news, jobs, wikipedia, travel ,Shopping , blogs and ...

    This metasearch engine has some unique features.
    After searching it rank results by relevance. Why?
    However Google, Yahoo, Live themselves rank results by relevance but as they are using of different algorithms, GLseek will rank results again to give best results to its users.
    Most metasearch engines can return only limited results, for example they can give at most 300 results to the visitor but GLseek will give all results from above search engines to the
    visitor. It’s a very unique technology which is used in this great search engine.
    Glseek Metasearch engine has some softwares too:

    1) Glseek Toolbar:

    This toolbar has nice abilities:
    Email Checker: It can check and receive your emails instantly without logging to your email account.
    Auto filler: It fills all of your forms if you want.
    Password Manager: It's one of the best password managers. It's really great and can save unlimited usernames and passwords if you want.
    Save: By clicking on this icon, the current page will be saved as image.
    Search Engine: It can search directly through Glseek search engine in all languages.
    Highlighter: It will highlight the searched words if you want, for finding them easier in pages.and other abilities.
    You can download the latest version of this toolbar from here:
    http://www.glseek.com/toolbar.exe

    2) Glseek Instant Search:

    By opening this software, you directly will be guided to Glseek home page then you can search.
    Or in the search box you can enter search term and see results.
    You can download the latest version of this Instant Search from here:
    http://www.glseek.com/Browser.exe

    ReplyDelete