Thursday, November 10

Vertical search definition and context

Vertical Search Engine – A program that allows a specialized collection of data harvested from the Internet or local machine by a piece of software called a spider or robot, to be searched using keywords, boolean logic, or more advanced criteria. The specialized collection may be limited to a specific topic, media format, genre, purpose, location, or other differentiating feature.

Further decomposing it: Vertical (as in a vertical market) means specialized, narrow, focused, specific, niche. Vertical search engines are contrasted with general or broad “horizontal” search engines that are general purpose and all-encompassing.

John Batelle does a great job of describing one class of vertical search engines, domain-specific search engines, in his book, The Search:

Domain-specific search solutions focus on one area of knowledge, creating customized search experiences, that because of the domain's limited corpus and clear relationships between concepts, provide extremely relevant results for searchers.

There has been quite a bit of controversy over what defines a vertical search engine. My foray into the controversy started when I heard Niki Scevak from Jupiter Research wrote a market report about “Vertical Search,” specfically focused on marketing and advertising in Febraury of 2005. This sparked a debate between Tom Evslin, a retired technology CEO turned author and blogger, and Fred Wilson, a venture capitalist.

What’s this conversy all about? Well, at the crux of the disagreement in all of this is something crucial and fundamental: How do you define “search engine?”

First, let’s here from some people who are supposed to know, the good old Oxford English Dictionary. It says, a search engine is “a program for the retrieval of data, files, or documents from a database or network, especially the Internet.” And for a more populist perspective, let’s compare with Wikipedia. According to Wikipedia, a search engine is “a program designed to help find information stored on a computer system such as the World Wide Web, or a personal computer.” For even more variety try define:search engine on Google. One thing becomes apparent when looking at these definitions, when most people refer to a “search engine,” they most are usually focused on an “internet search engine” such as Google, Yahoo, MSN, Altavista, etc… Most people don’t associate “search engine” with searching their local computer or a database. This difference in definition results in the controversy between Niki, Tom, and Fred.

Are Travelocity, Expedia, Orbitz, Mobissimo, etc… vertical search engines? According to the technical definition, yes. However, under this definition every database driven website is a “search engine.” I find this difficult to swallow. Instead, I argue that they are not. In doing so I agree with Tom Evslin’s argument. These travel sites have highly structured proprietary data that they use to answer a query. In contrast, a web search engine typically searches a large database, an inverted text index of unstructured data – words from web pages. As one definition writes, search engines are like “card catalogs for the internet.”

It seems that the real differentiator between a web search engine and a database powered website comes down boils down to the source of the data. If it is derived at least in part from data provided by websites, it is a "search engine". If the data is not, then it's not. It may be a travel site, a desktop application (GDS, Yahoo desktop), or any website on a specialized topic, but these aren't "vertical search engines". In my opinion, the bottom line is that sites like Kayak,, etc... are not "vertical search engines" because they don't derive their content from other, external, websites through an automated process, like crawling.

In the past one could make the argument that it was about structured database content vs. unstructured web pages. However, today that line is becoming blurred. Google and other search engines use information extraction to create structure from the unstructured mess allowing product search, job search, and other more specialized “advanced” search functionality. The old structured vs. un-structured distinction no long works. However, it is because search engines are smart enough to turn the unstructured content into structured. The source of the content hasn’t changed, they are simply smarter in their processing. The content is still derived from crawling, harvesting the information from the web.

Examples of Vertical Search Engines
Domain Specific:
Globalspec – 200+ Million pages of "Engineering" web content. (And my current employer.)
HealthLine -- A consumer health research site
KnowledgeStorm -- An IT search engine for IT decision makers.
Scirus -- A science only search engine based on FAST search technology.

Media Specific (Audio and Video)
Truveo -- A video an multimedia search engine.
Singing Fish -- An audio search engine

Genre Specific -- Blogs & News
Technorati -- A Lucene based search blog search engine. -- A news and blog search engine.

Task-Specifc Search utilizing information extraction
Trulia -- Real estate search
Indeed -- Job search
SimplyHired -- Job search
Oodle -- A local classified search engine.

Many of the above engine use specialized knowledge dervied from extracting very specific information from parts of the web, called “information extraction.” One of the companies that pioneered the advances in search engine technology was WhizBang Labs. It was one of the first major companies (created from Carnegie Mellon) that developed commercial information extraction technology now used in many of the above. (More on Whizbang and information extraction in a future post).

The bottom line is that sites like Kayak,, etc... are not "vertical search engines" because they don't derive their content from other, external websites through an automated process, like crawling. Desktop search, local search, etc.. can fall into "vertical search engines" because they are specialized search engines derived from "crawling" the web or a local desktop.