Tuesday, December 12

Search startups are dead

After a long hiatus (for many reasons) -- I found something that prompted me to write again.


Tim O'Reilly wrote an article, Search Startups are Dead, Long Live Search Startups. His article itself is a response to Tim Burnham's article of the same title.


I really enjoyed reading the interesting discussion from on Tim's blog. It garnered lots of interesting discussion. Highlights include Tim's comments along with those of Ken Krugler (Krugle.com), Otis Gospodnetic (Simpy), Markl (Google Custom search engine), and Jeff Chan (TopicBlogs).

To summarize Burnham's original article:
The hardware and bandwidth needed to create a large web scale search engine (high performance crawling, indexing, query serving, link analysis, etc...) provides a significant barrier to entry for new startups. In short, startups are effectively priced out of the market and that VCs and entrepeneurs should think hard before (don't even bother) attempting to displace one of GYM.

His message is that it won't be long before web search is a commodity platform. This is signaled by fledgling services such as Alexa and Google Custom Search Engine. The idea is that startups could switch search providers as easily as they could switch between RDBMs today. In other words, developers should focus on developing innovative applications that uses these new web search services rather than the attempting to write SQL Server from scratch.

Tim agrees and writes in his comments:
I'm not suggesting that there isn't the possible of radical disruption to the current model. But I still believe that we're entering the centralization phase of the web, in which the big get bigger, and put up barriers to entry to the new guys.
One of the best comments was from Jeff Chan who astutely writes:

It appears to be trendy to point to infrastructure as a barrier to entry for almost any new (or existing) category of services. A few hundred servers suffice for a small number of copies of a web-scale crawl and index. Google's scale is necessary for the volume of queries it serves, not the size of the corpus.

The real problems potential entrants have to deal with are technology and distribution. It is an open question whether anyone can significantly improve on the existing approach (perhaps mining and summarizing facts?), and even if they could, whether they can acquire a following before an existing service replicates it or acquires them.

I agree with Tim and Jeff. You can't create just another search engine and expect to succeed. There will need to be a radical disruption in order to displace Google. In my opinion, search is not yet a commodity platform; there are to few vendors and their services are unproven and immature. I wouldn't bet my (search engine) business on them, yet.

Tuesday, May 16

Yahoo Answers, Naver, and the future of search

Greg Linden has a nice follow-up to Yahoo Search's post on its usage statistics for Yahoo Answers. Reading his post prompted me to write this as a response.

I found a recent interview of Terry Semel on May 10th by Ken Auletta and Jeff Jarvis with a few particularly interesting tidbits. Of course Terry trumpets social search, and says: "Machines don’t answer the questions of people. People answer the questions of people." In another version of the interview Semel is quoted as saying, "Is web search the killer application or just the first? Knowledge search, as they call it in Korea, or social search, as we call it, has blown through the roof. There may be changing dynamics."

After doing a little more digging I found Naver, the wildly successful "social search" engine in Korea Semel refers to. This is what Yahoo holds up as a model for "social search." It was a bit difficult for me to use since it is in korean, but there is a lot of interesting reading in the search engine forums covering it.

What I found fascinating is that Yahoo points to the the success of answer based engines in Asia, specifically in Korea (Naver), as a portender of social search's success in the US. Asia tends to lead in tech adoption, so perhaps what we see happen there could happen in here in a few years? Or is Yahoo wrong and will Yahoo Answers go the way of Looksmart Live and Ask Jeeves' Answerpoint. Having studied history, I like to see the context of what's happened in the past with previous answer services. Danny Sullivan provides some great background in: "Web Search History: Before Google Answers and Yahoo Answers...". In a future post I would like to dig into a more in-depth: then and now.

Naver's competitive advantage has proved to be a major barrier to entry in the Korean market -- for Yahoo (although technically Yahoo Korea was there first) and Google. There was a recent AP article on the topic (MSNBC link). Yahoo Answers feels like a port of Naver to the American market. It's still too early to see if it will work, so I am going to withhold judgement. Yahoo would like to create a similiar barrier of content in search in the US.

Here's a different (my) take on a possible social search / answers service: provide a google-like interface where users can add their content to the search results list. It becomes less a list of sites and more a collaborative Squidoo-like guide, with web search results as one of the many resources. For example, do a search for "edinburgh restaurants" and get a list of the best restaurants with a link to "The List", the local arts and entertainment magazine that is the local authorative source. Make the page editable. Like Yahoo Answers, use a reputation system and points to reward good content and let the community police bad (spam) content. Crazy, perhaps, but an interesting thought.

In other words, what if you could benefit from other searchers' past experience. For example, in researching our wedding, I found the defintive local site on edinburgh restaurants (The List), yet it is not in Google's top 20 for "edinburgh restaurants." Chances are pretty good that another user with the same query is looking for restaurant information on Edinburgh, and The List would be extremely helpful. However, there is no way to add it or tell Google they might want to fix their ranking / add this resource. Instead, the searcher has to learn what I did, -- the hard way. What if search was more interactive (a bit like a better Eurekster?) and we let the community of humans help make results more relevant. Google Co-Op might be a start towards this "wiki of search," but it has a LONG way to go before it hits mainstream (it really requires development knowledge to use, currently. More on Co-op in the future).

In other news, I'm off to the WWW conference -- not to mention getting married and my honeymoon. I'll be returning the first week in June. I will be enjoying Scotland and the south of France, Languedoc, (at a nice little B&B in the country) for our honeymoon. More on the WWW conference, but not the honeymoon, when I return!

Saturday, May 6

Rexa, Automated Information Extraction Meets CiteSeer

Andrew McCallum's team at UMass Amherst recently announced the release of a new Citeseer killer. So, what's so special? Well, if you are familiar with Andrew's work you know it must utilize automated Information Extraction (IE) technology . The Rexa team has a blog, where Andrew describes how Rexa aims to be next generation citeseer:
Rexa's chief enhancement is that we use advanced statistical machine learning algorithms to perform information extraction, and then make first-class, deduplicated, cross-referenced objects not only of research papers, but also people and grants--and in the future, more objects types.
In the process of learning about Rexa I ran across a new and quite interesting blog: Machine Learning Theory. I found Rexa and the blog through Data Mining, a blog by Matthew Hurst, a British Expat and another former Whizbang employee who is now director of research at Intelliseek.

Machine Learning is an awesome blog. John, the author (along with other authors), has a write-up on Rexa which is also quite good. The best thing was the discussion about Rexa, which brought up several areas for "future work."

Rexa is a fully automated solution, in my opinion, one of the problems with fully automated ML solutions is that it is well nye impossible to get near 100% precision. Reading the user comments, one user named Andrej, noticed a mis-attributed paper. Charles, one of Rexa's creators, responds: "The automatic extraction and author merging performed by Rexa has accuracy in the 90s, but inevitably there are errors. " Ninety percent+ accuracy is quite good, quite remarkable in fact. However, in important decisions, is this error rate acceptable?

More important than the % of errors, is the type of errors made by the system. Last year, I saw a presentation given by David Hull entitled "Commercializing Information Extraction: Lessons from WhizBang Labs." One of the major takeaways was this: the problem with fully automated ML based extraction is not the percentage of errors, but the types of "stupid" errors made that a human wouldn't make. Even a modest number of "stupid" errors in author attribution in Rexa could leave users like, Andrej, unsure about the quality of the attributions in the system. In his presentation, Hull outlines some of the issues that Whizbang had with its FlipDog jobs database. My boss, Steve, once had his job (and other major exec positions at Globalspec) listed on as postings on FlipDog because it mistook the executive information page as job postings. Both of these cases (FlipDog and Rexa) illustrate the need for an easy way for people to intervene and make corrections when the ML certainty is ambiguous. In Rexa's case the problem of coreference analysis, also known as record linkage or identity uncertainty, is a very difficult research problem and one the computer will certainly not achieve 100% precision. So, what do you do? Do we abandon fully automated solutions or merely accept there error rate and "stupid" mistakes?

For difficult ML problems like this is there a balance between a fully automated ML solution and a fully human manual solution that achieves high precision with minimal user interaction. Could it, for example, let the machine figure out the clear linkages, but defer to a human editor for decisions that are more dubious. Or at least make it clear to the user that the linkage is weak or uncertain. Then, in these less certain cases, can the machine learning algorithm simplify the problem for the user to make it tractable for manual review? For example, the algorithm might surface that out of many choices there are two highly probable author matches for a paper by “A. Bauer”: “Andreas Bauer” and "Andrej Bauer" and surface some rationale for tjudgmentment and let the user confirm. This semi-automated approach has clear speed advantages over a completely manual process and achieves a higher levelperceivedeved "intelligence" than a completely automated process. However, it does involve the user and therefore time and money. Does the increased precision, and more importantly the lack of "stupid mistakes," justify the added cost?

Let me know what you think of this trade-off. Are fully automated solutions viable or do semi-automated solutions still necessary because of user trust issues?

Tuesday, May 2

Sphere Launches

Sphere is vaporware no more. It officially launched this morning and it received another 3.75 M in funding.

More coverage on tech crunch. Congratulations, even if I never got a beta invite!

Sphere's UI was designed by Adaptive Path, a nice User Experience (UE) design firm.

Tuesday, April 25

Desktop Search is Dead, Long Live Desktop Search

Greg Linden wrote in a recent post, "Using the desktop to improve search." In it, he writes:
But is it really natural to go to a web browser to find information? Or should information be readily available from the desktop and sensitive to the context of your current task?

I expect we some day will see information retrieval become a natural part of workflow. Search will be integrated into the desktop.
I disagree, but not because I think his premise is wrong. Instead, I believe the browser will become the platform for most of our tasks and search will integrated into these browser based applications. As more and more applications migrate to the web (calendar, e-mail, word processing, multimedia editing and sharing, file storage, etc...) contextual and implicit search enabled web applications / services will become more prevalent in the way Greg describes. A first step in this direction is Yahoo! Contextual Search, which uses the surrounding paragraphs of text on a webpage to find "Related Info."

I think Microsoft and others will skip a generation of desktop search software and instead focus on better search in their web service applications. See my previous post on Live Drive and GDrive. Why invest lots of money creating a search enabled desktop file system when the goal is to get users to switch to Live Drive? Instead, invest the resources in the new system to make the service more appealing and drive user adoption.

One problem I see with desktop search is that alot of the really cool contextual features require tight application integration without impacting application performance. Desktop search is today hindered by OS and application reliability and performance.

Application and OS metadata reliability was a another major issue highlighted in SiS, which Greg references. I've seen Susan Dumais give her SiS presentation twice, once to the UC Berkeley lecture series, and once in real life at a search conference (she must be so sick of it, she's probably done it hundreds of times!). Incorrect metadata was a major theme in her presentation. For example, one of her examples related to Microsoft Word documents and searching by author. The problem they found was that one person wrote a large amount of the companies' documents. It turns out that this person wrote the word templates everyone used and so became listed as their author! And then there are bigger issues like files / email moving (there is still a GDS bug where it doesn't track moved outlook e-mails or files).

In an age of spyware and adware people are sick of software hogging system resources. Correctly or not, desktop search has a bad rep when it comes to performance. Microsoft's indexing service is notorious for slowing systems down. I have had co-workers complain about even GDS hogging resources. Moving the systems to the network allows indexing to be handled by independent resources and so the user applications aren't hindered by the indexing overhead and users can focus on doing what they want and not waiting for the computer to respond. Searching on a highly parallelized is faster than it ever could be on a single desktop hard drive.

However, it will be some time before we have fully functional web platforms. There are still major barriers in terms of cost and resources, namely bandwidth. In the meantime, desktop search will continue to be important. Apple has shown this in Tiger with "Spotlight" and Google, Yahoo, and Microsoft will all continue to release desktop search applications. It is clear that "implicit queries" and push-oriented search systems will be an interesting and growing part of search in the next generation of web applications.

Wednesday, April 19

Live Drive and GDrive, The Ferraris of cluster-based storage services on the information superhighway

Web Scale Grid Computing Platforms
Google has TeraGoogle. Microsoft has The Platform. Amazon has S3. Think Grid Computing a la IBMs vision of massive Linux clusters running on blade servers. Now begin thinking about the Google File System (GFS), its ability to adapt and re-configure itself in light of hardware failures. It is all very reminiscent of IBM's vision of "Autonomic" and "On Demand" computing. While IBM is focused on mult-million dollar business service contracts, Google is focused on advertising supported consumer services, and Microsoft is creating its next generation of applications to be more web service centric with greater emphasis on advertising supported business models, perhaps IBM should take a hard look.

The forementioned companies all are making massive capital expenditures on computing hardware and bandwidth to build extremely large, commodity hardware based massive computing grids. IBM is doing it for business in its "On Demand" initiative and Google and Microsoft are (or will be) doing it for the consumer, and to some extent, business markets. These massive-scale distributing computing platforms will power the next generation of internet services. Google and Microsoft are each spending massively: Billions of dollars on Millions of servers! I am not exaggerating, it is in the financials, see the Google Analyst Day presentation.

We are seeing the Walmart-ization of distributed computing. Google, in particular, has created one of the lowest cost (to build and mantain), highest performing, grids in the world. This is one of its real innovations. The growth of these large clusters of commodity hardware signficantly lowers the cost per gigabyte of storage to quite reasonable dollar amounts. Lower than what consumers could buy themselves -- there are significant economies of scale at work. Software infrastructure like GFS and IBMs General Parallel File System (GPFS) are making TeraGoogle and IBM's grid computing a reality.

Free disks for everyone!
Sometime in the near future, the end of 2006, or maybe sometime in 2007, I predict that both Google and Microsoft will release online storage services, GDrive and Live Drive, respectively. Amazon is already doing this for software developers through its S3 services, that it launched in March. IBM already has a product for large businesses.

However, MS and Google will bring mass network storage to the masses. Their goal: Store all the users' information on their platforms. Not a new concept (think mainframes and thin clients), but one that is finally starting to gain traction on the internet. We are talking about hundreds of Gigabytes per user, pedabytes and pedabytes of storage. A recent article, Microsoft's new brain, on CNN Money lays it out:
Microsoft is planning to use its server farms to offer anyone huge amounts of online storage of digital data. It even has a name for that future service: Live Drive. With Live Drive, all your information - movies, music, tax information, a high-definition videoconference you had with your grandmother, whatever - could be accessible from anywhere, on any device.

Google apparently has similar plans. An internal memo [analyst day slides] accidentally posted online in March spoke of company efforts to "store 100 percent of user data" and mentions an unannounced Net-storage system called GDrive.
Searching your data would be faster, more reliable, and more relevant if it was stored on Google or Microsoft's high speed grid. Storing your information online would solve many problems inherent in today's "Desktop Search." The meta-data on your files (author, date of creation, modification, etc...) won't changed or be corrupted, a major problem in today's desktop search. Personal searching is no longer limited by the hardware and software constraints of the desktop PC. The result is faster and more reliable access to your information. And just maybe its encrypted so there are no prying eyes elsewhere.

Because these services will have access to your information they will be able to better target advertising to your interests, also referred to as Personalization. Do you have pictures of Paris (the city, not Hilton)? Perhaps you want to plan your next vacation there -- and Google's advertising can help. Of course there are major privacy concerns, but these seem to be inherent in all targeted, contextual, advertising. After all, its contextual! In the past consumers have been wiling to sacrifice privacy for free services, and I believe they will continue to do so. After all, the offer of unlimited storage space and free wi-fi access is pretty compelling!

The growth of web-scale services and different audiences

Only very few companies have the resources necessary to build these platforms and deploy services like the forementioned at scale. IBM does, but it seems content to build the software (Websphere) that powers the "On Demand" economy and rent computing grids to large Fortune 500 companies and provide integration and support via Global Services rather than create consumer facing applications. It was the same back in the 90s when it came to search.

IBM had the technology and resources to build Google in their WebFountain project, but the company lacked the will to create a consumer oriented service. Instead, they turned it into a business analytics engine, which was never really profitable. Consumer services weren't profitable because the business model (targeted ads), hadn't matured. Right now IBM is facing a similar problem in its On Demand platform as it did in search, businesses aren't interested in renting computational power from IBM for large sums of money with long contracts. IBM doesn't seem to have the will to create consumer facing services with advertising and subscription based business models. In the past, Microsoft has traditionally taken the same stance, but times are changing in Redmond with Ray Ozzie. Google's success has given them a wake-up call.

New Business Models (at least for MS)
In a recent memo, Ray Ozzie, the new Microsoft CTO and creator of Lotus Notes lays out one of the new key tenets at the "new" Microsoft, first:
The power of the advertising-supported economic model.
From my (limited) experience, consumers don't like to pay for software (or services), but they seem to be willing to watch targeted, relevant advertising. Google has proven this with their success. Until recently, Microsoft has left this massive revenue stream virtually untapped. Now, instead of charging for some of its services, more of them will be paid for through advertising, with premium advertising free versions, of course. Perhaps someday they'll stop charging for Encarta Online! This is new strategy for Microsoft and one that is still hotly debated internally and externally.

After all, according to previously mentioned Fortune article, more than three times the revenue spent on software is spent each year on advertising, and a growing proportion of that money is being spent online. The key is accountability. Google's CPC ads are the most relevant and targeted advertising system seen to date and their auction system sets prices based on supply and demand. For those of you who remember the late 90s, advertising has come a long way since 468x60 banner ads at $2 CPM. Google has pioneered this and been sucessful, now Microsoft is starting to wake up to the reality that well targeted advertising makes money.

New Technology
There are several major technical forces driving this new era of technical innovation (sometimes referred to as Web 2.0) : broadband and browser adoption. The biggest driver behind this shift towards web based services is the growth of broadband. After all, what good is having unlimited storage on the network if can't access it quickly? To this end, Google and Earthlink are partnering to make broadband truly ubiquitous (at least in large urban areas). In San Fransisco, San Diego, Philadelphia, New York, and other major metropolitan areas Google is providing the advertising and Earthlink is providing the wireless infrastructure. Meanwhile Verizon is rolling out fiber to the home and spending billions on broadband infrastructure. Broadband isn't there yet, but it has reached a critical mass (at least in the US) of > 100M subscribers (From Google Analyst Date notes) and continues to grow.

Lastly, semi-modern browsers have finally become widely adopted. For all its flaws, IE6 is a semi-capable browser. IE7 promises to be much better. This technology adoption has lead to the growth of DHML and CSS, enabling dynamic AJAX based applications. Both Microsoft and IBM are leading the way here with the next generation of web development tools. Microsoft recently released Atlas for Vistual Studio / ASP.Net. IBM just open sourced the AJAX toolkit as part of the Eclipse ATF project. Google has the head Firefox developers, Microsoft has IE. Both teams will work hard to ensure a good "it just works" experience for consumers with their next gen services.

Monday, April 10

Free as in Beer Java HTML Parsers

No, I'm not talking parsley, I'm talking parsing, not to be confused with one of my favorite grassy food garnishes. Looking to do some work in HTML? Let's talk options. First, let's face it-- Sun's sux. You'd think Sun would include a decent HTML parser in the core libraries, but they don't. The Swing HTML Parser is a joke. So, we'll be going open source for some industrial strength workhorses.

There's something for (almost) everyone's parsing needs. Does it need to be fast, work on anything, and have access to the DOM? You can't have everything! If its DOM you want you are going to take a performance hit -- its going to be less reliable and slower because HTML docs are messy. For my current situation I need a DOM tree, so I am going to focus primarily on those solutions. For those of you who want a fast, non-tree based tag parser, you might take a look at Jericho HTML parser, but I haven't used it. Now, on to the DOM Parsers!

Here are the top three free Java HTML parsers (tree based), to my knowledge:
Let's start with the basics. Nutch (The "NPR of Search Engines") uses NekoHTML and TagSoup as its primary supported parsers. Here are some recent benchmarks (March 06) I ran across from the Nutch user mailing list (they were running into problems with Neko hanging). The benchmarks are from recent, but not the latest versions of these two parsers. Here's the summary:

NekoHTML is faster than TagSoup, but TagSoup parses almost anything and is generally more reliable.

I think its fair to say that Neko and TagSoup are the two most popular. I'm not sure who actually uses HTML Parser... but I haven't run into one yet in production. According to the above benchmarks it didn't distinguish itself in either speed or memory, consistingly taking longer to parse and using more memory than its competitors.

Another project worth mentioning is JTidy. JTidy is a HTML cleaner and formatter -- and a gosh darn good one at that. In the process of cleaning it also happens to parse the HTML and so it can also be used as a parser. It is in the Nutch Sandbox, but it isn't "supported." I have heard of projects using it (mostly for testing) and it is reportedly a very reliable tool that produces DOM / XTML from otherwise rubbish code. However, I haven't used it -- so you'll have to do your own research with that one.

Happy HTML chopping-- keep those knives sharp and practice safe error handling!

Saturday, April 8

Between a Lego Printer and the Cover of Time

Life isn't too bad when you go from creating printers out of Legos to playing with Legos on the cover of Time magazine. I'd heard of rumors that some of the early Google computer cases were made out of legos. Well, thanks to The Search Guy (Stephen Green at Sun), I finally found some interesting pictures of Google hardware, circa 1998.

One thing that has always impressed me about Google was what they managed to accomplish considering the meagre hardware and software contstraints. This simple lesson is one a lot of startups could learn. Do more with less. The best of the best are breed of adversity, like the Sardaukar in Frank Herbert's Dune. If you are smart and creative enough to survive the Emperor's prison planet, then you are smart, resourceful, and little compunction from killing.

Although no prison planet, Stanford's CS department certainly did not provide a wealth of free computational power. Computers and parts were reportedly borrowed, stolen, salvaged, and otherwise appropriated. Google's infrastructure needed to be robust enough to handle even the most Frankenstienien hardware and its fickleness. This robust software (TeraGoogle?) allowed Google to create a blazing fast, reliable, service out of cheap bits and bobs. Although, according to some recent reports they are going soft (Google: A behind the scenes look) with their new found wealth.

If you haven't yet, check out Stephen's Blog. It's a great read from another programmer who has been in the business longer than myself.

Thursday, March 30

Sphere beta via TypePad widget

I've been waiting to try Sphere for months, what seems like forever (see my past posts here, and here.) Well, my wait is over.

Sphere is opening the gates a bit wider after being in closed beta for what seems like forever. Typepad today announced a new "widget gallery" to add features to your TypePad blog. You should go check out TypePad Widgets. I wish blogger had something similar. Kudos to Six Apart for the innovation. What immediately got my attention was the Sphere Blog Search widget.

Michael Arrington, from TechCrunch, posted a link to a Blogger, Ouriel who put up the Sphere widget on his TypePad blog. Cool!

If you do a search and click "more results" you are directed to a sphere search results page. Now feel free to try out the beta -- without an invite! I'm still a little sore that after signing up for the beta mailing list after hearing Mary Hodder's UC Sims presentation and never hearing anything back. But, hey its out there and I can use it... which is great! More info/review coming soon.

Wednesday, March 29

Who's your BigDaddy

Or perhaps more accurately, What's your BigDaddy? (and perhaps more interestingly, Why is it called BigDaddy?). According to Matt BigDaddy is:
a software upgrade to Google’s infrastructure that provides the framework for a lot of improvements to core search quality in the coming months (smarter redirect handling, improved canonicalization, etc.). A team of dedicated people has worked very hard on this change; props to them for the code, sweat, and hours they’ve put into it.
It started out at one data center and is now live on at all of Google's data centers. One of its biggest advantages of this upgrade is improved URL cannonicalization -- www vs. non-www, redirects, duplicate urls, 302 “hijacking,” etc... The biggest is most likely that they are better at picking www vs. non-www versions of URLs. Secondly, they are coming up to speed with Yahoo in regards to problems with redirects and "hijacking". Danny Sullivan over at SE Watch wrote a great article last August on Yahoo's policy regarding redirects and hijacking, including contrasting it with Google's (old) policy.

What caught my eye is what I interpret to be an entirely new crawler engine in BigDaddy. Here is the snippet from Matt's most recent post:
Q: “What’s the story on the Mozilla Googlebot? Is that what Bigdaddy sends out?”
A: Yes, I believe so. You will probably see less crawling by the older Googlebot, which has a User-Agent of “Googlebot/2.1 (+http://www.google.com/bot.html)”. I believe crawling from the Bigdaddy infrastructure has a new User-Agent, which is “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
Webmaster are reporting that the new crawler seems to support CSS, Javascript, and other features of modern browsers (such as form support). It sounds like a mozilla-engine based crawler. Truly an innovation in web crawling, considering most crawlers, including the open source, Nutch and Heritrix, are text based. This is huge! Google must be taking a performance hit, but I guess that's part of the reason for all the new hardware they've been buying. It looks like they've been putting those Mozilla/Firefox developers (Ben and Ryan, are you out there?) to work!

For starters quality will improve because Google can tell more accurately what the users see on the page. No more hiding DIVs with CSS or JS to stuff keywords!

Secondly, better coverage of sites using Javascript based navigation / content rendering. Even in the engineering world, there are sites that like McMaster that have their entire site based on Javascript. There is no crawler friendly version. Text based browsers, such as the old GoogleBot, don't do very well on the site, see for yourself. It's too early to tell whether or not the new GoogleBot will improve this, but I would imagine so. Couple these crawling improvements with improved URL cannonicalization and you have a higher quality index.

Some people have claimed that the new crawler is "blazing fast" compared to the old GoogleBot. While I believe it may seem this way to webmasters because Google is crawling more aggressively, I find it highlighly unlikely that the software itself is faster. If the new crawler is using a Mozilla based engine it MUST be slower than the text based crawler because of all of the new features -- Javscript parsing, CSS rendering, etc... which it hasn't done in the past.

Google is crawling more aggressively because I believe it is trying to re-crawl a large portion of the web very quickly. If you think about the impact of modifying the way URL canonization works along with a new crawler engine, it follows that you will probably need to re-computer PageRank. Crawling gently is not something you can do at this scale if you want to propogate these changes quickly. In the process, Google is creating some webmaster complaints. Along this line, Search Engine Journal has an recent article on the topic entitled: Mozilla Googlebot: Mozilla or Godzilla.

Thursday, March 23

Diffing search engine stop words lists

Here is an interesting question: What words does google ignore? I believe it must be query dependent. Try a search for "where how" and then "where how computers work". The where and how are not ignored in the first query, but they are in the second.

Nonetheless, there might be a set of "standard" set of words that google ignores, along with perhaps query dependent words in some cases.

I have yet to find a complete or up to date list of stop words (aka words google "ignores"). It doesn't really ignore them for the purposes of ranking, but that's a whole 'nuther story.

What about Yahoo or MSN? Stop words are application/domain specific, so what words are the search engines using? Do they differ, and if so, how?

Perhaps I'll experiment with this, it shouldn't be hard to figure out. In the meantime, has anyone already done this?

One thing that continually impresses me with Google is their attention to the "little things" which make search incrementally better. Dynamic stopping is one of those things. Another is abbreviation identification (try a search for "ACM" and see that Google highlights "Association of Computing Machininery" in the search results.

Tuesday, March 14

Vortals -- Spiraling upwards, or down the drain

Raul Valdes-Perez, from Clusty / Vivisimo, has been making the rounds claiming that Vortals are making a comeback. While I have seen specialized search take off, I have to disagree, I haven't seen the vortals take off. He is going to give a talk at the upcoming InfoNortics search technology conference on Vortals. However, the talk hardly looks like anything new for the conference.

Raul outlines the 5 reasons vortals are making a comeback in several recent media organization, and he released a short document on the clusty website. Let me summarize, here are the five reasons:
  1. Bandwidth and hardware were expensive, now they're cheap.
  2. Web crawling was hard (and expensive), now its easy (and cheap)
  3. Search engines were unreliable, so you had to create your own in house index. Now that they are reliable (most of the time), all you need is metasearch. (Ironically, one reason metasearch originated was because search engines were unreliable, as a I discuss in my meta-search history post.)
  4. You don't have to build a big, complex, directory because web search is now relevant.
  5. CPC ads will pay for all this, so its all works out.
And of course, don't forget to cluster the results using Vivisimo's whiz-bang technology. As cool as clustering technology is, I don't think it's as much as help here as is claimed. Don't get me wrong, Clusty is really cool and it has some interesting technology. If you haven't tried it, you should. I'll hope to write more about it on my series on Metasearch. But, I wouldn't put as much importance on it as he predictably does.

The problem I see is this: people like Google, getting people to switch is hard. In order to accomplish this you need to have something new, remarkable, and very compelling in order to be able to compete. The above isn't. At best, what he proposes is an incremental improvement over Google. Not enough to get people to switch, in my opinion.

Oh, and creating a really good vortal isn't as easy as it sounds! According to Raul, "The entire process can take as little as a week or two." I don't buy it. It takes blood, sweat, and tears. It also takes time, time to perfect the product and build partnerships with information vendors to get "hidden web" content. Creating something unique and useful, real innovation, isn't something that is done overnight.

So, what is the future? One possible set of answers is personalized search and topical search, but not mashed up as easily as Raul's solution. Google has been rolling out personalized and topical search quietly, but aggressively. An old demo of topical search is the "site flavored search." Take a look. It correctly identifies my page as Internet and Programming related (out of the DMOZ categories?). Not too shabby! Although, it is very coarse grained as Greg Linden has noted, but then again so are most topical search engines. However, looking at the results of my first search, I'm not delighted. My query for "eclipse sunshade" didn't produce the most relevant results, confusing my search for an Eclipse IDE plugin with a car windshield accessory.

This technology and much of the rest of Google's personal search tech, as Greg Linden points out, leads back to Kaltix. He points to this fascinating paper on "Scaling Personalized Web Search." It was written by Kaltix's founders at Stanford. They now most probably work at Google since Kaltix's acquisition. Their paper also references a very interesting article paper on "Topic-Sensitive PageRank," which is relevant to the discussion and some readers may find it interesting as well, I did.

Perhaps I'm wrong with all of this, perhaps Clusty inspired vortals shall conquer the land. However, for now, I'll stay alert and keep looking for something truly remarkable, like a purple cow.

Wednesday, March 8

Microsoft Livens Up Competition with New Windows Live.com

Today, Microsoft released major upgrades to Windows Live.com. Apparently, Microsoft wasn't all bluster last week when they were talking about major upgrades to search. You can read it from their perspective on the Live.com blog on MSN spaces. This also co-incides with their push to get Hotmail users to jump on board the Windows Live Mail beta, which I mentioned last night.


It is breaking on Read Write/Web, Scoble, and in mainstream media as well. The new Windows Live Search features new capabilities for image search, news search, RSS feeds, mail, local search and shopping. Ya, ya, ya, we know all that, its coming up to standard with Yahoo / Google, finally. Those aren't as interesting to me as some of the new UI features and cool new "Web 2.0" tools.

In addition to the standard fare, MS has rolled out some innovative UI features, such as an Ajaxy "Smart Scroll" that allows users to have an infinite scroll experience. Maybe 10 results won't be the major number anymore? Another innovation is the "slider bar," which controls the amount of content displayed for a given search result. Toggle it down and all you see is the title and url. Toggle it all the way up and you get a search box that says "Search this site:"

Now, here's where it gets interesting. Microsoft also announced it purchased Onfolio. Onfolio let's people clip and save text and links from the web. It sounds like a product that should be a part of the Yahoo MyWeb platform and the Yahoo toolbar. Interesting.

According to the Onfolio press release, dated yesterday, March 7:
Today at the O'Reilly Emerging Technology Conference, Microsoft Corp. announced the acquisition of the assets of Onfolio Inc., a privately held, Cambridge, Mass.-based Internet research and information management provider. Onfolio's technology has been incorporated into the Windows Live™ Toolbar to enhance the way people discover, save and reuse their personal and professional Web research.
Apparently, things are running a bit behind schedule because the link in the press release is not Live yet. Sorry, the bad puns from some of my co-workers must be rubbing off. I was digging around Onfolio and Live.com and couldn't find the download link for the toolbar. Then I ran across this quote on Onfolio's download page saying that it will be available on the afternoon of March 8th. Looks like things slipped a little. Oops!

Update 3/15: The toolbar is now available for download. I've downloaded it, but haven't had a chance to try it out much, yet.

The other thing that caught my attention which no one seems to have made a big deal about yet is the featured reported by the Seattle Pi: "In addition, the company says Windows Live Search will offer the ability to quickly conduct a search within a limited set of a person’s favorite sites."

Update:
After reading the SE Watch article, this feature is search macros. According to Chris Sherman, "Macros are flexible, easy to create and share. As an example, you could create a Macro that searched for terms across a set of web sites that you specify, in effect, creating your own personalized vertical search tool." This looks like Microsoft's attack on Yahoo based Rolly-O, which I have discussed previously. Fascinating! In the hidden drop down (click the down arrow) after the "Feeds" link on the search results there is an option to "Find Macro", but clicking on this currently redirects you to the live.com homepage. Looks like we will have to wait a little longer for this feature.

Update 3/15: Macros are also now finally available as well under the Gadget Gallery that can be installed on your homepage. It allows people to create lists of sites like: (prefer:site:joelonsoftware.com OR prefer:site:jeremy.zawodny.com). However, the query is very verbose compared to the list of sites like Rolly-o. Users should haven't to read something that looks like SQL in order to use it. Still, its a good start. I look forward to seeing how they expand on it. Meanwhile, I hope they improve the gadgets gallery usability perhaps, search functionality (isn't that ironic?) and some categories?

Go take a look at let me know what you think of the new Live.com

Tuesday, March 7

Victim of Yahoo

My Fiance, Krystle, is a victim of a Yahoo, of a live site test. Below is a screenshot of Yahoo testing a new homepage format:


It feels much cleaner and less cluttered than there current homepage. Below is the current homepage:


The new homepage has less text and more icons. I like the single column navigation bar on the left. The new UI also has fewer broken image links than the current page. C'mon guys, what gives?

More and more sites are doing live site testing, diverting a portion of users to new features. If you've been watching Globalspec.com, you might have noticed similar happenings (TIP: Check back very soon!). Like Yahoo and Google, we've been doing a lot of UI testing this way. It's a great way to try out new features and see what happens to important metrics.

Good times.

There has been a lot going on, which I've been digesting. For example, Microsoft invited hotmail users to partipate in the Windows Live Mail beta. Now if only I would finally get an invite to test out the Yahoo Mail Beta. I've been waiting for at least four months guys -- its about time!

I look forward to writing more soon, and maybe getting back to my series on meta-search.

Tuesday, February 21

Yahoo -Advanced -Search AND MSN anti-contextual ads

I was using Yahoo today and discovered what I believe to be a bug in their advanced search functionality. The minus (-) operator does not appear to function correctly.

See for yourself.
  1. First, try a search for "valves".

  2. Next do a search for "valves -pumps". The first result is for www.breastfeeding.com, which does not have any on-page mentions of the word valve. Furthermore, it contains the word "pumps" very prominently in relation to breast feeding "pumps & pumping."


  3. Finally, just to see the extent of the brokenness, do a search for "valves -12345678910111213". Hopefully, I have picked a string unlikely to occur in any common search result. However, instead of returning the original set of results, we are back to breast pumps and motor sports.
It's very surprising to see this from Yahoo. Has anyone else noticed this? None of the other major SEs have this problem. Time to file a bug report. However, Yahoo appears to be missing something I think is important:



In light of Yahoo's "Social Search" initiative, I hope I'm not the only one who sees the irony in this.

In the process of showing this to co-workers, we stumbled upon another interesting "feature", this time in MSN. MSN ads do not handle the minus operator. Try a search for "valves -pumps".



The "contextual" ad results are exactly what we subtracted from the query, with some stemming! Try just "valves" to see the difference. Thanks to my co-worker, Joan, for pointing this one out to me. Possibly a result of the way their ad code is parsing the query terms?

That's about does it for today's search engine "feature" round-up.

Tuesday, February 14

The Blog Authority Hustle

Mike over at TechCrunch was blogging about Technorati rolling out a new Authority Search Filter (a "slider" pattern for those who read my previous post). He asks the question: what is a better way to determine authority other than links into a blog? Well, one way is to go beyond link counting and use a topic specific, dare I say "vertical" community approach. From what I understand, Sphere is doing just this. Speaking of which...

What is going on with Sphere? I signed up for the beta waiting list and haven't heard a peep in months. Furthermore the last coverage I saw in the press was in October. Are they planning on launching soon? Can we get a status report guys?

Mary Hodder's research on topic communities is one of the bases of Sphere. Here is more information from her blog, Napsterization
In particular, I spent a couple of years working on the topic browsing of blog data, developing a front-end php-based system, 150 pages of usability, user and design research, preliminary algorithm design for determining topic communities through multiple metrics, scoping search up and down those communities, as well as weighting bloggers through other combinations of those same metrics (multiple metrics reduce the power law effects that a single metric can have on a community), as well as blogger profiling.
I would still love to see Mary's research, it sounds fascinating. Incidentally, she gave a great presentation to the UC Sims 141 Class on Blog search. I highly recommend the webcast.

New AJAX Libraries, tools, and design patterns

Yahoo released two cool pieces of code yesterday. The Yahoo UI Blog announanced the release of a AJAX UI library and also a UI design patterns library.

If I get a chance I want to look at the drag and drop code as well as the DOM tools. I hope it's well documented! One of the "hidden costs" of free software is that often times you have to figure it out yourself because of a dirth of up-to-date documentation.

On a related topic, I was digging around yesterday looking for a Javascript debugger for Eclipse. I'm working on my wedding website, which I am writing in PHP. I'm using Eclipse, and I was doing some Javascript for my online RSVP form. I think it's pretty neat, the menu selector, the ability to add people, etc... uses a fair bit of JS. A good debugger is worth its weight in gold.

I ran across the news that IBM will be contributing the AJAX Tools Framework to the Eclipse project. Some of the cool features are an integrated Javascript debugging environment and DOM Inspector. It will run Mozilla inside Eclipse. Sounds pretty cool. They hope to have a prototype for feedback in Q1 and hopefully something more robust by Q2.

If you are looking for a good Javascript debugger now, your choices are: MyEclipse for an Eclipse plug-in, Venkman's Firefox Plug-in, and of course good ol' MS Script Debugger. I don't really like Venkman's, the UI is just a bit clunky compared to what I am used to in VS / Eclipse. I haven't tried the MyEclipse editor because I haven't wanted to shell out the cash -- because after all, I am saving for a wedding here!

It's nice to see some of these technologies taking off now that we have crossed some major browser adoption milestones for CSS capable browsers. It's a fun time to be a developer.

Tuesday, February 7

Search for Dorks

No, it's not a geek dating service, that's already been done. I thought I would mention two new "vertical" search engines targeted at coders and CS geeks.

For a coder's needs Google and the other major search engines have major shortcomings. For starters, they strip out important punctuation characters, like ;, slashes, asterisks, parenthesis, etc... for starters. What stinks even worse is when you are looking for a variable name or other certain specific piece of code but only know a partial variable name -- GYM only matches complete words. Taking it one step further us CS dorks like to search using the power of regular expressions! And don't even get me started about their case (in)sensitivity. The bottom line here is that there is major room for improvement, and that's what several new search engines are trying to do.

Krugle
Krugle -- was announced yesterday at the DEMO conference. John Battelle has his coverage, here. “Krugle is a search engine for programmers”, according to Co-Founder and CEO, Steve Larsen. Ken Krugler, Co-founder and CTO writes, "While current search engines are OK at finding Web pages, they don’t crawl source code repositories, archives or knowledge bases, and they don’t leverage the inherent structure of code to support the types of searches programmers need.”

Exploiting the structural properties of the code would allow programmers to find code more like they would in their IDE than on a web page. More on this later. Exploiting the structure of the code opens up a whole new realm of vizualization and UI possibilites that are very exciting! Instead of search within this site, search within this project, within this logical grouping (like package). Imagine being able to specify the type of matches that that is language dependent -- is it the name of a class, variable, method, only in the comments? The possibilities are very exciting. Could the search engine perform "code translation" where you translate Java into C# and vice versa? That would be very awesome, especially if you were trying to learn a new language.

In addition to code, Krugle also searches technical articles, bug reports, documentation, standards, etc... After all, when you've got to navigate this huge API or even worse, this one, there's got to be a better way to find things the information you need.

The last major feature is the social aspect of search, which I think is more hype than substance at this point, but I could be wrong. It's an interesting idea that could open up some very interesting possibilities. In Krugle, users can comment on and tag results and share them, in their words: "save, annotate, and share your search results with others."

It is currently in a closed beta, but has a scheduled release for the O'Reilly ET conference in March. You can sign-up for the beta if you are really interested in getting a sneak peek.

Koders
Koders -- Is a code search engine that has been around for about a year. Recently, in December, it released plug-ins that integrate it with Eclipse and Visual Studio. The plugins use it's SmartSearch™ technology to find and recommend code similar to the code you are currently writing / viewing in your editor. In their words:
Koders.com helps developers navigate the rich but fragmented open source landscape by indexing thousands of open source software projects and more than 190 million lines of code at leading universities, consortiums and organizations including Apache, Mozilla, Novell Forge, SourceForge, and others.
There is still lots of opportunity for improvement. I'm not too impressed by the results for my test query, "Lucene", -- a popular open source IR library. Lucene's primary language is Java, and yet results are exclusively C on the first page of results. Even using a nice feature that allows you to restrict by language, restricting to Java still does not bring up any of the code on the Lucene project site. Instead, the first result is from some person's thesis: br.ufpe.liber.theses.examples.lucene.

The Eclipse integration is a great step in the right direction, I'll have to give the plug-in a try and give it a fair shot. Search inside the application is where the future of search is heading. After all, it provides a better grasp of the ever elusive "context" which SEs always seem to lament. This is a great feature to keep an eye on. There will be a lot more of this in the future.

Prospector Tool
I mentioned previously that it would be really cool if a search engine could exploit the structural properties of code to provide better SERPS. While not technically a search engine, one software engineering tool that I have been fascinated with is the Prospector tool that allows developers to mine code for snippets and examples. Have a file and need to read its input? Easily find and display other uses of that class as you type within Eclipse. In the authors' words, "Prospector scans and analyzes APIs and bodies of existing application code in advance, and then synthesizes code snippets on the fly in response to programmer queries, solving problems in seconds that otherwise take hours of searching documentation."

I highly recommend checking out David Mandelin's homepage for papers and presentations on code mining. Very neat research. Might be good for the developers of Krugle to look at.

It's an exciting time to be a programmer. The search engines described above and the Prospector tool are new and still immmature. They are still in beta (what isn't!). It will be exciting to see how they develop over the coming year or two. I'm looking forward to some real innovation.

That ends our search for dorks. It's getting late, and my supply of Penguins has run out. Time to go into Suspend Mode.

Tuesday, January 31

Globalspec 2005 Results

Globalspec, my employer, today made its 2005 year-end announcement. It is a privately held company, so there are no earnings information, but the numbers are interesting nonetheless. Although it will probably be overshadowed in the press today because of Google releasing its year-end number later this afternoon.

Here is a sample from the press release:
During 2005, GlobalSpec sales volume grew 51% over 2004, while the company’s worldwideuser base expanded to more than 2 million registered users. GlobalSpec continues to add new registered users at the current rate of 20,000 per week. SpecSearch®, GlobalSpec’s trademarked technology allowing engineers and other scientific and technical professionals the ability to search by product specification, now includes more than 95 million parts in 1,400,000 product families from over 17,000 catalogs.
One thing that it doesn't say that I will add is that The Engineering Web grew significantly in 2005, both in the amount of content and in its quality. For instance, in 2005 Globalspec created lots of new parnerships, adding important and useful content from partners like the IEEE and Knovel. We cracked down on spam, which is a continuous process, and made significant investments in resources to growth the breadth and depth of content. We also made signficant changes to improve the freshness of the content.

But, that's just my opinion. Try some searches on The Engineering Web.

I would be interested in hearing peoples' opinions about its strengths, weaknesses, and would we could do to make it better. Have people noticed the improvements made in 2005?

Meta-Search Part I: The Beginning

Today, I start the first in what I hope will be a series of posts on meta-search. Meta-search engines have been around since almost the beginning of web search and they continue to spring up almost daily, like Gravee, which I reviewed earlier. They continue to be controversial and interesting, as evidenced by the articles written, such as this recent article on Search Engine Watch. I've always enjoyed looking at history and the development of technology, so I thought a fun way to jump into the foray would be to look at the technical innovators of yesteryear. Although theere have been meta-search engines that searched databases for a long time, I am going to restrict my discussion to meta-search on the web. I am going to take a look at the first two web meta-search engines, SavvySearch and MetaCrawler. The problems they attempted to solve are many of the same problems facing search today.

Background: The beginning

I have seen several resources, such as this history of search engines, that wrong identify MetaCrawler as the first web meta-search engine. The first search engine was in fact SavvySearch, but not by a wide margin. SavvySearch and MetaCrawler were both university research projects released within months of one another, SavvySearch in March 1995[2] and Metacrawler in July of the same year[1]. The bottom line is that they were both in development at the same time, in 1994 and were released to the public in '95.

SavvySearch


SavvySearch was a research project out of the Colorado State University that attempted to provide a centralized interface to web search engines and specialized databases through intelligent serivce selection. It gave users with a "search plan" to execute their query aginst the most relevant subset of engines and databases for their query. Their goal was attempt to optimize two conflicting goals: "minimizing resource consumption and maximiming search quality. "[2] SavvySearch attempted to do back in 1994 what Gary Price describes as the future in the forementioned SEWatch article,
For a long time I've said verticals will continue to grow in popularity and importance as meta search tools which are getting better all of the time will allow various database and content publishers to offer material (free or fee) to end users who will select these databases at the time of their search based on their information need.
Like Gary's vision, SavvySearch searched not only web content, but also included specialized databases like Roget's Thesaurus, CNet's Shareware.com, and the IMDB. SavvySearch's raison D'etre was that no search engine, or even a group of engines was large enough to contemplate crawling the entire web. The major engines (Aliweb, Webcrawler, Lycos, Yahoo, etc...) lacked good coverage of the web. Furthermore, this was before major specialty sites started creating crawler-friendly database driven pages or provided DB feeds directly for indexing. SavvySearch's sibling, MetaCrawler, also addressed the coverage issue, but instead of focusing on intelligent service selection, it focused on addressing problems of freshness and relevance.

MetaCrawler
MetaCrawler was a project out of the University of Washington by graduate student Erik Selberg (advised by Oren Etzioni) that tackled not only recall but also staleness and (ir)relevance of search results of the search engines of the day. Unlike SavvySearch, it was a pure web search engine, although one of the future projects was to extend it to include databases[2]. While it did attempt to solve recall problems, its primary aim was to "verify" pages by fetching to ensure their existence and freshness.


Metacrawler saved the user time and work by quering six search engines: Galaxy, Infoseek, Lycos, Open Text, Webcrawler, and Yahoo!. It then de-duped and fetched all of the pages returned. That's right, it fetched all of the pages returned by every engine! It “verified” results, eliminating dead or modified pages that were not irrelevant. According to their paper, on average 14.88% of search results were removed because they were "dead." In addition to removing dead pages, the pages were re-scored based on the query term, pages that changed since they were fetched by the search engine were removed.

In the process of re-scoring pages that it fetched, it also generated query sensitive page snippets. Back in this pre-cambrian era of search, there were no query sensitive summaries. It was too expensive to store the cached content. Instead, search engines provided a list of URLs a query independent description of the page. Users were left to hunt and poke in order to discover how the page was related to their query in more depth. Metcrawler improved the percieved relevance of search because users could more easily understand why a search result was returned. Selberg describes this list of "references", "Each reference contains a clickable hypertext link to the reference, followed by local page context (if available), a confidence score, verified keywords, and the actual URL of the reference. Each word in the search query is automatically boldfaced." This feature would not be duplicated again (to my knowledge) until 1999 when Google released their search engine.

The biggest drawback to MetaCrawler was that it didn't store cached pages, in contrast with SavvySearch's emphasis on economy of resources, metacrawler was very bandwidth and time intensive. Their fetcher was highly optimized so that it could simultaneously download over 4,000 pages at a time[1]. Quite an accomplishment! However, fetching every result for every query doesn't scale, even with the benefits of caching. Instead, Selberg proposed that MetaCrawler could be a client side application and that ISPs could provide caching to speed page fetch time. However, this approach would have required substantial client-side bandwidth in the pre-broadband era. Even with MetaCrawler's highly optimized fetcher a query took on average over two minutes to verify all of the pages.[1] (page 8, table 4).

A brief comparison
SavvySearch searched up to 20 engines at once, while MetaCrawler queried only six. SavvySearch included topic spefic directories and databases, while MetaCrawler only searched web search engines. MetaCrawler was slower, but more reliable than SavvySearch.[4] Because MetaCrawler fetched all pages it could support more advanced query functionality, such as the minus query operator, restriction to a country, and a particular domain name extension. SavvySearch on the other hand did not support advanced query formats. Because it did no processing of pages itself, it was reduced to using the lowest common denominator. Neither provided a way to leverage the full advanced query power offered by most engines.

Conclusion
A primary reason for both of these engines birth was that in the era pre-Google, and even pre-AltaVista, having a single engine that provided even modest coverage of the web seemed impossible. Creators Selberg and Etzioni write,
Skeptical readers may argue that service providers could invest in more resources and provide more comprehensive indices to the web. However, recent studies indicated the rate of Web expansion and change makes a complete index virtually impossible.
They cite an interesting source, two researchers at CMU who helped develop the Lycos search engine. Mauldin and Leavitt write in their paper Web Agent Related Research at the Center for Machine Translation: "First, information discovery on the web (including gopher-space and ftp-space) is now (and will remain) too large a task...the scale is too great for the use of a single explorer agent to be effective." Advances in storage technology and cheap bandwidth allow GoogleBot and other search engines to just that.

SavvySearch and MetaCrawler paved the way both for today's search engines and for the next generation of meta-search engines. MetaCrawler was purchased by Infospace and continues to operate as a meta-search engine, but bears little resemblance to its former self. It provided a platform for research on the next-generation of meta-search engines, HuskySearch, which reesarched AI applications to query refinements and Grouper which explored document clustering in metasearch. MetaCrawler's dynamic summaries are now the de facto standard, with Google being a primary pioneer in bringing it to the masses. The problems that these search engines attempted to address: the proliferation of search engines and the lack of stability and coverage in their results continues today. There are more engines than ever and there is still a significant amount of difference between the results in even the major engines. In future parts in the series we'll take a closer look at some of these problems and meta-search engines today try to address them.

References and Resources
[1] E Selberg and O Etzioni. Multi-Service Search and Comparison Using the MetaCrawler, 1995.
[2] D Dreilinger and A Howe. Experiences with Selecting Search Engines using Meta-Search, 1997.
[3] MetaCrawler, HuskySearch, and Grouper.
http://www.cs.washington.edu/research/projects/WebWare1/www/metacrawler/
[4] Sonnenreich. A History of Search Engines.
http://www.wiley.com/legacy/compbooks/sonnenreich/history.html

Friday, January 27

How Old Are You Now: Search Engine Age Demographics

My post yesterday on the seeming dominance of Google in the younger generation prompted me to do some more thinking and research. It piqued my interest in the demographic breakdown of the major SEs. Is Google really THE choice of the youth or do kids use other search engines as well? My own observations say that Google dominates in the younger generation much more so than in the older generation, but is my hypothesis correct? What I found in my research was surprising, both in its findings and in the dirth of information. Here are three studies that I found which I believe are the most current and relevant:

Search Engine Usage in North America by Enquiro. (April 2004)
Google Gains Overall, Competition Builds Niches by Clickz Network(June 2004)
Search Marketing Benchmark Guide 2005-2006 by Marketing Sherpa (September 2005)

The first study has an interesting conclusion on SE usage and age:
We expected to see a trend in search engine usage according to age, but this wasn’t the case. High usage of Google was relatively consistent (at about 70%) and showed no specific trends. (Hotchkiss, Garrison, and Jensen, p 53.)
However, I notice that only 4.4% of the 425 participants sampled were under 20 (p23). This is a very small sample. I'm not an expert in statistics, but 18.7 youngsters doesn't seem like a statistically significant sample, at least not enough refute my hypothesis. Although, it does cast a hint of doubt. Does this work for a larger sample?

The second article, on Google Gains, is interesting because it claims that Yahoo has the largest share of the young audience: 48.23% vs. Google's 43.57% for users 18-34. This isn't that surprising, because this group really came onto the Internet at a time when Yahoo was a dominant player. This doesn't isolate the college group (18-22), the "Google Generation" from the pre-Google generation. In short, it doesn't really provide enough data to address the issue.

The last report could prove illuminating if someone has a copy. I would be interested in hearing what it has to say.

To my knowledge no one has studied, or at least not made public, a detailed look at search engine usage, loyalty, etc.. inthe younger generation: 13-23. If they did perhaps we could really validate whether or not this group really is the "Google Generation."

Thursday, January 26

The Google Generation and the future of search

US FirstI had the opportunity to help out at a local high school recently. I was helping-- ok, more like watching, in astonishment as the teens did the work and I bumbled about trying to catch up. They are competing in the US First robotics competition. Very cool competition, with some serious robots -- and some serious controller programming, in C. I never had anything like that in my school. Another thing these kids have that I didn't in high school is Google.

These kids were even more Google obsessed than I was in college -- they used it for everything. The graphics / animation team was playing with Google Earth. They had fun panning around and zooming in on their houses. The team programming controllers were even more manically obsessed. One student was programming and defining hex constants. He fired up google and searched "0xFF - 0x1a". It was their swiss army knife, in their words, "it does everything." These kids, the generation currently in high school and college have been dubbed the Google Generation by Kathy Hirsh-Pasek in USA Today.

Being in the search business, I stood back and took it all in. These kids are the future. There are talented programmers, artists, and leaders who are in love with their high school sweetheart -- Google. My generation was the first to use Google in higher education -- it came of age my freshman and sophmore year. For us it is new, cool, and innovative, for these kids its like breathing air.

If I were at Yahoo or MSN, I would be very very concerned. I would see my kids and their friends coming home from school and doing their homework with Google. It would be frustrating and infuriating! To these kids Yahoo is the "old geezer's" search engine. Today's high schoolers missed the dot com boom. They don't really remember the hype or the Super Bowl commercials. "Do You YAHOOO!!" doesn't mean a thing to them. It is for us old fogies and our elders.

Another illustration of Google's younger demographic is the biggest gainer in Google's search queries in 2005. The big winner is MySpace, the teenage social networking site. Next up are Ares, a p2p file sharing software, Wikipedia, and iTunes. Definitely all hip, cool, and popular with the next gen. Contrast this with the top searches on Yahoo -- MySpace is nowhere to be found. Johnny Carson, Pope John Paul II, and March Madness definitely sound like the young crowd to me, really.

Bill Gates recently announced a media partnerhisp with MTV at CES 2006 to try and capture the young crowd. Yahoo is betting on Flickr and company to regain some street cred. However, I don't think this will be enough, but I could be wrong. If the robotics team is any indication, they've got to do something -- and fast. If they don't, they will be ignored by the next generation of web searchers. One thing is for sure, as I walked out of the high school, all of the talk about Google's stock price being inflated seemed ridiculous. I would have written a check to Sergey right there on the spot.

Thursday, January 19

Major sites say: "No one home, try next door"

Redirects on your homepage are bad, they are just plain rude -- to visitors, to other webmasters, and to search engines.

Two examples of sites using redirects on their homepages are: eFunda and the IEEE. At least the ACM knows better! These are two of the largest and most important engineering sites on the web and they are just being lazy. Don't get me wrong, lazy can be powerful, but this isn't one of those times. In these cases, lazy is just plain sloppy.

What is the purpose of having http://www.efunda.com/ redirect to http://www.efunda.com/homepage.cfm? C'mon guys, configure your webserver! This is simple to fix. The IEEEs redirect looks like it is using some kind of portal software. Again, fix your config.

Why am I so worked up? Redirects on your homepage are just plain rude. A 3xx level response indicates a redirection. Specifically, the 302 response these sites use is a temporary re-direct. From the specification, "Since the redirection might be altered on occasion, the client SHOULD continue to use the Request-URI for future requests." This is like putting a sticky note on your door saying, "I'm in Olin 110 -- go downstairs, around the corner and knock on the door." You can't count on the person being in the same place, so you have to go check their door again first and read another note. Here is my bottom line: Be polite, meet your guests at the door, and maybe even offer them a cold one.

Thankfully, most search engines can traverse these pointers, but it wasn't always the case and you never know. After all, not everyone gets taught pointers.

Friday, January 13

To stop or not to stop that is the question

In my spare time at home, I am writing a parser to extract interesting phrases from search result snippets. In other words, I am looking for phrases with a high degree of co-occurrence across search results. For now, I'm using the Google API for the snippets. I think co-occurrences have lots of interesting applications: query narrowing, clustering, information exploration, etc... Not to mention its a fun way to play with the Google API and an excuse to learn J2EE/JSP/servlets.

In my program, I am using a stop word list to help narrow the list of candidate phrases. I was inspired by Manning and Schutze's approach to finding collocations (Stastical NLP, page 157) where they use a stop list to exclude words that are not verbs, nouns, or adjectives. This can be contrasted with Justeson and Katz who use a part of speech tagging approach and limit the n-grams to some common grammatical patterns.

POS tagging is nice, but its a bit overkill for a first pass -- so I am sticking to the stop list approach. I've started compiling a stop word list from various sources. The question is this, what is the best way to apply it?

Do stop words mark the end of a phrase/collocation or can they occur within them? Words in the stop list are not one of the three grammatical types listed by Manning and Schutze and should then not be included in the phrases. However, Oren Zamir, Etzioni, and Selberg took a slightly different approach when they added clustering to Metacrawler in the Grouper project. Grouper removed stop words from the beginning and end of the cluster names, but left them in the middle. This required that they post-process the phrase/cluster list in order to remove leading and trailing stopwords.

I started to take the approach that stop words marked the end of my co-occurence (consistent with Manning and Schutze's POS restrictions), but now I'm not so sure. This might limit the phrases that I discover, but perhaps not by much. Look at the clusters on clusty (try "food") -- there aren't many with stop words, except when they are part of an entity name, like the "Food and Drug Administration." What do you think? How many useful clusters do find with stop words in the middle of them? I'm beginning to think the Grouper approach is better for this application since I am not stricly looking for simple collocations. POS patterns, like those used by Justeson and Katz would be an elegant solution, but I'm going to blow that one off for V1.0.

I am also trying to decide whether or not to take what I've built so far and integrate it with LingPipe. A reader recommended it to me in the past for this project and it looks really cool. It has built in filters, word tagging, sentence boundary detection, and more. My concern is that it might be a little overkill. In addition, it can be more instructive to do things yourself. However, in the process you often end re-inventing the wheel.

I plan on putting a JSP/Servlet front-end on the system. If anyone has a stable Tomcat server they'd be willing to provide access to, that might help things along. In the meantime, I'll run it off my machine and see if I can't find something better. Weddings have a wonderful way of sucking up spare cash, so a second machine isn't on option for me right now.

That's all for now. Now, back to my coding... and the world of Eclipse!

Wednesday, January 11

Globalspec Most Memorable Spam of 2005: Dogpile Cloakers

I am going to steal a page from Matt Cutts, and talk about spam. Globalspec's Engineering Web is a gated community of only the engineering domain. Considering the limited scope, one might think that fighting spam would be easier than out there in the wilder "horizontal web" -- the would of GYM, right? Wrong. While GlobalSpec may not have to deal with Britney Spears and company (as much), it may surprise you to find that there are lots of people out there targeting the engineering domain with spam.

Over that past year, I've seen pretty much every type of spam the major SEs deal. I've seen major improvements in fighting spam in 2005. There was spam of every kind: domain hijacking, re-purposed content (ODP/Wikipedia), dynamic content generators, link spam, splogs, etc. (FYI if you want a good overview of web spam, check out Marc Najork's WebSpam presentation to the UC Sims 141 class.) However, one of the most difficult and insidious spam techniques from a search engine's perspective is cloaking. Cloaking is sending the search engine spiders different content than users see when they visit the page. Search Engine World has a good overview about the different types of cloaking spam.

My two most memorable spammers of 2005 are two cloaking sites that we caught. The two sites are http://www.cold-forming-company.com and http://www.metal-cold-forming.com. Both of these sites use referrer based cloaking. It's really sneaky. Let me illustrate:

The URL: www.cold-forming-company.com/coldforgedsteels.htm
Referrer: http://www.cold-forming-company.com/links.htm


Now look at this cloaked version coming from off of MSN search results:
Referrer: http://search.msn.com/results.aspx?q=site%3Acold-forming-company.com+cold+forged&FORM=QBNO



Now, I won't pass judgment on Dogpile, but the webmaster doesn't seem to get any tangible benefits, unlike Google Adsense spammers. Dogpile, on the other hand, gets to leech off of other search engines' results and gain traffic and therefore money. I'll let you investigate and draw your own conclusions.

Here's how the spammers operate in this spam network. They use referrer based cloaking. When you come in from a search engine, they detect the external referrer (and direct navigation with no referrer) and inject a few seemingly innocuous lines of html into the page. Here is they code they inject (with the opening < removed)

frameset>
frame src="'http://www.dogpile.com/info.rawhd/redirs_all.htm?pgtarg=" qkw="site%3acold%20forming%20company%20cold%20forged&qcat=">
noframes>

script language="JavaScript">location.replace(http://www.dogpile.com/info.rawhd/redirs_all.htm?
pgtarg=wbsdogpile&qkw=site%3acold%20forming%20company%20cold%20forged&qcat=web)
/script>/noframes>/frameset>
Whoa. The content is still there, but it is effectively hidden through their use of frames and javascript, which was previously not in the page. If you look at the bottom of the results page, there is a 1 row high line, which is where all the old content is displayed.

What's interesting is to examine which SEs have caught the above sites and which ones haven't. How good are the major SEs at detecting referrer cloaking spam? Google has not indexed cold-forming-company.com. MSN has pages and Yahoo has only their homepage. On the other hand, Google has indexed metal-cold-forming.com, Yahoo again only has the homepage, but MSN does not have it indexed. Clearly, even the big three search engines have mixed success dealing with this type of spam.

FYI, the two sites mentioned above are a part of a much larger spam network. Here is a small sampling of the sites:
http://www.dirt-bike-parts.com/index.htm
http://www.aftermarket-motorcycle-parts.com/index.htm
http://www.discount-motorcycle-parts.com/index.htm
http://www.vintage-motorcycle-parts.com/index.htm
http://www.pump-repair.com/index.htm
http://www.fountain-supplies-pumps.com/index.htm

And the list is much longer than that. My, what a nice little spam network they've got there.

My prediction for 2006 is that these problems will become even more of a problem for SEs. Search engines of every kind will need to devote more resources to shoring up their defenses and weeding out crap like the above to stay relevant. I hope to see blacklists of these types of sites that are known referrer spammers to make it easier to filter these sites out of results.