Wednesday, March 29

Who's your BigDaddy

Or perhaps more accurately, What's your BigDaddy? (and perhaps more interestingly, Why is it called BigDaddy?). According to Matt BigDaddy is:
a software upgrade to Google’s infrastructure that provides the framework for a lot of improvements to core search quality in the coming months (smarter redirect handling, improved canonicalization, etc.). A team of dedicated people has worked very hard on this change; props to them for the code, sweat, and hours they’ve put into it.
It started out at one data center and is now live on at all of Google's data centers. One of its biggest advantages of this upgrade is improved URL cannonicalization -- www vs. non-www, redirects, duplicate urls, 302 “hijacking,” etc... The biggest is most likely that they are better at picking www vs. non-www versions of URLs. Secondly, they are coming up to speed with Yahoo in regards to problems with redirects and "hijacking". Danny Sullivan over at SE Watch wrote a great article last August on Yahoo's policy regarding redirects and hijacking, including contrasting it with Google's (old) policy.

What caught my eye is what I interpret to be an entirely new crawler engine in BigDaddy. Here is the snippet from Matt's most recent post:
Q: “What’s the story on the Mozilla Googlebot? Is that what Bigdaddy sends out?”
A: Yes, I believe so. You will probably see less crawling by the older Googlebot, which has a User-Agent of “Googlebot/2.1 (+http://www.google.com/bot.html)”. I believe crawling from the Bigdaddy infrastructure has a new User-Agent, which is “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
Webmaster are reporting that the new crawler seems to support CSS, Javascript, and other features of modern browsers (such as form support). It sounds like a mozilla-engine based crawler. Truly an innovation in web crawling, considering most crawlers, including the open source, Nutch and Heritrix, are text based. This is huge! Google must be taking a performance hit, but I guess that's part of the reason for all the new hardware they've been buying. It looks like they've been putting those Mozilla/Firefox developers (Ben and Ryan, are you out there?) to work!

For starters quality will improve because Google can tell more accurately what the users see on the page. No more hiding DIVs with CSS or JS to stuff keywords!

Secondly, better coverage of sites using Javascript based navigation / content rendering. Even in the engineering world, there are sites that like McMaster that have their entire site based on Javascript. There is no crawler friendly version. Text based browsers, such as the old GoogleBot, don't do very well on the site, see for yourself. It's too early to tell whether or not the new GoogleBot will improve this, but I would imagine so. Couple these crawling improvements with improved URL cannonicalization and you have a higher quality index.

Some people have claimed that the new crawler is "blazing fast" compared to the old GoogleBot. While I believe it may seem this way to webmasters because Google is crawling more aggressively, I find it highlighly unlikely that the software itself is faster. If the new crawler is using a Mozilla based engine it MUST be slower than the text based crawler because of all of the new features -- Javscript parsing, CSS rendering, etc... which it hasn't done in the past.

Google is crawling more aggressively because I believe it is trying to re-crawl a large portion of the web very quickly. If you think about the impact of modifying the way URL canonization works along with a new crawler engine, it follows that you will probably need to re-computer PageRank. Crawling gently is not something you can do at this scale if you want to propogate these changes quickly. In the process, Google is creating some webmaster complaints. Along this line, Search Engine Journal has an recent article on the topic entitled: Mozilla Googlebot: Mozilla or Godzilla.

No comments:

Post a Comment