Friday, November 18

The Google Strategic Server Force

In its Cold War with Microsoft, Google is readying a new weapon: The Google Strategic Server Force (GSSF). This new elite mobile strike force is emerging as a main component in Google's strategic arsenal. The Government reports that Google is readying the first deployment of the G-36M series mobile data centers and predicts that they will be online in time for Santa.

The new arsenal is primarily targeted at The Enemy's primary capital, Redmond. However, other targets include Walmart, Ebay, and Association of American Publishers whose recent refusal to capitulate with Google's order to turn over all books and databases for indexing as part of Google Base and Books has resulted in the escalation of tensions between the major super powers.

The new mobile data centers are reportedly running Google OS 3.7M Cheetah with its new autonomic load balancing and data redundancy features courtesy of GFS II.x . Reportedly, the tractor trailer based centers are semi-autonomous agents based on K.I.T.T. whose prototype was designed by Mr. Norvig himself and utilizes the latest AI research. The previous model, the G-33M was widely successful and recently conquered the Government's toughest test. Google reportedly has 10-20 of these new data centers in its underground garage-bunkers, each with an effective range of 1,000 miles. Yahoo CEO Terry Semel described the level of this new threat:
We're talking about 5000 Opteron processors and 3.5 petabytes of disk storage that can be dropped-off overnight by a tractor-trailer rig. The idea is to plant one of these puppies anywhere Google owns access to fiber, basically turning the entire Internet into a giant processing and storage grid.
The new GSSF is under the direct authority of the Google Supreme High Command. The control of the troops is effected directly by the Supreme Commander in Chief through the central command headquarters of the General Staff and the main headquarters of the GSSF, using a multi-level extended network of command posts operating in alert-duty mode.

Bill Gates and other world internet leaders condemned this new technology and warned Google that this new threat could escalate the already tense conflict. They called on Google to destroy its weapons of mobile information, to cease development of all such weapons, and to stop support for open source terrorist threats.

In another recent development, John Battelle and Bill O'Reilly returned from negotiations in Munich where they reportedly negotiated a partial disarmament of the WMIs. In a recent presentation before the w3c security council and Microsoft CEO, Steve Ballmer, they announced "We believe it is peace for our time."

In their presentation to the council Battelle and Oreilly revealed some of the terms of the agreement, including Term 6:
The final determination of the mobile data center based Wi-Fi frontiers will be carried out by the international commission. The commission will also be entitled to recommend to the four Powers Microsoft, Yahoo, IBM, Ask and Looksmart, in certain exceptional cases, minor modifications in the strictly ethnographical determination of the zones which are to be transferred without plebiscite.
All of the terms of the agreement were not revealed, but undisclosed sources reported that several concessions were made to appease Brin, Page, and Company. According to these sources, the collective databases from leading publisher Reed Elsevier will be ceded to Google. Also under the terms, Google will acquire Scirus and integrate it into Google Scholar.

In an attempt to assuage public concern over its recent aggression into Google today announced it would be using its G-36M and Strategic Server Force (GSSF) to provide free Wifi to the city of Mountain View. Whether the strategy will prove effective in swaying public opinion remains to be determined, but first signs appear optimistic.

Ongoing coverage of this breaking news story:
Google data centers and dark fiber connections
Google Announces Plan to Destroy All Information It Can't Index

Thursday, November 17

Evolution of technological progress through queries

There is a new paper up on Google's research website:

Interpreting the Data: Parallel Analysis with Sawzall (Draft)
Very large data sets often have a flat but regular structure and span multiple disks and machines. Examples include telephone call records, network logs, and web document repositories. These large data sets are not amenable to study using traditional database techniques, if only because they can be too large to fit in a single relational database. On the other hand, many of the analyses done on them can be expressed using simple, easily distributed computations: filtering, aggregation, extraction of statistics, and so on.

Go to the site for the full paper and abstract. I'll read it later today. It's on the same list with GFS and Map Reduce, so I hope it lives up to the same standard.

The coolest thing in this so far is the movie showing the distribution of requests to Google's servers over the course of a day.

This is very interesting. Can you track the technical progress of a society (sorry for waxing philosophical like John Battelle) by the volume and type of queries executed? It will be interesting if we could track this over the course of years to see the growth of technology in rapidly emerging countries like China and parts of South America.

So now we have the volume distribution, but can we mine trends at a global level? For example, commercial queries have eclipsed sex related queries in North America. Will this trend repeat itself in Europe? Fascinating.

Thanks to Digg for the tip.

Sitemap statistics are not like a bikini...

"Statistics are like a bikini. What they reveal is suggestive, but what they conceal is vital." - Arron Levenstein. At least in this case, I think the Google sitemap statistic real more vital information than they hide. Maybe statistics aren't so evil after all...

Google announced today on its official blog and on the Sitemaps Blog that it is going to provide more statistics to webmasters via the sitemap service.

What's even cooler is that the Google Sitemaps blog reports that you can get site indexing statistics even if you don't have a sitemap! Now, if only it was integrated better with Google Analytics (If you missed it, here is the official blog post on it being free).

What's really awesome is the ability it gives you to fix problems on your site. The statistics show the fetch details for every page in the Sitemap. In my opinion the two most interesting are the HTTP request details and the crawled date for individual pages. Did half your pages drop out of Google because one of your important pages 404ed? Was your site down when Google tried to crawl it? Now at least you are more empowered to do something about the problem. To my knowledge no other SE is providing this level of transparency with their crawling -- Globalspec, MSN, Yahoo, nobody.

I think it would be cool if there was a way you could suggest that Google retry crawling errored pages. When there was a 404 or some sort of logic error on your site, you could see it, fix it, and tell Google so that they can re-crawl it. I suppose if Google crawls you very frequently, this may not be a big issue, but if major portions of your site errored out repeatedly and dropped out of the index this could be devastating to a business that gets a lot of traffic from search engines (most do), especially small retailers in the holiday season!

Now here is an interesting experiment: Add a new page to my site (and sitemap) and then monitor its appears in the Google index. Then, compare the index date with the crawl date. What is the delay between crawling and appearance in the search index? Just how fast can Google get content that is crawled into its live search index?

The extra value provided by these sitemaps statistics is very smart because it is a very compelling incentive for webmasters to sign up for Sitemaps and also to spread adoption of the Sitemap format (it's still just a Google thing, after all).

The problem is, for me at least on Blogger, is that you need to "Verify" site ownership by placing a file in your root directory so that Google can fetch it. What sucks: no support for Blogger sites. And I'm not the only one and again on the Google Sitemap group ... who thinks this SuXors. It's ironic that the Google Sitemaps blog is on Blogspot and yet I have no way of verifying with Google that I own this blog on Blogspot . Another step on the way to setting up my own web server and WordPress.

Wednesday, November 16

Excuse me, I believe Google has my stapler...

We knew it was coming, but somebody at Google was working late last night (until at least 9:51pm PST). It's official, Google Base is online.

So the jokes in the geek world of course will include the obligatory: All your base are belong to Google. In fact, I bet that's what they would have titled the press release if the developers could have written it and not been forced to un-geekify it.

It looks like Sergey and Larry have finally eclipsed one of their advisors, Mr. Hector Garcia-Molina in their attempt to gain supreme power:

The Google Blog has been updated with a post on Google Base.

I'm sure there is going to be a flurry of buzz over this one. I for one think this is a really cool idea. For example I can just search Recipes, job postings, anything.

For fear of sounding stupid, Why is it called Google Base?

It's really cool, I can create any arbitrary object with a collection of attributes, description, keywords (*cough* tags *cough*), and possibly pictures.

I'think I'll put my collection of Red Swingline staplers online.

Excuse me Mr Brin, I believe you have my stapler...

Tuesday, November 15

How Italian hotels and villas need to get hip to SEO

One thing to ponder is popularity vs. authority on a subject.

Let me show you an example. Try a search on Google for "jeff's search cafe" with the quotes so it is a phrase search. There are only 49 matches in Google's index, clearly this is not a popular or ambiguous query -- you are searching for this site (or searching for my non-existent real life cafe).

So, what is the first result: Findory's link to my feed! Second result, here I am. Popularity and Authority. No doubt, Findory has a higher page rank than my pathetic site on blogger.

I have noticed this more and more recently. I have been doing a lot of travel research for my honeymoon (next May is coming too soon!) and I've been exploring hotels and cities and things. What I find quite often is that hotels and cities in Europe (Italy, France, Greece, etc..) obviously don't know much about SEO! Many of the websites for these places are one of two types: fancy art decco flash that looks very expensive, but lacks any substantive content or a quick mom & pop type homepage with simple information and maybe a couple pictures, if I am lucky. Neither ranks well in search engines.

So what do I see most often? I see the the travel sites that review those hotels like Yahoo travel, TripAdvisor,, etc... My favorite is TripAdvisor, which I actually find quite useful for its fantastic user community. It would be great if TripAdvisor linked to the hotel website, but it doesn't! In fact most of these types of travel guides / sellers do not. It is very frustrating sometimes.

These are two examples where link popularity breaks down. First my blog -- I can't compete with Findory in link quantity or quality. In the travel / hotel instance these are in a similar position competing for link text with major sites like TripAdvisor, Fodors, and their peers. Many are definitely borderline spammish.

How are search engines dealing with this problem? Good Question. I know there was some discussion awhile back about TrustRank. Teoma / Clusty try to help with clustering and refinements. (See also: DiscoWeb rank based roughly on Kleinberg's HITS algorithms). I'll think on this some more later -- its time for some sleep. There must be something better we can come up with.

For now it is just an interesting lesson (and frustrating as I try to plan my honeymoon!). On a side note, I have this intuition that Google feels "spammier", perhaps what I mean is much more commercial, for travel searches than some other types of searches I generally run.

information retrieval (IR) and NLP resources continued

So I guess its been a busy day for me posting. I tend to go in spurts, really.

Here are some resources I wanted to make sure people knew about:

I found something pretty cool today, an online draft of Introduction to Information Retrieval by Manning, Raghavan, and Schutze. Two of the same authors as the Foundations of Statistical NLP I recommended yesterday. According to the site the book is scheduled to be published in 2007. It looks like they have drafts of about half the chapters online right now. Very interesting reading from what I see so far.

For IR practitioners I ran across a relatively new book:
Information Retrieval : Algorithms and Heuristics by Grossman and Frieder (2004).

I mentioned yesterday the CS276a course on IR, well there is CS276b on Text Mining which is very relevant to what do we do here at GS (as well as the other SEs) that people shouldn't miss out on.

And lastly if you want yet more resources there is always

There is some fodder for a future post in Lecture 7 of CS276B, specifically, slide 3 the "Evolution of Search Engines". Stay tuned.

So much to read, so little time!

Human Vs. Computer RTS Game

Yahoo (Human) vs. Google (Computer) Real Time Search Game. Who will win?

Here is an interesting article in Business 2.0 about Flickr's acquisition and how Yahoo is betting on social networks, tagging, etc...

That upstart in neighboring Mountain View may have a better reputation for search, it may dominate online advertising, and it may always win when it comes to machines and math. But Yahoo has 191 million registered users. What would happen if it could form deep, lasting, Flickr-like bonds with them -- and get them to apply tags not just to photos, but to the entire Web?
So, just how do you get people involved in tagging the web? What does that look like? Perhaps Rolly-o is a start?

The real question of the day is: is Google playing on easy, average, or super ultimo death mode?

You know, I bet those geeks at Google/Yahoo have some killer private HLF2 Counter-strike / Quake III servers. What DO you do with a few thousand servers and almost unlimited bandwidth.. Hmm... really now. Next up: Yahoo's social networking / bookmarking / life MMORPG where you play as yourself. ;-). Imagine the possibilities.

Any Googlers / Yahooers / MSNers / etc.. want to chime in? ... not that they actually read this, but still! Any search engine death matches going on after hours?

Kudos to SearchBlog for the tip and Westwood Studios (EA) / Sony / Id for the idea.

Google Resarch from 2000

Google did some research in 2000 where they spidered the top 100 pages for popular queries. They then compiled a variety interesting metrics about the pages (such as document size, url length, average title length, file types, etc...).

The research was published as part of the Search Engine World Quarterly (Q3 2000).

Any one know of follow-up research? The article mentioned a phase 2 where they used the same data / methods but analyzed links. I'll try to track that down, but somehow I have doubts about finding it.

It might be interesting (and feasible) to update this study. It would be even more interesting to look at this on Globalspec and other verticals to see if it holds in a domain centric search engine. Do the same properties hold over time and in different types of web pages (i.e. blogs or engineering pages)?

Monday, November 14

Reading List on NLP / Information Retrieval

So I ran across some reviews by Peter Norvig on Amazon. On a side note, judging by Peter's wish list it looks like he's really getting into photography.

Some of the books caught my attention and I'm going to dig into some of them in more detail when I get the chance.

First some blatant ripping of selections from his list:

Statistical Learning and NLP

Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition by Jurafsky & Martin. Peter says this book a good general NLP / Theory book. The foundations of statistical natural language processing is more focused on algorithms.
Foundations of Statistical Natural Language Processing by Chris Manning from Stanford
Neural Networks for Pattern Recognition by Christopher Bishop

Information Retrieval (IR)
Managing Gigabytes by Witten, Moffat, and Bell. Definitely a must read in the genre!

-- and now for my appendix to Peter's very nice list:
Modern Information Retrieval by Yates and Neto
Mining the Web by Chakrabarti

The interested reader should also refer to: (and its newer sibling)

Others Miscellaneous Books / Proceedings
Natural Language Processing for Online Applications: Text Retrieval, Extraction, and Categorization by Jackson and Moulinier

I think I might also give one of the other's from Peter's a try:
Selected Papers on Computer Science (Csli Lecture Notes) looks very interesting to me.

Also I had a friend point this one out to me which I'm sure I will find stimulating:
Survey of Text Mining : Clustering, Classification, and Retrieval by Michael Berry.

Did I miss anything? Anyone want to add some? More importantly, it looks like I need a crash course in statistics, one thing they didn't teach use at Union -- does anyone know any good books?

Google Spell-Check Snafu


Some of you may have noticed that the HTML in my last post was slightly fuzzled for a few days. I started using Google's spell check that is built into the toolbar. To make a long story short, I accidentally hit "publish" when I was in spell check mode. oops!

Has this happened to anyone else? I resorted to manual html editing -- paste into notepad, edit html, paste back into firefox and re-post. The HTML was a big mess, almost as bad as word HTML. All of the span highlighting was a pain to get out! I wish there was an "undo" or "clean" feature. I could probably write one. It shouldn't be too hard.

All of the span tags have a common id convention: gtbmisp_xx where xx is the number of the spell check correction. It also adds a bunch of divs at the end where the actual spell check corrections are rendered, they have Ids like: gtbspellmenu_xx.

You know, its really amazing how simple the spell checking tool really is. It's not that hard to implement with a little XMLHttp and CSS / Javascript tricks.

Here's hoping for a proper clean button!