Saturday, December 31

Yahoo Search New Year's Eve Outage

So far, Yahoo's search results have been offline for at least an hour. Perhaps a hard disk failure(s)? As the tech team here at Globalspec would say, "SCSI go pooh", in technical terms.

Look's like the folks watching the search server clusters at Yahoo are out enjoying a Happy New Year's! I bet somewhere, someone, has a beeper going off and they are headed back to the office. Hopefully the partying hasn't gotten too far along or this could take awhile ;-). Remember geeks don't let other geeks drive drunk.

Also of note, this means that Rolly-o is down, because it uses Yahoo search.

Since I don't have a beeper, (sorry, Kathleen!) I'm heading out to celebrate.

Happy new year's!

Tuesday, December 27

All aboard the Gravee train

Time for some Gravee and biscuits. Gravee is a new meta-search engine whose gimmick is to reward webmasters for their content.

They share revenue with the sites whose pages are viewed in the search results. Gravee shares 70% of the ad revenue directly to the website owners. If your site appears in the top 10, you get 1/10 of the money or a total of 7% of the ad money from that page view. The position your site appears in doesn't matter, everyone gets the same amount.

The name Gravee is a reference to gravy, as in the "Gravy Train", meaning money with little or no effort -- cushy. The money that they share with web content owners is "pure gravee." Gravy train is an example of a collocation. "Gravy Train" has an interesting history, it's origins aren't completely certain. I found this interesting Q&A on Google on the meaning of the word "Gravy Train", which I think offers a pretty good treatment of some possibilities.

The theory behind Gravee is very interesting and unique. Here is their manifesto, from their blog:
We view ourselves as much as a distributor of content as we do a search engine. Search advertising is content driven. It’s an economy of attention – i.e. one could posit that the content showing up most frequently in search results is driving the most ad revenue. As a result, our economic model is so, as well – meaning, whatever content gets more attention (i.e. shows up in more searches) will get paid more.
In theory, rewarding web content owners for content directly from a search engine is a cool idea. In practice it could turn into a nightmare. Perhaps it's just my cynicism from first hand experience here at Globalspec with Spam, but paying the people for merely appearing in search results is Naive, with a capital N.

For starters, this fails to account for user intent -- clicks on search results. It's like suddenly we're back in the dot bust era with Pay-Per-Impression (CPM) model. To heck with your performance, we'll pay you anyway, even if not a single user clicks on your search result because it's not relevant.

Now that Gravee has provided incentive to scam the system with their checkbooks, they hand them the keys to kingdom -- an affiliate network. Gravee search affiliates make 35% of the ad revenue whose searches originate from their site. Gee, I wonder what happens if do a site: query -- wohoo!! I've now made 81% (65% * 70% = 46% + 35%) of the ad revenue. Whoop de da. Ok, ok, let's say they put a kabosh on those shenanigans.

Jeremy, in his reply on TechCrunch's coverage, echoes my thoughts exactly:
So now a spammer (err, I mean “SEO”) can get money from a search engine by being in the top 10 even if they’re never clicked on.

I’m in the wrong end of this business!

If Gravee had a novel way of ensuring a site didn't SEO and that all their results were 100%relevant, this would be a great model. However, this is Impossible, as I state in my comment on their blog.

One place they could improve is their means of verification for the AdSahre program. The site owner verification mechanism requires that your e-mail match the ICANN owner. This may be fine for large sites, but it doesn't work for bloggers whose content is on a shared provider, like my blog. I don't own blogspot, and therefore I can't sign-up. So much for "empowering the little guy."

So, does Gravee have a new and innovative product here? In other words, are their search results any better? The website is pretty mum on the exact details, except for a comment by William on their blog:
We have an algorithm on the back calculating clickthrough rates and other factors to re-rank the existing relevance rank of today’s search engines.
In other words, the only thing new they are adding is utilzing click through data -- which is nothing new. The major SEs are very careful when dealing with click through data. Click through data can be difficult to intrepret, and you need a lot of it. Using it in ranking also leads to somewhat of a positive feedback loop, because higher results are clicked on more frequently and clicking on a result pushes it higher. Don't get me wrong, click through data has potential, but only on a large scale system with lots of data, which Gravee does not yet have, to my knowledge.

Perhaps Gravee has some other secret sauce, but I haven't seen anything worth writing home about, yet (maybe that will change). Perhaps I'm being cynical, but they seem to be excited about touting their novel business model rather than focusing on their product, the quality of their search results. That bothers and worries me.

Here's my prediction: if it appears to good to be true, it probably is -- or at least probably won't be for long! Still, I've got to give them Kudos for having their heart in the right place, maybe they can find a way to make it work.

Friday, December 23

Roller and JRoller -- Blogging for the Java community

So, I have my blog on Blogger, but I am keeping my eye on Roller. According to the site Roller is:
Roller is a blog server, a Java-based web application that is designed to support multiple simultaneous weblog users and visitors. Roller supports all of the latest-and-greatest weblogging features such as comments, WYSIWYG HTML editing, page templates, RSS syndication, trackback, blogroll management, and provides an XML-RPC interface for blogging clients such as w:bloggar, MarsEdit, Ecto, and nntp//rss.
Hmm, sounds cool. Sun uses it on its website and it has integrated spell checking using Jazzy.

There is also a free Roller blog hosting service called JRoller. It is "the catalyst of the java blogging community." JRoller has a lot of promise. It has a pretty sizable blogging community there, purportedly 10,000+ java developers. Wow! That's a very sizable Java geek squad in one place. Does Microsoft have a .Net blogging community? ;-) One area where there is opportunity for improvement is the URL you get for your blog: Apprentely, they are considering offering sub-domains (like Blogger). Please go over and add your voice for this feature!

Overall, Roller looks like a fairly mature platform and the JRoller community looks to be growing... I'm definitely a fan of Java, so I have a feeling I'll be reading more geek blogs over there.

On that note, instead of going on a GYM diet, perhaps I should go on a A-List free blogger diet. No reading Scoble, Greg, Jeremy, Matt, Battelle, Om, etc... Rather, I'll read less well known and geekier blogs, perhaps I'll start by exploring JRoller...

It's a slow friday... so congrats to whoever reads this ;-).

Wednesday, December 21

Beyond Google: specialty and alternative SEs

Kevin Delaney wrote an article that ran Monday in the Wall Street Journal entitled "Beyond Google" where he looks at specialty search engines and databases. Here is the public link for the article. It's a good survey of vertical search engines broken down by category. Of note, Globalspec got a mention along with Scirus and LawCrawler as "industry" search tools:
GlobalSpec Inc., for one, offers an engineering search engine that searches about 200 million engineering and technical Web pages. GlobalSpec, of Troy, N.Y., also allows users to search within specialized databases, such as published technical standards and patent filings in the U.S. and internationally.
Perhaps it's a pet peeve, but some of the sites Delaney lists are specailized databases, not true vertical search engines. When people say "search engine" today there is an implicit reference to Google, and therefore web search engines. A web search engine is a subclass of database, a very large database (VLDB). I think it is important to differentiate between these specialized, vertical, web search engines and specialized database driven websites. Gary Price at SE Watch gets this right, with their follow-up to the story, Specialty Databases (Verticals) The Focus of a WSJ article. Apparently, Gary was one important source for Kevin's story.

Since we are talking about search engines beyond Google, LifeHacker also recently posted a list of top ten alternative search engines.

One of my favorite new vertical search engines is FoodieView - "The Recipe Search Engine". It integrates a "Recipe Box" where you can save search results. Pretty neat. My own big beef with FoodieView is the ads. I really dislike the way FoodieView integrates their ads in line with the organic search results. It makes the search results difficult to read and starts to blur between paid and organic results. Two thumbs down on that aspect. For now, I'll stick with RollyO, where I can accomplish the same thing by limiting the sites I search (without the annoying ad format). In addition, the Rollyo results are even a little more comprehensive and better ranked. However, I have to say cheers to the creators of FoodieView for an interesting idea -- they beat me to the punch. Maybe someday I'll start a competitor in my spare time. ;-)

One hole in the WSJ article is blog search. There are lots of cool and interesting blog search engines -- Technorati, IceRocket, Feedster, etc... and then newcomers such as OpinMind and hopefully soon, Sphere.

Tuesday, December 20

The Spirit of Search, Past, Present and Future

Spirits don't just come in the Christmas variety to visit holiday Scrooges. They also come in the form of users past, present, and future. And this is the time of year that search engines are heralding in the Spirit of Search Present. For example, today Google announced its take on the "interesting" search movers and shakers of 2005. Many of the other major search engines (with the notable exception of MSN) are doing the same. Here is a quick round-up:

  • Google zeitgeist 2005 -- A selective look at the 'interesting' queries of 2005. It is organized into categories: World Affairs, Nature, Movies, Celebrities, and Phenomena. There is no "top queries" list. The top gainers of 2005 are: MySpace, Ares, Baidu, and Wikipedia.
  • Yahoo! 2005 Top Searches -- Top queries include: Britney, 50 Cent, and Cartoon Network.
  • Lycos Top 100 of 2005 -- Paris Hilton, Pamela, and Britney top the list.
  • AOL's 2005 Year in Review -- The most popular: lottery, horoscopes, and tattoos.
Lycos' Daily Report and the Yahoo Buzz Index both have interesting features on the holiday season. Look at the queries. So, what do people want for Christmas?

  • MP3 Players -- specifically the Ipod (Nano, Video, Shuffle, Mini, etc...)
  • Sony PSP
  • XBox 360
  • Laptops
However, search terms have greater meaning beyond telling the latest fads in pop culture and what to buy your nephew for Christmas.

Search terms are fascinating things -- they represent the Spirit of User Intent. John Battelle has a lot to say about user queries and the power of user intent on his blog and in his very interesting book, The Search. Personally, I find queries absolutely fascinating. I would really like to spend some more time researching their deltas from season to season and year to year across the major engines (although a detailed analysis is impossible because much of the data is private, of course).

Yahoo, Google, and other industry leaders use search terms as one important metric in deciding what "verticals" to enter next. I was watching a presentation given by Brad Horowitz from Yahoo, and he had an interesting tidbit: Yahoo decides what verticals to enter based on what users search for. They also order their "tabs" based on the search frequency of that partical "vertical" market.

If you want to create a vertical search engine, of any kind, you should be asking yourself, "What is your audience searching for?" Start analyzing the popular queries across the major search engines. If you work for a search engine with access to a data warehouse, then you are truly blessed. If not, then maybe the most popular are a good starting point. At least some of this data is publicly accessible -- the above links are great starting places. Once you've looked at this year, compare it with past years and look for trends. Then maybe dive into keyword analysis tools for more depth. One thing I think you will notice is that Pamela, Britney, wrestling (WWE) and similar entertainment and celebrity queries consistently float to the top.

Another approach might be to find the important and the popular categories in the query stream. In fact, the search engines have started the process along. Yahoo and others have organized their popular queries into categories like: Music, Sports, TV, Kids Stuff, Movies, Video Games, and News. Staring at these lists it is no surprise that Google recently launched enhanced music search features, more details can be found on Google's blog post on the new features. Hmm, I wonder what will be next!

The Spirit of Search Future: 2006 and beyond. Video, music, movie, and news search (like Topix) have all already seen some attention in 2005, but look for them to really take off next year. There is also gaps in search that aren't covered. From the above list I see sports, kids stuff, and video games -- and that's just for starters. Look for more entertainment centric vertical search in 2006. But, hey, don't take my word for it.

Go check out the top queries of 2005 and make your own predictions about search in 2006. I would love to hear your thoughts.

Up Next: One of the top gainers of 2005 in the Google ZeitGeist (and others), Myspace. And Google's ban on Kozoru.

Thursday, December 15

Information Retrieval and other interesting collocations

In my spare time, on evenings and weekends I have been doing some research (programming) on collocations and term co-occurrence, especially in relation to their applications in web search. Let me begin with some definitions.

According to Manning and Schutze, A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things. (Manning and Schutze, 1999). Another definition used by Columbia NLP researcher Frank Smadja is: A collocation is an arbitrary and recurrent word combination (Benson 1990). Term co-occurrence is less strict, it is simply words that happen to occur in the same document.

Collocations are very interesting. They are unique combinations of words that have a special meaning, like "a stiff wind." When used in a technical context they are things like technical terms. For example, the title of this article includes a CS/IR collocation: "information retrieval", another might be "information extraction." For a full discussion of Collocations you can read Chapter 5 from Manning & Schutze's book, which happens to be availabe online.

Collocations can be used in web search in some interesting ways. One way they are used is to guide the discovery of information. They give you ideas for keywords you might want to add to your query to make it more specific. One example of this is Ask Jeeve's "narrow your search" suggestions. Collocations are also used in clustering, but that is a story for another day.

In my spare time, evening and weekends, I have been writing a Java program that uses the Google API to extract interesting collocations from web search result summaries. It utilizes N-Grams (sequences of words or characters) to identify interesting phrases. Hopefully, I'll have it working well enough to post online soon.

My current approach is not too sophisticated, I am using a list of stop words (a, the, is of, etc...) to remove "uninteresting" phrases. If I have time I would like to pursue a more sophisticated approach that utilizes part of speech (POS) tagging.

One example of this POS tagging approach is the XTract program by Frank Smadja. I managed to find a copy of his paper online (not through the ACM portal): Retrieving Collocations from Text: Xtract. When I get around to it, I think it would be really interesting to combine Stanford's POS tagger with my program to see if I can produce better results.

Finally, I would also like to track down and read Justeson and Katz's paper: Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text, which I had some difficulty locating through common online sources.

I'd like to deploy my program as a web app so that people can try it out and see the results in real time without having to download and run the program.

Stay tuned.

Tuesday, December 13

Google Active Homepage

Google today announced on its blog that in addition to RSS feeds it will start supporting "richer web apps as well." In other words, if you can create some HTML / Javascript you can include it as a "Widget" on your homepage.

Creating an application with Google Homepage Developer APIs is a snap. To create a widget all you need to do is wrap some simple XML around the HTML of your choice. As a demo, I created a widget to search Globalspec's Engineering Web search engine. If you want to check it out all you need to do is:
  1. Copy this link location.
  2. Open the "Add content" pane of your Google personalized homepage.
  3. Paste into the Google "Create Section" text box .
  4. Learn how to work for Intel: Try a search for "Britney Spears", or maybe learn to build a lego robot by checking out "Lego Mindstorms".
Now, was that so incredibly difficult? Making a widget is almost as hard!

It's nice to see some action from Google in this area. I use Google as my personalized homepage, but Google is playing catch up in this arena. Microsoft has its "Windows Live" platform that it is building, including your own personalized, active homepage. Microsoft's widget gallery has been around longer and has a richer widget gallery, of a mind-blowing 110 widgets.

The other player, Yahoo, has its widget system from its Konfabulator acquisition. It has a hefty 1,600+ widgets. The downside: you have to download an engine that runs on your desktop. Perhaps its just me, but isn't this what I have a browser for? You want me to download a desktop app to use widgets? Sorry, I'm not convinced, yet.

Right now Google's directory only has five widgets, but I look forward to seeing what is added as time goes on.

Did you get the Memeorandum?

In was watching the news of the Delicious acquisition over the weekend and I came across this website mentioned more than a few times called Memeorandum. Up until today I thought it was spelled MEMOrandum, I was wrong, but it appears to be a not uncommon mistake.

"Memeorandum is like an automated “New York Times” for the Web," according to John Furrier, Founder & CEO,, in a recent interview. What's so cool? As Scoble writes in his blog on the topic, "Memeorandum chews through thousands of blogs in minutes and tells you what's important." Currently the paper only has two sheets: politics & technology.

As you've probably already figured out, Memeorandum is a play on the word meme + memorandum. I didn't know what a meme was, so I looked it up:
Meme. As defined by Richard Dawkins in The Selfish Gene (1976): "a unit of cultural transmission, or a unit of imitation." "Examples of memes are tunes, ideas, catch-phrases, clothes fashions, ways of making pots or of building arches. Just as genes propagate themselves in the gene pool by leaping from body to body via sperms or eggs, so memes propagate themselves in the meme pool by leaping from brain to brain via a process which, in the broad sense, can be called imitation."
In this case, the meme is a news story. Memeorandum tracks a news story's viral-like propagation throughout the blogosphere and news media. It tracks and rates stories based on this propogation of ideas.

Memeorandum is founded by Gabe Rivera, a former Intel compiler designer. Clearly, he has some experience designing complex software. According to his Blog, He founded memeorandum on three basic pinciples:
  1. Recognize the Web as Editor: There's this notion that blogs collectively function as news editor.
  2. Rapidly uncover new sources: Sometimes breaking news is posted to a blog created just to relate that news.
  3. Relate the conversation: Communication on the web naturally tends toward conversation.
The last bullet is what I find really interesting. The coolest thing about Memeorandum is that it groups stories into headlines and collects the most relevant discussions of that story into a thread that you can navigate like a forum. It's a one-click way to see the repurcussions and discussion of a story throughout the "Live Web."

One of the most interesting things about Memorandum is that it excels at filtering out the noise present in other services like Digg or even Slashdot. It is great for busy people who need to quickly see what is going on on the web and what people are saying. Memeorandum allows you to do this without visiting sites or even plowing through lots of posts in an RSS reader. So how does it select who is shown on the front page?

Gabe has a blog post which is particular enlightening on the topic. Here is an excerpt:

I'll start with the most common question: how are sources selected for inclusion?

To answer that, I'll begin with my philosophy: I want writers to be selected by their peers. That is, I want the writers in each topic area to select which of their peers show up on the site. Not deliberately, but implicitly, by whom they link to and in what context they link.

The source-picking algorithm is based on this philosophy and works roughly as follows: I feed it a number of sites representative of the topic area I want coverage. It then scans text and follows links to discover a much larger corps of writers within that area.

The decisions for including sources are continually reevaluated, in such a way that new sources can be included in real time. Think about that for a second.
Wow! Not very Web 2.0y, that's for sure. Gabe is saying that there must be hieararchy! Heresy, burn him! In fact, he says that the initial seed sites are ones that He selects on a topic. So much for "radical decentralization and harnessing the collective intelligence." This is centralized authority harnessing the elite collective intelligence. This elite then select other relevant writers / authors based on their votes -- their links. It sounds to me a bit like hiring by committee at Google done on a story-by-story basis. It isn't suspectible to spam because as Gabe says in the Podtech article, "They won’t link to what isn’t relevant because it will spam up their own blogs." In other words, these guys are important, they have a reputation to protect, a bit like 'real journalists' and they are more careful about what and who they link to.

Gabe makes an interesting distinction between Memeorandum and other commercial news sites in the same PodTech interview:
One way that you should look at this different to the NY Times or Cnet news is that - it’s open to you! If you have something to say on a story and if you’re a blogger may get a link with ease, but if you don’t show your work to someone else and get them to link to you and you may find that you’ll be added in minutes.
If you haven't read the PodTech interview yet, check it out!

Another site that is similar to this, but limited strictly to blog sources is Blogniscient. In my opinion, Memeorandum is a lot more intesting!

Very smart guy, very cool technology. I'm not the only one that is smitten. The folks over at Tech Crunch wrote about it here, here, and again here! We also agree -- we both prefer Memeorandum over blogniscient.

Alexa Web Search Platform: IBM WebFountain 2.0

John Battelle and others are reporting about the launch of the Alexa Web Platform.

Is it economically feasible for a new vertical search engine to build its own web crawler and build a multi-terabyte data storage system? This presents a sizable barrier to entry into the vertical search arena. This is one of the main reasons there are still so few vertical search engines. Alexa hopes to change that by offering a hosted platform for companies and users to create their own custom search engines, or perhaps just get some meta-data.

From Alexa's website:
The Alexa Web Search Platform provides public access to the vast web crawl collected by Alexa Internet. Users can search and process billions of documents -- even create their own search engines -- using Alexa's search and publication tools. Alexa provides compute and storage resources that allow users to quickly process and store large amounts of web data. Users can view the results of their processes interactively, transfer the results to their home machine, or publish them as a new web service.
My intial question was: How did they get around the legal problems associated with this? After all, they are essentially charging for access to my and other users copyrighted content.

They got around this the same way IBM did with WebFountain. I had the opportunity to talk to the head of IBMs WebFountain project at a search engine conference earlier in the year. One of the reasons that WebFountain was never a runaway hit was because they couldn't provide direct access to the content. The reason the IBM employee gave was that it wasn't legal to charge for access to others copyrighted works. Imagine me charging you to download all of the Star Trek episodes as a service off my website. IBM got around this by providing access to a derivative work: metadata. WebFountain aggregated the knowledge of the web to create a new product that they could sell. IBM would mine the web for you and provide answers to your business intelligence questions. However, IBM had to write the software to run on its cluster. You paid per question you asked because each one was expensive because it required custom programming.

The Alexa platform is the next evoluation of this business model. I call it WebFountain 2.0. Instead of approaching IBM and asking them to design a program to answer your question, now you can create your own program and have Amazon, err Alexa, run it.

So, what exactly is the platform they are offering? According to the FAQ:
This store contains both the raw document content and the metadata extracted by Alexa's internal processes. All Platform users have access to the data in this store... Alexa maintains three Web snapshots in the Data Store. Each Web snapshot represents two months of crawling activity. Each snapshot contains about 100 Terabytes of uncompressed data so at any time, the Data Store contains 200 - 300 Terabytes of data.
In other words, they will give you access to run programs against their 5 billion page web crawl. The Alexa Web Platform allows you to run code on their cluster to process web data. At the end you can download your results (metadata) or you can publish your own private search engine hosted on their cluster. I don't think you can't actually download the content directly from the repository.

The pricing model is pretty simple and straightforward, you pay for CPU time, bandwidth, and storage space.

What's innovative about the Alexa platform over WebFountain are two things: 1) The ability to write your own code against the system and 2) The end product can be a private / custom search engine instead of just some meta-data.

We'll see what happens!

Saturday, December 10

Yahoo Having a Wiki Good Time

With all the major buzz over Yahoo's acquisition of social sites, I thought it would be interesting to look at one that Yahoo hasn't acquired yet -- WikiMedia. I guess it's not surprising, but what I found was that the two are already collaborating.

Back in April Yahoo and Wikimedia announced that Yahoo would be donating server and bandwidth resources in Asia. You can read the full press release on Wikimedia's website. According to the press release,"Yahoo! is one of the Wikimedia Foundation's earliest corporate supporters."

At this time Jimmy Wales, the creator of Wikipedia, was guest blogger on Yahoo's search blog. Jimmy Wales clearly says that Yahoo's relationship is purely "charitable." Jimmy says in his post:
As our relationship with Yahoo has grown over the past year, we began to talk about other ways that Yahoo could help us. One theme that made sense for both of us was to think about Yahoo's global reach and Wikipedia's global goals.
In what ways are Yahoo and Wikimedia going to collaborate on global expansion? What are other ways that Yahoo will help Wikimedia? I wonder how this relationship will evolve.

Yahoo wants Wikipedia to succeed because it has a lot at stake in this new social movement -- Yahoo has a large amount of capital, tens of millions, vested in seeing it succeed. Wikipedia's continued properity and growth lends credence to Yahoo's new community initatives.

More thoughts on this to come...

Friday, December 9

Yahoo acquires -- The $30 Mln Firefox extension

The news broke this afternoon on Tech Crunch, Yahoo's Search Blog, the Delicious Blog, Jeremy's blog, Om, -- and then all over the place that Yahoo has purchased Delicious.

According to Battelle this didn't come cheap. He says that according to his sources the deal is pegged around $30-35 million dollars. There are other rumors that the figure was closer to $40 million. Update: Om, and others estimate this to probably be lower. Personally, If I had to wager a guess it would be something on the order of $15-20 million dollars. My estimate is in line with umair, which is based his estimate of a street value in the range of $10 million, and the supposition that Delicious could command a premium well over and above that.

Yahoo was acquiring the one thing My Web 2.0 lacked -- a good Firefox extension. The Firefox extension converted at least one user, Nick Wilson, who defected back to Delicious because of this dirth in My Web. Is Delicious' user community and tag meta-data worth $10-40 million? How many Firefox extensions did they buy for this chunk of change? Now, in all seriousness, if not for the firefox extension, Why Did Yahoo Really Buy Delicious?

The rationale for this acquisition is obvious, the community, of course! It's not about acquiring cool technology. It's about increasing the value of My Web 2.0 through the Network Effect. This states that the percieved value of the service is dependent on the number of users already using the service. See more about the network effect and MetCalfe's law on Wikipedia.

One thing to note is that Delicious' traffic curve is going in the right direction. According to Alexa, it's reach has tripled since October!

Update: BusinessWeek and others estimate the size of Delicious' user community to be 300,000 users. Depending on who you believe, this means that Yahoo paid between $30-100 for ever user's bookmarks. Maybe I should try auctioning my bookmarks on EBay and pay for some Christmas gifts. It remains, three hundred thousand users is a drop in the proverbial bucket to someone as big as Yahoo. However, I would wager that Yahoo's My Web 2.0 doesn't have the same exponential traffic growth curve.

Let me try and provide an inside look at a hypothetical a meeting at Yahoo HQ. Sitting around the table are Terry Semel, Jeremy Zawodny, Jeff Weiner, Jan Pederson, and other Yahoo execs. Terry asks, "So, how long for My Web 2.0 to reach critical mass?" The project manager gives a presentation that provides several different projections based on varying marketing spends, etc.., etc... but the bottom line is that it is still a ways out there -- he estimates 1-2 years and millions of dollars. The next big question Terry asks is, "How can you make this go faster?" The PM replies, "Well, if we buy a community like Delicious, it will cut the time in half ." They talk it over and then someone asks, "What would it cost us if that other company purchases Delicious and uses it to start their own community initative?" Ouch. They look at each other for a minute and Terry break's the silence, "Buy it."

The Yahoo leadership recognize that Delicious in and of itself may not be worth $30-40 million, but being the first to reach critical mass with the "community thing" is critical. As it stands, Yahoo can use Delicious' amazing growth rate and existing user base to accelerate the growth of My Web 2.0 and take an early lead in the market. If I were them, I would hope that the time to grow a community would provide a barrier to entry for competitors.

Overall, the acquisition isn't surprising, but I honestly say that it still caught me a bit off gaurd. I expected Yahoo to try and do it themselves, after all, they had the technology and the marketing money. As I said earlier, I think it really comes down to time and expertise. Joshua and the Delicious users can help Yahoo accomplish their goal faster.

Let's take a moment to compare this with the Flickr purchase. More than the traffic differences the biggest difference I see is in the audience of the two sites. The Delicious crowd is really super geeky -- look at the "Popular" pages and you can see what I mean. Flickr appeals to a wider audience that is more inline with Yahoo's audience. Perhaps Yahoo can bring Delicious to a wider crowd, but we'll have to see what happens to it. Nick Wilson has a really catchy quote in his post on the acquisition that I wish I had thought of:
So What Happens as Delicious Leaves the Geekerati and Joins the Mainstream?
This acquisition will surely alienate some cyber geeks in the tech community. If you check out the comments on Digg, many are not happy and accuse Delicious of "selling out."

Ho John Lee posted some very interesting ideas in his repsonse to the news. Here is what he had to say that I found the most interesting:
Tagged bookmarking sites such as can provide a rich source of input data for developing contextual and topical search. The early adopters that have used up to this point are unlikely to bookmark spam or very uninteresting pages, and the aggregate set of bookmarks and tags is likely to expose clustering of links and related tags which can be used to refine search results by improving estimates of user intent. Individuals are becoming their own search engine in a very personal, narrow way, which could be coupled to general purpose search engines such as Yahoo or Google.
Think of millions of users bookmarking sites. The early adopters might not bookmark spam, but will a wider audience? What about all of the SEOs who realize that creating accounts and bookmarking pages gets them more traffic in the context of a larger Yahoo audience? Finding interesting relationships in the user data is a veritable mountain of gold. The question is will this gold tarnish as it grows?

All of these acquisitions present Yahoo with some really cool properties, but also some interesting problems. How is Yahoo going to integrate Delicious and Flickr with My Web 2.0? How do you keep the fans happy while integrating all of these pieces into the bigger platform?

In a broader context, this acquisition will likely have ripple effects throughout the Web 2.0 community. Will it be a boon or a bane? If Yahoo is smart, they will provide ways for new services to leverage the Web 2.0 / Delicious platform to layer services on top of it. Delicious' lavish reward should also spur the number of "Web 2.0" startups that try to jump on the bandwagon. If some of them become successful will continue to interoperate with them, squash them, or buy them too?

I think this acquisition is a good thing for both Yahoo and Delicious. It's a win-win. Yahoo gets a lot of users and tag content to bootstrap their platform. Delicious gets cash, but it also gets the resources to take their business to a whole new level inside the Yahoo network.

My congratulations to Joshua, Fred, and the rest of the Delicious team.

Update: Greg also raises a lot of the same points as I do, albeit a bit more eloquently and succinctly. He also raises some even more interesting questions: Did they sell too soon? If this whole thing works out Delicious was in a great position to do better later on. A lot of search companies back in the dot com boom were bought up by large media companies only to be neglected and later abandoned. If one of those stuck it out they might have been the next Google. Another interesting point he makes is that this acquisition might lead to the perception that Web 2.0 companies are "in it to flip it." What are the long term business models?

Monday, December 5

RawSugar the first tag based web search engine

A colleague asked the question: What's lies beyond link text when it comes to search engine relevance? One possible solution is tagging. John Battelle recently posted on his blog: Will Tagging Work? I have started to think about this in a web search context and I'm not sure I have any answers, but here is at least an introduction...

There are some that think the "next big thing" is tagging. This is all part of the "Web 2.0" way of doing things where users generate content. The most famous examples of these models are Wikipedia, Flickr, and

The question is, can this be extended to the web as a whole? Search engines crave high quality meta-data about web pages. First, they use sophisticated computer algorithms, like clustering, to derive meta-data. However, sometimes humans can provide more insightful data. Users can generate this data explicitly by tagging urls directly, or implicitly through some by product of using the service, even by playing a game. One of the coolest examples I have seen of this type of system is the ESP Game. The ESP Game is an attempt by CMU researchers to get users to label image data. In fact it is entitled "The ESP Game: Labeling the web". Very compelling incentive -- addictive fun.

One group trying to build a high quality social network-tag-based search engine is RawSugar. There is an interesting interview with its founder over on Free Internet Radio (Thanks to's Weblog). At first glance RawSugar may appear to be another Delicious rip-off. However, it is more than a social bookmarking platform -- it is the first real tag based social search engine. A Raw Sugar employee provides a good description of this differentiation over on Tech Crunch:

Most importantly, our search is not the same as and most (though not all) of the other sites in the tagging space–we search the tags, notes and full text of pages saved into our system while, at least for now, only searches tags and, i think, notes.
RawSugar is angel funded with about ten engineers working on the engine. They have just made some very interesting service upgrades, check out their blog for details. According to a recent interview with CEO Ben-Shachar they are using an interesting mix of technologies, including PostGresSQL and Lucene. Lucene is an Apache project -- a very popular open source indexing library, in Java and other languages.

Right now I would say Raw Sugar is more of an experiment than anything else -- it only has about 135,000 pages indexed (based on stop word tests my estimate is about 170k) and an undisclosed number of users. If it can scale and attract a sizable user base it could be something to watch. At the very least, it is an experiment to learn from.

Rollyo is another search engine using a more implicit approach to tagging. It allows people to create their "own custom verticals" by performing restricted searches across a collection of sites organized into a "Roll". One of the by products of creating a roll is the creation of a human created cluster of sites organized under an informative title and keywords. One of the biggest questions I have about Rollyo is: Can it scale? Users are currently limited to 20 sites in a roll and you can only search one roll at a time. Is being able to restrict a keyword search to a list of websites enough incentive to use the service? I'm not convinced -- I think there is a lot of potential, but will it catch on? What compelling new features does it offer to get people to switch?

The question that these and others are trying to answer is: How can search engines get users to tag web pages with usable content as a by product of their daily surfing? What incentives motivate users to provide reliable and useful tags? And lastly, how can search engines handle spam in these tagging systems?

To sum it all up, I'm not sure if tagging will work. Right now I have more questions than answers -- and the questions are still fairly nebulous. I hope to refine these questions when I attend the WWW 2006 conference and hopefully attend the Collaborative Web Tagging workshop on May 21st. Raw Sugar, Yahoo, and other major players will be taking part, so I have high hopes for an interesting discussion. More on the confirmed speakers at the Raw Sugar Blog...

More reading:
Social Consequences of social tagging
There is also a paper available via the ACM on the ESP game:
"Labeling images with a computer game" a search engine for the holidays

Search Engine Watch recently ran an article on what's new for the 2005 Shopping season. One important site that I think the author missed is Incidentally, SEWatch does not provide a means to comment on this story, which is somewhat frustrating. is a vertical search engine specialized in shopping. The growth of a shopping vertical isn't that surprising consider the explosion of e-commerce on the web, but there is lots of competition, especially from new services like Froogle which, on a side note, just started offering Geo-targeted results, and Yahoo Shopping. has some experience from its start-up veteran founders: Michael Yang and Yeogirl Yun. They founded the comparison shopping site MySimon and sold it at the height of the dot come boom to CNet for some serious cash -- smart people. If two people are going to take on shopping, at least they have some experience with comparison shopping -- but this time they decided to take their idea to the next level and combine research with comparison shopping. has two modes -- shopping and research. As soon as I saw this it immediately said to me: Yahoo! Mindset. However, instead of having this slider that dynamically filters results, they have created something much simpler. KISS -- not to mention that us cs geeks like binary choices. Interestingly enough, Globalspec also has the same two modes, although not as explicitly defined: research (The Engineering Web) and product search (SpecSearch).

The research side of Become is a 3+ billion page web index, purpotedly emphasizing review and informational sites. The shopping mode is very much like MySimon, finding products from store feeds. I think the best new "cool" feature they have is the ability to start doing other types of Faceted Search, allowing filtering by features beyond price, on things like brand name. It would be cool if they could extend this to something more powerful -- like Specsearch for consumers -- which would allow me to perform very precise product searches based on precise specifications. However, I may not be the typical user and perhaps a text search is good enough for most people.

I am going to be keeping my eye on Become. They have been hiring people with search engine expertise to enter the market. One thing the founders of have is chutzpa -- it takes guts to go head to head against Google's main product offerings. However, seems like it is heading in the direction by assembling an experienced leadership team culled from veterans of Ebay, AltaVista, Overture, Yahoo, and Sun. For example, they recently hired Jon Glick as "Senior Director of Product Search and Comparison Shopping" (interesting job title, what does he really do?). From their press release:
Glick joins from Yahoo! (NASDAQ: YHOO) where he headed Product Management for Yahoo!'s web crawling, indexing, search relevancy, and assisted search initiatives. He was an instrumental part of the team that launched Yahoo!'s in-house web search in 2004, displacing Google (NASDAQ: GOOG).
While I was working on some side programming projects (more on these soon) I made the jump to Java 1.5. While perusing Sun's site, I ran across a very interesting article about and their creation of a large scale web crawler using Java 1.5. I had never heard of before, but the article was very informative and impressed me. From the article:
The company has successfully created a Java technology web crawler that may be the most sophisticated, massively scaled Java technology application in existence, obtaining information on over 3 billion web pages and writing well over 8 terabytes of data (and growing) on 30 fully distributed servers in seven days.
I honestly don't believe their crawler is the most sophisticated massively scaled java technlogy app in existence, but I won't start a rant on it. I would highly recommend the Sun article to everyone interested in Java 1.5 or web crawlers. Interestingly enough, I believe the Internet Archive is also using Java 1.5 for their Heritrix crawler on an AMD Opteron platform... but that's also another story. The real story is that this article prompted me to check out and I thought it was especially relevant to share because of the holiday season:

It looks like they are serious, and they have the cash and the courage to do it. For now, I am going to go give it another go as I do my Christmas shopping.

Further reading:
Silicon Beat's coverage of the new SE from Feb 05.
SE Watch's Coverage of the April launch
Geeking with Greg on Yahoo! MindSet

Wednesday, November 30

Weirdness in Google search result dates

Here is a query that illustrates an interesting phenomena in Google's search results dates:

Try the query:

The second result is for the url: - 65k - Nov 28, 2005. However, if you look at the cached copy it says the page was retrieved on November 26th.

Now, I noticed that the leading store of the page is "The race is on". So I try my second query: "the race is on" . The fourth result is for the same url: - 63k.

Quite interesting -- the size of the page is 2k smaller and the date is not displayed. These results a different, despite the url being exactly the same.

It is also interesting that Google seems to only show dates next to content that is very recent, in these results, anything purportedly indexed in the past two days. Therefore, it appears that when I execute the second query Google recognizes the indexed date corresponds with the cached copy date of Nov 26th because it does not display the date.

The question is, why are these two queries inconsistent with the date of the content?

One hypothesis that I have is that the first site: query without a query term returned a generic summary that was generated from content different content from the second query. Perhaps the first summary was generated from data stored in the recently updated index? The second query contained a term and therefore required a dynamic snippet to be generated, perhaps from the same server that generated the cached copy? This would indicate that there is a significant amount of lag in the latest copy of content being propagated to cached content servers, at least in some cases.

Anyone else have a theory they would like to share?

Friday, November 18

The Google Strategic Server Force

In its Cold War with Microsoft, Google is readying a new weapon: The Google Strategic Server Force (GSSF). This new elite mobile strike force is emerging as a main component in Google's strategic arsenal. The Government reports that Google is readying the first deployment of the G-36M series mobile data centers and predicts that they will be online in time for Santa.

The new arsenal is primarily targeted at The Enemy's primary capital, Redmond. However, other targets include Walmart, Ebay, and Association of American Publishers whose recent refusal to capitulate with Google's order to turn over all books and databases for indexing as part of Google Base and Books has resulted in the escalation of tensions between the major super powers.

The new mobile data centers are reportedly running Google OS 3.7M Cheetah with its new autonomic load balancing and data redundancy features courtesy of GFS II.x . Reportedly, the tractor trailer based centers are semi-autonomous agents based on K.I.T.T. whose prototype was designed by Mr. Norvig himself and utilizes the latest AI research. The previous model, the G-33M was widely successful and recently conquered the Government's toughest test. Google reportedly has 10-20 of these new data centers in its underground garage-bunkers, each with an effective range of 1,000 miles. Yahoo CEO Terry Semel described the level of this new threat:
We're talking about 5000 Opteron processors and 3.5 petabytes of disk storage that can be dropped-off overnight by a tractor-trailer rig. The idea is to plant one of these puppies anywhere Google owns access to fiber, basically turning the entire Internet into a giant processing and storage grid.
The new GSSF is under the direct authority of the Google Supreme High Command. The control of the troops is effected directly by the Supreme Commander in Chief through the central command headquarters of the General Staff and the main headquarters of the GSSF, using a multi-level extended network of command posts operating in alert-duty mode.

Bill Gates and other world internet leaders condemned this new technology and warned Google that this new threat could escalate the already tense conflict. They called on Google to destroy its weapons of mobile information, to cease development of all such weapons, and to stop support for open source terrorist threats.

In another recent development, John Battelle and Bill O'Reilly returned from negotiations in Munich where they reportedly negotiated a partial disarmament of the WMIs. In a recent presentation before the w3c security council and Microsoft CEO, Steve Ballmer, they announced "We believe it is peace for our time."

In their presentation to the council Battelle and Oreilly revealed some of the terms of the agreement, including Term 6:
The final determination of the mobile data center based Wi-Fi frontiers will be carried out by the international commission. The commission will also be entitled to recommend to the four Powers Microsoft, Yahoo, IBM, Ask and Looksmart, in certain exceptional cases, minor modifications in the strictly ethnographical determination of the zones which are to be transferred without plebiscite.
All of the terms of the agreement were not revealed, but undisclosed sources reported that several concessions were made to appease Brin, Page, and Company. According to these sources, the collective databases from leading publisher Reed Elsevier will be ceded to Google. Also under the terms, Google will acquire Scirus and integrate it into Google Scholar.

In an attempt to assuage public concern over its recent aggression into Google today announced it would be using its G-36M and Strategic Server Force (GSSF) to provide free Wifi to the city of Mountain View. Whether the strategy will prove effective in swaying public opinion remains to be determined, but first signs appear optimistic.

Ongoing coverage of this breaking news story:
Google data centers and dark fiber connections
Google Announces Plan to Destroy All Information It Can't Index

Thursday, November 17

Evolution of technological progress through queries

There is a new paper up on Google's research website:

Interpreting the Data: Parallel Analysis with Sawzall (Draft)
Very large data sets often have a flat but regular structure and span multiple disks and machines. Examples include telephone call records, network logs, and web document repositories. These large data sets are not amenable to study using traditional database techniques, if only because they can be too large to fit in a single relational database. On the other hand, many of the analyses done on them can be expressed using simple, easily distributed computations: filtering, aggregation, extraction of statistics, and so on.

Go to the site for the full paper and abstract. I'll read it later today. It's on the same list with GFS and Map Reduce, so I hope it lives up to the same standard.

The coolest thing in this so far is the movie showing the distribution of requests to Google's servers over the course of a day.

This is very interesting. Can you track the technical progress of a society (sorry for waxing philosophical like John Battelle) by the volume and type of queries executed? It will be interesting if we could track this over the course of years to see the growth of technology in rapidly emerging countries like China and parts of South America.

So now we have the volume distribution, but can we mine trends at a global level? For example, commercial queries have eclipsed sex related queries in North America. Will this trend repeat itself in Europe? Fascinating.

Thanks to Digg for the tip.

Sitemap statistics are not like a bikini...

"Statistics are like a bikini. What they reveal is suggestive, but what they conceal is vital." - Arron Levenstein. At least in this case, I think the Google sitemap statistic real more vital information than they hide. Maybe statistics aren't so evil after all...

Google announced today on its official blog and on the Sitemaps Blog that it is going to provide more statistics to webmasters via the sitemap service.

What's even cooler is that the Google Sitemaps blog reports that you can get site indexing statistics even if you don't have a sitemap! Now, if only it was integrated better with Google Analytics (If you missed it, here is the official blog post on it being free).

What's really awesome is the ability it gives you to fix problems on your site. The statistics show the fetch details for every page in the Sitemap. In my opinion the two most interesting are the HTTP request details and the crawled date for individual pages. Did half your pages drop out of Google because one of your important pages 404ed? Was your site down when Google tried to crawl it? Now at least you are more empowered to do something about the problem. To my knowledge no other SE is providing this level of transparency with their crawling -- Globalspec, MSN, Yahoo, nobody.

I think it would be cool if there was a way you could suggest that Google retry crawling errored pages. When there was a 404 or some sort of logic error on your site, you could see it, fix it, and tell Google so that they can re-crawl it. I suppose if Google crawls you very frequently, this may not be a big issue, but if major portions of your site errored out repeatedly and dropped out of the index this could be devastating to a business that gets a lot of traffic from search engines (most do), especially small retailers in the holiday season!

Now here is an interesting experiment: Add a new page to my site (and sitemap) and then monitor its appears in the Google index. Then, compare the index date with the crawl date. What is the delay between crawling and appearance in the search index? Just how fast can Google get content that is crawled into its live search index?

The extra value provided by these sitemaps statistics is very smart because it is a very compelling incentive for webmasters to sign up for Sitemaps and also to spread adoption of the Sitemap format (it's still just a Google thing, after all).

The problem is, for me at least on Blogger, is that you need to "Verify" site ownership by placing a file in your root directory so that Google can fetch it. What sucks: no support for Blogger sites. And I'm not the only one and again on the Google Sitemap group ... who thinks this SuXors. It's ironic that the Google Sitemaps blog is on Blogspot and yet I have no way of verifying with Google that I own this blog on Blogspot . Another step on the way to setting up my own web server and WordPress.

Wednesday, November 16

Excuse me, I believe Google has my stapler...

We knew it was coming, but somebody at Google was working late last night (until at least 9:51pm PST). It's official, Google Base is online.

So the jokes in the geek world of course will include the obligatory: All your base are belong to Google. In fact, I bet that's what they would have titled the press release if the developers could have written it and not been forced to un-geekify it.

It looks like Sergey and Larry have finally eclipsed one of their advisors, Mr. Hector Garcia-Molina in their attempt to gain supreme power:

The Google Blog has been updated with a post on Google Base.

I'm sure there is going to be a flurry of buzz over this one. I for one think this is a really cool idea. For example I can just search Recipes, job postings, anything.

For fear of sounding stupid, Why is it called Google Base?

It's really cool, I can create any arbitrary object with a collection of attributes, description, keywords (*cough* tags *cough*), and possibly pictures.

I'think I'll put my collection of Red Swingline staplers online.

Excuse me Mr Brin, I believe you have my stapler...

Tuesday, November 15

How Italian hotels and villas need to get hip to SEO

One thing to ponder is popularity vs. authority on a subject.

Let me show you an example. Try a search on Google for "jeff's search cafe" with the quotes so it is a phrase search. There are only 49 matches in Google's index, clearly this is not a popular or ambiguous query -- you are searching for this site (or searching for my non-existent real life cafe).

So, what is the first result: Findory's link to my feed! Second result, here I am. Popularity and Authority. No doubt, Findory has a higher page rank than my pathetic site on blogger.

I have noticed this more and more recently. I have been doing a lot of travel research for my honeymoon (next May is coming too soon!) and I've been exploring hotels and cities and things. What I find quite often is that hotels and cities in Europe (Italy, France, Greece, etc..) obviously don't know much about SEO! Many of the websites for these places are one of two types: fancy art decco flash that looks very expensive, but lacks any substantive content or a quick mom & pop type homepage with simple information and maybe a couple pictures, if I am lucky. Neither ranks well in search engines.

So what do I see most often? I see the the travel sites that review those hotels like Yahoo travel, TripAdvisor,, etc... My favorite is TripAdvisor, which I actually find quite useful for its fantastic user community. It would be great if TripAdvisor linked to the hotel website, but it doesn't! In fact most of these types of travel guides / sellers do not. It is very frustrating sometimes.

These are two examples where link popularity breaks down. First my blog -- I can't compete with Findory in link quantity or quality. In the travel / hotel instance these are in a similar position competing for link text with major sites like TripAdvisor, Fodors, and their peers. Many are definitely borderline spammish.

How are search engines dealing with this problem? Good Question. I know there was some discussion awhile back about TrustRank. Teoma / Clusty try to help with clustering and refinements. (See also: DiscoWeb rank based roughly on Kleinberg's HITS algorithms). I'll think on this some more later -- its time for some sleep. There must be something better we can come up with.

For now it is just an interesting lesson (and frustrating as I try to plan my honeymoon!). On a side note, I have this intuition that Google feels "spammier", perhaps what I mean is much more commercial, for travel searches than some other types of searches I generally run.

information retrieval (IR) and NLP resources continued

So I guess its been a busy day for me posting. I tend to go in spurts, really.

Here are some resources I wanted to make sure people knew about:

I found something pretty cool today, an online draft of Introduction to Information Retrieval by Manning, Raghavan, and Schutze. Two of the same authors as the Foundations of Statistical NLP I recommended yesterday. According to the site the book is scheduled to be published in 2007. It looks like they have drafts of about half the chapters online right now. Very interesting reading from what I see so far.

For IR practitioners I ran across a relatively new book:
Information Retrieval : Algorithms and Heuristics by Grossman and Frieder (2004).

I mentioned yesterday the CS276a course on IR, well there is CS276b on Text Mining which is very relevant to what do we do here at GS (as well as the other SEs) that people shouldn't miss out on.

And lastly if you want yet more resources there is always

There is some fodder for a future post in Lecture 7 of CS276B, specifically, slide 3 the "Evolution of Search Engines". Stay tuned.

So much to read, so little time!

Human Vs. Computer RTS Game

Yahoo (Human) vs. Google (Computer) Real Time Search Game. Who will win?

Here is an interesting article in Business 2.0 about Flickr's acquisition and how Yahoo is betting on social networks, tagging, etc...

That upstart in neighboring Mountain View may have a better reputation for search, it may dominate online advertising, and it may always win when it comes to machines and math. But Yahoo has 191 million registered users. What would happen if it could form deep, lasting, Flickr-like bonds with them -- and get them to apply tags not just to photos, but to the entire Web?
So, just how do you get people involved in tagging the web? What does that look like? Perhaps Rolly-o is a start?

The real question of the day is: is Google playing on easy, average, or super ultimo death mode?

You know, I bet those geeks at Google/Yahoo have some killer private HLF2 Counter-strike / Quake III servers. What DO you do with a few thousand servers and almost unlimited bandwidth.. Hmm... really now. Next up: Yahoo's social networking / bookmarking / life MMORPG where you play as yourself. ;-). Imagine the possibilities.

Any Googlers / Yahooers / MSNers / etc.. want to chime in? ... not that they actually read this, but still! Any search engine death matches going on after hours?

Kudos to SearchBlog for the tip and Westwood Studios (EA) / Sony / Id for the idea.

Google Resarch from 2000

Google did some research in 2000 where they spidered the top 100 pages for popular queries. They then compiled a variety interesting metrics about the pages (such as document size, url length, average title length, file types, etc...).

The research was published as part of the Search Engine World Quarterly (Q3 2000).

Any one know of follow-up research? The article mentioned a phase 2 where they used the same data / methods but analyzed links. I'll try to track that down, but somehow I have doubts about finding it.

It might be interesting (and feasible) to update this study. It would be even more interesting to look at this on Globalspec and other verticals to see if it holds in a domain centric search engine. Do the same properties hold over time and in different types of web pages (i.e. blogs or engineering pages)?

Monday, November 14

Reading List on NLP / Information Retrieval

So I ran across some reviews by Peter Norvig on Amazon. On a side note, judging by Peter's wish list it looks like he's really getting into photography.

Some of the books caught my attention and I'm going to dig into some of them in more detail when I get the chance.

First some blatant ripping of selections from his list:

Statistical Learning and NLP

Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition by Jurafsky & Martin. Peter says this book a good general NLP / Theory book. The foundations of statistical natural language processing is more focused on algorithms.
Foundations of Statistical Natural Language Processing by Chris Manning from Stanford
Neural Networks for Pattern Recognition by Christopher Bishop

Information Retrieval (IR)
Managing Gigabytes by Witten, Moffat, and Bell. Definitely a must read in the genre!

-- and now for my appendix to Peter's very nice list:
Modern Information Retrieval by Yates and Neto
Mining the Web by Chakrabarti

The interested reader should also refer to: (and its newer sibling)

Others Miscellaneous Books / Proceedings
Natural Language Processing for Online Applications: Text Retrieval, Extraction, and Categorization by Jackson and Moulinier

I think I might also give one of the other's from Peter's a try:
Selected Papers on Computer Science (Csli Lecture Notes) looks very interesting to me.

Also I had a friend point this one out to me which I'm sure I will find stimulating:
Survey of Text Mining : Clustering, Classification, and Retrieval by Michael Berry.

Did I miss anything? Anyone want to add some? More importantly, it looks like I need a crash course in statistics, one thing they didn't teach use at Union -- does anyone know any good books?

Google Spell-Check Snafu


Some of you may have noticed that the HTML in my last post was slightly fuzzled for a few days. I started using Google's spell check that is built into the toolbar. To make a long story short, I accidentally hit "publish" when I was in spell check mode. oops!

Has this happened to anyone else? I resorted to manual html editing -- paste into notepad, edit html, paste back into firefox and re-post. The HTML was a big mess, almost as bad as word HTML. All of the span highlighting was a pain to get out! I wish there was an "undo" or "clean" feature. I could probably write one. It shouldn't be too hard.

All of the span tags have a common id convention: gtbmisp_xx where xx is the number of the spell check correction. It also adds a bunch of divs at the end where the actual spell check corrections are rendered, they have Ids like: gtbspellmenu_xx.

You know, its really amazing how simple the spell checking tool really is. It's not that hard to implement with a little XMLHttp and CSS / Javascript tricks.

Here's hoping for a proper clean button!

Thursday, November 10

Vertical search definition and context

Vertical Search Engine – A program that allows a specialized collection of data harvested from the Internet or local machine by a piece of software called a spider or robot, to be searched using keywords, boolean logic, or more advanced criteria. The specialized collection may be limited to a specific topic, media format, genre, purpose, location, or other differentiating feature.

Further decomposing it: Vertical (as in a vertical market) means specialized, narrow, focused, specific, niche. Vertical search engines are contrasted with general or broad “horizontal” search engines that are general purpose and all-encompassing.

John Batelle does a great job of describing one class of vertical search engines, domain-specific search engines, in his book, The Search:

Domain-specific search solutions focus on one area of knowledge, creating customized search experiences, that because of the domain's limited corpus and clear relationships between concepts, provide extremely relevant results for searchers.

There has been quite a bit of controversy over what defines a vertical search engine. My foray into the controversy started when I heard Niki Scevak from Jupiter Research wrote a market report about “Vertical Search,” specfically focused on marketing and advertising in Febraury of 2005. This sparked a debate between Tom Evslin, a retired technology CEO turned author and blogger, and Fred Wilson, a venture capitalist.

What’s this conversy all about? Well, at the crux of the disagreement in all of this is something crucial and fundamental: How do you define “search engine?”

First, let’s here from some people who are supposed to know, the good old Oxford English Dictionary. It says, a search engine is “a program for the retrieval of data, files, or documents from a database or network, especially the Internet.” And for a more populist perspective, let’s compare with Wikipedia. According to Wikipedia, a search engine is “a program designed to help find information stored on a computer system such as the World Wide Web, or a personal computer.” For even more variety try define:search engine on Google. One thing becomes apparent when looking at these definitions, when most people refer to a “search engine,” they most are usually focused on an “internet search engine” such as Google, Yahoo, MSN, Altavista, etc… Most people don’t associate “search engine” with searching their local computer or a database. This difference in definition results in the controversy between Niki, Tom, and Fred.

Are Travelocity, Expedia, Orbitz, Mobissimo, etc… vertical search engines? According to the technical definition, yes. However, under this definition every database driven website is a “search engine.” I find this difficult to swallow. Instead, I argue that they are not. In doing so I agree with Tom Evslin’s argument. These travel sites have highly structured proprietary data that they use to answer a query. In contrast, a web search engine typically searches a large database, an inverted text index of unstructured data – words from web pages. As one definition writes, search engines are like “card catalogs for the internet.”

It seems that the real differentiator between a web search engine and a database powered website comes down boils down to the source of the data. If it is derived at least in part from data provided by websites, it is a "search engine". If the data is not, then it's not. It may be a travel site, a desktop application (GDS, Yahoo desktop), or any website on a specialized topic, but these aren't "vertical search engines". In my opinion, the bottom line is that sites like Kayak,, etc... are not "vertical search engines" because they don't derive their content from other, external, websites through an automated process, like crawling.

In the past one could make the argument that it was about structured database content vs. unstructured web pages. However, today that line is becoming blurred. Google and other search engines use information extraction to create structure from the unstructured mess allowing product search, job search, and other more specialized “advanced” search functionality. The old structured vs. un-structured distinction no long works. However, it is because search engines are smart enough to turn the unstructured content into structured. The source of the content hasn’t changed, they are simply smarter in their processing. The content is still derived from crawling, harvesting the information from the web.

Examples of Vertical Search Engines
Domain Specific:
Globalspec – 200+ Million pages of "Engineering" web content. (And my current employer.)
HealthLine -- A consumer health research site
KnowledgeStorm -- An IT search engine for IT decision makers.
Scirus -- A science only search engine based on FAST search technology.

Media Specific (Audio and Video)
Truveo -- A video an multimedia search engine.
Singing Fish -- An audio search engine

Genre Specific -- Blogs & News
Technorati -- A Lucene based search blog search engine. -- A news and blog search engine.

Task-Specifc Search utilizing information extraction
Trulia -- Real estate search
Indeed -- Job search
SimplyHired -- Job search
Oodle -- A local classified search engine.

Many of the above engine use specialized knowledge dervied from extracting very specific information from parts of the web, called “information extraction.” One of the companies that pioneered the advances in search engine technology was WhizBang Labs. It was one of the first major companies (created from Carnegie Mellon) that developed commercial information extraction technology now used in many of the above. (More on Whizbang and information extraction in a future post).

The bottom line is that sites like Kayak,, etc... are not "vertical search engines" because they don't derive their content from other, external websites through an automated process, like crawling. Desktop search, local search, etc.. can fall into "vertical search engines" because they are specialized search engines derived from "crawling" the web or a local desktop.

Saturday, November 5

Feature request for GDS 2.0

I like use the Google sidebar as my simple, stupid news and RSS reader. However, one thing I find annoying, -- ok, two things.

1) No ability to import or export feed lists! I have feeds in my RSS readers and other websites in OPML and other formats. Is this possible with an extension? I'll have to look around

2) The automatic of recent clips is annoying and leads to a lot of low-quality sites being added to my web clips. One of the first things I did after using GDS 2.0 for awhile was turn this feature off.

More on vertical search and other things soon to come ;-)

Thursday, November 3

Vertical Search: An Introduction

I had the opportunity to listen hear Jan Pederson from Yahoo speak at an industry conference on search. In Jan's presentation, he highlighted what he thinks part of the future in search might be: Vertical Search.

He identified several emerging search markets: local search, image search, desktop search, product search, and personalized search. Of course he also plugged Yahoo's new contextual search, Y!Q. First, I think Y!Q is a fantastic idea, but it's not really vertical search, so I'll leave that for another series on search and context.

So, what's in the news recently? Well...

Google just released Google Desktop 2.0 and moved Google Local out of beta. In light of these two new "vertical search" offerings and Jan's claim to their emerging importance, I think now is an auspicious time to take a closer look.

Update: It looks like Jan and company have been busy. Yesterday, they released a new version of Yahoo local / Maps beta. It has some very nifty features. Yes, I think this is fortuitous timing!

What I think I am going to do is start a series on "Vertical Search." I am going to start with Jan's emerging vertical search areas and compare what the offerings from the major players are in these areas. Then I'll give some of my ideas on vertical search and what I think is the future of vertical search.

However, before we dive right in, let's try to define vertical search and put it into some context. What is Vertical Search? What are some of its different definitions and what are some of its early pioneers. ... next time!

Tuesday, November 1

Google Toolbar Autolink Scanners

So, my hunch was right. I've been digging through the GTB code.

It's a very cool piece of code.

If you unpack the google-toolbar.jar file in the toolbar extension folder you will find a nice directory structure. Inside the content directory you find gtb.js -- this seems to be where most of the cool code powering the toolbar exists.

Here are some of the biggest chunks of code I identified: 1) Google Suggest and 2) AutoLink and 3) Spell checking ( Utilizing XMLHttp and Google's servers).
Additionally, there is code to store your search history and provide the PageRank visualizer. Nothing too shocking (yet..).

What's cool is that it doesn't seem to be obfuscated. Sure, there are no comments -- but I have seen open source software that is harder to understand.

The auto-link feature is powered by scanners. These scanners look over the words in the current document. Here is a selection of scanners and the words they look for:

Books ("ISBN", "Book", "Publication")
Package tracking for Fedex Ground / Express ("fedex" / "fed") tracking info / UPS ("ups") / USPS ("usps")
Vehicle Histories: "vin", "vehicle", "auto", "car", "bus", "pickup", "truck", "suv", "bike", "moto", "numbers", "number"
Address Mapping --

The address scanner was what I was looking for, so I'll cover that in a little more detail -- although all of the above are also interesting.

Address Scanner
Looking at the scanner code I see some interesting things, first the list of states is neat. Here are some highlights:
"American Samoa", "AS"
"Federated States of Micronesia", "FM"
"Marshall Islands", "MH"
"Northern Mariana Islands", "MP"
"Palau", "PW"

I didn't even know those were valid US states! You also have all the usual suspects. Nothing international in here.

It can't find certain addresses. Here is why:

var addressStreetScanner = "street", "st", "ave", "road", "avenue", "rd", "san", "blvd", "dr", "drive", "new", "york", "west", "east", "north", "south", "ct", "park", "way", "los", "city", "parkway", "beach", "main", "boulevard", "santa", "se", "ne", "sw", "nw"

Is it just me or do some of those seem very west coast oriented -- santa? beach? los? hmm... very interesting indeed.

What is more telling is what is lacking. Take a look at Wikipedia's entry on street names and what it has to say about street name designations. How did Google come up with this list? Perhaps it is direct from Google maps? Is the list the output of some sort of address text classifier / extractor ? Why aren't the Wikipedia street designations in it?

The address parsing problem extends beyond the Google Toolbar. It is a symptom of a larger problem with Google Maps. I went back over the list of addresses that auto-link couldn't extract and tried manually punching them into Google Maps. Did they work? Nope. Google maps can't handle the address formats I mentioned in my previous post -- and unsurprisingly, the functionality isn't built into the toolbar.

Now, just for curiousity's sake -- let's try Mapquest. Most of my problem addresses below work! It finds most of the addresses entered, including nice maps. However, Mapquest still can't handle the Empire State Plaza or Executive Park -- two local landmarks. Competition is good for business -- Mapquest is making upgrades to its UI to complete with Google Maps. It has recently upgraded to nice big maps, even if they aren't yet draggable. Did I mention Mapquest's maps are prettier and easier to read?
Verdict: at least for now, MapQuest is this geek's top choice for mapping / directions.

So what else did I find puttering around in the toolbar Javascript? Lots of cool and interesting things. There is enough material there to keep me busy for awhile. But here is a little interesting bit that I found enlightening:

the Google toolbar pings back to the Googlesphere daily. In Fact, the Google Toolbar sends out a GET request with the user's first search of the day. This isn't a 100% surprise, I think I saw something about usage statistics in EULA. However, this is the first hard evidence I found of Google collecting my information
The ping url:
var url = "" + installId + "&" + "dll=" + "1.0.20051012" + "&" + "hl=" + GTB_GoogleToolbarOverlay.languageCode + "&" + "browser=" + encodeURIComponent(window.navigator.appVersion) + firstsearch;

One last ending thought:
As I mentioned earlier, I would like to see Google AutoLink integrated with Google local and its entity extraction algorithms. It would be so cool! Imagine, I am browsing a local restaurant review site and see pizza hut -- Ok, let's order. I hit the auto-link button. It finds extracts the entity Pizza Hut from the page using Google's meta data / local and it knows my zip code. Google looks up Pizza Hut in google local and gives me the option to auto-link to its Google Local listing, complete with my local phone number and reviews. Easier than putzing about Pizza Hut's website trying to find my nearest location!

Auto-Link has a lot of potential, but I think its name could be improved. I had no idea what it did or when I should use it before I saw the advert. Even then, I thought it was just for addresses, but the code showed me lots of cool ways to use it! How about more info on this in the docs Googlers? Did I miss this somewhere?