Friday, April 20

StumblingUpon $40 Million-ish

Rumor has it that EBay has purchased StumbleUpon a social search / bookmarking site that let's user explore new sites via "collaborative serendipity." The sell price is reportedly approximately $40 Million dollars (see TechCrunch and GigaOm). Not bad for a small company with only 1.5 M in investment. I can't remember if I blogged about it, but I predicted that they would be purchased this year. However, I predicted that one of the big three would buy them.

There is a great interview, Q&A with Garrett Camp on SELand, one of the founders, on some of the technology:

Our 2 million registered users stumble around 5 million times a day, so we have a pretty active user base. If they find something new, it's incredibly easy for them to submit it to us. All they need to do is click the thumbs-up button on the toolbar and it's submitted to our database. We get over 16,000 new URL submissions a day - all new and unique content endorsed by our members... We have a classification engine which automatically places content into one of 500 predefined categories based upon on-the-page factors. This means most content submitted can be distributed to interested members even before tags have been applied.
It is also a great relatively undervalued marketing channel:
StumbleUpon has a unique business model that works well for marketers where we can deliver traffic directly to your site. You can target by category, age, gender and location. So for product launches, distributing audio/visual content or just getting feedback on your blog, StumbleUpon often works better than PPC approaches since targeting is precise and no click through is required.

Perhaps eBay will extend it to products... or videos of products. Who knows. I'm not sure I get this one...

In response, Google has launched it's own blatant rip-off. See Google's blog post: "Searching Without a Query. Google's new personalization will take into account not only your search history, but now also your web browsing history via the Google Toolbar (in a separate post Your Slice of the Web). You did know that the Google Toolbar tracked the sites you visit if you had PageRank turned on, right?

Remember my previous post from Hakia's future of search-- How much do you trust Google with your data (mail, docs, purchase history (Checkout), search history, web browsing history, files (GDrive), etc...) ? Imagine the possiblities... for good or evil.

QueryCat: The query's meow

One of my co-workers has launched a new vertical search engine: QueryCat a FAQ search engine. Other coverage on Search Engine land, QueryCat - Search FAQs.

Kevin created the search engine using Alexa's Web Search platform to mine the web for questions and answers in FAQs. After pages are discovered they are mined and questions and answers are extracted and indexed with Lucene.

According to Kevin:
The idea was inspired by some of the "answer engines", such as as well as Google's "one box". I think that the next level of search will involve more understanding of a user's query and matching it up with structured information parsed from the web. These sort of techniques help the user find the answer just a little bit faster...We have about 2 million questions and answers right now, but I believe we can double or even triple that in the next few weeks.
Some of the answers are spot on, but others still need some tuning. For example, A query for what is the capital of mexico? returns as the first result What is the difference between the mortgage rate and the APR? (presumably because mexico and capital are in the description).

Good luck Kevin.

Thursday, April 19

Ebay's Hybrid Desktop Application: Project San Dimas

There has been a lot of buzz about EBay's San Dimas prototype. Project San Dimas is a hybrid desktop-web app based on Adobe Apollo technology.

There has been a lot of coverage recently from the Web 2.0 Conference: TechCrunch, two of the developers blogs, Rob Abbott and Alan Lewis, and of course the video from the Adobe Conference. (Side Note, Alan has a great presentation, The Future of the Desktop that he gave at Web 2.0 with some great slides on San Dimas.)

The Adobe video highlights some of the features, including the wicked feature that allows you to interact (post items, bid, etc...) with EBay even if your network is disconnected. Ebay will automatically sync up when the connection is restored.

As a side note, there is some interesting shake-up going on with San Dimas project, one of the lead UI designers on the project, Alan Lewis, is leaving the company to join Ribbit. He writes, "Despite our success and raves from the San Dimas team, all design work on our side ended abruptly at the end of Q1."

I have not experienced San Dimas first hand, but I look forward to playing with what Apollo has to offer. It is interesting to note that the screenshots I have seen of San Dimas look somewhat similar to Ebay Express, at least with the controls for query refinement on the top of the page (what Google just abandoned for Product Search).

I will save a in-depth discussion of faceted search and EBay express for another day.

Froogle Rebranded to boring name

Google has re-branded Froogle to Product Search. Maybe if you're stock valuation is as high as Google you go afford to go Dean and Deluca instead of Price Chopper.... Here is the news straight from their blog, Back to Basics. I must admit, the new name works, but it's a lot less witty. Next thing you know Google Base will become Google Object Database.

As usual, Danny Sullivan at SE Land has great in-depth coverage, Goodbye Froogle, Hello Google Product Search. One interesting aspect of the UI change is the change in the way query refinement is done:
The big giant box of query refinement options that were at the top of the page will move to the bottom and be more condensed. The refinements were relatively little used at the top of the page, Mayer said, and putting them down at the bottom also seemed to make more sense.
I'm not sure if I like the change, but it sure makes products the focus of the page, with more content above the fold.

CNet's coverage, Google takes the pun out of shopping, has a great title. It is a decent article, but most of it is on Google Base, not Product Search (highlighting some continued confusion in this area). Here are some highlights from what I would describe as the Google Base article:
Rather than encourage people to go to specific sites for specialized search, which is what vertical sites do, Google wants them to go to first and find the best results from its own specialized searches there. And most people do start their searches, for everything from cars to houses to jobs, on a major search site, experts say. Recent statistics from online traffic measurement firm Hitwise found that search engines are the primary way that Internet users navigate to key industry categories...

But Mayer says Google Base isn't intended to be competition for e-commerce companies. "Faceted search is an important part of the process," allowing people to search for part-time versus full-time jobs and to search for a five-bedroom house, she said. "We know that's important to search and that's something Google hasn't done particularly well in the past."

A case study on culinary Web site provided by Google said the company didn't see any results from its recipe listings on Google Base until it added descriptors such as cuisine type, course and main ingredient. Then traffic to the site jumped 6 percent immediately.

As Google's prominence and power of user attention grows, vertical sites can find Google's approach unsettling. Instead, many verticals are trying to lessen their dependence on Google and find ways to drive direct and repeat usage where Google is not a part of the transaction.

Wednesday, April 18

April Showers Bring... Search Engine Video Lectures

Here in the northeast it has not been pleasant, the cold rain is incessant. The old saying for April this year could be "April Monsoons (hopefully) bring May flowers." Since it's raining outside you may as well watch search engine video lectures, assuming you aren't completely under water.

Here are some of the best sources for search videos on the web.

SIMS 141: Search Engines: Technology, Society, and Business. The class lectures of Marti Hearst's UC Bekeley class from 2005. Great speakers including John Batelle, Sergey, and Jan Pedersen. A good mixture of technical content and business content.

Resarch Channel
- A great wealth of academic lectures available online. A good starting place can be found via the SIGIR Talks page.

VideoLectures.Net - A European site focused on computer science research videos (from conferences and workshops) with over 1,000 video lectures online. It is focused primarily on machine learning and the semantic web. As a starting point, many of the videos from The Future of Web Search workshop from last May hosted by Yahoo Research Barcelona have been posted.

Google Tech Talks - A series of internal lectures given at Google. The topics run the gambit from biofuels to computer security and programming languages. While most are not search focused, they are quite fascinating (and obvously for the technical and geeky audience).

The hard part is choosing what to watch.

That should be enough to keep you busy for at least 40 days and 40 nights, or at least until the spring floods subside.

Sunday, April 15

The Spock Entity Resolution challenge and other miscellany

The Spock Contest - via O'Reilly Radar. Spock is a new people search engine. Like NetFlix, Spock has started a contest. First some background on Spock:
Spock is a search application that helps consumers discover more about people who matter in their lives. At the core, we organize relevant information around people and have developed unique technology to do so...With over one hundred million individuals indexed and millions added every day, Spock is the largest and most comprehensive people specific search application.
Next up information on The Challenge:

We have selected one of our most interesting problems, namely Entity Resolution, to share with the community, allowing other leading computer scientists and engineers to compete in an open contest... You can work individually and in teams. The competition will last 4 months and the winning team will win a Grand Prize of $50,000! Most importantly you’ll be working on a very important and widely applicable problem. We will also be issuing prizes for 2nd and 3rd place.

The dataset is 1.5 GB compressed. Time to dig a little deeper... more soon.

Now, other miscellany:

Microsoft Research Asia has released a package of benchmarks for creating and testing machine learning based ranking algorithms called Letor (LEarning TO Rank). Their goal is to create a platform that allows researchers to more easily compare the effectiveness of their ML based ranking systems through the use of a standard set of benchmarks.
Ranking is the central problem for many applications, and using machine learning technologies to learn the ranking function has been a promising research direction. However, the lack of public benchmark datasets (e.g. standard features, relevance judgments, data partitioning, and evaluation metrics) makes the existing work difficult to be compared with each other...We benchmarked several state-of-the-arts ranking models with these features and provide baseline results for future studies.

Found via Fernando Diaz's (a grad student at UMass CIIR) blog post on the topic.

AIRWeb 2007 Papers Announced

Also, for the latest in Web Spam research, the AIRWeb 2007 accepted papers are now online. Search Engine Land has a great article on the topic, with links and descriptions of all the papers, something lacking on the website. One of the primary organizers of AIRWeb is Brian Davison. Brian is presenting a paper on link filtering at the conference, Measuring Similarity to Detect Qualified Links.