Wednesday, July 2

Powerset: An early exit for a promising startup

Back in May, I covered Powerset's launch of WikiSearch and expressed my skepticism. Today, it was formally announced that Microsoft is buying Powerset for undisclosed sum, but rumors put it around $100M. Microsoft VP Satya Nadella has more on the Live Search blog and Powerset announced it on their blog as well. Powerset writes:
With any startup, the challenge is to take the seeds of an idea and grow it into a viable company. At Powerset, we transformed our idea into a world-class semantic search platform, demonstrating the future of search with our Wikipedia search experience. But building a large-scale semantic search engine is expensive, requiring an engineering effort and computing resources beyond what most start-ups could ever imagine.
Unfortunately, it seems Powerset wasn't meant to be a viable long-term independent company.

To me, it sounds like investors weren't willing to pour the millions (or tens of millions) of dollars into the hardware and engineering infrastructure needed to scale Powerset's service. Considering the competition and current market conditions, it's not surprising that investors were unwilling to take such a big risk. For Microsoft, which already has large investments in data centers and an ad platform, it's an easier proposition. Don Dodge, Director of Biz Dev for Microsoft's Emerging Business Team, discusses the implications of Powerset's semantic text analysis technology and its potential applications for Microsoft.

My biggest concern about the acquisition is that could mean the end of Powerset's investment in open source infrastructure, namely the HBase project, which is an open-source version of Google's BigTable.

It's sad to see Powerset get folded into Microsoft at such a young age. Search is a tough business and considering the situation the deal seems like a good business decision for both parties. Congratulations to the Powerset team! I look forward to watching what you are able to do with the resources of Microsoft behind you.

Thursday, June 26

Ask.com Interview: new Edison ranking algorithm; Ask still way behind in relevance

Barry Schwartz over at SELand has an interview with Ask CEO Jim Safka and Teoma co-founder Apostolos Gerasoulis (AG).

The big news is that in March Ask.com launched their "Edison" ranking algorithm along with an entirely new infrastructure that reportedly greatly improved index freshness. Last April at SES NY, AG first mentioned Ask's new ranking algorithm see my post from last April on Edison, but there has been nothing heard on it since. We don't know much about Edison, except that it incorporates updated versions of Teoma's HITS topical link analysis algorithm as well as a modernized version of DirectHit's click tracking.

From my personal experience I think Ask still has a long way to go before they are competitive in relevance. For example, a relatively generic query for "DC Motors" yields results to both http://www.maxonmotor.com/ and http://www.maxonmotorusa.com/, despite the fact that these sites are mirrors with almost identical content. Also, the "Smart Answers" box displays the Wikipedia "Electric Motors" page that is later duplicated in the organic results. Both of these cases of duplicate content illustrate a lack of attention to detail. Other queries I tried yielded similarly disappointing results compared with GYM's results.

Good luck guys, there's still a long hill to climb.

Wednesday, June 25

Eclipse Ganymede simultaneous release

Eclipse 3.4, "Ganymede" is finally here. Check out the very lengthy New and Noteworthy.

One of the big new features is real-time shared editing via Cola.


Cola: Real-Time Shared Editing from Mustafa K. Isik on Vimeo.

The new release also includes the Javascript Development Tools (JSDT) platform for Javascript development.

One minor frustration is that the PHP Development Tools (PDT) did not participate. Ganymede users will need to upgrade to a beta version of PDT 1.1, not due to be released until September. If you do plan to try the beta, there are installation instructions on the wiki.

Eclipse is great IDE for developing java search technology, Lucene, Minion, Solr, Nutch, and others.

Sunday, June 15

Google I/O Presentations, Google Infrastructure

The Google I/O presentations are now online via Google Sites.

Underneath the Covers at Google: Current Systems and Future Directions by Jeff Dean. James Hamilton has a summary online. [Found via Greg].

A single query currently hits between 700 to 1000 servers. Slide 38 from his slides lays out the basic query serving architecture (Index shards + replicas for redundancy and query volume). I'm sure there are more gems here.

I also ran across a few more presentations I'd like to watch:
Effective Java Reloaded by Josh Bloch.
How to Index Your Geo Data by Lior Ron and Mano Marks

How Microsoft Live Search Plans to Differentiate Itself

Robert Scoble posted a short video interview with Brad Goldman, the General Manager of Windows Live Search.

Robert's title for the interview, "Microsoft Search: Will It Use Mahalo Techniques to Compete With Google?" is blatant link bait, trying to invoke the controversy over Jason Calacanis's Mahalo. I'll save you the time of watching the video. The short answer is a resounding no. MS is focused on scalable algorithmic solutions (Search is still about the long tail; 50% of MS's queries are distinct within a given week). Below are my other notes from the interview.

Introduction
For users, the most important factors are relevance and the speed (the time to find information, not simply search responsiveness). Most users still think that relevance is the most important factor in search engine usage. There's still a lot of room for improvement.

Brad claims that all three of the search engines are about even on relevance. However, he acknowledges the fact that there is no objective third party measuring service and that this is still very subjective. [I don't believe it. I want them to put their data where their mouth is so that we can publicly decide. I suspect there is bias in their evaluation methodology.]

Focus on relevance
- A big milestone was surpassing an index of 20 billion documents (last fall around Searchification 2007). They are now confident that they can keep up with the growth of the web and have a fresh and comprehensive search index.
- Last fall they had 85% of their search team working on relevance. It's really important priority.
- A goal for the coming year is to continue to focus on great relevance.

Task-centric search
Brad claims that search is becoming more task centric. He categorizes searches into four broad categories:
  1. Users are looking to be entertained
  2. Users are looking to buy something (commercial queries)
  3. Users are looking to find an article or other piece of information (research)
  4. Users are looking to navigate (navigational queries)
[It's interesting that he broke out entertainment as a query type. He didn't elucidate clearly on what this really meant or what it's chracteristics were. I don't buy it as a top four label for search query behavior.]

Differentiation plans: Better Shopping Experience
Brad talked about Microsoft plans to differentiate itself by focusing on no. 2 above, the commercial queries. One push on the consumer side is to turn Windows Live Shopping into Live Search Cashback, [It's not clear how Live Product Search will be incorporated]. Brad mentioned briefly a plan to better incorprate reviews and continue to do more to improve the user experience for this class of queries. For the advertiser, they want to move from a CPC model to a more productive CPA model.

Wrap-up and analysis
Overall, I expected more interesting information from the interview. Scoble rambled on about Twitter and Friendfeed, which he clearly loves, but which didn't produce any interesting discussion. The interview did clarify Microsoft's push on shopping as a path for differentiation. It was also comforting to see that MS is still focused on improving relevance.

It makes sense for MS to focus on shopping first; follow the Benjamins. Creating a better user experience in shopping could lead to a greater share of monetizable searches, providing greater revenue opportunity per search.

Live Search's plan of differentiating itself in commercial search reminds me of the shopping vertical search engine, Become.com.
Become.com searches over 5.6 billion web pages and uses its patent-pending AIR (Affinity Index Ranking) search technology to provide the Internet's most useful product reviews and guides, and then makes it easy to find and buy products from brand name retailers at the best prices. With over 25 million products from 5000 merchants, Become.com provides the Web's most robust and easy to use combination of relevant product research and comparison shopping.
Will Microsoft buy Become.com?

Personally, I don't find better shopping a compelling vision for search. But, maybe the team will prove me wrong.

Friday, June 13

Change the algorithm, not the dataset

Preach it.

Tuesday, June 10

The Rise of Intention and Preference Machines

Yesterday, I mentioned Eric Horvitz's presentation Machine Learning, Reasoning, and Intelligence in Daily Life: Directions and Challenges.

He spends a good deal of his presentation talking about "preference machines" which include recommendation systems. Intention machines are services that use models to predict activities and goals. In short, they uses past history to predict future behavior.

First, an excerpt from the mobile arena, the "Predestination" project that predicts driver destinations.
We have been exploring the uses of the data in learning and reasoning systems, including the construction of a system that can predict and then harness drivers’ likely destinations, given initial driving trajectories [Krumm and Horvitz, 2006]. Beyond geocentric intention machines, we have been exploring the feasibility of building geocentric preference machines, that perform geocentric collaborative filtering: Given sets of sensed destinations of multiple people and the sensed destinations of a particular driver, what places, unvisited previously by that driver, might be of interest, and how and when might the driver be best informed (e.g., by hearing a paid advertisement when he or she is approaching such destinations).
Challenges in Learning and supervision
Priorities research explored a middle ground of allowing users to become more involved with in-stream supervision. In versions of Priorities, users could inspect and modify in-stream supervision policies. Such awareness and potential modification allows the in-stream supervision to become a grounded collaboration between the machine and user...

Challenging areas of research include developing a better understanding of the best approaches to constructing generic models that can provide valuable, usable initial experiences with intelligent applications and services, but that allow for efficient adaptation downstream with a user’s explicit training efforts or in-stream supervision. Research may lead to deeper insights about setting up systems for “ideal adaptability” given expectations about the nature of different kinds of environments, and adaptations, given the users and uses.
Machines and humans need to learn to work together. Sometimes machines can help us make decisions, but one key challenge is to translate the machine's recommendation into a rationale that humans can understand and for this to begin a "dialogue" to correct mistakes and provide more accurate predictions.

This barrier is one reason that Google does not use ML for their core ranking algorithm, see the recent post "Are Machine-Learned Models Prone to Catastrophic Errors?" for an enlightening interview with Google's Peter Norvig. Anand relates,
Peter tells me that their best machine-learned model is now as good as, and sometimes better than, the hand-tuned formula on the results quality metrics that Google uses... Google's search team worries that machine-learned models may be susceptible to catastrophic errors on searches that look very different from the training data. They believe the manually crafted model is less susceptible to such catastrophic errors on unforeseen query types.