Thursday, July 10

Yahoo BOSS: Full web search at your fingertips

Today Yahoo launched BOSS - Build your Own Search Service. This is a huge step forward. It allows developers to build white-box applications using Yahoo's search engine. Vik Singh, Search Strategy & Architect at Yahoo!, has a great insider's take on the launch.

In my online experience, I typically visit a variety of sites: Techmeme, Digg, Techcrunch, eBay, Amazon, del.icio.us, etc... The biggest goal of Boss is to help bootstrap sites like these to get comprehensiveness and basic ranking for free, as well as offer tools to re-rank, blend, and overlay the results in a way that revolutionizes the search experience... I think users should be confident that if they searched in a search box on any page in the whole wide web that they’ll get results that are just as good as Yahoo/Google and only better...

The next couple of milestones for Boss I think are even more interesting and disruptive - server side services, monetization, blending ranking models, more features exposure … so stay tuned.
The new search API returns results in both JSON and XML. There is also a new BOSS mashup framework that provides SQL-like syntax for web data sources, like BOSS.

My first plan is to try it with RecipeComun, my recipe search engine, to add a spell-checking ability. Ideally, I'd also like to backfill recipe results with web search results. However, I want to only return results from a slice of Yahoo's index, the food section. I can do something like this with Google Custom Search Engine, but I'm not keen on Google's restrictions. Perhaps I will start by adding site: restrictions, but I'm not sure how far I'll get with that.

Also, one observation is that the restriction on 50 results is very limiting if you want to perform any kind of significant reranking. Will multiple queries at different start offsets be throttled if you try and get around this? (Answer: you can get up to result 1000 by paging. This seems inefficient, why not allow this in fewer queries?)

Questions:
  • What is that max "start" position possible? (Answer: 950)
  • What is the maximum supported query length? (for query expansion and site restrictions)
  • In the future, would it be possible to get result metadata from the Webmap?
Thank to Vik and all the other Yahoo!s involved in the product. I look forward to trying it out.

Wednesday, July 9

Google: No Query Left Behind

Amit Singhal, the head of the Core Ranking team at Google has a post on Google's philosophy of ranking.

To summarize the article:
Today, I would like to briefly share the philosophies behind Google ranking:
1) Best locally relevant results served globally.
2) Keep it simple.
3) No manual intervention.
The first one is obvious. Given our passion for search, we absolutely want to make sure that every user query gets the most relevant results. We often call this the "no query left behind" principle...

We make about ten ranking changes every week and simplicity is a big consideration in launching every change. Our engineers understand exactly why a page was ranked the way it was for a given query. This simple understandable system has allowed us innovate quickly, and it shows...

No discussion of Google's ranking would be complete without asking the common - but misguided! :) - question: "Does Google manually edit its results?" Let me just answer that with our third philosophy: no manual intervention... often a broken query is just a symptom of a potential improvement to be made to our ranking algorithm.
Also see the older NY Times article Google Keeps Tweaking Its Search Engine (requires registration).

Tuesday, July 8

Cool new serialization library: Google Protocol Buffers

Update: Tom White, from Hadoop, has a follow-up on Hadoop's serialization method and the possible application of Google Protocol Buffers.

Originally via Matt Cutts.
Protocol buffers are a flexible, efficient, automated mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages. You can even update your data structure without breaking deployed programs that are compiled against the "old" format...

As the system evolved, it acquired a number of other features and uses:
  • Automatically-generated serialization and deserialization code avoided the need for hand parsing.
  • In addition to being used for short-lived RPC (Remote Procedure Call) requests, people started to use protocol buffers as a handy self-describing format for storing data persistently (for example, in Bigtable).
  • Server RPC interfaces started to be declared as part of protocol files, with the protocol compiler generating stub classes that users could override with actual implementations of the server's interface.
Protocol buffers are now Google's lingua franca for data – at time of writing, there are 48,162 different message types defined in the Google code tree across 12,183 .proto files. They're used both in RPC systems and for persistent storage of data in a variety of storage systems.
I'm a big fan of compact and simple serialization formats. We've developed a similar system here at Globalspec for our inter-system communication. If I'm allowed, perhaps I can elaborate more in the future. It's really exciting to take a look at Google's solution to a similar problem. One of the coolest features is that their protocol language generates stubs for multiple languages: C++, Java, and Python.

I wonder if Hadoop/HBase could leverage this as a way to store serialize data in the file system.

I can't wait to try it out, this looks incredibly useful. Thank you Google developers!

Update: I ran across Thrift, which is Facebook's model for cross-language service communication. Thrift is now being spun off as a new Apache incubator project, Apache Thrift.
Thrift allows you to define data types and service interfaces in a simple definition file. Taking that file as input, the compiler generates code to be used to easily build RPC clients and servers that communicate seamlessly across programming languages.
I don't have time for a more complete comparison, but a few differences:
1) Thrift defines services and communication, Google protocol buffer is only a way to serialize data
2) Thrift is a c-like language, Google's protocol buffers is a specialized data language with very simple structures

Monday, July 7

A variety of diverse opinions: IBM Many Aspects Text Summarization Tool

IBM recently posted a new text summarization tool, IBM Many Aspects on their alphaWorks site. Many Aspects is a java tool that is available for download. It's goal seems to be to summarize documents with multiple topics or viewpoints.

From the description:
These sentences are picked using the following two criteria:
  • Coverage: The sentences should span a large portion of the spectrum of the document's subject matter.
  • Orthogonality: Each sentence should capture different aspects of the document's content. That is, the sentences in the summary should be as orthogonal to each other as possible.

...For example, in online comments and discussions following blogs, videos, and news articles, it is desirable to have a summary that highlights different angles of these comments because each often has a different focus. With IBM Many Aspects Document Summarization Tool, you can get a concise yet comprehensive overview of the document without having to spend lots of time drilling down into the details.

You can read the research behind the tool as well, ManyAspects: A System for Highlighting Diverse Concepts in Documents by Kun Liu, Evimaria Terzi, and Tyrone Grandison from VLDB 2008.

From the paper, the primary use case seems to be summarizing user opinions about a movie, or a product. In these cases, it's useful to identify the different aspects of the product/movie being discussed.

If you get a chance to try it out, I'd love to hear what you think. Is it useful?

Lingpipe rant on Lucene Tokenization

The Lingpipe blog has a good rant on Lucene's tokenizer infrastructure.

When I get around to it, I have a few rants on Lucene, but the downside is that whenever I start writing about the problems I feel obligated to submit patches. That's the problem with ranting on open source software--it's (usually) open to user contribs.

I'm reminded of Grant Ingersoll's response to a JavaLobby article offering Six Ways of Improving Lucene, he writes, "...thanks for the ideas. Hope to see your patches soon!"

Have a Lucene rant and want to fix it? Read the Lucene doc on How To Contribute.