Friday, March 4

Evgeniy Gabrilovich wins 2010 Karen Spärck Jones Award

The British Computing Society IRSG announced that the winner of the 2010 Karen Spärck Jones Award goes to Evgeniy Gabrilovich. Evgeniy is Senior Research Scientist and Manager of the NLP & IR Group of Yahoo! Research.

He will present a keynote talk at the upcoming ECIR 2011 conference later this month. His presentation will be Ad Retrieval Systems in vitro and in vivo: Knowledge-Based Approaches to Computational Advertising.

Congratulations Evgeniy! I have heard a lot of great things from Yahoo! Research interns who worked under his guidance.

Thursday, March 3

Google's War on Content Farms: Project Big Panda

In late February Google launched a significant update to its ranking algorithm to address "shallow content" pages. The change has been referred to as the "Farmer" update externally and internally it is known as "Panda".

Amit Singhal and Matt Cutts posted about the change on the Google blog, Finding more high quality sites in search. It reduced the rankings of "low quality sites" that aggregated content from other websites and didn't add a significant amount value to users. According to the post the update effected 11.8% of queries. They also launched the Chrome Blocklist Extension to let people block websites from their Google results. The O'Reilly Radar published an article with a very good overview of the discussion.

What is behind the change? The most informative article is a recent Wired interview by Stephen Levy, The ‘Panda’ That Hates Farms. It interview Matt Cutts and Amit Singhal who managed the update.

What was the answer? In short, they built a document quality classifier trained on lots of rater data. Here are some of the questions they asked raters from the article:
  • Would you be comfortable giving this site your credit card?
  • Would you be comfortable giving medicine prescribed by this site to your kids?
  • Do you consider this site to be authoritative?
  • Would it be okay if this was in a magazine?
  • Does this site have excessive ads?
These questions seem to ask about the authoritativeness and trust of the content on a page. The results were also confirmed by an 84% overlap between sites downgraded in the change and those that people blocked using the Chrome extension, even though it is not used as a feature in update.

How did Google become overrun with almost-spam content? Amit sheds a bit of light on the question in one of his answers:
So we did Caffeine in late 2009. Our index grew so quickly, and we were just crawling at a much faster speed. When that happened, we basically got a lot of good fresh content, and some not so good. The problem had shifted from random gibberish, which the spam team had nicely taken care of, into somewhat more like written prose. But the content was shallow.
The interview then gets bogged down in bigger issues around editorial process and transparency, which are important but not as technically interesting.

Wednesday, March 2

HeyStaks launches: Social and Collaborative Web Search App

Heystaks is a collaborative search startup that launched publicly yesterday at DemoCon. Heystaks has a browser / iPhone app that lets you share your search experiences. It lets you save searches and pages you find into "Staks" and then share them with your "Search Buddies".

VentureBeat has an article covering their launch which you should probably check out. Here is the video from their website:

The chief scientist at the company is Barry Smyth, a professor at the University College Dublin.

It's a bit early for a full review, but I tried it out and it seems promising. I have some privacy concerns about browser toolbars that save and share my search history, especially when the service is oriented towards public sharing of the information.

HeyStaks reminds me of the failed Yahoo! Search Pad, but with a more social focus, and it works across search engines. I hope it has better luck.

I would like to see the service evolve to have more collaboration in the search beyond saving and sharing results. For example some deeper integration that Gene Golovchinsky, Jeremy Pickens, and others have been advocating. See their paper, Algorithmic mediation for collaborative exploratory search which won the best paper award at SIGIR 2008.

My congratulations to Heystaks on the launch. I look forward to Chome and Android apps versions that I hope will be soon to follow.

News Highlights: Bing Price search, Yahoo! Boss, Google Data Publishing, and more

Here is a round up of news from around the web:
  • Bing adds price recognition to its query support. You can now search for "digital camera under $200" and it will automatically add the price filter. It is a good step in the right direction. How about something a bit harder? "Canon 12 MP Camera under $200" with the manufacturer and megapixel attribute restrictions.
We created this format to address a key problem in the Public Data Explorer and other, similar tools, namely, that existing data formats don’t provide enough information to support easy yet powerful data exploration by non-technical users.
  • Scala tip: Check out REPL for interactive debugging.

Tuesday, March 1

WhistlePig: A minimalist real-time search engine

William Morgan recently announced the release of Whistlepig, a real-time search engine written in C with Ruby bindings. It is now up to release 0.4. Whistlepig is a minimalist in memory search system with ranking by reverse date. You can read William's blog post for his motivations for writing it. Here is a description from the current readme:
Roughly speaking, realtime search means:
- documents are available to to queries immediately after indexing, without any reindexing or index merging;
- later documents are more important than earlier documents.

Whistlepig takes these principles to an extreme.
- It only returns documents in the reverse (LIFO) order to which they were
added, and performs no ranking, reordering, or scoring.
- It only supports incremental indexing. There is no notion of batch indexing or index merging.
- It does not support document deletion or modification (except in the
special case of labels; see below).
- It only supports in-memory indexes.

Features that Whistlepig does provide:
- Incremental indexing. Updates to the index are immediately available to
- Fielded terms with arbitrary fields.
- A full query language and parser with conjunctions, disjunctions, phrases, negations, grouping, and nesting.
- Labels: arbitrary tokens which can be added to and removed from documents at any point, and incorporated into search queries.
- Early query termination and resumable queries.
- A tiny, <>

Monday, February 28

Palantir: Next Gen Platform for Information Analysis

Palantir is a very ambitious new tech company building a high-powered information analysis platform. They currently have products targeted for the government and the financial industries. Their product is a highly specialized enterprise data system to support intelligence and business analysts.

What does Palantir do?
... the most central hard problem that we address in trying to enable the analyst is data modeling, the process of figuring out what data types are relevant to a domain, defining what they represent in the world, and deciding how to represent them in the system. At Palantir we make sure our data model (ontology) is both flexible and dynamic, and that it mirrors the concepts people naturally use when reasoning about the domain.
The platform handles both structured and unstructured information and performs extraction and data integration. See their infrastructure page and white videos for a few more details.

Their data platform handles objects. An Object in their platform has four object components:
- Properties: text object attributes
- Media: images, video, and binary data
- Notes: free text
- Relationships: links between objects

Clients can specialize this generic object to have specific types using their "Dynamic Ontology" tool to define the semantics. Their platform has one fixed schema with 5 tables: object, property, notes, media, and object-object. An object is linked to one or more data sources which is critical for data lineage and access controls.

A key component of the platform is search over the objects. According to their blog, their scenario has two differentiating features from web search:
  • Realtime indexing and querying – we need information to be available immediately as it changes in the system.
  • Leak-proof access controls – we need the search engine to help us make sure that we don’t have information leaking across access control boundaries.
They go into more detail on their modifications to Lucene for their use cases in two blog posts, Search with a Twist Part I and Part II. From the comments, it sounds like they are using a custom branch of Lucene 2.4.

Palantir's platform combines data processing over large heterogenous datasets, filtering, mapping, visualization, and search in unique ways to create a compelling toolset. It built an intelligence platform that the Government could not do themselves by recruiting a team of uber-geek talent lured by hip silicon valley panache worthy of James Bond.