Saturday, January 31

We're sorry to Inform you...

Don't listen to all the criticism you read.

We're Sorry to Inform You...

A reminder that we are bad at predicting the importance of breakthroughs and research in the long run. Keep this in mind next time you read or review someone's research.

Friday, January 30

Facebook's Database of Sentiment Intent

You've probably heard John Battelle talk about the "database of intentions" captured in our search queries. Today, I ran across an interesting parallel with Facebook. Scoble interviewed Mark Zuckerber at Davos. Scoble reports,
Facebook is, he [Mark] told me studying “sentiment” behavior. It hasn’t yet used that research in its public service yet, but is looking to figure out if people are having a good day or bad day. He said that already his teams are able to sense when nasty news, like stock prices are headed down, is underway. He also told me that the sentiment engine notices a lot of “going out” kinds of messages on Friday afternoon and then notices a lot of “hungover” messages on Saturday morning. He’s not sure where that research will lead...
Facebook's social interaction data brings the 'database of intentions' to an entirely new depth of understanding: people's actions and feelings.

Thursday, January 29

NY Times Open Blog and 20 years of articles available

The NY Times started an Open blog to share information about their efforts to free more of their data.

They just announced an API to access their best seller list.

I also wanted to remind readers about the Time's Annotated Corpus. From the introduction:
The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at
You can read the full description on the LDC website.

For some ideas on what you could do, you could start by looking at the Stanford data mining course offered this winter.

Reconsidering Relevance talk online

Daniel's talk on Reconsidering Relevance is finally online.

You can also view his slides.

I'm not sure when I'll get a chance to watch it, maybe between commercials during the super bowl.

Monday, January 26

Aperture framework for content crawling and conversion

Aperture is framework for crawling, extracting, and storing data from different systems for indexing and other processing. Aperture contains crawlers for different content systems and content converters to extract text from a variety of common file formats. It writes the extracted data in RDF for storage and indexing.

Found via the Search and Text Analysis presentation from Grant.

See also the Lucene Tika incubator project for extracting text and structured data from a variety of formats.

Lucid Imagination Adds Further Commercial Lucene Support

Today, Lucid Imagination launched a new company to provide commercial support for Lucene and related open-source search technologies. It received $6M in venture capital.

Congratulations to Grant, Yonik, and Eric Hatcher.

LI joins Otis' Sematext and other companies providing commercial consulting for Lucene and other open-source search solutions.