Saturday, December 5

Google Gets More Personal: Extends Personlization to All Users

Google announced in a blog post that it will personalize results for all users, utilizing data from cookies. It will use up to the last 180 days of history about your behavior.

It sounds like they've learned enough from using the histories of signed in users to generalize the technique to a wider audience. They also believe it has enough benefit to push on everyone, despite the fact that it raises more privacy issues. It will make people more aware that Google tracks how they interact with search results over an extended period of time.

Thursday, December 3

BixoLabs Public Terabyte Webcrawl

Ken Krugler and the team at BixoLabs are doing some things worth noting. They created Bixo, an open-source Java web crawler built on top of the Cascading MapReduce framework. I think it's one of the best open-source options for large-scale web crawling. Going further, they have their own cluster that they use for specialized client crawls.

You should read their blog for some of the recent talks Ken gave on web mining.

Particularly interesting is that they are starting to work on a Public Terabyte Project web crawl. The code and the crawl will be available for free via Amazon's dataset hosting.

I look forward to taking a look at it more. It's an important resource, because ClueWeb09 is only available to researchers who pay for it.

Microsoft EntityCube is live

Microsoft demoed EntityCube at TechFest, see my previous post. The system is now live.

See the top-ranked papers by domain; information retrieval.

It provides an interesting list of the most influential people in a field by citations.