Friday, April 13

Friday Round-up: GSOC, ICSWSM vids, and Edison

  1. The winners of Google's Summer of Code was announced today, all of results are here. The collaborative filtering software Taste was a winner, with two projects selected.

  2. The ICSWSM keynote videos are online at the ICSWSM blog.

  3. Apostolos Gerasoulis, co-founder of Teoma and Professors at Rutgers, leaked at SES NY (more coverage on the Social Search panel at SES NY from SEORoundTable) that Ask is working on a new ranking system. The new system, "Edison", combines Teoma's HITS link analysis (see Kleinberg's Authoritative Sources in a Hyperlinked Environment) with user behavior analysis, an evolution of DirectHit technology.

    From Rahul Lahiri, Vice President of Product Management and Search Technology at

    Edison is still in development, so we can't say too much at this juncture. I can tell you that it's a next generation algorithm that, among many other things, synthesizes modernized versions of Teoma and DirectHit technologies, as AG said this morning. It's much more complicated than saying we're just counting clicks, in the case of DirectHit. The technologies we have, and the patents we hold, go way beyond that. We're also taking a deeper look at communities and calculating the authorities in those communities. We were really inspired by looking into the universe of user behavior, and what that could tell us, and the social fabric of the Web itself, and what that tells us. We're also rolling out an upgraded search infrastructure over the course of 2007 and building a new datacenter along the Columbia River in eastern Washington, which will help our speed, freshness and data quality. It's safe to say that Edison itself will roll out over the course of the year, as we improve it and tweak parameters.

    via SearchEngineLand. Remember what I said about implicit feedback being underrated...

Grand Re-Opening: un caffè, per favore

Now open later for your search engine information fix.

You may have noticed (or more likely not have noticed) that the name of the blog has been changed from "Jeff's Search Cafe" to "Jeff's Search Engine Caffè". Likewise, the address has been updated and has a new TLD name,

The reason for this is that I was contacted by a company who owns the trademark for "SearchCafe". In order to avoid any more conflict and hassle, I decided to take this as an opportunity to buy a new domain name and update the site. In the long run, having its own domain name will give me more freedom -- in case I want to switch off of Blogger.

In the next few weeks I hope create a new template to go with the new name and address; the current look and feel is pretty lame.

Thanks for reading, and if you get a chance I would appreciate it if you update any bookmarks (such as those on Delicious).

P.S. Just a piece of advice to anyone who decides to do anything online, before you start do a trademark search; at the very least, search the USPTO and Google -- even if you only decide to create something on a sub-domain of a free service and your content is completely non-commercial.

Wednesday, April 11

Hakia's Quest for Better Search

Hakia, a semantic and NLP based engine, started a discussion from some of the leading bloggers and search engine journalists on the future of search:

The Search For Better Search

Here are some of my thoughts on The Future.

The future of search engines is about context and authority.

The future of search engines will provide more relevant information because they will have more information: my level of expertise in different disciplines (5th grader versus a post-doc), my current location (geographic and home/work), what I am working on at the moment (writing a research paper, writing code, reading news, researching my next trip, etc...), etc...

Second, search engines will (hopefully) be able to tell the difference between overall popularity and topical authority. For example, The Wall Street Journal may not be an authority on Food or culinary information. And a mere blogger may be an authority on Personalization.

The future of search has major implications for digital identity and privacy as people do more online. Search engines will begin to be aware of what we do: what websites we visit through toolbars, what searches we execute, what search results and ads we click on, what information is in our documents, and how this information is connected with our blogs and even our Amazon Wishlists and product reviews.

How much privacy are you willing to give up to get good search results? How much is necessary?

As John Battelle commented:
To get there, we'll need to trust that everything we disclose online - our behaviors, our clickstreams, and our intent - are managed through a trusting relationship. The future of search is as a conversation with someone we trust.
How much do you trust Google? Or MSN or Yahoo?

In the future, the service that gets my search traffic may not be who provides the most relevant search results, but the one I trust with my data the most.

Recent Graduate Courses on Information Retrieval

If you are looking to get into search, or just stay up to date on what's happening a good place to start is the latest courses being offered. Here is a review of some of the best IR Courses that were taught recently, in the past year.

CS276 (Fall 2006) - The Stanford Graduate IR course, taught by Christopher Manning and Prabhakar Raghavan. This is The Standard for an IR course. Their new book Introduction to Information Retrieval is quickly becoming one of the standard text books.

IR 11741 (Spring 2007) - The CMU Graduate IR course taught by Jamie Callan and Yiming Yang. They also taught 15-493, Information Retrieval and Web Mining, in the Fall 2006.

CS646 (Fall 2006) - The UMass Graduate IR course taught by James Allan. This course is strong on probabilistic IR and Language Modeling, UMass (and Indri's) particular area of expertise.

CSE345/445 (Spring 2007) - WWW Search Engines Algorithms, Architectures and Implementations by Brian Davison at Lehigh University. Brian is one of the creators of DiscoWeb, later to become Teoma which was acquired by Ask. He also chairs the AIRWeb (Spam) workshop.

CS584 (Fall 2006) - The Emory University IR course, taught by Eugene Agichtein. Eugene's specialty is Information Extraction. He previously worked at MSR in the Search and Navigation Group.

That's all for now, although I am surely missing some. If you run across some good ones, let me know!

Monday, April 9

Monday: Catching up on last week's news

  1. Karen Spärck Jones (26 August 1935 – 4 April 2007) died last Wednesday from cancer, the press release is available here. She was a pioneer in computational linguistics and information retrieval. Karen pioneered the ideas of IDF, BM25, and automatic document summarization. She was also instrumental in helping to create the TREC evaluation model. Karen will be greatly missed. If you get the chance, look over her recent papers (even into 2007).

  2. SEOmoz posted an article on Search Engine Ranking Factors, Version 2. It incorporates replies from 37 leading SEO practitioners. The important things are not all that surprising: 1) Keyword use in the title, 2) Site/Domain popularity or "authority", 3) anchor text.

    I believe one of the most underrated factors in their article is user behavior (relative click-thru rates, time spent on the page, popularity measured via toolbars, number of times bookmarked, etc...). It is hard to measure (unless you are the SE), so I can see how it might be overlooked. However, I believe it is more important than most people realize.

    User behavior based ranking is a hot research area, one important researcher in the field is Susan Dumais at MSR. For example, two recent MSR papers are Learning User Interaction Models for Predicting Web Search Result Preferences and Improving Web Search Ranking by Incorporating User Behavior Information. Google has not been publishing in this area, but you can bet that they have created similar models for measuring user behavior to measure and improve the relevancy of their search results.

  3. David Sifry, founder and CEO of Technorati, posted their quarterly State of the Live Web Report. It has some interesting information on the growth and size of the blogosphere.

  4. LingPipe is moving to Java 1.5 for their 3.0 release. They have an interesting write-up on their experience using Generics in LingPipe 3.0.
This concludes my round-up.