Friday, November 21

Google harnesses "Wisdom of Crowds" with Wiki of Search

See the official blog post. Here is Googler's Cedric and Corin's description of "SearchWiki",
...SearchWiki, a way for you to customize search by re-ranking, deleting, adding, and commenting on search results. With just a single click you can move the results you like to the top or add a new site... The changes you make only affect your own searches. But SearchWiki also is a great way to share your insights with other searchers. You can see how the community has collectively edited the search results by clicking on the "See all notes for this SearchWiki" link.
This has been in testing for awhile. See my previous post on Google's Wiki of Search and Eric Schmidt's notes from 2006.

I think getting users more involved is a good idea, especially when you have an audience as big as Google. However, I'm skeptical about the current system's utility. For example, it is disconnected from Google's related products, such as Google Notebook and Google Bookmarks. It doesn't allow me to incorporate my social network. I don't think that most people have a compelling desire to edit or comment on their own search results. A few common queries may get edits, but what about the long tail of search? That said, maybe I'll go change the ranking for a few vanity searches anyway ;-).

SE Land also has informative coverage.

Thursday, November 20

MIT Mathematics for Computer Science is a godsend

This semester I'm taking a class in Advanced Algorithms. We are currently investigating randomized algorithms.

I've found the MIT open courseware material a godsend. MIT offers a course, Mathematics for Computer Science (2002) with a significant section on probability theory, including the bounding techniques we've been studying. If you want a good crash course in stats I highly recommend reading the notes on lectures 10-14. The notes are clear and the examples fascinating. I'll share one of my favorites. Professor Chernoff did an investigation off the Mass. lottery, described in the notes for lectures 13-14:
There is a lottery game called Pick 4. In this game, each player picks 4 digits, defining a number in the range 0 to 9999. A winning number is drawn each week. The players who picked the winning number win some cash. A million people play the lottery, so the expected number of winners each week is 100... In this case, a fraction of all money taken in by the lottery was divided up equally among the winners. A bad strategy would be to pick a popular number. Then, even if you pick the winning number, you must share the cash with many other players. A better strategy is to pick a lot of unpopular numbers. You are just as likely to win with an unpopular number, but will not have to share with anyone. Chernoff found that peoples’ picks were so highly correlated that he could actually turn a 7% profit by picking unpopular numbers!
Most of the state-of-the-art retrieval algorithms are based statistics and the probability of a word occurrences in a document w.r.t a collection of documents. So, even if you aren't taking a class in algorithms, it's useful background to study for search.

Thank you MIT!

Wednesday, November 19

Berry picking your way through search

Gord Hotchkiss has a writeup titled Berrypicking Your Way Through Search where he looks at information seeking behavior.

In the article, he jumps off the from the insights from an older pre-web paper on information seeking behavior: The design of browsing and berrypicking techniques for the online search interface by Marcia Bates.

Here is a brief excerpt from the original article:

So throughout the process of information retrieval evaluation under the the classic model, the query is treated as a single unitary, one-time conception of the problem. Though this assumption is useful for simplifying IR system research, real-life searches frequently do not work this way... At each stage they are not just modifying the search terms used in order to get a better match for a single query. Rather the query itself (as well as the search terms used) is continually shifting, in part or whole. This type of search is here called an evolving search.

Another reminder that search is an inherently interactive process and classical models that do not account for this are very limiting. On a related note, see previous coverage of Nick Belkin's ECIR 2008 keynote address (and Daniel's notes).

Tuesday, November 18

Yahoo! BOSS API updated with Prisma document terms

Yahoo! Search announced an interesting extension to their public BOSS API on their blog today, from their description:
Key Terms is derived from a Yahoo! Search capability we refer to internally as "Prisma."... Key Terms is an ordered terminological representation of what a document is about. The ordering of terms is based on each term's frequency and its positional and contextual heuristics...Each result contains up to 20 terms describing the document.

Add the parameter view=keyterms to the BOSS request to see the new functionality.

I wonder if this is at all related to the Key Term Extraction API that Yahoo! provides.

Monday, November 17

Symposium on Semantic Knowledge Discovery, Organization, and Use

Over the weekend NYU hosted an NSF sponsored symposium on semantic knowledge. To summarize the description, the conference tackles the issue of extraction of knowledge from large corpora using automatic or semi-automatic methods. It is a forum to discuss research and provide a high-level picture of the field.

Daniel attended the symposium and has notes from Day 1 and Day 2. His notes are a good start, but I'm really disappointed by the dirth of information available for those who could not attend. The IRF symposium provides a good model for how to do this; there was a live stream of the presentations and the videos and slides are available after the conference.

Beyond basics, in the future we should enable remote audience registration and participation. We should be able to watch presentations and have online discussion. After all, traveling to conferences is expensive and often infeasible.

Sunday, November 16

Open Source HTML parsers for Java

Simple HTML parsing and text extraction is, well, not so easy to do well. Over two years ago, I wrote a post: Open Source HTML parsers. Since then, I've mainly stuck by TagSoup as the best open-source choice for me, but today there are a few other alternatives that I'm considering for a new project.

HtmlCleaner - A small, lightweight parser that fixes up and re-orders HTML to produce well-formed XML. It won top marks in Ben McCann's comparison of HTML parsers. However, I tried it out on a few Wikipedia pages and the text it returned was not acceptable, it contained snippets of javascript and commented cdata content.

The best parsers are those found in the top web browsers. However, it's usually quite challenging (and slow) to use them in external programs.

Java Mozilla Html Parser - A Java wrapper around the Firefox HTML parser that provides a Java API to parse documents. The website is out-of-date, there was a v 0.3 release in October.

Of course, you still have the option to write your own for maximum flexibility and speed. I'm still waiting for a real production quality parser. We'll need something better than what's currently available today to deal with those messy billion document test collections that are coming soon.