Thursday, December 15

Information Retrieval and other interesting collocations

In my spare time, on evenings and weekends I have been doing some research (programming) on collocations and term co-occurrence, especially in relation to their applications in web search. Let me begin with some definitions.

According to Manning and Schutze, A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things. (Manning and Schutze, 1999). Another definition used by Columbia NLP researcher Frank Smadja is: A collocation is an arbitrary and recurrent word combination (Benson 1990). Term co-occurrence is less strict, it is simply words that happen to occur in the same document.

Collocations are very interesting. They are unique combinations of words that have a special meaning, like "a stiff wind." When used in a technical context they are things like technical terms. For example, the title of this article includes a CS/IR collocation: "information retrieval", another might be "information extraction." For a full discussion of Collocations you can read Chapter 5 from Manning & Schutze's book, which happens to be availabe online.

Collocations can be used in web search in some interesting ways. One way they are used is to guide the discovery of information. They give you ideas for keywords you might want to add to your query to make it more specific. One example of this is Ask Jeeve's "narrow your search" suggestions. Collocations are also used in clustering, but that is a story for another day.

In my spare time, evening and weekends, I have been writing a Java program that uses the Google API to extract interesting collocations from web search result summaries. It utilizes N-Grams (sequences of words or characters) to identify interesting phrases. Hopefully, I'll have it working well enough to post online soon.

My current approach is not too sophisticated, I am using a list of stop words (a, the, is of, etc...) to remove "uninteresting" phrases. If I have time I would like to pursue a more sophisticated approach that utilizes part of speech (POS) tagging.

One example of this POS tagging approach is the XTract program by Frank Smadja. I managed to find a copy of his paper online (not through the ACM portal): Retrieving Collocations from Text: Xtract. When I get around to it, I think it would be really interesting to combine Stanford's POS tagger with my program to see if I can produce better results.

Finally, I would also like to track down and read Justeson and Katz's paper: Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text, which I had some difficulty locating through common online sources.

I'd like to deploy my program as a web app so that people can try it out and see the results in real time without having to download and run the program.

Stay tuned.

Tuesday, December 13

Google Active Homepage

Google today announced on its blog that in addition to RSS feeds it will start supporting "richer web apps as well." In other words, if you can create some HTML / Javascript you can include it as a "Widget" on your homepage.

Creating an application with Google Homepage Developer APIs is a snap. To create a widget all you need to do is wrap some simple XML around the HTML of your choice. As a demo, I created a widget to search Globalspec's Engineering Web search engine. If you want to check it out all you need to do is:
  1. Copy this link location.
  2. Open the "Add content" pane of your Google personalized homepage.
  3. Paste into the Google "Create Section" text box .
  4. Learn how to work for Intel: Try a search for "Britney Spears", or maybe learn to build a lego robot by checking out "Lego Mindstorms".
Now, was that so incredibly difficult? Making a widget is almost as hard!

It's nice to see some action from Google in this area. I use Google as my personalized homepage, but Google is playing catch up in this arena. Microsoft has its "Windows Live" platform that it is building, including your own personalized, active homepage. Microsoft's widget gallery has been around longer and has a richer widget gallery, of a mind-blowing 110 widgets.

The other player, Yahoo, has its widget system from its Konfabulator acquisition. It has a hefty 1,600+ widgets. The downside: you have to download an engine that runs on your desktop. Perhaps its just me, but isn't this what I have a browser for? You want me to download a desktop app to use widgets? Sorry, I'm not convinced, yet.

Right now Google's directory only has five widgets, but I look forward to seeing what is added as time goes on.

Did you get the Memeorandum?

In was watching the news of the Delicious acquisition over the weekend and I came across this website mentioned more than a few times called Memeorandum. Up until today I thought it was spelled MEMOrandum, I was wrong, but it appears to be a not uncommon mistake.

"Memeorandum is like an automated “New York Times” for the Web," according to John Furrier, Founder & CEO, PodTech.net, in a recent interview. What's so cool? As Scoble writes in his blog on the topic, "Memeorandum chews through thousands of blogs in minutes and tells you what's important." Currently the paper only has two sheets: politics & technology.

As you've probably already figured out, Memeorandum is a play on the word meme + memorandum. I didn't know what a meme was, so I looked it up:
Meme. As defined by Richard Dawkins in The Selfish Gene (1976): "a unit of cultural transmission, or a unit of imitation." "Examples of memes are tunes, ideas, catch-phrases, clothes fashions, ways of making pots or of building arches. Just as genes propagate themselves in the gene pool by leaping from body to body via sperms or eggs, so memes propagate themselves in the meme pool by leaping from brain to brain via a process which, in the broad sense, can be called imitation."
In this case, the meme is a news story. Memeorandum tracks a news story's viral-like propagation throughout the blogosphere and news media. It tracks and rates stories based on this propogation of ideas.

Memeorandum is founded by Gabe Rivera, a former Intel compiler designer. Clearly, he has some experience designing complex software. According to his Blog, He founded memeorandum on three basic pinciples:
  1. Recognize the Web as Editor: There's this notion that blogs collectively function as news editor.
  2. Rapidly uncover new sources: Sometimes breaking news is posted to a blog created just to relate that news.
  3. Relate the conversation: Communication on the web naturally tends toward conversation.
The last bullet is what I find really interesting. The coolest thing about Memeorandum is that it groups stories into headlines and collects the most relevant discussions of that story into a thread that you can navigate like a forum. It's a one-click way to see the repurcussions and discussion of a story throughout the "Live Web."

One of the most interesting things about Memorandum is that it excels at filtering out the noise present in other services like Digg or even Slashdot. It is great for busy people who need to quickly see what is going on on the web and what people are saying. Memeorandum allows you to do this without visiting sites or even plowing through lots of posts in an RSS reader. So how does it select who is shown on the front page?

Gabe has a blog post which is particular enlightening on the topic. Here is an excerpt:

I'll start with the most common question: how are sources selected for inclusion?

To answer that, I'll begin with my philosophy: I want writers to be selected by their peers. That is, I want the writers in each topic area to select which of their peers show up on the site. Not deliberately, but implicitly, by whom they link to and in what context they link.

The source-picking algorithm is based on this philosophy and works roughly as follows: I feed it a number of sites representative of the topic area I want coverage. It then scans text and follows links to discover a much larger corps of writers within that area.

The decisions for including sources are continually reevaluated, in such a way that new sources can be included in real time. Think about that for a second.
Wow! Not very Web 2.0y, that's for sure. Gabe is saying that there must be hieararchy! Heresy, burn him! In fact, he says that the initial seed sites are ones that He selects on a topic. So much for "radical decentralization and harnessing the collective intelligence." This is centralized authority harnessing the elite collective intelligence. This elite then select other relevant writers / authors based on their votes -- their links. It sounds to me a bit like hiring by committee at Google done on a story-by-story basis. It isn't suspectible to spam because as Gabe says in the Podtech article, "They won’t link to what isn’t relevant because it will spam up their own blogs." In other words, these guys are important, they have a reputation to protect, a bit like 'real journalists' and they are more careful about what and who they link to.

Gabe makes an interesting distinction between Memeorandum and other commercial news sites in the same PodTech interview:
One way that you should look at this different to the NY Times or Cnet news is that - it’s open to you! If you have something to say on a story and if you’re a blogger may get a link with ease, but if you don’t show your work to someone else and get them to link to you and you may find that you’ll be added in minutes.
If you haven't read the PodTech interview yet, check it out!

Another site that is similar to this, but limited strictly to blog sources is Blogniscient. In my opinion, Memeorandum is a lot more intesting!

Very smart guy, very cool technology. I'm not the only one that is smitten. The folks over at Tech Crunch wrote about it here, here, and again here! We also agree -- we both prefer Memeorandum over blogniscient.

Alexa Web Search Platform: IBM WebFountain 2.0

John Battelle and others are reporting about the launch of the Alexa Web Platform.

Is it economically feasible for a new vertical search engine to build its own web crawler and build a multi-terabyte data storage system? This presents a sizable barrier to entry into the vertical search arena. This is one of the main reasons there are still so few vertical search engines. Alexa hopes to change that by offering a hosted platform for companies and users to create their own custom search engines, or perhaps just get some meta-data.

From Alexa's website:
The Alexa Web Search Platform provides public access to the vast web crawl collected by Alexa Internet. Users can search and process billions of documents -- even create their own search engines -- using Alexa's search and publication tools. Alexa provides compute and storage resources that allow users to quickly process and store large amounts of web data. Users can view the results of their processes interactively, transfer the results to their home machine, or publish them as a new web service.
My intial question was: How did they get around the legal problems associated with this? After all, they are essentially charging for access to my and other users copyrighted content.

They got around this the same way IBM did with WebFountain. I had the opportunity to talk to the head of IBMs WebFountain project at a search engine conference earlier in the year. One of the reasons that WebFountain was never a runaway hit was because they couldn't provide direct access to the content. The reason the IBM employee gave was that it wasn't legal to charge for access to others copyrighted works. Imagine me charging you to download all of the Star Trek episodes as a service off my website. IBM got around this by providing access to a derivative work: metadata. WebFountain aggregated the knowledge of the web to create a new product that they could sell. IBM would mine the web for you and provide answers to your business intelligence questions. However, IBM had to write the software to run on its cluster. You paid per question you asked because each one was expensive because it required custom programming.

The Alexa platform is the next evoluation of this business model. I call it WebFountain 2.0. Instead of approaching IBM and asking them to design a program to answer your question, now you can create your own program and have Amazon, err Alexa, run it.

So, what exactly is the platform they are offering? According to the FAQ:
This store contains both the raw document content and the metadata extracted by Alexa's internal processes. All Platform users have access to the data in this store... Alexa maintains three Web snapshots in the Data Store. Each Web snapshot represents two months of crawling activity. Each snapshot contains about 100 Terabytes of uncompressed data so at any time, the Data Store contains 200 - 300 Terabytes of data.
In other words, they will give you access to run programs against their 5 billion page web crawl. The Alexa Web Platform allows you to run code on their cluster to process web data. At the end you can download your results (metadata) or you can publish your own private search engine hosted on their cluster. I don't think you can't actually download the content directly from the repository.

The pricing model is pretty simple and straightforward, you pay for CPU time, bandwidth, and storage space.

What's innovative about the Alexa platform over WebFountain are two things: 1) The ability to write your own code against the system and 2) The end product can be a private / custom search engine instead of just some meta-data.

We'll see what happens!