Thursday, January 10

Terrior 2.0 Search Engine Released

The University of Glasgow recently released Terrier 2.0, just in time for ECIR 2008.

The new version has two new main features: 1) Faster and more efficient single-pass indexing and 2) A new Divergence From Randomness (DFree) retrieval model. See What's New for a complete list.

Indexing
The new version has a new faster, single-pass indexing architecture from Roi Blanco from the University of Coruna in Spain. Roi is organizing the first Efficiency Issues in Information Retrieval Workshop at ECIR with Fabrizio Silvestri.

Retrieval
The DFree retrieval model provides a robust parameter free(!) retrieval model. In short, it should be able to be applied to new document collections without need for any tuning. It was contributed by Fondazione Ugo Bordoni, very likely including Gianni Mati.

Last year, I included Terrior in my previous roundup of open source search engines.

Vint Cerf History of the Internet video

Vint Cerf, one of the inventors of the internet, provides a fascinating retrospective on the development of the Internet and directions for the future.



Or, watch it directly.

Monday, January 7

Semantic tagging of Wikipedia and a workshop at ECIR

The European Conference on Information Retrieval, ECIR 2008, is coming up, from March 30th to April 3rd at Glasgow University. I would love to attend, but it doesn't appear likely.

One workshop that I would like to attend is Exploiting Semantic Annotations in Information Retrieval, organized by Omar Alonso from A9 and Hugo Zaragoza from Yahoo! Barcelona. From the description:

By semantic annotations we refer to linguistic annotations (such as named entities, semantic classes, etc.) as well as user annotations such as microformats, RDF, tags, etc. We are not interested in the annotations themselves, but on their application to information retrieval tasks such as ad-hoc retrieval, classification, browsing, textual mining, summarization, question answering, etc...

In particular, techniques have been developed to ground named entities in terms of geo-codes, ISO time codes, Gene Ontology ids, etc. Furthermore, the number of collections which explicitly identify entities is growing fast with Web 2.0 and Semantic Web initiatives...

Despite the growing number and complexity of annotations, and despite the potential impact that these may have in information retrieval tasks, annotations have not yet made a significant impact in Information Retrieval research or applications. Further research is needed before we can unleash the potential of annotations.

There have been some recent efforts on automatically semantically Wikipedia. For example, Hugo and other Yahoo researchers made available a Semantically Annotated Snapshot of the English Wikipedia (SW .1).

Also, in the paper Autonomously Semantifying Wikipedia, Fei Wu and Danield Weld from the University of Washington describe the KYLIN system that automatically extracts semantic information from Wikipedia, with two main goals:
  1. Automatically generating "infoboxes", the concise tabulated summaries of the subjects attributes
  2. Autonomously linking articles to create useful structure between articles
And of course, as I have mentioned in the past, there is FreeBase, a structured version of Wikipedia.