Friday, November 7

IRF Symposium on patent search wrap-up session notes

I watched the livestream of the closing session at the IRF Symposium on patent retrieval. The videos should be available next weekend. Here are my notes from the session, which included both Mark Sanderson and Steve Adams.

Mark Sanderson (Academic retrieval perspective)

- IRF Symposium 2007 was an introduction of IR people to IP people. What was striking then was an example given of a deliberate mispelling in a document because someone was trying to make sure their patent isn't found. This exposed the adversersial nature of some aspects of IP retrieval, which has parallels in the web retrieval community in the opposite direction.

- In 2008 academics drew on the experience, but much of this has been based on newspaper and web test collections. There is still a disconnect between what academics solved and what is relevant to the IP community. For example, academic groups are still evaluating using Precision@5 and MAP which focus on precision, instead of recall which matters more for IP. We need to look at new ways of assessing results.

Projects - Matrixware contributions
Alexandria System - a large-scale global archive of IP data
Leonardo System - an application development platform to access the data repositories. There is potential here for information studies specialists to study how IP searchers work and analyze their interactions.

He encouraged academics to take part in the CLEF IP and TREC chemical 2009 tracks this coming year. He drew parallels to the TREC legal track and the new and interesting understanding developed from that relationship. For example, legal track people are wedded to boolean retrieval and it was a big shock when ranked retrieval systems found documents that boolean search missed.

Steve Adams (IP Industry)

- He characterized the theme of this year as "hybrid".

Hybrid Documents
A patent document is a fundamental tension. At the end the patent office delivers a doc that serves both the legal community and technical community. Those two functions are often in tension. A single document to perform both these functions is something which takes a lot of practice. They are also hybrid because they contain both text and non-text data. We need retrieval system to pull out the non-text part of the documents.

Hybrid approaches to IR
No single system or paradigm is going to deliver all the results on all occasion for every search.

Multi-linguality - we were reminded there are multiple methods to retrieve documents: query translation and document translation and both are useful.

Annotation (Eric) – the basic question is: Do we get good retrieval based solely on the original document or do we need some form of enriched documents to give better retrieval? As we face ever expanding corpora, is it possible to continue automatically or semi-automatically enriching the documents this will be very helpful. Semantic annotation currently requires a stable ontology, but we have a very dynamic vocabulary that develops over time.

Boolean vs ranked - Leif's findability index was very interesting. It could be the beginning of evaluation tools. Both boolean and best match ranking have their place.

Hybrid Responsibilities
Pierre identified the fact that getting to the bottom of each players role is an important preliminary step: who does what? IP: Mark referred to ‘dirty data’, we need to improve our data at the early stage of document production, not after it has been published.

Monika’s paper, Multimedia challenge – the patent application of 20 years in the future. It may not be text at all. Send us the cad-cam files, send us the 3d crystollagraphic model, send us the chipmask. We are light years away from being able to search these types of documents.

Some of the Highlights seem to have been:
Mapping how easily Documents can be found - by Leif Azzopardi
Annotations and Ontologies in the Context of Patent Retrieval - Eric Gaussier
Also the Alexandria and Leonardo systems from Matrixware.

Thursday, November 6

Paul Olgivie is blogging again

Paul Olgivie from CMU has a blog, Information Retrieval on the Live Web. He's recently starting blogging again after a period of absence. He also posts on his company, mSpoke's, official blog. His post on mSpoke examines what features of a blog posts correlates with their popularity.

Welcome back Paul! I look forward to more interesting posts ;-).

I meant to write about this sooner, but Jon and Daniel beat me to it.

Paul's blog is a nice addition to my blogroll.

CIKM 2008 coverage and best interdisciplinary paper award

Greg has some of the best reporting including:
Matthew Hurst has rough notes from Andrew Tomkin's keynote.

Xing Yi returned and told us that the best interdisciplinary award went to:
Structural Relevance: A Common Basis for the Evaluation of Structured Document Retrieval by Sadek Ali, Mariano Consens, Gabriella Kazai, Mounia Lalmas.

Other highlights from Xing include:
Both of these look really interesting to me, I'll try to write something up on them this weekend.

Does anyone know who won the best poster award? The website hasn't been updated and we have a few people at the lab who would be interested.

I am also looking forward to the video lectures being available online.

Information Retrieval Facility Symposium 2008

The IR Facility works on patent search, bringing together IR researchers and professional patent examiners. The annual IR Facility Symposium 2008 is underway in Vienna. They are live streaming the event, if you want to watch presentations. Unfortunately, Thursday is over, but you can still catch tomorrow's presentations.

Here is a description from the programme,
The main themes of this year’s speeches are multilingual retrieval, annotation and ontology, retrieval in non-textual documents and the improvement of user interfaces. The latest scientific projects from the fields of semantic and linguistic retrieval, text mining, automated quality control and machine translation will be presented for the first time.
The CIIR here collaborates with the IRF. We have researchers there presenting work on using retrieval methods to detect errors in OCRed patent documents. I hope to have more details to follow.

The IRF also recently hosted the Patent Information Retrieval workshop at CIKM '08. The papers should be available through the ACM.