I watched the livestream of the closing session at the IRF Symposium on patent retrieval. The videos should be available next weekend. Here are my notes from the session, which included both Mark Sanderson and Steve Adams.
Mark Sanderson (Academic retrieval perspective)
- IRF Symposium 2007 was an introduction of IR people to IP people. What was striking then was an example given of a deliberate mispelling in a document because someone was trying to make sure their patent isn't found. This exposed the adversersial nature of some aspects of IP retrieval, which has parallels in the web retrieval community in the opposite direction.
- In 2008 academics drew on the experience, but much of this has been based on newspaper and web test collections. There is still a disconnect between what academics solved and what is relevant to the IP community. For example, academic groups are still evaluating using Precision@5 and MAP which focus on precision, instead of recall which matters more for IP. We need to look at new ways of assessing results.
Projects - Matrixware contributions
Alexandria System - a large-scale global archive of IP data
Leonardo System - an application development platform to access the data repositories. There is potential here for information studies specialists to study how IP searchers work and analyze their interactions.
He encouraged academics to take part in the CLEF IP and TREC chemical 2009 tracks this coming year. He drew parallels to the TREC legal track and the new and interesting understanding developed from that relationship. For example, legal track people are wedded to boolean retrieval and it was a big shock when ranked retrieval systems found documents that boolean search missed.
Steve Adams (IP Industry)
- He characterized the theme of this year as "hybrid".
A patent document is a fundamental tension. At the end the patent office delivers a doc that serves both the legal community and technical community. Those two functions are often in tension. A single document to perform both these functions is something which takes a lot of practice. They are also hybrid because they contain both text and non-text data. We need retrieval system to pull out the non-text part of the documents.
Hybrid approaches to IR
No single system or paradigm is going to deliver all the results on all occasion for every search.
Multi-linguality - we were reminded there are multiple methods to retrieve documents: query translation and document translation and both are useful.
Annotation (Eric) – the basic question is: Do we get good retrieval based solely on the original document or do we need some form of enriched documents to give better retrieval? As we face ever expanding corpora, is it possible to continue automatically or semi-automatically enriching the documents this will be very helpful. Semantic annotation currently requires a stable ontology, but we have a very dynamic vocabulary that develops over time.
Boolean vs ranked - Leif's findability index was very interesting. It could be the beginning of evaluation tools. Both boolean and best match ranking have their place.
Pierre identified the fact that getting to the bottom of each players role is an important preliminary step: who does what? IP: Mark referred to ‘dirty data’, we need to improve our data at the early stage of document production, not after it has been published.
Monika’s paper, Multimedia challenge – the patent application of 20 years in the future. It may not be text at all. Send us the cad-cam files, send us the 3d crystollagraphic model, send us the chipmask. We are light years away from being able to search these types of documents.
Some of the Highlights seem to have been:
Mapping how easily Documents can be found - by Leif Azzopardi
Annotations and Ontologies in the Context of Patent Retrieval - Eric Gaussier
Also the Alexandria and Leonardo systems from Matrixware.