Wednesday, October 27

CIKM 2010 Jamie Callan Keynote: Search Engine Support for Software Applications

I am not at CIKM, but Michael Bendersky sent me his notes from Jamie Callan's keynote address. Gene also gave his writeup on the FXPal Blog.

Jamie Callan: Search Engine Support for Software Applications

  • Motivation: SE (search engine) as a "language DB"
    • Computer Assisted Language Learning
    • Q&A
    • Read-the-Web

  • IR typically assumes a "user" is a person

  • Software applications are a new challenging class of SE users

  • There are very low expectations from a SE from an application "user" perspective
    • E.g., SE's are mostly used for keyword search

  • Recall-Precision tradeoff avoids SE's from using a highly structured query language (like Indri)
    • BOW query - high recall/low precision
    • Structured query - low recall/high precision

  • Motivation II: using rich language/information resources
    • Wordnet, Freebase, Dbpedia, ...
    • SE's are not very good at using them

  • Structured queries and documents are well-studied IR topics, but
    • Do we really understand them?
    • Maybe the basic structures, but not the more advanced ones

  • Document = structured object
    • Metadata:
    • Fielded text: title, chapters, sections, references
    • Relations to other documents

  • Example application: REAP Project: Computer Assisted Language Learning
    • Find interesting documents/passages for students based on their language level
    • Use a structured Indri query language to find relevant documents or document parts

  • A typical approach to fields
    • Exact Boolean match on the attributes
    • Can be brittle.

  • Another type of document structure
    • Text annotations in documents (POS, semantic labeling, co-referencing)
    • Annotations can be considered to be "small fields"

  • Problems with retrieval with text annotations
    • Annotations are not always 100% accurate / ambiguous
      • Missing annotations
      • Wrong annotation boundaries
      • Conflated annotations: white/JJ house/NN should be white/NP house/NP

    • Term weighting in short fields is hard - need to take field length normalization into account.

    • Problem of multiple matches: combining evidence from different fields from the same type is not a solved problem.

  • Relations among documents/entities
    • Hyperlinks & RDF
    • XML

  • Relational Retrieval (Lao & Cohen 2010)
    • Example for use: journal recommendations, expert finding
    • Some parts of metadata are "domain knowledge" --- they really reside outside the documents.

    • How to model domain knowledge as an integral part of the documents
      • Have different types of documents: paper, journal, authors...
      • Have typed relations between the documents: transcribes, appears in, ...
      • Have an Indri-like query language to match documents and relations

  • Inferred knowledge: Read-the-Web project
    • How to integrate the accumulated knowledge in SE's
    • Entity search is one example
    • General purpose solutions are still in progress.
More CIKM coverage soon.

No comments:

Post a Comment