Tuesday, July 26

SIGIR 2011 Keynote ChenXiang Zhai: Beyond Search: Statistical topic models for text analysis

ChengXiang Zhai gave the second keynote address at SIGIR 2011 held this week in Beijing.

Here are the notes from my friend and fellow UMass grad student Michael Bendersky (follow him on @bemikelive). Also, be sure to check out his workshop on Query Representation and Understanding.

Be sure to read Michael's notes from Qi Lu's first keynote talk on the Future of the Web & Search.

Beyond Search: Statistical topic models for text analysis
  • Complex Task Completion Flow
    - Multiple Searches → Information Synthesis & Analysis → Task Completion
    - Sometimes the process above is iterative

    Examples of complex tasks
    • What laptop to buy?
    • What’s hot in database research?
    • What do people say in blogs on a certain topics? How does the topic coverage change over time?
    • What people like/dislike about “Da Vinci Code”?

  • Can we model complex tasks in a general way?
  • Can we solve them in a unified framework?
  • How do we bring users into the loop?

  • Proposed solution – Statistical Topic Models
    - Generative model
    - Captures language models shifts based on topics
    - Language model serves as a convenient topic representation
    - Every document has a lot of contextual data (metadata)
    o Author
    o Communities
    o Location
    o Author’s occupation
    o User labels
  • Any combination of contextual data can induce partition over the documents

  • We should make topics depend on context variables
    o Text is generated from a contextualized PLSA model
    o Fitting such a model enables a wide range of analysis tasks on a document

  • Applications of contextual topic models
    o Social Network Analysis can aid to derive more coherent topic models
    o Opinion mining – integration of expert reviews and personal opinions
    • Take into account the well-formed and faceted design of expert reviews to impose context on personal opinions, which come from a variety of unstructured sources (blogs, micro-blogs, review sites, comments)
    • Derive integrated expert/personal opinions on different aspects
    • Infer aspect ratings and weights

  • Using topic models to go from search engine to analysis engine
    o Tasks
    • What is a task?
    • How is task different from information need/intent?
    • How do we help users to express tasks
    o What does ranking mean in analysis engine?
    o How to evaluate the output of the analysis engine?
    o Operators to allow analysis of search results
    -- Select, Split, Intersection/Union, Interpret, Rank, Compare
    • Operators can be combined, similar to SQL/InQuery languages