Beyond Search: Statistical topic models for text analysis
Complex Task Completion Flow - Multiple Searches → Information Synthesis & Analysis → Task Completion - Sometimes the process above is iterative
Examples of complex tasks • What laptop to buy? • What’s hot in database research? • What do people say in blogs on a certain topics? How does the topic coverage change over time? • What people like/dislike about “Da Vinci Code”?
Can we model complex tasks in a general way?
Can we solve them in a unified framework?
How do we bring users into the loop?
Proposed solution – Statistical Topic Models - Generative model - Captures language models shifts based on topics - Language model serves as a convenient topic representation - Every document has a lot of contextual data (metadata) o Author o Communities o Location o Author’s occupation o User labels
Any combination of contextual data can induce partition over the documents
We should make topics depend on context variables o Text is generated from a contextualized PLSA model o Fitting such a model enables a wide range of analysis tasks on a document
Applications of contextual topic models o Social Network Analysis can aid to derive more coherent topic models o Opinion mining – integration of expert reviews and personal opinions • Take into account the well-formed and faceted design of expert reviews to impose context on personal opinions, which come from a variety of unstructured sources (blogs, micro-blogs, review sites, comments) • Derive integrated expert/personal opinions on different aspects • Infer aspect ratings and weights
Using topic models to go from search engine to analysis engine o Tasks • What is a task? • How is task different from information need/intent? • How do we help users to express tasks o What does ranking mean in analysis engine? o How to evaluate the output of the analysis engine? o Operators to allow analysis of search results -- Select, Split, Intersection/Union, Interpret, Rank, Compare • Operators can be combined, similar to SQL/InQuery languages
Agenda - Perspective of the web/IT industry - Future of search - Role of IR - Challenges - Opportunity
The heritage: web of documents The future: - Social web - Facebook profiles, like buttons - Geospatial web: Mobile devices - Temporal web: Collection of information over time, real-time microblogging - Application web: Fundamental design of the browser doesn’t support new application models
IT industry of the future - Devices + cloud services - Changing the user intent capturing from rigid keyboard/mouse/keywords combination to more natural modalities • Understanding the natural language • Voice recognition - On mobile devices - In living room products • Body gestures - Microsoft Kinect • Image/Audio/Video capturing
Vision: of the future of search o Empower people with knowledge o Re-organize the web for search to unlock the full potential of the web • Better discovery • More informed decisions • Easier task completions
Role of IR o Understanding user intent o Modeling web of the world • People/places/things • Relations o Task completion & decision making o Incentive engineering for making people do more things on the web
Challenges o Measurement, evaluation & self-correction • Some things are inherently hard to evaluation: objectiveness, design, opinions • Search results have profound influence on the way people perceive the world • It is important that they have no inherent bias or skew
o Lack of • Tools & understanding in existing disciplines • Training & development if cross-disciplinary talent
o Barriers for academia research • Access to data • Computing infrastructure • Funding • Not just based on company agenda • Funding projects based on pure creativity
Opportunities • Opportunities for key breakthroughs in the areas of • Serendipitous discovery (e.g. Hunch.com) • Information theory for the age of the web and social networks • Science of big data
• Broadening collaborations • Research • Development (API/tools) • Investment (Training & Development)
• Vibrant community
Follow #sigir2011 for more news, although given the censorship in China, the results are very sparse.