Thursday, December 18
He highlights skills such as: time management, writing/speaking, leadership, and entrepreneurship. It's a good list. However, I would be interested to know what skills grad students should be developing that differ from what people should be developing in the 'real world'. In fact, I think industry can teach you these skills faster in many cases because the environment can be more demanding with near-term revenue and jobs on the line.
Personally, I think grad students should learn about funding and the grant process. This is similar to the reason that programmers in industry should know the basics of business and managerial accounting. The bottom line is that techies need to be able to live and communicate effectively with a non-technical audience, some of whom may be their bosses.
Tuesday, December 16
You can search the 1.5 million BioMed articles and add them to your 'basket'. The system then recommends new articles you should read.
Now that the semester is over, it's good to be able to focus on research again. After all, the SIGIR 2009 deadline is only about a month away!
Friday, December 12
Monday, December 8
For the past 157 years (that's how old the newspaper is) we've essentially delivered 'dumb content' to people's doorsteps. You and I, irrespective of interests, location etc. have received the same newspaper on our doorsteps every morning. We're beginning to explore ways to make content smarter, to understand what you've read, which device you've read it on and your micro level interests—making the most important news find you, instead of you having to find it.However, it may be too little, too late. Newspapers could have adapted to the Internet and the realities of the web, instead they chose to fight tooth and nail trying to forestall the inevitable change in business models. For example, they could have developed really good online classified services, but they didn't and Craigslist is eating them alive. Despite owning some of the best content and writers, they can't seem to figure out what to do with it all. Many of them have largely ignored the reality of content navigation via search engines and SEO. In short, they don't get the web. There are a few examples of individuals trying to change this, for example Marshall Simmonds
I'm even less sympathetic to the big three automakers whose short-sightedness building gas-guzzling trucks and SUVs has finally caught up with them.
If these companies can't adapt to the reality of the market then they should fail. There should be no government bailouts. I don't want to loan them my money when they ruined their companies and there does not appear to be concrete viable turnaround plans.
Standing by and letting car companies and newspapers fail will certainly hurt in the short run, but ultimately it will make our economy stronger and more competitive by getting rid of the weak and corrupt. Both newspapers and car companies will be replaced by new breeds of more nimble businesses that figure out ways to adapt and thrive in a new world.
Just a last note: This does not mean that I am unsympathetic to the hard-working blue collar workers at these companies. I have friends and relatives involved in these businesses. I think that if anything is done, the government should extend education money, health care, and other benefits to help these workers make ends meet and re-train. We should all chip and help our neighbors get by in tough times, I'd want others to do the same for me.
Saturday, December 6
I appreciate Erik's in-depth responses to the questions. In the process he shares some wisdom from his grad student days at UW. In particular, how the research focus of MetaCrawler evolved. For example, the fact that the mechanics of distributed querying wasn't an interesting research problem,
However, that tool could be used to collect a large number of web pages about a topic from “knowledgeable sources” and thus we could do something to analyze semantic structure. However, this wasn’t terribly well defined, and by the time we had MetaCrawler, we still weren’t sure what structure we’d want to investigate and even what kinds of semantics we were interested in. So, that part of the project was dropped, and we focused more on the research of MetaCrawler itself.Things don't work out as planned, but good researchers adapt and shift focus. One last nugget of wisdom for researchers from the interview:
Oren’s advice on the matter was to always investigate surprises with great vigor. Predictable things are, well, predictable, and the research that comes from steady improvement, while beneficial, tends to be rather boring. However, when you discover something that was unexpected, the results and explanations are almost always exciting and fascinating.I can't help but notice two connections of meta-search to current search engines.
- The decision to perform 'deep web surfacing' rather than federating results from third-party data sources. For example, Google has starting crawling the data behind forms. See the recent paper, Google's Deep-Web Crawl.
- The rise of "Universal Search", the process of blending results from multiple vertical search indices, is an interesting application of meta-search. Is there research that focused on the unique challenges of this use case? Considering the importance to industry, it's surprising to see the dirth of recent work in this area.
Friday, December 5
Should IR groups be using it or a similar model to distribute and perform processing of test collections?
For example, there will likely be a billion document web corpus for TREC 2009. However, there's concern over the number of groups with the resources able to handle a collection that large.
Thursday, December 4
The interview reminds me of the the article I wrote back in 2006 on the beginning of metasearch featuring MetaCrawler. Maybe sometime I'll get around to part II.
One quote from the interview struck me because it deals with the problem of extracting interesting research questions from engineering tasks. Erik writes,
Fundamentally, a Web service that simply sends a query to a number of search engines and brings back results isn’t all that interesting for a researcher. That’s an engineering problem, and not a difficult one. But there are a number of questions that ARE interesting — such as how do you optimally collate results? How do you scale the service?... Oren pushed me to answer those questions.The ability to abstract the interesting problems in a system and focus on those is a skill I'm still in the process of acquiring.
Erik solved the problem of combining a bunch of unreliable search engines to create one that was very useful, in the process he pioneered early research on meta-search. It's amazing how far web search engines have come; from unreliable early prototypes developed by grad students into today's multi-billion dollar industry.
I look forward to reading part II.
Wednesday, December 3
CSE 490H: Scalable Systems: Design, Implementation and Use of Large Scale Clusters
The topics covered are Map/Reduce, MapReduce algorithms, distributed file systems like the Google File System, cluster monitoring, power and availability issues. The course is taught by Ed Lazowska and Aaron Kimball. The class uses the widely used Hadoop Map-Reduce framework created by Doug Cutting and Yahoo! to give students hands on experience.
The four class assignments help students become familiar with real-world tools and tasks:
- Setup and test Apache Hadoop, using it to count words in a corpus and build and inverted index
- Run PageRank on Wikipedia to find the most highly cited articles.
- Assignments 3-4 build a rudimentary version of Google Maps.
Assignment 3 create maps and tiles of the US from geographic survey data
- Use Amazon S3 storage and EC2 compute cluster to lookup addresses on the maps created in assignment three and connect it to a web-front end.
Also the videos and slides of the lectures are available to view/download. This is fantastic because the speakers in the class look really interesting, such as Jeff Dean from Google and Werner Vogels from Amazon speaking about the tools and their future directions.
The class is a great quick-start on using Hadoop for cluster computation.
On a related note, you may also want to look at the lectures and materials for a mini-course on cluster computing for the Google interns.
Here at UMass we do large-scale indexing using a Map-Reduce like framework called TupleFlow that powers the Galago search engine; both were written by Trevor Strohman (now at Google).
Tuesday, December 2
CS276 (updated for fall 2008) - The Stanford graduate IR course, taught by Christopher Manning and Prabhakar Raghavan. This is the standard IR course. Their new book Introduction to Information Retrieval is quickly becoming one of the standard texts.
CS572: Information Retrieval and Web Search
At Emory taught by Eugene Agichtein.
CS 4300 / INFO 4300 Information Retrieval
At Cornell, taught by William Arms.
CSI550: Information Retrieval
At University of Albany, taught by Prof. Tomek Strzalkowski.
In addition to the forementioned Stanford IR book, the new IR book from the UMass IR lab, Search Engines: Information Retrieval in Practice by Bruce Croft, Donald Metzler, and Trevor Strohman, seems to be gaining adoption.
See also my previous post on IR courses.
Monday, December 1
Rodrygo from Glasgow has a post covering the blog track workshop, focusing mainly on the discussion around the 2009 track. Notably, the opinion finding and polarity tasks are being discontinued.
It was a consensus among the attendees that opinion retrieval and polarity detection are still open, relevant problems. Yet a few groups managed to deploy interesting techniques that achieved consistent opinion retrieval performances across several strongly performing baselines in the track this year, polarity detection approaches looked rather naive.In its place are new tasks for 2009 were discussed.
Faceted distillation task
The goal for this task will be to identify desirable blogs on a topic, for example non-spam blogs that have a recurring interested in a given topic. It's 'faceted' because the topic can specify desired properties of the blog such as, non-spam, satirical, Republican, last month, etc... Personally, I'm encouraged because it's more realistic and goes beyond topicality as the sole criteria for 'relevance'.
Story tracking task
The task will investigate how stories emerge and evolve over time. It may be linked to a parallel news corpus to show connections between blog news and the media. This reminds me of a previous discussion I had about using the blog corpus for topic detection and tracking (TDT) tasks.
There will also be a new blog corpus. See also the previous related discussion on the ICWSM 2009 data challenge for similar tasks on a different blog corpus.
Hopefully, I'll hear more about TREC 2008 as well as the future for TREC 2009 beyond just the blog track. I hear exciting rumors about the possibility of resurrecting the web search track with a new corpus.
Tuesday, November 25
Just as today we recognize keywords, we should recognize tasks and intent. The search engine should 'read' and synthesize the information to solve the intent...Here are some of the complex tasks that I've performed or worked on in the last few months where the current retrieval technology did not solve my underlying task:
- Find me the 'best' used car within 100 miles of Amherst, MA for less than $10,000.
- Create a gourmet Thanksgiving menu that is gluten-free.
- Plan a romantic weekend away to the Finger Lakes that involves hiking and wine tasting for less than $150 dollars per day.
- Create a homemade Christmas pack of spice mixes and drinks as presents for friends.
One intriguing thread of many of my search tasks is that I would like to poll my friends and family with knowledge of a specific domain (machine learning) or a geographic area (Finger Lakes). I then want to synthesize their opinions weighted by their expertise and preference similarity. What is the best way to leverage their expertise to help me solve my task?
What is the way forward from keyword-centric search to task-centric search? How will search engines help me leverage my social network and the knowledge of others to solve my task?
Friday, November 21
...SearchWiki, a way for you to customize search by re-ranking, deleting, adding, and commenting on search results. With just a single click you can move the results you like to the top or add a new site... The changes you make only affect your own searches. But SearchWiki also is a great way to share your insights with other searchers. You can see how the community has collectively edited the search results by clicking on the "See all notes for this SearchWiki" link.This has been in testing for awhile. See my previous post on Google's Wiki of Search and Eric Schmidt's notes from 2006.
I think getting users more involved is a good idea, especially when you have an audience as big as Google. However, I'm skeptical about the current system's utility. For example, it is disconnected from Google's related products, such as Google Notebook and Google Bookmarks. It doesn't allow me to incorporate my social network. I don't think that most people have a compelling desire to edit or comment on their own search results. A few common queries may get edits, but what about the long tail of search? That said, maybe I'll go change the ranking for a few vanity searches anyway ;-).
SE Land also has informative coverage.
Thursday, November 20
I've found the MIT open courseware material a godsend. MIT offers a course, Mathematics for Computer Science (2002) with a significant section on probability theory, including the bounding techniques we've been studying. If you want a good crash course in stats I highly recommend reading the notes on lectures 10-14. The notes are clear and the examples fascinating. I'll share one of my favorites. Professor Chernoff did an investigation off the Mass. lottery, described in the notes for lectures 13-14:
There is a lottery game called Pick 4. In this game, each player picks 4 digits, defining a number in the range 0 to 9999. A winning number is drawn each week. The players who picked the winning number win some cash. A million people play the lottery, so the expected number of winners each week is 100... In this case, a fraction of all money taken in by the lottery was divided up equally among the winners. A bad strategy would be to pick a popular number. Then, even if you pick the winning number, you must share the cash with many other players. A better strategy is to pick a lot of unpopular numbers. You are just as likely to win with an unpopular number, but will not have to share with anyone. Chernoff found that peoples’ picks were so highly correlated that he could actually turn a 7% profit by picking unpopular numbers!Most of the state-of-the-art retrieval algorithms are based statistics and the probability of a word occurrences in a document w.r.t a collection of documents. So, even if you aren't taking a class in algorithms, it's useful background to study for search.
Thank you MIT!
Wednesday, November 19
In the article, he jumps off the from the insights from an older pre-web paper on information seeking behavior: The design of browsing and berrypicking techniques for the online search interface by Marcia Bates.
Here is a brief excerpt from the original article:
So throughout the process of information retrieval evaluation under the the classic model, the query is treated as a single unitary, one-time conception of the problem. Though this assumption is useful for simplifying IR system research, real-life searches frequently do not work this way... At each stage they are not just modifying the search terms used in order to get a better match for a single query. Rather the query itself (as well as the search terms used) is continually shifting, in part or whole. This type of search is here called an evolving search.Another reminder that search is an inherently interactive process and classical models that do not account for this are very limiting. On a related note, see previous coverage of Nick Belkin's ECIR 2008 keynote address (and Daniel's notes).
Tuesday, November 18
Key Terms is derived from a Yahoo! Search capability we refer to internally as "Prisma."... Key Terms is an ordered terminological representation of what a document is about. The ordering of terms is based on each term's frequency and its positional and contextual heuristics...Each result contains up to 20 terms describing the document.
Add the parameter view=keyterms to the BOSS request to see the new functionality.
I wonder if this is at all related to the Key Term Extraction API that Yahoo! provides.
Monday, November 17
Daniel attended the symposium and has notes from Day 1 and Day 2. His notes are a good start, but I'm really disappointed by the dirth of information available for those who could not attend. The IRF symposium provides a good model for how to do this; there was a live stream of the presentations and the videos and slides are available after the conference.
Beyond basics, in the future we should enable remote audience registration and participation. We should be able to watch presentations and have online discussion. After all, traveling to conferences is expensive and often infeasible.
Sunday, November 16
The best parsers are those found in the top web browsers. However, it's usually quite challenging (and slow) to use them in external programs.
Java Mozilla Html Parser - A Java wrapper around the Firefox HTML parser that provides a Java API to parse documents. The website is out-of-date, there was a v 0.3 release in October.
Of course, you still have the option to write your own for maximum flexibility and speed. I'm still waiting for a real production quality parser. We'll need something better than what's currently available today to deal with those messy billion document test collections that are coming soon.
Tuesday, November 11
Chapters 8 (Machine Learning with WEKA), 9 (Statistical NLP), and 10 (Information Gathering) are all highly relevant for those of us in IR/text processing fields.
The code is designed to be easy to learn and to teach concepts rather than being the most efficient, or latest state-of-the-art. One good benefit is that recommends other implementations for those seeking more depth.
Friday, November 7
Mark Sanderson (Academic retrieval perspective)
- IRF Symposium 2007 was an introduction of IR people to IP people. What was striking then was an example given of a deliberate mispelling in a document because someone was trying to make sure their patent isn't found. This exposed the adversersial nature of some aspects of IP retrieval, which has parallels in the web retrieval community in the opposite direction.
- In 2008 academics drew on the experience, but much of this has been based on newspaper and web test collections. There is still a disconnect between what academics solved and what is relevant to the IP community. For example, academic groups are still evaluating using Precision@5 and MAP which focus on precision, instead of recall which matters more for IP. We need to look at new ways of assessing results.
Projects - Matrixware contributions
Alexandria System - a large-scale global archive of IP data
Leonardo System - an application development platform to access the data repositories. There is potential here for information studies specialists to study how IP searchers work and analyze their interactions.
He encouraged academics to take part in the CLEF IP and TREC chemical 2009 tracks this coming year. He drew parallels to the TREC legal track and the new and interesting understanding developed from that relationship. For example, legal track people are wedded to boolean retrieval and it was a big shock when ranked retrieval systems found documents that boolean search missed.
Steve Adams (IP Industry)
- He characterized the theme of this year as "hybrid".
A patent document is a fundamental tension. At the end the patent office delivers a doc that serves both the legal community and technical community. Those two functions are often in tension. A single document to perform both these functions is something which takes a lot of practice. They are also hybrid because they contain both text and non-text data. We need retrieval system to pull out the non-text part of the documents.
Hybrid approaches to IR
No single system or paradigm is going to deliver all the results on all occasion for every search.
Multi-linguality - we were reminded there are multiple methods to retrieve documents: query translation and document translation and both are useful.
Annotation (Eric) – the basic question is: Do we get good retrieval based solely on the original document or do we need some form of enriched documents to give better retrieval? As we face ever expanding corpora, is it possible to continue automatically or semi-automatically enriching the documents this will be very helpful. Semantic annotation currently requires a stable ontology, but we have a very dynamic vocabulary that develops over time.
Boolean vs ranked - Leif's findability index was very interesting. It could be the beginning of evaluation tools. Both boolean and best match ranking have their place.
Pierre identified the fact that getting to the bottom of each players role is an important preliminary step: who does what? IP: Mark referred to ‘dirty data’, we need to improve our data at the early stage of document production, not after it has been published.
Monika’s paper, Multimedia challenge – the patent application of 20 years in the future. It may not be text at all. Send us the cad-cam files, send us the 3d crystollagraphic model, send us the chipmask. We are light years away from being able to search these types of documents.
Some of the Highlights seem to have been:
Mapping how easily Documents can be found - by Leif Azzopardi
Annotations and Ontologies in the Context of Patent Retrieval - Eric Gaussier
Also the Alexandria and Leonardo systems from Matrixware.
Thursday, November 6
Welcome back Paul! I look forward to more interesting posts ;-).
I meant to write about this sooner, but Jon and Daniel beat me to it.
Paul's blog is a nice addition to my blogroll.
- Coverage of Bruce Croft's keynote
Unsolved Problems in Search (and how we might approach them)
- Data is good, code is a liability, coverage of Peter Norvig's industry day talk:
Statistical Learning as the Ultimate Agile Development Tool
Xing Yi returned and told us that the best interdisciplinary award went to:
Structural Relevance: A Common Basis for the Evaluation of Structured Document Retrieval by Sadek Ali, Mariano Consens, Gabriella Kazai, Mounia Lalmas.
Other highlights from Xing include:
- MedSearch: A Specialized Search Engine for Medical Information Retrieval
- Beyond the Session Timeout: Automatic Hierarchical Segmentation of Search Topics in Query Logs
Does anyone know who won the best poster award? The website hasn't been updated and we have a few people at the lab who would be interested.
I am also looking forward to the video lectures being available online.
Here is a description from the programme,
The main themes of this year’s speeches are multilingual retrieval, annotation and ontology, retrieval in non-textual documents and the improvement of user interfaces. The latest scientific projects from the fields of semantic and linguistic retrieval, text mining, automated quality control and machine translation will be presented for the first time.The CIIR here collaborates with the IRF. We have researchers there presenting work on using retrieval methods to detect errors in OCRed patent documents. I hope to have more details to follow.
The IRF also recently hosted the Patent Information Retrieval workshop at CIKM '08. The papers should be available through the ACM.
Wednesday, October 29
Google Rolling out "SearchWiki". Calling this a "Search Wiki"
It's a step in in the direction Eric Schmidt outlined back in the 2006 analyst day presentation. From the notes on Slide 8:
It's exciting to see some innovation in this area. It's a logical next step for Google to collect explicit feedback from users on the quality of results. For example, not that I dislike my blog ranking highly, but a search for [Kleinberg memetracker] could be improved by moving memetracker.org higher.
- Encourage our large user base to actively contribute metadata that leads to better search results
- Wiki of search: empower users/experts to improve search results in their domains of expertise — create a million verticals
- Effectively integrate user feedback (ratings, comments, tags) into search
In somewhat related news, Microsoft recently showed a research prototype U Rank, see their blog, that leverages your social network and let's you organize and edit search results and share them with others. This is closer to a "wiki of search", but in the limited context of a social network.
Bruce Croft from the CIIR here at UMass is giving the keynote today: Unsolved Problems in Search (and how we might approach them). I'm afraid I don't have any inside information on the content of his presentation.
I look forward to watching the videos on videolectures.net
Monday, October 27
I can't find much online for most of them. I agree with Daniel who says, Please Blog!. Live blogging anyone? Here's a little bit that I could find on one of the tutorials:
Large graph mining: patterns, tools and case studies
(See also the tutorial from WWW 2008 and ECML 2007 on VideoLectures).
Elif and Jangwon from our lab here UMass are attending, so hopefully I'll have their highlights when they return.
You can read Jangwon's paper on blog search Blog Site Search Using Resource Selection.
Napa is gorgeous. It also happens to have some of the best restaurants in the world. I would love any coverage of those too ;-).
Friday, October 24
Building on the work by the Yahoo! Research team in the paper "Information Re-Retrieval: Repeat Queries in Yahoo! Logs," the algorithm that generates the personalized results has been enhanced to return more targeted results.Inquisitor will also search the contents of your bookmarks to help you re-discover old content.
I've been trying it out and I really like it.
See ReadWriteWeb's coverage.
Thursday, October 23
Monday, October 20
The dataset consists of 44 million blog posts (27 GB compressed) crawled by Spinn3r between August 1st and October 1st 2008. The paper deadline is in January, so get to work!
It's exciting to see that in at least one task the relevance will include not only topical relevance, but also include the 'quality' of the content. This is one my major criticisms of the current Cranfield/TREC paradigm and most current academic experiments.
I find the blog track interesting, and not just because I have a blog. I'm interested in utilizing the highly temporal nature of blog posts to study the importance of temporal relevance. For example, to study the trade-off between authority and recency in ranking.
See also my previous discussion of the 2009 blog track.
You can read the announcement on the website for the full list of updates and changes. As an example, there are new document features for ranking:
In LETOR3.0 we added in-link number, out-link number, length of URL, number of slashes in the URL, etc. as new features. Also, we extracted those existing features in all streams (URL, title, anchor and body), while features in some streams are missing in LETOR2.0. Overall, there are 64 features (Table. 2) which can be directly used by learning algorithms.Also of note, the document parser used to index the documents changed (different tf counts) and the definition of some of the document fields differs slightly from version 2.0. Furthermore, the IDF calculation changed significantly.
Saturday, October 18
I've been blogging about search since late 2005. Early on I focused on domain-specific search engines, like Globalspec. More recently, it's become more technical and research oriented. I look forward to using it to share not only research highlights, but practical guides and code for practitioners. This is one area where my previous job as a developer limited my ability to be open and participate in these types of discussions. Here a few highlights from the past few years.
The most popular posts:
- Open source Information Extraction and Text Mining tools
- Open source search libraries
- Vertical search definition and context
- Open source Collaborative Filtering and recommendation engines
(and the more recent web-scale recommendation engines).
- How to quickly reset a Java array
My favorite posts:
Monday, October 6
Truvert is a new semantic search engine built to demo Orcatec's semantic technology. They built a 'green search engine' using Yahoo! BOSS (see also my recent post on the BOSS-U workshop).
From their blog:
Truevert has solved the problem of semantic search by learning the meaning of words directly from the documents that it reads rather than by relying on a prebuilt taxonomy, ontology, dictionary, or thesaurus...In conjunction with an excerpt from a more recent post:
Delivering focused search results depends on the ability to understand the meaning of words to a detailed level. This understanding will not come from syntactic analysis or from the construction of elaborate ontologies. It will come from using human-like processes on the documents themselves.Interesting. I'd love to learn more about their semantic analysis technology.
The workshop consisted of a series of all-day sessions in which academics from MIT, Stanford, UIUC, UMass, and Purdue, and experts from the Yahoo! Search Team and Yahoo! Research brainstormed and discussed ways to incorporate BOSS-U into academic research and teaching programs.Getting more academic involvement in what has been traditionally a very closed industrial environment is very encouraging. One of the goals is to provide academic researchers to web-scale data. To start the process, Yahoo! is being quite generous with access to their API for academic researchers. It will be interesting to see what research and ideas emerge from the collaboration.
Tuesday, September 30
What would it take to convince you:
- ... to buy a spam filter?
- ... to win a nobel prize for spam filtering?
- ... to publish a paper?
- ... to grant a PhD?
Here are a few of the many ideas that surfaced:
- develop a system that generates undetectable spam
- create a high-accuracy system that performs automatic unsupervised learning so that the user is never bothered with spam again.
- prove the problem of spam is the same as a currently known solvable or unsolvable problem
Myth - a widely held, but false belief or idea (in this context). Myths get us into trouble when we say they are false, but we act like they are true.
- Computer Science isn't science, it's just processing.
- The right questions and their possible answers are obvious.
- To find good research problems just look at what everyone else is doing.
"I skate to where the puck is going to be, not to where it has been.” - Wayne Gretzky
- Science is just common sense.
Myth: Good research is based on what your undergraduate degree trained you to do well.
- All findings in major journals are true.
- Failure is bad.
Design an experiment to learn regardless of the outcome.
- Great researchers are born, not made.
- To be successful I just need to show my system is better.
- To be successful I have to work all the time.
Focus on productivity.
- To be successful, I just need to do more of what I'm already doing
1) think harder or 2) code more
- Applied Math/CS is not as good as theory
"The code you write today won't run in five years. Get over it. What will be used? It is the understanding derived from running the code."They also referenced two great books: The Structure of Scientific Revolutions and Sciences of the Artificial.
See the website for last year's version.
If you want to learn more about methods to conduct constructive Computer Science research, I recommend David Jensen's Research Methods Class. The notes from the Spring 2008 are available.
Upon reflection, what struck me is that sometimes I have a tendency to follow what's hot right now rather than looking ahead to the future. Don't follow into this trap.
Thursday, September 18
Instead of watching The Office or other mind rotting television this fall, you may want to consider watching NLP and ML lectures courtesy of Stanford.
Stanford's Engineering Everywhere is offering some course materials for free, including lecture videos and course notes. I hope CMU and other top CS programs do likewise.
There are two interesting courses, including lectures that are relevant for IR people:
Natural Language Processing (CS224N)
by Chris Manning (course site) (SEE link).
Machine Learning (CS229)
by Andrew Ng (course website) (SEE link).
(A small plug for the Machine Learning course, CMPSCI 689 here at UMass, which I look forward to taking.)
Monday, September 15
Traditional search evaluation has focused on the relevance of the results, and of course that is our highest priority as well. But today's search-engine users expect more than just relevance. Are the results fresh and timely? Are they from authoritative sources? Are they comprehensive? Are they free of spam? Are their titles and snippets descriptive enough? Do they include additional UI elements a user might find helpful for the query (maps, images, query suggestions, etc.)? Our evaluations attempt to cover each of these dimensions where appropriate.One of my biggest issues with TREC and similar environments is the single focus on topical based relevance. See my previous post on the TREC blog track. For example, a spam post that is relevant to a topic would be acceptable, even if you would never want to read it in real life. It's time we move beyond the basics and find ways to tackle the more challenging retrieval quality aspects in a way that is still amenable to cost effective measurement.
Note: I also highly recommend What People Think About When Searching by Daniel Russell who analyzes user intent and behavior at Google.
Friday, September 12
Fernando Diaz, a recent CIIR alumnus has the first post: Blogs, queries, corpora. He's continuing the discussion that Iadh started on tasks for the TREC 2009 blog track (see my earlier post in response). Fernando focuses on the origins of the current TREC tasks and deriving future tasks from the behavior of real-world users of blog search engines. Fernando writes,
One question I hope will be resolved in the comments is where these query types came from. Are they derived from actual blog searchers?... One approach would be to inspect query logs to blog search engines for different retrieval scenarios and then improve performance for those scenarios.He poses a very good question. I don't recall seeing any published research analyzing the behavior of users with blog search log data. Ultimately, the problem comes back to a fundamental issue that academia struggles to try and create relevant and realistic test scenarios without access to log data from real-world systems. However, hopefully we can at least try to improve what we have today.
I would like to see TREC topics begin to model the interactive nature of search. A starting pointing is acknowledging that users enter multiple queries in order to find information. Today, TREC topics are only a single query, which is unrealistic and overly simplistic. As a starting point, I advocate the development of multi-query topics developed from query refinement chains. Evaluation would be performed on each query in the chain and the results for the query chain combined. Thoughts?
Wednesday, September 10
First, I'm looking forward to the new blog corpus. The 2006 blog corpus is small and only covers eleven weeks. Hopefully, the new 2008 corpus will be much larger over a longer time frame that includes the upcoming US presidential election and all of the controversy surrounding it.
I read What Should Blog Search Look Like by Hearst, Hurst, and Dumais. The paper has three key tasks it goes over:
1. Find out what are people thinking or feeling about XThe paper focuses heavily on the search features needed to support these tasks. It's main criticism of the current blog distillation task (roughly task 2 above) is that the current task focuses only on relevance and does not incorporate information about the quality of the content or authority of the blog discovered.
2. Find good blogs/authors to read.
3. Find useful information that was published in blogs
sometime in the past.
I also read On the Trec Blog Track which summarizes the last two years of the blog track. It talks about an extension to the existing opinion finding track that I think could be really interesting:
For example, for a given product, one might wish to have a list of its positive and negative features, supported by a set of opinionated sentences extracted fromAn interesting extension to this would be to try and summarize the positive and negative opinions on individual features.
blogs (Popescu & Etzioni 2005). Such a task complements work in the TREC Question Answering track.
I focused on the section Lessons Learnt and future tasks. The paper outlines three new possible tasks:
The first two sound very interesting. They sound similar to some of the tasks in the older Topic Detection and Tracking community (TDT) that was done with news data.
- Feed/Information Filtering - Inform me of new feeds or new blog posts about X.
- Story Detection - Identify all posts related to story X. A possible variant is to ask the participating systems to provide the top important stories or events for a given date or a given range of dates.
- Information Leaders - Identify all information leaders about topic X
Personally, I really like the first two because I spend a lot of my time reading blogs of other leading tech leaders and researchers to stay on top of interesting topics in information retrieval and other related interested topics. The current alert systems (i.e. Google Alerts) are inadequate; they don't find all of the new information and often find many duplicates. A sub task here could be linking and deduping different versions of the same story.
For the second task, it's interesting to find all posts about a story. However, it isn't very realistic to find ALL posts. A primary reason is that not all posts are useful or worth reading. For example, a post may simply be a link to the story without any other content, this is relevant but not very useful. Again, I would like to incorporate some sense of quality: find the highest quality posts on story X.
To me the third task is slightly less interesting. However, it would be interesting to try and link the conversations together and track the discussion across blogs (including both comments and posts). The end goal might be to discover novel subgroups off the original story.
Through all of these one of the themes that sticks out is the need to find not just relevant information, but the need to discover posts or blogs that contain quality or authoritative content.
To start, he gives a brief history of the track over the past three years it's run.
Our main findings and conclusions from the first two years of the Blog track at TREC are summarised in the ICWSM 2008 paper, entitled On the Trec Blog Track. The Blog track 2006 and 2007 overview papers provide further detailed analysis and results.
He also points to a position paper by Marti Hearst, et. al., What Should Blog Search Look Like? that will be presented at CIKM 2008.
I will read both papers and give it some thought. You will hear from me soon ;-).
Monday, September 8
I wonder what words he used when he opened this door? Open sissy, open cecil, no that can't be it! whoop, it's giving way, it's giving way...Unfortunately, Popeye's super strength won't help you with your search tasks. Or perhaps you feel like the bumbling fool Hasan attempting to prevent Daffy from stealing the master's treasure:
- Popeye watch the full episode via the IA (or jump to the second half on YouTube)
Using today's search engines is playing the game of Guess the Magic Words. Guess the right words, and Open Sesame! Guess poorly and you bang your head against the cave door for minutes or hours. How good are you?
A current problem for search engines today is this: If you type a long query and give the search engine more information about your information need you are likely to get worse results than if you entered only a few brief keywords.
Barney Pell from PowerSet describes the current language of search engines as keywordese. If you enter too few of these keywords your query is likely too vague; too many keywords and relevant documents are mistakenly filtered out. And too often we don't know the 'magic words' to find the desired information.
Until search engines utilize the information in long queries without being overwhelmed by the 'noise' search will remain broken.
Marissa Mayer said in a recent LA Times Article on Google's 10 year anniversary:
I think there will be a continued focus on innovation, particularly in search. Search is an unsolved problem. We have a good 90 to 95% of the solution, but there is a lot to go in the remaining 10%.Search isn't 90% solved. It's not easy to quantify because search is constantly evolving. Regardless, beating the Guess the Magic Words level in the game of search is still a long way off.
Thursday, August 21
O'Reilly Radar has good coverage.
Don't forget to subscribe to the Photosynth Team Blog.
Wednesday, August 20
Me and my wife are moving to Northampton to begin the long road to a PhD in CS from UMass Amherst at the CIIR. Orientation starts Sept. 2nd; my last day at Globalspec is Friday.
Needless to say, things have been quite hectic. More to come once we get settled.
Tuesday, August 19
It reminds of some of the best advice on writing a good essay: create a rough outline and then start writing! Plan that you will re-rewrite or revise most the first draft.
This same principle also translates quite successfully to software development. Break the project down into small, well-defined chunks with clear responsibilities and functions. Getting these pieces build creates code that kind-of, sort-of works. Then you can redesign and reshape, adding layers of functionality and flushing out details based on what you learned building the 'draft'. This follows the principle of "Plan to throw the first one away, you will anyhow" from Brook's Mythical Man Month. Just don't put too much effort into the first one. ;-).
Friday, August 15
It's been really busy recently, so I haven't had time to read through the papers in depth yet. I am in the frenzied process of packing for the imminent move to western Mass. to begin the PhD program at UMass.
I previously mentioned the final publication of the new Intro to IR book from Manning, et al. It will be a classic.
There is also background material for the MPhil in Computer Speech, Text and Internet Technology (CSTIT) from the CL at Cambridge.
An Introduction to linguistics by Ted Briscoe.
Background on basic maths: set theory and logic, linear algebra, and probability.
The main book for the one year course CSTIT course is Speech and Language Processing by Daniel Jurafsky and James Martin. A new second edition of the book was released this Spring. It's at the top of my reading list. The first edition was a classic and I'll try to write up a review of the second edition when I get through it.
If that doesn't keep you busy, there is always Moffat, Zobel, and Hawking's Recommended Reading for IR Research Students.
Prabhakar Raghavan gave a talk, New Sciences for a New Web at the Infocomm Development Authority (IDA) of Singapore.
They also celebrated the release of Raghavan's new book: Introduction to Information Retrieval. The website now has a new PDF for online reading with hyperlinks.
Thursday, August 14
I have the same problem with Firefox using my old Hotmail account. The 'full' version of Hotmail prompts me that my Firefox 3.0 browser is unsupported and that the product may not function correctly.
Perhaps there are more complicated 'platform support considerations', but I am sick of the BS. Firefox has about 20% of the browser market. Microsoft, stop pretending like Firefox users don't exist. Real developers right code that supports rich experiences in all of the modern (post IE6) browsers.
In the meantime, I recommend Confluence as an alternative to Sharepoint's Wiki component. The downside is it doesn't come cheap...
Tuesday, August 5
Iadh's posted the team's research presented at SIGIR 2008.
Welcome to the blogosphere Iadh, Craig, and the rest of the Glasgow IR team. It's about time, considering you organize the TREC blog track! ;-)
Sunday, August 3
Later, the author writes:
Lackadaisical attention--slipshod thinking—diffused interest—scattering of mental forces, all these get us nowhere. The Jack-of-all-trades is a failure in the twentieth century. The big rewards of this age go to the man who can do one thing supremely well. That-means concentration.I need to practice more:
Try to hold a chosen topic of thought for a fixed period of time. Do not be too ambitious... Try it on the following topics: "How I spent my last birthday." "My favorite book." "The best moving-picture I ever saw." "The most inspiring lecture I ever heard"... Be content with a minute, at the outset less, perhaps, if your pride permit.Via Feld.
Saturday, August 2
Learning to Rank (full proceedings)
The main purpose of this workshop is to bring together information retrieval researchers and
machine learning researchers... The goal is to design and apply methods to automatically learn a function from training data, such that the function can sort objects (e.g., documents) according to their degrees of relevance, preference, or importance as defined in a specific application.
See Jon's coverage.
Focused Retrieval - (full proceedings)
Focused retrieval has been used to extract relevant sections from academic documents; and the application to text book searching is obvious (such commercial systems already exist). The purpose of this workshop is to raise issues and promote discussion on focused retrieval - that is, Question Answering (QA), Passage Retrieval, and Element Retrieval (XML-IR).
Information Retrieval for Advertising - (full proceedings)
Online advertising systems incorporate many information retrieval techniques by combining content analysis, user interaction models, and commercial constraints. Advances in online advertising have come from integrating several core research areas: information retrieval, data mining, machine learning, and user modeling. The workshop will cover a range of topics on advertising, with a focus on application of information retrieval techniques.
Mobile Information Retrieval (MobIR '08) - ( full proceedings)
Mobile Information Retrieval (MobIR'08) is a timely workshop concerned with the indexing and retrieval of textual, audio and visual information such as text, graphics, animation, sound, speech, image, video and their various possible combinations for use in mobile devices with wireless network connectivity.
Beyond Binary Relevance: Preferences, Diversity, and Set-Level Judgments - ( full proceedings)
New methods like preference judgments or usage data require learning methods, evaluation measures, and collection procedures designed for them. This workshop will address research challenges at the intersection of novel measures of relevance, novel learning methods, and core evaluation issues.
Future Challenges in Expertise Retrieval - (full proceedings and slides)
The main theme of the workshop concerns future challenges in Expertise Retrieval. Instead of focusing on core algorithmic aspects of a specific expert finding scenario (as is the case for the TREC Expert Finding task), our aim is to broaden the topic area and to seek for potential connections with other related fields.
Analytics for Noisy Unstructured Text Data (full proceedings behind ACM web login)
Noise in text can be defined as any kind of difference between the surface form of a coded
representation of the text and the intended, correct, or original text. The goal of the AND workshops is to focus on the problems encountered in analyzing noisy documents coming from various sources.
Best student paper award: Latent Dirichlet Allocation Based Multi-Document Summarization by Rachit Arora and Balaraman Ravindran
Workshops without proceedings online (yet):
Speech Search (SSCS)
Lastly, an older, but highly related workshop from WWW 2008:
Adversarial Information Retrieval (AIRWeb) - (program with papers and slides)
The program is structured around 3 sessions with presentations of peer-reviewed papers on Adversarial IR on the Web, covering usage analysis, network analysis and content analysis; followed by one session with the Web Spam Challenge results and a panel on the future of Adversarial IR on the Web.
Thursday, July 31
First up, Paul Heymman on the Stanford InfoLab Blog has some of the best coverage of the conference I've read yet.
In case you missed Jon's earlier comment, he has coverage of the Learning to rank sessions and workshop.
Paraic Sheridan, covers the keynote on Google China and its future in Africa. Paraic is a computational linguist at the Centre for Next Generation Localisation (CNGL) at Dublin City University
Pranam Kolari, from Yahoo!'s web spam team has coverage of Kai Fu Lee's Keynote.
Best paper awards
I couldn't find the award information on the SIGIR 2008 site, but here's what I pieced together, please correct me if I'm wrong:
Algorithmic Mediation for Collaborative Exploratory Search (best paper award)
BrowseRank: Letting Web Users Vote for Page Importance (best student paper award)
Also, I look forward to reading Peter Bailey's paper Relevance Assessment: Are Judges Exchangeable and Does it Matter?
In the paper they examine the impact of assessor expertise on the quality of relevance judgments. In the end, they conclude:
...the Cranfield method of evaluation is somewhat robust to variations in relevance
judgements. Having controlled for task and topic expertise, system performance measures show statistically significant, but not large, differences. Similarly, system orderings allow us to identify “good” and “bad” IR systems at a broad-brush level.
Tuesday, July 29
Cuil's plans to differentiate itself
1) It's about the infrastructure, of course.
From a recent interview with GigaOm, Anna Patterson, formerly one of Google's infrastructure designers, reportedly said:
How it works is that company has an index of around 120 billion pages that is sorted on dedicated machines, each one tasked with conducting topic-specific search — for instance, health, sports or travel. This approach allows them to sift through the web faster (and probably cheaper) than Google...The Forbes article has a little more detail on their query serving architecture:
Finally, to compare with Google's architecture, a quote from Danny Sullivan's interview with Anna:
Patterson and Costello's impressive feat is that they've done this with a total of 1,400 eight-CPU computers (1,000 find and data-mine Web pages, the remaining 400 serve up those pages) [JD: Even assuming there is no redundancy 120 billion docs / 400 servers = 300 million documents per node. This seems unrealistically high, especially considering that Lucene, a widely used search library can realistically handle 10-20 million.] ...
Cuil attempts to see relationships between words and to group related pages in a single server. Patterson says this enables quicker, more efficient searching: "While most queries [at competitors] go out to thousands of servers looking for an answer, 90% of our queries go to just one machine."
If they [Google] wanted to triple size of their index, they'd have to triple the size of every server and cluster. It's not easy or fast...increasing the index size will be 'non-trivial' exerciseAccording to the news, Cuil's index serving infrastructure is a key competitive advantage over Google and the other major players. It remains to be seen if they can leverage this platform to produce world-class results.
On their size claims
Last I heard, Google's index is rumored to be in the 40 billion range, Microsoft is in the 10-20+ billion range. Cuil claims their architecture allows at least a 3x increase in index size over Google. However, it's hard to verify this because Cuil's hit counts are badly broken: a search for [the] returns an estimated 250 documents. The lack of support for advanced search, such as site: also makes it difficult to compare coverage of individual sites, such as Wikipedia.
Other differentiating features:
- Topic-specific ranking
From Danny's interview, it sounds like Cuil is doing post-retrieval analysis of document content, analyzing phrase co-occurrence and extracting 'concepts'. From the interview:
It figures out these relationships by seeing what type of words commonly appear across the entire set of pages it finds. Since "gryffindor" appears often on pages that also say "harry potter," it can tell these two words (well, three words -- but two different query terms) are related.
Cuil then reportedly computes a topic specific link score. It sounds very similar to Teoma's HITS technology. Again, there is no support (yet) for Cuil's claim that this is superior to other search approaches.
- UI and exploration
Cuil has a non-standard two or three column layout of results, which attempts to feel more like a newspaper, with images associated with many results.
It appears to use the information from the content analysis to create the 'Explore by Category' box to drill down into specific topics, as well as offering related searches as tabs across the top of the page.
Size matters, but it's more important to get the right content in the index. It's purely subjective since the web is infinite, but only a subset of the web is useful. Google tracks at least a trillion distinct URLs, and Cuil's crawls only a mere 186 billion (SE Land reference). It's critical that crawling and indexing be prioritized correctly. For example, despite the reported massive index size, this blog is not indexed by Cuil. While on its own this doesn't mean much, Daniel reports similar experiences with lack of coverage of his content.
I am unimpressed with Cuil's current coverage and relevance, but it's still early. Despite all the criticism (much of it justified), launching a search engine of this scale is an impressive feat. I think what Cuil's doing is exciting and I'm witholding judgment until it has time to mature. Once again, congratulations to the Cuil team and good luck with the long road ahead.
Monday, July 28
Read the good coverage from GigaOm and Danny Sullivan over at SELand. It's an early stage product, but maybe I'll have a review later today.
Cuil has a formidable group of search veterans including Louis Monier from Altavista/eBay, Tom Costello from IBM Webfountain, and Google veterans Anna Patterson and Russell Power, see their bios on the Cuil management page. Congratulations to the entire team, launching a product of this scale is a monumental undertaking!
Wednesday, July 23
However, there is a major difference from Wikipedia:
The key principle behind Knol is authorship. Every knol will have an author (or group of authors) who put their name behind their content. It's their knol, their voice, their opinion. We expect that there will be multiple knols on the same subject, and we think that is good... People can submit comments, rate, or write a review of a knol.
Greg's coverage of Kai-Fu Lee's keynote on Google China.
Jon's posted his slides from his presentation:
I'm going to be taking a long weekend for vacation with the family. I will see you Sunday or Monday. Hopefully there will be more coverage when I return!