First, if you want to learn how these work you should read a recent paper published by Ron Kohavi from Microsoft:
Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO from SIGKDD 2007 (Knowledge Discovery and Data Mining), which is happening next week in San Jose. Essentially, he provides guidance on running experiments based on his experience running controlled tests at Amazon and Microsft.
Fist, I have to say that creating good tests is HARD. One of the hardest parts in testing is creating what Ron calls the Overall Evaluation Criterion (OEC) that maximizes not only short term profits, but also long term profits and customer satisfaction. The long-term impact is often overlooked in favor of short-term gain.
For example, spamming a customer with lots of promotions now may be great for short term sales, but in the long term it is not the best decision because of its longer term effect of increasing user unsubscribe rates thus limiting the scope of possible future promotions. All too often it is hard for profit-driven companies to take a long view and think beyond this quarter's (or if you are lucky, this year's) bottom line.
The authors have some good advice, suggesting that if you don't get the results desired you should drill-down and look at many metrics along with slicing and dicing by user segment in order to understand what happened in more detail; learning from what didn't work. The devil is in these oft overlooked details.
Erik Selberg raises this issue when he comments on the 'data driven decision process.' He writes,
A data-driven approach has to start with the right question, followed by experiments that provide data that is properly interpreted to provide the answer. Typically, people fail in either starting out with the wrong question, or by conducting poor experiments that produce flawed data. An evil twin of flawed data is what I call Executive Data Bias. A decision-maker will have a certain bias on what to do, and is looking for data to back up that decision. Thus, flawed data that backs up the decision is accepted without much probing, while good data and the implications are rejected, typically by asking for “more experiments” or “more data,” or questioning assumptions made in the experiment or question.He goes on to say that it can work, but it must be done very very carefully.
I sometimes hear, 'we tested that, it didn't work', but too often when I ask why not, what didn't work in detail I don't get satisfactory answers. This incomplete test data is then used to dismiss projects that may still hold great potential if we only understood what failed in test that was run in more detail. Unfortunately, more often than I would like, I find myself in Erik's camp.
Hopefully, Ronny's team at MS and others will help all of us become more educated on this important topic.