Tuesday, September 30

Ten Myths of Computer Science Research

Dave Jensen and David Smith recently gave a presentation here at UMass titled Myths of Research in Computer Science. A copy of their slides is now available online. The talk began with an ice breaker and small group discussion on spam filtering.

What would it take to convince you:
  • ... to buy a spam filter?
  • ... to win a nobel prize for spam filtering?
  • ... to publish a paper?
  • ... to grant a PhD?
To open they posed the following question: What is a 'major finding' in spam filtering?

Here are a few of the many ideas that surfaced:
  • develop a system that generates undetectable spam
  • create a high-accuracy system that performs automatic unsupervised learning so that the user is never bothered with spam again.
  • prove the problem of spam is the same as a currently known solvable or unsolvable problem
One of the interesting take aways was that a system usually doesn't get awarded the highest prize; the theory/knowledge behind it is the key.

Myth - a widely held, but false belief or idea (in this context). Myths get us into trouble when we say they are false, but we act like they are true.

Ten Eleven Myths of Computer Science Research
  1. Computer Science isn't science, it's just processing.

  2. The right questions and their possible answers are obvious.

  3. To find good research problems just look at what everyone else is doing.
    "I skate to where the puck is going to be, not to where it has been.” - Wayne Gretzky

  4. Science is just common sense.
    Myth: Good research is based on what your undergraduate degree trained you to do well.

  5. All findings in major journals are true.

  6. Failure is bad.
    Design an experiment to learn regardless of the outcome.

  7. Great researchers are born, not made.

  8. To be successful I just need to show my system is better.

  9. To be successful I have to work all the time.
    Focus on productivity.

  10. To be successful, I just need to do more of what I'm already doing
    1) think harder or 2) code more

  11. Applied Math/CS is not as good as theory
One of the great quotes that I took away from the talk was from Dave Jensen quoting Paul Cohen:
"The code you write today won't run in five years. Get over it. What will be used? It is the understanding derived from running the code."
They also referenced two great books: The Structure of Scientific Revolutions and Sciences of the Artificial.

See the website for last year's version.

If you want to learn more about methods to conduct constructive Computer Science research, I recommend David Jensen's Research Methods Class. The notes from the Spring 2008 are available.

Upon reflection, what struck me is that sometimes I have a tendency to follow what's hot right now rather than looking ahead to the future. Don't follow into this trap.

6 comments:

  1. I agree with most myths mentioned. However I really can't understand the first one - "CS isn't science".

    Sure there is CS research that is badly designed thus hardly can be called science. This happens in all fields.

    Can you give a little more detail of the reasoning behind this assertion? Thanks!

    ReplyDelete
  2. Nothing like replying to myself after rereading the text. Obviously I interpreted the first myth backwards. Please ignore. :)

    ReplyDelete
  3. From the quote, which I don't see in their slides, it sounds like these guys (or whomever they're quoting) have no clue about commercial software. Or even reasonably well-written and documented academic software. I wrote academic software in 1992 that's still fairly widely used today. Many of the classes in our current production software were written more than five years ago.

    Real software in real products tends to be incremental rather than being continually rewritten from scratch by each new grad student that can't figure out what the previous one did.

    Sergio: the myth was that CS isn't science. I believe the authors by calling it a myth are implying that it is science.

    The Gretsky quote is what my old VP at Bell Labs used to call the "pee-wee soccer effect" of everyone following the same ideas.

    Simon's Sciences of the Artificial is one of my all-time faves, but I don't see how it's relevant to this discussion.

    While I'm not saying Kuhn's book on scientific revolutions is wrong, it's used in practice to justify nonsense by authors who think they're revolutionaries.

    PS: A good practially-oriented and fun read on this topic is Hamming's advice to young scientists.

    ReplyDelete
  4. Bob,

    I asked Dave Jensen about the quote and it is from Paul Cohen.

    I don't think Paul was saying that there isn't value in well designed and documented software, especially in the commercial sector.

    His message is that in the long-run, whether it's five years or fifty, that the code won't survive. However, the ideas in it and the findings produced by running it will endure. This is especially true for people conducting experiments in academia where ideas and not software is the product. In this context, software just a tool to produce empirical results and prove theories.

    The Sciences of the Artificial was mentioned to refute the claim that was made that Comp. Sci. isn't science because we don't study 'real' subjects.

    The Hamming talk is one of my favorites, thank you for pointing it out.

    ReplyDelete
  5. Interesting thoughts, all around.

    I tend to agree with the "your code won't be used in 5 years" statement. My feeling is that this has a lot to do with the "science" aspect of computer science, rather than the engineering aspect.

    With engineering, you are trying to design systems for stability and longevity. You are not really trying to make things extensible or flexible. Except for maybe some sort of plug-in architecture. But for the most part, user requirements do not really change. And so it makes sense that code will still be used 5 years later. Modulo those plugins, the user is essentially doing the same thing with the software 5 years later.

    With science, on the other hand, you are continually testing, discarding, reformulating, retesting and synthesizing new hypotheses. In a sense, very little code is written to ever run twice, because once you've tested a hypotheses, you move on to the next one. Why would you keep running the same experiments, over and over? Maybe if new data were continually made available? But even then, you might run your code for 2 years, or 3 years. After your 5th year of running that code, how many papers or posters or journal articles are you going to get, saying "yup, the hypothesis continues to bear out"?

    I think that's all that they're saying, here.

    Perhaps the exception to this rule is for lower-level, data structure or generic algorithm software development. For example, if you've spent a fair amount of time writing data structure libraries (b-trees, hashsets, etc.) in C, you probably will be reusing that code not just in 5 years, but in 10 years. Similarly, if you've implemented some low level machine learning algorithms (Adaboost or whatever), you will probably reuse that as well.

    But those become libraries, into which you plug your main code, your hypothesis-testing code. And that main code is the code that you won't be running in 5 years. And that main, hypothesis code is where all the interesting stuff is happening, where all the "real" development is happening. Not in the b-tree library for C/C++.

    "Science" code is a very different creature from enterprise, production code.

    ReplyDelete