Wednesday, October 26

CIKM 2011 Keynote II: Justin Zobel on Biomedicine

Data, Health, and Algorithmics: Computational Challenges for Biomedicine
by Justin Zobel

(I missed the first part of this talk)

The Central Dogma
- DNA consts of sequence of four bases, A, C, G, T
- The concept of a gene is now uncertain

SNP analysis (Nature 00)
 - used PCA

 - Read DNA directly
 - $1000 by the end of 2012

The data is erroful, incomplete, voluminous, ambiguous.

Also, reads are not very random

Within a few years there will be DNA databases of 10-100 terabases, which we will use to find matches to short read data

Challenge: Assembly
 - Imagine a million copies of a phone book, a million pages long
 - Shredded into tiny pieces, each no more than 20 or 30 characters
 - 99.999% are thrown away.
 - The task: reconstruct the phone book from the billion remaining pieces

The Problem of assembling short reads.

Genomes are a combinatorial minefield
 - vast quantities of repeated material

The genome is cheap, but the analysis is expensive

de Bruijn graph
 - Divide 7-base reads into kmers (3mers)
 - each node is a kmer, each arc is an overlap

The graph is about 4 terabytes, and it needs to be in memory.

Succint 'Gossamer' representation
 - fast access with simple index
 - space down by a factor > 10
 - cuts the storage down to 32 GB

DNA dictionaries
 - there is no grammar for DNA that would allow construction of a parser
 - a dictionary of all possible tokens would be impossible large

Dictionary - any representative string
 -> solves the text compression problem for DBs

Genetics for diagnosis
 -> inference diagnosis based on symptons replaced by ones based on DNA analysis
 -> drug effect and health outcome determined directly from historical health records
 - Built to simplifly, improve, and automate bureaucratic decisions

'Guardian Angel' clinical decisions
 - Electronic health records analyzed on the fly to check whether a mistake is about to be made.

Health at Home
 -> health monitoring deeply embedded in our e-lifestyle activities
 -> webcam that determines how well you are based on your skin
 -> iphone app that tells if your drunk

Computer Science vs heatlth research
 - many algorithmic solutions are not biologically meaningful
 - spend money on IT and the number of errors is decreased (It saves lives.)

13 of the top 25 questions (Science mag July 2005) are about DNA

The real way to have an impact on medicine is in the clinic -> text mining records.  Helping doctors make decisions.