|Data, Health, and Algorithmics: Computational Challenges for Biomedicine|
by Justin Zobel
(I missed the first part of this talk)
The Central Dogma
- DNA consts of sequence of four bases, A, C, G, T
- The concept of a gene is now uncertain
SNP analysis (Nature 00)
- used PCA
- Read DNA directly
- $1000 by the end of 2012
The data is erroful, incomplete, voluminous, ambiguous.
Also, reads are not very random
Within a few years there will be DNA databases of 10-100 terabases, which we will use to find matches to short read data
- Imagine a million copies of a phone book, a million pages long
- Shredded into tiny pieces, each no more than 20 or 30 characters
- 99.999% are thrown away.
- The task: reconstruct the phone book from the billion remaining pieces
The Problem of assembling short reads.
Genomes are a combinatorial minefield
- vast quantities of repeated material
The genome is cheap, but the analysis is expensive
de Bruijn graph
- Divide 7-base reads into kmers (3mers)
- each node is a kmer, each arc is an overlap
The graph is about 4 terabytes, and it needs to be in memory.
Succint 'Gossamer' representation
- fast access with simple index
- space down by a factor > 10
- cuts the storage down to 32 GB
- there is no grammar for DNA that would allow construction of a parser
- a dictionary of all possible tokens would be impossible large
Dictionary - any representative string
-> solves the text compression problem for DBs
Genetics for diagnosis
-> inference diagnosis based on symptons replaced by ones based on DNA analysis
-> drug effect and health outcome determined directly from historical health records
- Built to simplifly, improve, and automate bureaucratic decisions
'Guardian Angel' clinical decisions
- Electronic health records analyzed on the fly to check whether a mistake is about to be made.
Health at Home
-> health monitoring deeply embedded in our e-lifestyle activities
-> webcam that determines how well you are based on your skin
-> iphone app that tells if your drunk
Computer Science vs heatlth research
- many algorithmic solutions are not biologically meaningful
- spend money on IT and the number of errors is decreased (It saves lives.)
13 of the top 25 questions (Science mag July 2005) are about DNA
The real way to have an impact on medicine is in the clinic -> text mining records. Helping doctors make decisions.