Class 01 – Fragment assembly
DNA sequence data DNA sequence data is the motherlode of molecular biology. 10^10 base pairs. One human genome/year. It is our portal to protein sequences. It is fast, cheap and reliable. How do we get it?
Where the fragments come from Make many copies of a chromosome, using pcr (polymerase chain reaction). Break it up into short pieces. (We can sequence short pieces only.) Reassemble the short pieces.
Simplest version Like a jigsaw puzzle, except that we match overlaps rather than adjacencies. Assume that the shortest assembled string (shortest superstring is the correct solution). We know the orientation of each fragment, and the approximate length of the correct answer. (Real world considerations.)
Real world complications This model is too optimistic to be realistic Problems: Errors is reading fragments Contamination (chimeras) Could come from either strand Repeats Inverted repeats
The coverage problem Incomplete coverage (leaving ‘contigs’) We may have complete coverage, but not know it (for sure!)
Shortest superstring problem (SSP) Input: A collection F of strings Output: A shortest possible string S s.t. for every f in F, S is a superstring of f. Theorem: SSP is NP-complete. Fact: approximation algorithms for SSP are of no known practical value
Does motivation trump solution? Biologist: ‘Find an efficient algorithm which solves my problem.’ Computer scientist: ‘Give me a problem which I can solve efficiently.’ Culture clash: What happens when neither is possible?