Bioinformatics: Buzzword or Discipline (???)
Outline of the course Analysis of one DNA sequence: Shotgun sequencing, Markov-Chain modeling, patterns and repeats. Analysis of multiple DNA or protein sequences: Dynamic programming alignments, substitution matrices. BLAST: Algorithm for sequence retrieval and comparison. Refresher on Markov Chains: Capsule theory, Markov-Chain Monte Carlo algorithms. Hidden Markov Models: Viterbi Algorithm and its applications. Evolutionary Models: Models of nucleotide mutation and substitution, recombination and genetic drift, with applications to genome evolution and gene mapping. Molecular phylogenetics (tree making): distance matrix, maximum likelihood and parsimony. Special topics: Gene and protein networks, analysis of DNA-microarray data, …
30,000 Genes make up only 3% of the genome BCM- HGSC
Genome Sizes Human 3.0 x 109 base pairs Mouse 3.0 x 109 Drosophila 1.1 x 108 Worm 1.0 x 108 Dictyostelium 3.4 x 107 Yeast 1.2 x 107 Bacteria 1.0 - 5.0 x 106
Shotgun Sequencing High Accuracy Sequence: < 1 error/ 10,000 bases
The Human Genome: 3 Billion Base Pairs Whole Genome Shotgun Strategy 3 billion bases Libraries of clones 3kb, 10kb, 50kb base pairs DNA sequence reads 500 bases each AGGCTCACTG BCM- HGSC
Statistical issues in shotgun strategy Model for the random fragments: Binomial/Poisson process Coverage of sequence by random fragments Mean number of contigs Mean size of contigs Coverage by anchored contigs
Binomial/Poisson Process N fragments, of length L each, randomly scattered in the interval of length G. Coverage a = NL/G Contig: Union of overlapping fragments. We want to have them cover as much of G as possible. Pr[#frags with left end in (x, x-h) = k] “is” binomial(N,h/G) or approximately Poisson(Nh/G) (when?).
Mean number of contigs E[#contigs] = N Pr[a frag is rightmost in a contig] = N Pr[frag does not include the left end of any other frag] = N exp(- NL/G) = (aG/L) exp(- a) L = 800 G = 100,000
Mean contig size E[S] = E[#frags-1] E[inter-epoch distance] + L
Mean contig size E(S) a
Number of anchored contigs #anchors = M #frags = N a = NL/G b = ML/G E[#anchored contigs] =Nb [exp(-a)-exp(-b)]/(b-a)
Conclusions Expected number of contigs first increases, then decreases with coverage. Expected size of contig increases with coverage. Expected number of anchored contigs first increases then decreases with anchor density . Attention: Computations do not involve boundary effects.