Download presentation
Presentation is loading. Please wait.
1
Lior, Bernd & Seth Algebraic Statistics for Computational Biology
2
What is Biology? The study of living organisms. What is Statistics? The science concerned with the collection, organization, analysis and interpretation of data. What is Algebra? The part of mathematics that deals with generalized arithmetic.
3
What is Algebraic Statistics?
4
There is no dictionary definition yet. The term was coined by European statisticians interested in applying Gröbner bases to the design of experiments. Their book is: G. Pistone, E. Riccomagno and H. Wynn, “Algebraic Statistics: Computational Algebra in Statistics”. CRC Press, 2000.
5
Table of Contents Part I - Introduction to the four themes 1.Statistics 2.Computation 3.Algebra 4.Biology Part II - Studies on the four themes 5.Parametric Inference 6.Polytope Propagation on Graphs 7.Parametric Sequence Alignment 8.Bounds for Optimal Sequence Alignment 9.Inference Functions 10.Geometry of Markov Chains 11.Equations Defining Hidden Markov Models 12.The EM Algorithm for Hidden Markov Models 13.Homology Mapping with Markov Random Fields 14.Mutagenic Tree Models 15.Catalog of Small Trees 16.The Strand Symmetric Model 17.Extending Tree Models to Split Networks 18.Small Trees and Generalized Neighbor Joining 19.Tree Construction Using Singular Value Decomposition 20.Application of Interval Methods to Phylogenetics 21.Analysis of Point Mutations in Vertebrate Genomes 22.Ultra-Conserved Elements in Vertebrate and Fly Genomes Algebraic Statistics for Computational Biology Edited by Lior Pachter and Bernd Sturmfels Cambridge University Press, 2005 New book:
6
Algebraic Statistics for Computational Biology Group Department of Mathematics, U.C. Berkeley Photo courtesy of Robert Fisher Lawrence Hall of Science March 7th, 2005 http://math.berkeley.edu/~lpachter/ascb/
7
Who is this girl ? TAGAGACGGGGGTTTCACAATGTTGGCCA Her name is DiaNA. She makes DNA sequences.
8
The human genome Consists of 2.8 billion DNA bases. Sequenced in 2001 and finished in 2004. Contains genes: - these are subsequences which code for protein. - estimated number of genes: 20,000-25,000. - genes make up less than 5% of the genome. Example: Breast-ovarian cancer susceptibility gene (BRCA1)
9
The human genome
11
>hg17_dna range=chr17:38464686-38473085 5'pad=0 3'pad=0 revComp=FALSE strand=? repeatMasking=noneATCCAGAAGTCTAGTATACATCTCAAAATTCATGCATCTGGCCGGGCACAGTGGCTCACACCTGCAATCCCAGCACTTTGGGAGGCCGAGGTGGGTGGATTACC TGAGGTCAGGAGTTTAAGACCAGCCTGGCCAACATGGTAAAACCCCATCTCTACTAAAAATACAAGTATTAGCCAGGCATTGTGGCAGGTGCCTGTAATCCCAGCTACTCGGGAGGCT GAGGCAGGAAAATCACTTGAACCGGGAGGCGGAGGTTGGAGTGAGCTGAGATCGTGCTACCGCACTCCATGCACTCTAGCCTGGGCAACAGAACGAGATGCTGTCACAACAACAAC AACAACAACAACAACAACAACAACAACAACAACAAATTCTCACATCTAAAACAGAGTTCCTGGTTCCATTCCTGCTTCCTGCCTTTCCCACTCCCCCATATTCCCTACCATGCCTTCTTC ATCTAATTTAATATTACTAACAAGATCTATTGTTCAAGCCAAAACCCAAGTGTCACTCCTTCAATTTCTCTTTACCTTATCCTCCAAATTTAATCCATTAGCAAGTCCTCTCTTCAAACCCA TCCCAAACCAACCTTGTTTTTAACCATCTCCACACCACCAATTACCACAAGGATAAAATCTGAATTCCTTACCACCAAATACTATGTGATCTGGCCCTCATCTATGACCTTCTCCCATTCC TTGTGTAATCTCTGCCTCCACACATAATTTGCAAATTACTCCAGCTACACTGGCCTATTATTATTATTATTATTATTTTTGAGACGGAGTCTTGCTCTTTCGCCCAGCCTGGAGTGCAGTG GCGCAATCTCAGCTCACTGCAATCTCCGCCTCCTGGGTTCAAGCGATTCTCCTGCCCCAGCCTCCCAAGTAGCTGTGATTACAGGCACATGCCACCATTCCCAGCTAATTTTTTTTTGT TTTTGAGATGGAGTTTCACTCTTGTTGCCCAGGCTGGAGTGCAATGGTGCGATCTCAGCTCACCACAACCTCCACCTCCCGGGTTGATGAAGTGATTCTCTTGTCTCAGCCTCCCGTG TAGCTGGGATTAGAGGCACGCGCCACCACGCTGGGCAAATTTTTGTATTTTTAGTAGAGACAGGGTTTCTACCTCAGTGATCTGTCCGCCTTGACCTCCCAAAGTGCTGGGATTACAG GAATGAGCCACCACACCCAGCCGTGCCCAGCTAATTTTTGCATTTTTTAGTAGAGATGGGGTTTTGCCACGTTGGCCAGGCTGGTCTCAAACTCCTGACCTCAGGGGATCTGCCTGC CTCGGCCTCCTAGAGTGCTGGAATTACAGGTGTGAGCCACTGTGCCCGAACCTTTTATCATTATTATTTCTTGAGACAGGAGTCTTGCTCTGTCGTTCAGGCTGGAGTGCAGTGATGC GATCTTGGCTCACTGTAACTCCTACCTTTCGGTTCAAGTGATTCTCCTGCCTCAGCCTCTGGAGTAGCTGGGATTACAGGCACTGGGATTACAGGCACACACCACCACACCATGCTAG TTTTTTGTATTTTTAGTAGAGATGGGGTTTCACCATGTTGGCCAGGCTGGTCTCGAACTCCTGACCTCAAGTGATTTGCCTGCCTTGGCTTCCCAAAGTGCTGGGATTATAGGCACGAG CCACCACACACGACCAACATTGGCCTATCTTTTAAAAAATAAACCAAGCTCTGGCCGGGCACAGTGGCTCACACCTGTGATCCCAGCACTTTGGGAGGTTGAGGTGGTTGGATCACTT GAGTTCAGGAGTTTGAGACCAGCCTGACCAACGTGGTAAAACCCCATCTCTACTAAAAATAAAAACTAGTCGGGTGTGGTAGCACGCGTGCCTGTAATACCAGCTACTCAGGAGGCC AAGGCAGGAGAATTGCTTGAACCCAGGAGACAGAGTTTGCAGTGAGCCAAGATTGTGCCACTGCACTCCAGCCTGGGGGATAGAGGGAGACACCATCTCAAAAAAACCAAAATACA GAAATCAAAAAACCACACTCATTATTACCTCAAGACCTTTATGTTTGCTATTCCTCTGCCTATAAGATGCATTCCCTTCATTTTTCAAGGACAATTATTTCTTGTTATTTAGGTCTCAGCTC AATTTTTTCAGAAAGGCTTTCCCTGGCCTCCTTAAACGAAAGTAATCAACAACCTTTGACAGCTAATACTATTCCACTGTTCTGTATATTTCTCCATAGCATTTATTGTTATCTTAAATTCA TCTTTATTGTGTATCTCCCCTCGACAGAACCTGAATCCTACCAGGGACTTAGTTAGTCTTATTTACTGTTGCATTCCTAGTGCCCAGAACACAGTAGGCTCCCAATAAATAGCCACTGAA TAAAAGTTAAAACCAACAAAAATAATCATTTAATTAATTATGAATACATCGAATTGTGCACAATAGTTTATAAAATTACTTTTTTTTTTTTTTTAAGACAGGGTCTCATTCTGTCTCACAGGC TGGAGTGCAGTGGTGCAATCTAGGCTCACTGCAACCTCCGCCTCCCGGGTTCAAGTGATTCTCCTGCCTCAGCCTCCCCAGCAGCTAGGATTACAGGCACATGCCACCACGCTCGAC TAATTTTTTTGTGTTTTTAGTAGAGACAAGGTTTCACCATGTTGACCAGGCTGGTCTCGAACTCCTGACCTCAAGTGATCCACCTGCCTTGGCCACTCAAAGTGCTGGGATTATAGGCA TGAGCCACCACGCCTGGCCTATAAAATTACTTTCACATTTCATTTTGCCTGATCTGTTGTCACAGAAGTTCTCAGATGGCTGTTCTGAAATTATTCCTCCTCCTACACTCTATCTTATTTA CTTCTCACTGTTCTCAGTATCATAAAGTGCAACATCTTTTTGAAGCAATCTGAATTATAAACAGATACATTTGCATGTATATATATGTATATATGCATATGCACACACACACTTTTTTTTTTT TAAGAGACAGGGTCTTGCTCTGTGCAAGTGCAAGAGTGCAATGGTATGATCATAGCTCACTGCAGCCTTGAACTCCTGGGCTCAAGTGATTCTTCTGGCTTAGCTTCCTCAGTAGCTA AGACTACAGAAGCACACTGCCATGCCCGGCTAATTAAAAAAAAATTTTGTGGAGACAGAGTCTCACTATGTTGCCCAGGCTGGTTTCAAACTCCTGGCCTCAAGTAATCTTCCTGTCTC AGCCTCCCAAAGGGCTGAGATTATAAGTGTGAGCCACTGCATCTGGACTGCATATTAATATGAAGAGCTTTTCTTCAACAACAGTGAACAGTTTTCTACAAAGGTATATGCAAGTGGGC CCACTTCTTGTTCTTATGAATCTTTTCTTTCCTTTTATAAAACTCCTTTTCCTTTCTCTTTTCCCCAAAGAAAGGACTGTTTCTTTTGAAATCTAGAACAAATGAGAACAGAGGATATCCTG GTTTGCGCTGCAAAATTTTTTTTTTTTTTAAGACGGAGTCTCGCTCTGTTGCCAGGTTGGAGTGCAGTGGCACGATCTTGGCTCATTGCAACCTCCACCTCCCGGGTTCAAGAGATTCT CCTGCCTCAGCCTCCTGAGTAGCTGGAACTAAAGGCGCATGCCACCACGCTGAGTAATTTTTTGTATTTTAGTAGAGACAGGGTTTCACCATGTTGCCCAGGCTGATCTCGAACTCCT GAGCTCAGGCAATCTGCCTGTCTTGGCCTCCCACAGTGTTAGGATTACAGGCATGAGCCACTGCACCCGATTTTTTTTTTCTTTTGATGGAGTTTTGCTCTTGTTGCCCAGGTTAGAGT GCAATGATGCGATCTCAGCTCACTGCAACCCCCGCCTCCCAGGTTCAAGTGATTCTCCTGCCTCAGCCTCCCGAGTAGCTGGAATTACAGGCAAGTGCCACCAAGCCCGGCTAATTT TGTATTTTTAGTAGAAACGGGGTTTCTCCATGTTGGTCAGGCTGGTCTTGAACTCCCGACATCAGGTGATCCAAGCGCCTCAGCCTCCCAAAGCGCTGGGATTATAGGTATGAGCCAC AGTGCAGGCCTGCATAATTCTTGATGATCCTCATTATCATGGAAAATTTGTGCATTGTTAAGGAAAGTGGTGCATTGATGGAAGGAAGCAAATACATTTTTAACTATATGACTGAATGAA TATCTCTGGTTAGTTTGTAACATCAAGTACTTACCTCATTCAGCATTTTTCTTTCTTTAATAGACTGGGTCACCCCTAAAGAGATCATAGAAAAGACAGGTTACATACAGCAGAAGAACG TGCTCTTTTCACGGAGATAGAGAGGTCAGCGATTCACAAAAGAGCACAGGAAGAATGACAGAGGAGAGGTCCTTCCCTCTAAAGCCACAGCCCTTTAATAAGGCTTGTAGCAGCAGT TTCCTTCTGGAGACAGAGTTGATGTTTAATTTAAACATTATAAGTTTGCCTGCTGCACATGGATTCCTGCCGACTATTAAATAAATCCCTAGCTCATATGCTAACATTGCTAGGAGCAGA TTAGGTCCTATTAGTTATAAAAGAGACCCATTTTCCCAGCATCACCAGCTTATCTGAACAAAGTGATATTAAAGATAAAAGTAGTTTAGTATTACAATTAAAGACCTTTTGGTAACTCAGA CTCAGCATCAGCAAAAACCTTAGGTGTTAAACGTTAGGTGTAAAAATGCAATTCTGAGGTGTTAAAGGGAGGAGGGGAGAAATAGTATTATACTTACAGAAATAGCTAACTACCCATTT TCCTCCCGCAATTCCTAGAAAATATTTCAGTGTCCGTTCACACACAAACTCAGCATCTGCAGAATGAAAAACACTCAAAGGATTAGAAGTTGAAAACAAAATCAGGAAGTGCTGTCCTA AGAAGCTAAAGAGCCTCAGTTTTTTACACTCCCAAGATCAATCTGGATTTATGATTCTAAAACCCCTGGTGACAGAATCAGAGGCTGAAAACACCACTAATTATAACCAGCAGGTATGG ATATTTGGAAGTCTAGGGGAGGCTGATATGAAGTTAAGACCAGAGGAAATATCTGTCCACTCCCTCTTCTCAACACCCATCTTCTAGACGCCAAGGCTAGCTATAGATCTCCATTATAG TGTTCAAGGAATTAGGAATTATCCATGTCAATAGTTTTGATTAATGTGGACGGAGAACATCTATATTACTAGATGGCAATATGTGAAAGAAGAAAACAGTATTGTTGAAAACCTAAATCT GAAATGTCAATGTAATGACAAATTTTCACCCCTAGAATGTCTACCTGGGGAGTCCTAACCCTCTAATATTCCCCTGAGAGGGATGGGAGAATACAGTGCAGAGCTTTTATATAAGTATT TCAGAAAGCAGTAGCTAAAGAATCACTTGTTTATTTCCCAGTGTTTCAAAGGCCCTTCTGAAGAACTAAGCAAACTAAGGAAAGACCATTTAGTTTTAAACAGGAGAAATGTATTTAACT AAATCCTAAACACAGCAGGCTATCTGCAAGCAGCAGCAGCAGCAGCAGCCATGCTCCCTCACAGAATCCTTACAATTTTTGAAGTTTTTTGTTTAACTGCTACAAAAGCCGATTTAGTA ACATTTATTACACTTAAAAACTTCAGTTCATTTGTAGTTCAAAGCAAATGTATTGGCTTTGAGTTTAAAGACTGAACTACTTTAGATTTGATTTGCATTTTTTTTTTTTTTTTTTTTTGAGATG CAGTCTTGCTCTGTCAGCCAGGCTGGAGTGCAGTGGCTGGATCTCAGCTCACGGCAAGCTCTGCCTCCTGGGTTCATGCCATTCTCCTGCCTCAGCCTCCTGAGTAGCTGGGACTAC AGATGCCCGCCACCATGCCCGGCTAATTTTTTGTATTTTTACTAGAGATGGGGTTTCACCGTGTTAGCCAGGATGGTCTCGATCTCCTGACCTCGTGATCTGCCCGCCTTGGCCCCCC AAAGCGCTGGGATTACAGGCCTGAGCCACCACGCTTGGCATCTTTTTACCTTTCATTAACTTTGATGCAAACCTATAGCTTAAGGTATCTTAAACTTTAATGACATTTTTCTCTAAAATAG TAGTTTGTAATAACTTGTTCTGGCACCTGGCTCCAATGAACACTACCCTCTGACCCTGTGGTATAATTTTCATGAGTAAGTGGAAACCTAAGATCTTAGAAGTTCAACGGCAATGTGTCC AAGGGGTTTAGATCCTCTCCTTAAGTGCCTGTATCTCTGTGAAAAGAATCATCATAGGCTAGGCGCGATGGCTCACACCTGTAATCCCAGCACTTTGGGAGGCCGAGGTAGGTGGAT CACCTGAGGTCGGGAGTCCAAGACCAGCCTGACTGACATGGAAAAACCCTGTCTCTACTAAAAATACAAAATTAGGTATGGTGGTGCATTCCTGTAATCCCAGCTACTCGGGAGGCTG AGGCAGGAGAATCGCTTGAACCCGGGAGGGGGAGGTTGCAGCAAGCCAAGATCGTGCCATTGCACTCCAGCAGCCTGGGCAACAAGAGTGAAAAACTACACCTCAAAAACAAAAAC AAAAACAAAAGAATCATCATCAAGTGAACTGGAACACATCCAGAGAACTAATTTTGTTAGAAAGATTTTAGAGTTGAGCCACACAATCTGCATCTTCTGCGTCCTCCATGCACTCGTCTG CTTTCTGGAGCCCCATGAGTGAGTCTTAATCCTGTTCCAGATAACAGTTCTCTTCCGGGTAACGGTTCTTCAGATACTTGAAGACAGTGTCTTATTTCCTTAAATCTTCTCATTTCTTCTT CAAAAGACAGTATTTCAAGTTACTTTTATGTATCTTTACCATCTACCTCTGGATAAACACTCTCCAATTTGTCAGTGACCATGTTAAAAACCAAGCACGGTGCTTAAAACTGACATCATCT TTCAGGCAATCACTCCATTGGAGAATACAGTGGGGCTCTGGATCTGTACTTCACTTGCTCCAGAGCCTCTGCTTGTGTTAATACGGCCCAGTTTCAAATAAGCATTTTTAGCAGCCCTG AAATGTGTACTCAGATTTAGTTTATAGTCAACTAAAAACACCCAGAGGTCTCCTGTATTACACAAGTTATAATTAAAACCTTAAAAGAGAAAGGTATAGGACAAATGATCTGTCTCCTCC CTTTTTTGCTTTTTCATATGTTAAGACTATCTCGGAGCTGTTATCAGACTTTTTTCCTGAAAAACTCTCAACAATACTCAAACTAGGTGTTACATGAAGCTGGGGTCTCCAGGTTTTGCCT CACTTGTTCTTTCTTTTGTTGTTGTTGAGACAGAGTCTCACTCTGTCGCCAGGCTGGAGTGCAGTGGCAGGATCTCAGCTGACTGCAACCTCAGCCTCCAGAGTTCAAGCAATTCTTCT GTGTCAGCCTCCCAAGTAGCTGGGATTACAGGTGCACACCACCACGCCCAGCCA
12
Another example of annotation INPUT:..t..r…o..p..i..c..a..a..l...g..e..e..t..r..y.. OUTPUT:..t..r…o..p..i..c..a..a..l...g..e..e..t..r..y.. Annotation is the labeling of the input sequence, in this case with 3 colors: ome keep change delete
13
tctctggttagtttgtaacatcaagtacttacctcattcagcatttttctttctttaatagactgggtcacccctaaagag tccgggattagtctgtatgaggtacccaccacactcagaagttttctttcttggatagacttgatcacccctgaagagaag
14
tctctggttagtttgtaacatcaagtacttacctcattcagcatttttctttctttaatagactgggtcacccctaaagag tccgggattagtctgtatgaggtacccaccacactcagaagttttctttcttggatagacttgatcacccctgaagagaag acgt a9434 c4464 g3252 t59413 Data summary
15
Statistics Question Are the two sequences independent? Algebra Question Is the 4x4 matrix close to rank 1? acgt a9434 c4464 g3252 t59413
16
The independence model m = 16 observable states {A,C,G,T} 2 d = 6 unknown parameters = ( A C G T A C G T where A C G T = A C G T = 1 Independence means probabilities factor AG = prob(A,G) = A G
17
The independence model m = 16 observable states {A,C,G,T} 2 d = 6 unknown parameters = ( A C G T A C G T where A C G T = A C G T = 1 Independence means probabilities factor AG = prob(A,G) = A G The model is the polynomial map
18
Models for discrete data A statistical model is a parameterized family of probability distributions d = number of parameters m = number of observable states = the parameter space = probability simplex on the m states UU UU
19
The geometry of maximum likelihood estimation data parameter space probability simplex
20
tctctggttagtttgtaacatcaagtacttacctcattcagcatttttctttctttaatagactgggtcacccctaaagagatc tccgggattagtctgtatgaggtacccaccacactcagaagttttctttcttggatagacttgatcacccctgaagagaag Observed data
21
tctctggttagtttgtaacatcaagtacttacCTCATTCAGCATTTTTCTTTCTTTAATAGACTGGGTCACCCctaaagagatc tccgggattagtctgt---atgaggtacccacCACACTCAGAAGTTTTCTTTCTTGGATAGACTTGATCACCCctgaagagaag ** * ***** *** ** * **** *** ** **** * *********** ******* * ******** ****** Hidden data
22
The alignment problem is to find the shortest path in the alignment graph: start finish This is solved with dynamic programming and is known in computational biology as the Needleman-Wunsch algorithm. gttta- gt--gc g t g c gttta Example: n=5, m=4 **
23
The algebraic statistical model for sequence alignment, known as the pair hidden Markov model, is the image of a map The logarithms of the 33 parameters give the edge lengths for the shortest path problem on the alignment graph. whose coordinates are polynomials with one term for each path in the alignment graph.
24
General Mathematical Framework Statistical models are algebraic varieties. Algebraic varieties can be tropicalized. Tropicalized models are useful for MAP inference in statistics. L. Pachter and B. Sturmfels, Tropical Geometry of Statistical Models, Proceedings of the National Academy of Sciences, Volume 101:46 (2004), p 16132--16137. L. Pachter and B. Sturmfels, Parametric Inference for Biological Sequence Analysis, Proceedings of the National Academy of Sciences, Volume 101:46 (2004), p 16138--16143.
25
2.1. Tropical arithmetic and dynamic programming In tropical algebraic geometry, varieties are piecewise linear…
26
Comparative Genomics tctctggttagtttgtaacatcaagtacttacCTCATTCAGCATTTTTCTTTCTTTAATAGACTGGGTCACCCctaaagagatc tccgggattagtctgt---atgaggtacccacCACACTCAGAAGTTTTCTTTCTTGGATAGACTTGATCACCCctgaagagaag ** * ***** *** ** * **** *** ** **** * *********** ******* * ******** ****** human rat
27
Human tctctggttagtttgtaacatcaagtacttacCTCATTCAGCATTTTTCTTTCTTTAATAGACTGGGTCA Chimp tctctggttagtttgtaacatcaagtacttacCTCATTCAGCATTTTTCTTTCTTTAATAGACTGGGTCA Mouse tcccagatcagttcgt---atcaggtacccacCACATTCAGAAGTCTTCTTTCTTGGATAGACCGGACCA Rat tccgggattagtctgt---atgaggtacccacCACACTCAGAAGTTTTCTTTCTTGGATAGACTTGATCA Dog tttctgattcgtttgtaacattgagtacctacCTCATCTAGTATCTTTCTTTCTTTAATAGACTGGGTTA * * * ** ** ** **** *** ** ** * ********* ****** * * Comparative Genomics A phylogenetic tree on 5 taxa.
28
Human tctctggttagtttgtaacatcaagtacttacCTCATTCAGCATTTTTCTTTCTTTAATAGACTGGGTCA Chimp tctctggttagtttgtaacatcaagtacttacCTCATTCAGCATTTTTCTTTCTTTAATAGACTGGGTCA Mouse tcccagatcagttcgt---atcaggtacccacCACATTCAGAAGTCTTCTTTCTTGGATAGACCGGACCA Rat tccgggattagtctgt---atgaggtacccacCACACTCAGAAGTTTTCTTTCTTGGATAGACTTGATCA Dog tttctgattcgtttgtaacattgagtacctacCTCATCTAGTATCTTTCTTTCTTTAATAGACTGGGTTA * * * ** ** ** **** *** ** ** * ********* ****** * * Comparative Genomics Petersen graph parametrizes trees on 5 taxa.
29
Y Chromosome of D. pseudoobscura Is Not Homologous to the Ancestral Drosophila Y Antonio Bernardo Carvalho and Andrew G. Clark, Science, January 7 2005. Trees are Ubiquitous in Biology Fig. 1.
30
1 2 3 4 5 1 2 3 5 4 1 2 45 3 1 2 3 4 5
31
TAGAGACGGGGGTTTCACAATGTTGGCCA Summer school Themes Algebra, discrete mathematics and statistics… …are relevant for genomics… …and vice versa... Organ system (digestive) Organ (liver) Tissue (liver sinusoid) Cell (hepatocyte) Organelle (nucleus) Molecule (DNA)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.