In silico reconstruction of an ancestral mammalian genome UQAM Seminaire de bioinformatique Mathieu Blanchette.

Slides:



Advertisements
Similar presentations
Introduction to molecular dating methods. Principles Ultrametricity: All descendants of any node are equidistant from that node For extant species, branches,
Advertisements

1 Aligning Multiple Genome Sequences With the Threaded Blockset Aligner Blanchette, W., Kent, W.J., Riemer, C., Elnitski, L., Smit, A.F.A., Roskin, K.M.,
Multiple Sequence Alignment
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
GENE TREES Abhita Chugh. Phylogenetic tree Evolutionary tree showing the relationship among various entities that are believed to have a common ancestor.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Molecular Evolution Revised 29/12/06
Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication.
CS273a Lecture 8, Win07, Batzoglou Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion.
Some new sequencing technologies. Molecular Inversion Probes.
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
Branch lengths Branch lengths (3 characters): A C A A C C A A C A C C Sum of branch lengths = total number of changes.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.
CS273a Lecture 9/10, Aut 10, Batzoglou Multiple Sequence Alignment.
[Bejerano Fall09/10] 1 Milestones due today. Anything to report?
[Bejerano Fall10/11] 1 HW1 Due This Fri 10/15 at noon. TA Q&A: What to ask, How to ask.
Genomic Rearrangements CS 374 – Algorithms in Biology Fall 2006 Nandhini N S.
Comparative Genomics and Evolution Pollard, K.S., et al., Forces Shaping the Fastest Evolving Regions in the Human Genome. PLoS Genetics 2(10), McLean,
Short Primer on Comparative Genomics Today: Special guest lecture 12pm, Alway M108 Comparative genomics of animals and plants Adam Siepel Assistant Professor.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Phylogenetic trees Sushmita Roy BMI/CS 576
Sequencing a genome and Basic Sequence Alignment
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Molecular phylogenetics
Todd J. Treangen, Steven L. Salzberg
Parsimony and searching tree-space Phylogenetics Workhop, August 2006 Barbara Holland.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Binary Encoding and Gene Rearrangement Analysis Jijun Tang Tianjin University University of South Carolina (803)
[BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu CS173 Lecture 17:
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Genome Alignment. Alignment Methods Needleman-Wunsch (global) and Smith- Waterman (local) use dynamic programming Guaranteed to find an optimal alignment.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Discovery of Regulatory Elements by a Phylogenetic Footprinting Algorithm Mathieu Blanchette Martin Tompa Computer Science & Engineering University of.
CS CM124/224 & HG CM124/224 DISCUSSION SECTION (JUN 6, 2013) TA: Farhad Hormozdiari.
Evolutionary Models for Multiple Sequence Alignment CBB/CS 261 B. Majoros.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Calculating branch lengths from distances. ABC A B C----- a b c.
Phylogenomics “The intersection of phylogenetics and genomics”
Using BLAST for Genomic Sequence Annotation Jeremy Buhler For HHMI / BIO4342 Tutorial Workshop.
PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Introduction to Phylogenetic trees Colin Dewey BMI/CS 576 Fall 2015.
Phylogeny Ch. 7 & 8.
Selecting Genomes for Reconstruction of Ancestral Genomes Louxin Zhang Department of Mathematics National University of Singapore.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Comparative Genomics I: Tools for comparative genomics
Post-Darwinian Facts I. Physics II. Geology/Paleontology III. Genetics.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
Phylogeny.
Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
Modelling evolution Gil McVean Department of Statistics TC A G.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Indel rates and probabilistic alignments Gerton Lunter Budapest, June 2008.
Evolutionary genomics can now be applied beyond ‘model’ organisms
Very important to know the difference between the trees!
Multiple Alignment and Phylogenetic Trees
Genome Projects Maps Human Genome Mapping Human Genome Sequencing
Volume 2, Issue 4, Pages (October 2012)
Mattew Mazowita, Lani Haque, and David Sankoff
Phylogenetic footprinting and shadowing
Presentation transcript:

In silico reconstruction of an ancestral mammalian genome UQAM Seminaire de bioinformatique Mathieu Blanchette

CGACTGCATCAGACGACGATCAGACTACTATATCAGCAGATTACGGT GCATCGTATTTACGTTACGCATGACGATCAGACTACGCATAGATAGA TGCATCAGACGACGATCAGACTACTATATCAGCAGATTACGGTCGAT TATTTACGTTACGCATGACGATCAGACTACGCATAGATAGAGCAATA CGACTGCATCAGACGACGATCAGACTACTATATCAGCAGATTACGGT GCGTATTTACGTTACGCATGACGATCAGACTACGCATAGATAGAGCA CGCATCAGACGACGATCAGACTACTATATCAGCAGATTACGGTCGTA ACGTTACGCATGACGATCAGACTACGCATAGATAGAGCCGATCATCT CAGACGACGATCAGACTACTATATCAGCAGATTACGGTGGCATACTA ATCGTATTTACGTTACGCATGACGATCAGACTACGCATAGATAGAAA CGACGATCAGACTACTATATCAGCAGATTACGGTGCGCGAATTCATA TATTTACGTTACGCATGACGATCAGACTACGCATAGATAGATTGATA CATCAGACGACGATCAGACTACTATATCAGCAGATTACGGTGCATAT TTTACGTTACGCATGACGATCAGACTACGCATAGATAGAGATCATCA TCAGACGACGATCAGACTACTATATCAGCAGATTACGGTAGCATTCT CGTATTTACGTTACGCATGACGATCAGACTACGCATAGATAGAATGC ACGACGATCAGACTACTATATCAGCAGATTACGGTGATAGATACGAT CGTATTTACGTTACGCATGACGATCAGACTACGCATAGATAGAGATA GCATCAGACGACGATCAGACTACTATATCAGCAGATTACGGTGATAC GCATGACGATCAGACTACGCATAGATAGATTATTACTGGATACTGCA The Human genome Sequence of ~3*10 9 nucleotides Complete sequence is known (2001) HOW DOES IT WORK??

Comparative Genomics Goal: Functional annotation of the genome –What is the role of each region of the genome? –Very hard to answer…. Idea: Look not only at what our genome is now, but also at how it evolved –Different types of functional regions have different evolutionary signatures Complete genomes are sequenced for: –Human, chimp, mouse, rat, house, chicken, zebrafish, pufferfish Partial genomes are available for: –Dog, cow, rabbit, elephant, armadillo

Mutations G(t) = ACGTAGGCGATCAG---ATCGAT G(t+1) =ACGAAGG--ATCAGGGGATCGAT Other less frequent mutations: - Duplications - Genome rearrangements (e.g. large inversions) Mutations happen randomly Natural selection favors mutations that improve fitness SubstitutionsDeletionsInsertions

A random walk in genome space

Mammalian evolution -Rapid radiation ~75 Myrs ago -Many nearly independent phyla -Many “noisy” copies of ancestor - Accurate reconstruction of ancestors may be feasible

Ancestral Genome Reconstruction Given: - Genomic sequences of several mammals - Phylogenetic tree Find: The genomic sequence of all their ancestors ARMADILLO TGCTACTAATATTTAGTACATAGAGCCCAGGGGTGCTGCTGAAAGTCTTAAAATGCACAGTGTAGCCCCTCCTCC COW GCCTCTCTTTCTGCCCTGCAGGCTAGAATGTATCACTTAGATGTTCCAAATCAGAAAGTGTTCAGCCATTTCCATACC HORSE GTCACAATTTAGGAAGTGCCACTGGCCTCTAGAGGGTAGAAGACAGGGATGCTAATAATCATCCCACGTCATCCTACAGTGCTCAGAACAGCACCCCTACCCTCACCCC CAT GTCACAGTTTAGGGGGTACTACTGGCATCTATCGGGTGGAGGATAGGGATACTGATAATCATTCTACAGTGCACAGGACAGTACCCCTACTTTCACCCC DOG GTCACAATTTGGGGGATACTACTGGCATCTAATGGGTAGAGGACAGGGATACTGATAATTGCTTTACAGTGCACAGGACAGCACCCTTATCTTCACCCC HEDGEHOG GTCATAGTTTGATTATATGGGCTTCTTAGTAGACAAAGAAAAAGATGTTCTGGTAGTCATTCTGCTTTCCATATGATAGCACTCCCATCTTCACTTC MOUSE GTCACAGTTTGGAGGATGTTACTGACATCTAGAGAGTAGACTTTAAAGATACTGATAGTCACCCCATTGTGCACCTCC RAT GTCACAATTTGGAGGATGTTACTGGCATCTAGAGAGTAGACTTTAAGGACACTGATAATCATACTATGCTGCACTTCC RABBIT ATCACAATTTGGGGAACACCACTGGCATCTCGGGTAGCAGGCCAGGCATGCTGGTAATTATACTACAGTGCACAGTACAGTTCCCCACATCCCGCACC LEMUR ATCACAATTGGGGGTGCCACGGTCCTCCAGTGGGTAGAGAACAGGGAGGCTGATAACCACCCTGCAGTGCACAGGGCAGTGCCCCACTCCCACCAC MOUSE-LEMUR ATCACAGTTGGGGGATGCCACTGGCCTCAAGTGGGTAGAGAACAGGGAGGCTGAAAACCACCCTGCAGAGCACGGGGCAGTGCCTTCACCACCACTCC VERVET GTCAGAATTTGGGGGATGCTTCTGGCTCTACTTGGGTAGAGAAACAGGGATGCTTATAATCATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCC MACAQUE GTCAGAATTTGGGGGATGCTTCTGGCTCTACTTGGGTAGAGAAACAGGAATGCTTATAATCATCCTACAGTGCACAGGTCAGTACCCCCACCCACACTCC BABOON GTCAGAATTTGGGGGATGCTTCTGGCTCTACTTGGGTAGAAAAACAGGGATGCTTATAATCATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCC ORANGUTAN GTCACGATTTGGGAGATGCTTCTGGCTCGACTTGGGTAGAGAAGCGGGGATGCTTATAATCATCCAACAGTGCACAGGACAGTACCCCCACCCACACTCC GORILLA GTCACGATTTGGGGGATGCTTCTGGCTCAACTTGGGTAGAGAAGTGGGGATGCTTATACTCATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCC CHIMP GTCACGATTTGGGGGATGCTTCTGGCTCAACTTGGGTAGAGAAGCGGGGATGCTTATAATCATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCC HUMAN GTCACGATTTGGGGGATGCTTCTGGCTCAACTTGGGTAGAGAAGCGGGGATGCTTATAATCATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCC Mutational operations Small-scale : Substitutions, deletions, insertions (inc. transposons) Large scale: Genome rearrangement, segmental/tandem duplications (*): Heterochromatin non-included All of it: Functional, non-functional, introns, intergenic, repeats, everything * !

Reconstruction algorithm 1)Identify syntenic regions in each species Blastz (Schwartz et al.) and Chaining/netting program (Kent et al.) In ENCODE case: targeted BAC sequencing

Reconstruction algorithm 2) Compute multiple genome alignment TBA program (Blanchette, Miller, et al.) ARMADILLO TGCTACTAATAT-----T-TAGTA-CATAGAG-CC-CAGGGGTGCTGCTGAAA GTCTTAAAATGCACAGTGTAGCCCCTCCTCC ACAAAGAATTAACTAGCCCAGAATGTCAGGA GT--A-CCAAG COW GCCTCTCTTT CTGCCCTGCAGGC-TAGAA-TGTATCA-CT-TAGATGTTCCAA ATCAGAAAGTGTTCAG CCATTTCCATACCACC----AGGAGCTA-CAATGTTGGGCTGCAGCTA TTTGGATCAAA HORSE GTCACAATTTAGGAAGTGCCACTGGCCT-----C-TAGAG-GGTAGAA-GA-CAGGGATGCTAATAATCATCCCACGTCATCCTACAGTGCTCAGAACAGCACCCCTACCCTCACCCCATCAACAAAGAATTATCCAGCCCAAAATGCCAATA GT--GCCCAGA CAT GTCACAGTTTAGGGGGTACTACTGGCAT-----C-TATCG-GGTGGAG-GA-TAGGGATACTGATAATC ATTCTACAGTGCACAGGACAGTACCCCTACTTTCACCCCACAA-CAAAGAATTATCCAGCCCAAAATGCCAACA GT--GCTCAGA DOG GTCACAATTTGGGGGATACTACTGGCAT-----C-TAATG-GGTAGAG-GA-CAGGGATACTGATAATT GCTTTACAGTGCACAGGACAGCACCCTTATCTTCACCCCAAAAGCAAAGTATTATCCAGCCCCAAATGCCAATG GT--GCTCAGA HEDGEHOG GTCATAGTTT----GATTATATGGGCTT-----CTTAGTA-GACAAAGAAA-AAGATGTTCTGGTAGTC ATTCTGCTTTCCATATGATAGCACTCCCATCTTCACTTCCAAAATTAAGAGTCATCATACTCAGTGTGCCAATA TG--GCCCAGA MOUSE GTCACAGTTTGGAGGATGTTACTGACAT-----C-TAGAG-AGTAGAC-TT-TAAAGATACTGATAGTC ACCCCATTGTGCAC CTCCAACAATAATGGCTCATCGAAACCTAAATGCCAATCTGCCAATTAT--GTCCATG RAT GTCACAATTTGGAGGATGTTACTGGCAT-----C-TAGAG-AGTAGAC-TT-TAAGGACACTGATAATC ATACTATGCTGCAC TTCCAACAATAATGGCTCATCTAGACCTAAATACCAATCTGCCAATTAT--ATCCATG RABBIT ATCACAATTTGGGGAACACCACTGGCAT-----C-TCGGGTAGCAGGC----CAGGCATGCTGGTAATT ATACTACAGTGCACAGTACAGTTCCCCACATCCCGCACCAACAACA--GGTTTATGCTGCCCAAAGTGCCAGTGTGC CCACG LEMUR ATCACAA-TTGGGGG-TGCCACGGTCCT-----C-CAGTG-GGTAGAG-AA-CAGGGAGGCTGATAACC ACCCTGCAGTGCACAGGGCAGTGCC-CCACTCCCACCACAACAATGGAGAATTATTGGGCCCCAAATGCCAATA GT--GCCCAAG MOUSELEMUR ATCACAG-TTGGGGGATGCCACTGGCCT-----C-AAGTG-GGTAGAG-AA-CAGGGAGGCTGAAAACC ACCCTGCAGAGCACGGGGCAGTGCCTTCACCACCACTCCAACAACGGAGAATTATTGGGTCCCAAATGCCAATA GT—-GCCCAGG VERVET GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAG-AAACAGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGAACCCAAAATGTTAATA GT--GTCCAGG MACAQUE GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAG-AAACAGGAATGCTTATAATC ATCCTACAGTGCACAGGTCAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGGACCCAAAATGCTAATG GT--GTCCAGG BABOON GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAA-AAACAGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGGACCCAAAATGTTAATG GT--GTCCAGG ORANGUTAN GTCACGATTTGGGAGATGCTTCTGGCTC-----G-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC ATCCAACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCACTGGACCCAAAATGTTAATG GT--GTCCAGG GORILLA GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGTGGGGATGCTTATACTC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCGAAAATGTTAATG GT--GTCCAGG CHIMP GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCGAAAATGTTAATG GT--GTCCAGA HUMAN GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCTAAAATGTTAATG GT--GTCCAGG Goal: Phylogenetic correctness Two nucleotides are aligned if and only if they have a common ancestor.

Reconstruction algorithm 3) Reconstruct insertion/deletion history Find most likely explanation for gaps observed ARMADILLO TGCTACTAATAT-----T-TAGTA-CATAGAG-CC-CAGGGGTGCTGCTGAAA GTCTTAAAATGCACAGTGTAGCCCCTCCTCC ACAAAGAATTAACTAGCCCAGAATGTCAGGA GT--A-CCAAG COW GCCTCTCTTT CTGCCCTGCAGGC-TAGAA-TGTATCA-CT-TAGATGTTCCAA ATCAGAAAGTGTTCAG CCATTTCCATACCACC----AGGAGCTA-CAATGTTGGGCTGCAGCTA TTTGGATCAAA HORSE GTCACAATTTAGGAAGTGCCACTGGCCT-----C-TAGAG-GGTAGAA-GA-CAGGGATGCTAATAATCATCCCACGTCATCCTACAGTGCTCAGAACAGCACCCCTACCCTCACCCCATCAACAAAGAATTATCCAGCCCAAAATGCCAATA GT--GCCCAGA CAT GTCACAGTTTAGGGGGTACTACTGGCAT-----C-TATCG-GGTGGAG-GA-TAGGGATACTGATAATC ATTCTACAGTGCACAGGACAGTACCCCTACTTTCACCCCACAA-CAAAGAATTATCCAGCCCAAAATGCCAACA GT--GCTCAGA DOG GTCACAATTTGGGGGATACTACTGGCAT-----C-TAATG-GGTAGAG-GA-CAGGGATACTGATAATT GCTTTACAGTGCACAGGACAGCACCCTTATCTTCACCCCAAAAGCAAAGTATTATCCAGCCCCAAATGCCAATG GT--GCTCAGA HEDGEHOG GTCATAGTTT----GATTATATGGGCTT-----CTTAGTA-GACAAAGAAA-AAGATGTTCTGGTAGTC ATTCTGCTTTCCATATGATAGCACTCCCATCTTCACTTCCAAAATTAAGAGTCATCATACTCAGTGTGCCAATA TG--GCCCAGA MOUSE GTCACAGTTTGGAGGATGTTACTGACAT-----C-TAGAG-AGTAGAC-TT-TAAAGATACTGATAGTC ACCCCATTGTGCAC CTCCAACAATAATGGCTCATCGAAACCTAAATGCCAATCTGCCAATTAT--GTCCATG RAT GTCACAATTTGGAGGATGTTACTGGCAT-----C-TAGAG-AGTAGAC-TT-TAAGGACACTGATAATC ATACTATGCTGCAC TTCCAACAATAATGGCTCATCTAGACCTAAATACCAATCTGCCAATTAT--ATCCATG RABBIT ATCACAATTTGGGGAACACCACTGGCAT-----C-TCGGGTAGCAGGC----CAGGCATGCTGGTAATT ATACTACAGTGCACAGTACAGTTCCCCACATCCCGCACCAACAACA--GGTTTATGCTGCCCAAAGTGCCAGTGTGC CCACG LEMUR ATCACAA-TTGGGGG-TGCCACGGTCCT-----C-CAGTG-GGTAGAG-AA-CAGGGAGGCTGATAACC ACCCTGCAGTGCACAGGGCAGTGCC-CCACTCCCACCACAACAATGGAGAATTATTGGGCCCCAAATGCCAATA GT--GCCCAAG MOUSELEMUR ATCACAG-TTGGGGGATGCCACTGGCCT-----C-AAGTG-GGTAGAG-AA-CAGGGAGGCTGAAAACC ACCCTGCAGAGCACGGGGCAGTGCCTTCACCACCACTCCAACAACGGAGAATTATTGGGTCCCAAATGCCAATA GT—-GCCCAGG VERVET GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAG-AAACAGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGAACCCAAAATGTTAATA GT--GTCCAGG MACAQUE GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAG-AAACAGGAATGCTTATAATC ATCCTACAGTGCACAGGTCAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGGACCCAAAATGCTAATG GT--GTCCAGG BABOON GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAA-AAACAGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGGACCCAAAATGTTAATG GT--GTCCAGG ORANGUTAN GTCACGATTTGGGAGATGCTTCTGGCTC-----G-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC ATCCAACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCACTGGACCCAAAATGTTAATG GT--GTCCAGG GORILLA GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGTGGGGATGCTTATACTC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCGAAAATGTTAATG GT--GTCCAGG CHIMP GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCGAAAATGTTAATG GT--GTCCAGA HUMAN GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCTAAAATGTTAATG GT--GTCCAGG

Reconstruction algorithm 3) Reconstruct insertion/deletion history Find most likely explanation for gaps observed ARMADILLO TGCTACTAATAT-----T-TAGTA-CATAGAG-CC-CAGGGGTGCTGCTGAAA GTCTTAAAATGCACAGTGTAGCCCCTCCTCC ACAAAGAATTAACTAGCCCAGAATGTCAGGA GT--A-CCAAG COW GCCTCTCTTT CTGCCCTGCAGGC-TAGAA-TGTATCA-CT-TAGATGTTCCAA ATCAGAAAGTGTTCAG CCATTTCCATACCACC----AGGAGCTA-CAATGTTGGGCTGCAGCTA TTTGGATCAAA HORSE GTCACAATTTAGGAAGTGCCACTGGCCT-----C-TAGAG-GGTAGAA-GA-CAGGGATGCTAATAATCATCCCACGTCATCCTACAGTGCTCAGAACAGCACCCCTACCCTCACCCCATCAACAAAGAATTATCCAGCCCAAAATGCCAATA GT--GCCCAGA CAT GTCACAGTTTAGGGGGTACTACTGGCAT-----C-TATCG-GGTGGAG-GA-TAGGGATACTGATAATC ATTCTACAGTGCACAGGACAGTACCCCTACTTTCACCCCACAA-CAAAGAATTATCCAGCCCAAAATGCCAACA GT--GCTCAGA DOG GTCACAATTTGGGGGATACTACTGGCAT-----C-TAATG-GGTAGAG-GA-CAGGGATACTGATAATT GCTTTACAGTGCACAGGACAGCACCCTTATCTTCACCCCAAAAGCAAAGTATTATCCAGCCCCAAATGCCAATG GT--GCTCAGA HEDGEHOG GTCATAGTTT----GATTATATGGGCTT-----CTTAGTA-GACAAAGAAA-AAGATGTTCTGGTAGTC ATTCTGCTTTCCATATGATAGCACTCCCATCTTCACTTCCAAAATTAAGAGTCATCATACTCAGTGTGCCAATA TG--GCCCAGA MOUSE GTCACAGTTTGGAGGATGTTACTGACAT-----C-TAGAG-AGTAGAC-TT-TAAAGATACTGATAGTC ACCCCATTGTGCAC CTCCAACAATAATGGCTCATCGAAACCTAAATGCCAATCTGCCAATTAT--GTCCATG RAT GTCACAATTTGGAGGATGTTACTGGCAT-----C-TAGAG-AGTAGAC-TT-TAAGGACACTGATAATC ATACTATGCTGCAC TTCCAACAATAATGGCTCATCTAGACCTAAATACCAATCTGCCAATTAT--ATCCATG RABBIT ATCACAATTTGGGGAACACCACTGGCAT-----C-TCGGGTAGCAGGC----CAGGCATGCTGGTAATT ATACTACAGTGCACAGTACAGTTCCCCACATCCCGCACCAACAACA--GGTTTATGCTGCCCAAAGTGCCAGTGTGC CCACG LEMUR ATCACAA-TTGGGGG-TGCCACGGTCCT-----C-CAGTG-GGTAGAG-AA-CAGGGAGGCTGATAACC ACCCTGCAGTGCACAGGGCAGTGCC-CCACTCCCACCACAACAATGGAGAATTATTGGGCCCCAAATGCCAATA GT--GCCCAAG MOUSELEMUR ATCACAG-TTGGGGGATGCCACTGGCCT-----C-AAGTG-GGTAGAG-AA-CAGGGAGGCTGAAAACC ACCCTGCAGAGCACGGGGCAGTGCCTTCACCACCACTCCAACAACGGAGAATTATTGGGTCCCAAATGCCAATA GT—-GCCCAGG VERVET GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAG-AAACAGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGAACCCAAAATGTTAATA GT--GTCCAGG MACAQUE GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAG-AAACAGGAATGCTTATAATC ATCCTACAGTGCACAGGTCAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGGACCCAAAATGCTAATG GT--GTCCAGG BABOON GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAA-AAACAGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGGACCCAAAATGTTAATG GT--GTCCAGG ORANGUTAN GTCACGATTTGGGAGATGCTTCTGGCTC-----G-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC ATCCAACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCACTGGACCCAAAATGTTAATG GT--GTCCAGG GORILLA GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGTGGGGATGCTTATACTC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCGAAAATGTTAATG GT--GTCCAGG CHIMP GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCGAAAATGTTAATG GT--GTCCAGA HUMAN GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCTAAAATGTTAATG GT--GTCCAGG

Reconstruction algorithm 3) Reconstruct insertion/deletion history –Find most likely explanation for gaps observed This defines the presence/absence of a base at each position of each ancestor ARMADILLO TGCTACTAATAT-----T-TAGTA-CATAGAG-CC-CAGGGGTGCTGCTGAAA GTCTTAAAATGCACAGTGTAGCCCCTCCTCC ACAAAGAATTAACTAGCCCAGAATGTCAGGA GT--A-CCAAG COW GCCTCTCTTT CTGCCCTGCAGGC-TAGAA-TGTATCA-CT-TAGATGTTCCAA ATCAGAAAGTGTTCAG CCATTTCCATACCACC----AGGAGCTA-CAATGTTGGGCTGCAGCTA TTTGGATCAAA HORSE GTCACAATTTAGGAAGTGCCACTGGCCT-----C-TAGAG-GGTAGAA-GA-CAGGGATGCTAATAATCATCCCACGTCATCCTACAGTGCTCAGAACAGCACCCCTACCCTCACCCCATCAACAAAGAATTATCCAGCCCAAAATGCCAATA GT--GCCCAGA CAT GTCACAGTTTAGGGGGTACTACTGGCAT-----C-TATCG-GGTGGAG-GA-TAGGGATACTGATAATC ATTCTACAGTGCACAGGACAGTACCCCTACTTTCACCCCACAA-CAAAGAATTATCCAGCCCAAAATGCCAACA GT--GCTCAGA DOG GTCACAATTTGGGGGATACTACTGGCAT-----C-TAATG-GGTAGAG-GA-CAGGGATACTGATAATT GCTTTACAGTGCACAGGACAGCACCCTTATCTTCACCCCAAAAGCAAAGTATTATCCAGCCCCAAATGCCAATG GT--GCTCAGA HEDGEHOG GTCATAGTTT----GATTATATGGGCTT-----CTTAGTA-GACAAAGAAA-AAGATGTTCTGGTAGTC ATTCTGCTTTCCATATGATAGCACTCCCATCTTCACTTCCAAAATTAAGAGTCATCATACTCAGTGTGCCAATA TG--GCCCAGA MOUSE GTCACAGTTTGGAGGATGTTACTGACAT-----C-TAGAG-AGTAGAC-TT-TAAAGATACTGATAGTC ACCCCATTGTGCAC CTCCAACAATAATGGCTCATCGAAACCTAAATGCCAATCTGCCAATTAT--GTCCATG RAT GTCACAATTTGGAGGATGTTACTGGCAT-----C-TAGAG-AGTAGAC-TT-TAAGGACACTGATAATC ATACTATGCTGCAC TTCCAACAATAATGGCTCATCTAGACCTAAATACCAATCTGCCAATTAT--ATCCATG RABBIT ATCACAATTTGGGGAACACCACTGGCAT-----C-TCGGGTAGCAGGC----CAGGCATGCTGGTAATT ATACTACAGTGCACAGTACAGTTCCCCACATCCCGCACCAACAACA--GGTTTATGCTGCCCAAAGTGCCAGTGTGC CCACG LEMUR ATCACAA-TTGGGGG-TGCCACGGTCCT-----C-CAGTG-GGTAGAG-AA-CAGGGAGGCTGATAACC ACCCTGCAGTGCACAGGGCAGTGCC-CCACTCCCACCACAACAATGGAGAATTATTGGGCCCCAAATGCCAATA GT--GCCCAAG MOUSELEMUR ATCACAG-TTGGGGGATGCCACTGGCCT-----C-AAGTG-GGTAGAG-AA-CAGGGAGGCTGAAAACC ACCCTGCAGAGCACGGGGCAGTGCCTTCACCACCACTCCAACAACGGAGAATTATTGGGTCCCAAATGCCAATA GT—-GCCCAGG VERVET GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAG-AAACAGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGAACCCAAAATGTTAATA GT--GTCCAGG MACAQUE GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAG-AAACAGGAATGCTTATAATC ATCCTACAGTGCACAGGTCAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGGACCCAAAATGCTAATG GT--GTCCAGG BABOON GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAA-AAACAGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGGACCCAAAATGTTAATG GT--GTCCAGG ORANGUTAN GTCACGATTTGGGAGATGCTTCTGGCTC-----G-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC ATCCAACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCACTGGACCCAAAATGTTAATG GT--GTCCAGG GORILLA GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGTGGGGATGCTTATACTC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCGAAAATGTTAATG GT--GTCCAGG CHIMP GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCGAAAATGTTAATG GT--GTCCAGA HUMAN GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCTAAAATGTTAATG GT--GTCCAGG NNNNNNNNNNNNNNNNNNNNNNNNNNNN-----N-NNNNN-NNNNNNN-NN-NNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Reconstruction algorithm ARMADILLO TGCTACTAATAT-----T-TAGTA-CATAGAG-CC-CAGGGGTGCTGCTGAAA GTCTTAAAATGCACAGTGTAGCCCCTCCTCC ACAAAGAATTAACTAGCCCAGAATGTCAGGA GT--A-CCAAG COW GCCTCTCTTT CTGCCCTGCAGGC-TAGAA-TGTATCA-CT-TAGATGTTCCAA ATCAGAAAGTGTTCAG CCATTTCCATACCACC----AGGAGCTA-CAATGTTGGGCTGCAGCTA TTTGGATCAAA HORSE GTCACAATTTAGGAAGTGCCACTGGCCT-----C-TAGAG-GGTAGAA-GA-CAGGGATGCTAATAATCATCCCACGTCATCCTACAGTGCTCAGAACAGCACCCCTACCCTCACCCCATCAACAAAGAATTATCCAGCCCAAAATGCCAATA GT--GCCCAGA CAT GTCACAGTTTAGGGGGTACTACTGGCAT-----C-TATCG-GGTGGAG-GA-TAGGGATACTGATAATC ATTCTACAGTGCACAGGACAGTACCCCTACTTTCACCCCACAA-CAAAGAATTATCCAGCCCAAAATGCCAACA GT--GCTCAGA DOG GTCACAATTTGGGGGATACTACTGGCAT-----C-TAATG-GGTAGAG-GA-CAGGGATACTGATAATT GCTTTACAGTGCACAGGACAGCACCCTTATCTTCACCCCAAAAGCAAAGTATTATCCAGCCCCAAATGCCAATG GT--GCTCAGA HEDGEHOG GTCATAGTTT----GATTATATGGGCTT-----CTTAGTA-GACAAAGAAA-AAGATGTTCTGGTAGTC ATTCTGCTTTCCATATGATAGCACTCCCATCTTCACTTCCAAAATTAAGAGTCATCATACTCAGTGTGCCAATA TG--GCCCAGA MOUSE GTCACAGTTTGGAGGATGTTACTGACAT-----C-TAGAG-AGTAGAC-TT-TAAAGATACTGATAGTC ACCCCATTGTGCAC CTCCAACAATAATGGCTCATCGAAACCTAAATGCCAATCTGCCAATTAT--GTCCATG RAT GTCACAATTTGGAGGATGTTACTGGCAT-----C-TAGAG-AGTAGAC-TT-TAAGGACACTGATAATC ATACTATGCTGCAC TTCCAACAATAATGGCTCATCTAGACCTAAATACCAATCTGCCAATTAT--ATCCATG RABBIT ATCACAATTTGGGGAACACCACTGGCAT-----C-TCGGGTAGCAGGC----CAGGCATGCTGGTAATT ATACTACAGTGCACAGTACAGTTCCCCACATCCCGCACCAACAACA--GGTTTATGCTGCCCAAAGTGCCAGTGTGC CCACG LEMUR ATCACAA-TTGGGGG-TGCCACGGTCCT-----C-CAGTG-GGTAGAG-AA-CAGGGAGGCTGATAACC ACCCTGCAGTGCACAGGGCAGTGCC-CCACTCCCACCACAACAATGGAGAATTATTGGGCCCCAAATGCCAATA GT--GCCCAAG MOUSELEMUR ATCACAG-TTGGGGGATGCCACTGGCCT-----C-AAGTG-GGTAGAG-AA-CAGGGAGGCTGAAAACC ACCCTGCAGAGCACGGGGCAGTGCCTTCACCACCACTCCAACAACGGAGAATTATTGGGTCCCAAATGCCAATA GT—-GCCCAGG VERVET GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAG-AAACAGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGAACCCAAAATGTTAATA GT--GTCCAGG MACAQUE GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAG-AAACAGGAATGCTTATAATC ATCCTACAGTGCACAGGTCAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGGACCCAAAATGCTAATG GT--GTCCAGG BABOON GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAA-AAACAGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGGACCCAAAATGTTAATG GT--GTCCAGG ORANGUTAN GTCACGATTTGGGAGATGCTTCTGGCTC-----G-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC ATCCAACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCACTGGACCCAAAATGTTAATG GT--GTCCAGG GORILLA GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGTGGGGATGCTTATACTC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCGAAAATGTTAATG GT--GTCCAGG CHIMP GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCGAAAATGTTAATG GT--GTCCAGA HUMAN GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCTAAAATGTTAATG GT--GTCCAGG GTCACAATTTGGGGGATGCTACTGGCAT-----C-TAGTG-GGTAGAG-AA-CAGGGATGCTGATAATC ATCCTACAGTGCACAGGACAGTGCCCCCACCCCCACTCCAACAACAAAGAATTATCCGGCCCAAAATGCCAATA GT--GCCCAGG 4) Infer max.-like. nucleotide at each position –Felsenstein algo. with context-sensitive model Ancestral sequences are inferred!

Optimal indel reconstruction Not so easy! NNNNNNNNNNNNNNN NN------NNNNNNN NNNN NNNN NNNNNN-----NNNN

Reconstructing indel history Not so easy! NNNNNNNNNNNNNNN NN------NNNNNNN NNNN NNNN NNNNNN-----NNNN

Reconstructing indel history Not so easy! NNNNNNNNNNNNNNN NN------NNNNNNN NNNN NNNN NNNNNN-----NNNN NNNNNNNNNNNNNNN NN------NNNNNNN NNNN NNNN NNNNNN-----NNNN

Reconstructing indel history Not so easy! NNNNNNNNNNNNNNN NN------NNNNNNN NNNN NNNN NNNNNN-----NNNN NNNNNNNNNNNNNNN NN------NNNNNNN NNNN NNNN NNNNNN-----NNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NN NNNNNNN NNNN NNNN NNNNNN NNNN

Reconstructing indel history Not so easy! NNNNNNNNNNNNNNN NN------NNNNNNN NNNN NNNN NNNNNN-----NNNN NNNNNNNNNNNNNNN NN------NNNNNNN NNNN NNNN NNNNNN-----NNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NN NNNNNNN NNNN NNNN NNNNNN NNNN

Inferring indel history Given: –A multiple sequence alignment, –A phylogenetic tree, –Probability model for deletions Probability depends on deletion length and branch length –Probability model for insertions Probability depends on insertion length, branch length, and content Find: The most likely set of insertions and deletions that lead to the given alignment NP-hard (Chindelevitch et al. 2006) Fredslund et al. (2003): Restricted enumeration Blanchette et al. (2004): Greedy algorithm Chindelevitch et al. (2006): Integer Linear Programming

Partial Results - Deletions only If only deletions are allowed and all deletions have the same probability (cost), then: –Rectangle-covering problem, where the tree determines which sets of rows of admissible NNNNNNN---NN-----N NNNNNNNN--NN-----N N---NNNNNNNNNN---N NN--NNNNNNNNNNNNNN – Exact polynomial-time greedy algorithm – Idea: There always exists a “forced moved”, i.e. a gap that can only be covered by a single maximal deletion

Measuring accuracy We use simulations of mammalian sequence evolution to evaluate the accuracy of the reconstruction on neutrally evolving DNA. - Start with a random (realistic) ancestral sequence AGCATAGA

Measuring accuracy We use simulations of mammalian sequence evolution to evaluate the accuracy of the reconstruction on neutrally evolving DNA. 1) Simulate evolution along the mammalian tree AGCATAGA ACGACGATA AGCATA AGCATCAG AGCAAATC AGACTACA AGCATCAGC AGG AGGCT AGGACATCA AGGACACCA AGGACCCCA AGGATTC AGGGTTC AGCATAGA AGGATAGA AGCATTAGA AGCATTGAGA

Measuring accuracy We use simulations of mammalian sequence evolution to evaluate the accuracy of the reconstruction on neutrally evolving DNA. - Use TBA to align the sequences generated AG-C-AT--- ACGA-CG--- A----GC--- AGC--AT--- AGCA-A---- AGAC-TA--- AGCAATC--- AGGC AGGA-CA--- AGGA-CACCA AGGA-CCCCA AGGA--TTC- AGGG--TTC- AGCATAGA AGGATAGA AGCATTAGA AGCATTGAGA

Measuring accuracy We use simulations of mammalian sequence evolution to evaluate the accuracy of the reconstruction on neutrally evolving DNA. - Reconstruct indel history: AG-C-AT--- ACGA-CG--- A----GC--- AGC--AT--- AGCA-A---- AGAC-TA--- AGCAATC--- AGGC AGGA-CA--- AGGA-CACCA AGGA-CCCCA AGGA--TTC- AGGG--TTC- AGCATAGA AGGATAGA AGCATTAGA AGCATTGAGA

Measuring accuracy We use simulations of mammalian sequence evolution to evaluate the accuracy of the reconstruction on neutrally evolving DNA. - Infer ancestral sequences at each node AG-C-AT--- ACGA-CG--- A----GC--- AGC--AT--- AGCA-A---- AGAC-TA--- AGCAATC--- AGGC AGGA-CA--- AGGA-CACCA AGGA-CCCCA AGGA--TTC- AGGG--TTC- AGCATAGA AGGATAGA AGCATTAGA AGCATTGAGA AGATCGA AGCTTGAGA AGTATTTAGA AGTATAGGA

Measuring accuracy We use simulations of mammalian sequence evolution to evaluate the accuracy of the reconstruction on neutrally evolving DNA. For each node, align true and predicted ancestor Count: Missing bases + Added bases + Substituted bases AGCATAGA AGGATAGA ACGCATTAGA AGCATTGAGA AGATCGA AGCTTGAGA AGTATTTAGA AGTATAGGA ACGCATT-AGA A-GTATTTAGA 3 errors/10 bp Error rate = 0.3

Simulation details We simulate neutrally evolving regions of 50kb We model:- Lineage-specific neutral mutation rates - Insertions and deletions based on empirical frequency and length distributions - Insertion of transposable elements - CpG effect We don’t model: - DNA polymerase slippage - Positive selection - Genome rearrangement, duplications Sanity checks: Simulated sequences are similar to actual mammalian sequences: –Same pair-wise percent identity –Same frequency and length distribution of insertions and deletions –Same repetitive content and age distribution of repeats

Guess which ancestor can be best reconstructed? Eizirik et al. 2001

Reconstructability and tree topology R Star phylogeny Leaves are independent Accuracy approaches 100% exponentially fast as n increases n independent descendents Bifurcating root Information lost between R and A or B can’t be recovered Can’t do better than if A and B were reconstructed perfectly Accuracy < 100% -  for all n n dependent descendents R A B

Eizirik et al. 2001

How many species do we need? Best choice of species: - Sample many taxa - Choose slowly evolving species

What if the fast-radiation model is wrong?

Reconstructing real ancestors

MOUSE-LEMUR COW RAT CHIMP, GORILLA, ORANGUTAN, MACAQUE, VERVET, BABOON For this set of species, simulations predict: - Expected accuracy ~95%

Transposon consensus Actual mammalian ancestor External validation using ancestral transposons Human relic

Transposon consensus Actual mammalian ancestor subst/site subst/site External validation using ancestral transposons Reconstructed mammalian ancestor Human relic subst/site

Transposon consensus Actual mammalian ancestor subst/site subst/site Error = subst/site External validation using ancestral transposons Reconstructed mammalian ancestor Human relic subst/site

What’s next? Whole genome! Data available –Whole genomes: Human, chimp, mouse, rat, dog –Unassembled/ low coverage genomes: Cow, rabbit, armadillo, elephant Challenges: –Fewer species –Unassembled contigs –Genome rearrangements –Recombination hotspots We expect that 90% of the Boreoeutherian genome can be reconstructed with ~90% accuracy

Why should we care? Ancestral genome allows to see what and when changes happened in our genome –Allows detection and “dating” of lineage specific innovations (e.g. FOXP2). Allows a better understanding of the forces driving genome evolution New model organism? –Human genome is 4 times closer to the ancestral genome than to the mouse genome: better model for human phenotypes?

Even if we had the full genomes of all living mammalian species: Technological problem: –We can’t synthesize large regions of DNA Many regions can’t be reconstructed at all: –Heterochromatin –Regions with high recombination rates 99% base-by-base accuracy is not enough –One mistake may be enough to make life impossible

Acknowledgements David Haussler, Brian Raney UC Santa Cruz Webb Miller Penn State Univ. Eric GreenNHGRI UC Santa Cruz group: –Adam Siepel, Robert Baertsch, Gill Bejerano, Jim Kent McGill group: –Leonid Chindelevitch, Zhentao Li, Eric Blais