[Bejerano Fall10/11] 1 HW1 Due This Fri 10/15 at noon. TA Q&A: What to ask, How to ask
[Bejerano Fall10/11] 2 Lecture 8 Chains & Nets cont’d Genome Reconstruction Genomic Conservation & Function
Mutations Chromosomal Mutations: –Deletion –Inversion –Translocation –Duplication –(Nondisjunction) 3 [Bejerano Fall10/11]
4 Gene Families Orthologs : Genes related via speciation (e.g. C,M,H3) Paralogs: Genes related through duplication (e.g. H1,H2,H3) Homologs: Genes that share a common origin (e.g. C,M,H1,H2,H3) Species tree Gene tree Speciation Duplication Loss single ancestral gene
[Bejerano Fall10/11] 5 Chaining (Paralogs) Protease Regulatory Subunit 3
[Bejerano Fall10/11] 6 Netting (Ortholog)
[Bejerano Fall10/11] 7 Convert / LiftOver "LiftOver chains" are actually chains extracted from nets, or chains filtered by the netting process. LiftOver – batch utility
[Bejerano Fall10/11] 8 Net highlights rearrangements A large gap in the top level of the net is filled by an inversion containing two genes. Numerous smaller gaps are filled in by local duplications and processed pseudo-genes.
Drawbacks Nets relentlessly try to fill in gaps Heuristic guess of best chain! True dog ortholog just hasn’t been sequenced 9
[Bejerano Fall10/11] 10 And Retrogenes
[Bejerano Fall10/11] 11 Conservation Track Documentation
[Bejerano Fall10/11] 12 Useful in finding pseudogenes Ensembl and Fgenesh++ automatic gene predictions confounded by numerous processed pseudogenes. Domain structure of resulting predicted protein must be interesting! gene pred.
[Bejerano Fall10/11] 13 Cautionary Note 2
[Bejerano Fall10/11] 14 Same Region… same in all the other fish
[Bejerano Fall10/11] 15 A Rearrangement Hot Spot Rearrangements are not evenly distributed. Roughly 5% of the genome is in hot spots of rearrangements such as this one. This 350,000 base region is between two very long chains on chromosome 7.
Drawbacks Inversions not handled optimally > > > > chr1 > > > < < < < chr1 < < < < < < < < chr5 < < < < Chains Nets > > > > chr1 > > > < < < < chr5 < < < < 16
Drawbacks High copy number genes can break orthology 17
[Bejerano Fall10/11] 18 Self Chain
From pairwise to multiple alignments [Bejerano Fall10/11] 19
Example: in 3D (three sequences): 7 neighbors/cell F(i,j,k) = max{ F(i-1,j-1,k-1)+S(x i, x j, x k ), F(i-1,j-1,k )+S(x i, x j, - ), F(i-1,j,k-1)+S(x i, -, x k ), F(i-1,j,k )+S(x i, -, - ), F(i,j-1,k-1)+S( -, x j, x k ), F(i,j-1,k )+S( -, x j, x k ), F(i,j,k-1)+S( -, -, x k ) } Multidimensional DP
Progressive Alignment When evolutionary tree is known: Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles p x, p y, to generate a new alignment with associated profile p result x w y z p xy p zw p xyzw E.g: Blastz – Multiz shown in UCSC browser
Anchor based alignment [Bejerano Fall10/11] 22 Example:
Anchor based alignment [Bejerano Fall10/11] 23 E.g: Enredo - Pecan shown in ENSEMBL browser
Reconstruct the Boreoeutherian ancestor
Ancestral Genome Reconstruction Given: - Genomic sequences of several mammals - Phylogenetic tree Find: The genomic sequence of all their ancestors ARMADILLO TGCTACTAATATTTAGTACATAGAGCCCAGGGGTGCTGCTGAAAGTCTTAAAATGCACAGTGTAGCCCCTCCTCC COW GCCTCTCTTTCTGCCCTGCAGGCTAGAATGTATCACTTAGATGTTCCAAATCAGAAAGTGTTCAGCCATTTCCATACC HORSE GTCACAATTTAGGAAGTGCCACTGGCCTCTAGAGGGTAGAAGACAGGGATGCTAATAATCATCCCACGTCATCCTACAGTGCTCAGAACAGCACCCCTACCCTCACCCC CAT GTCACAGTTTAGGGGGTACTACTGGCATCTATCGGGTGGAGGATAGGGATACTGATAATCATTCTACAGTGCACAGGACAGTACCCCTACTTTCACCCC DOG GTCACAATTTGGGGGATACTACTGGCATCTAATGGGTAGAGGACAGGGATACTGATAATTGCTTTACAGTGCACAGGACAGCACCCTTATCTTCACCCC HEDGEHOG GTCATAGTTTGATTATATGGGCTTCTTAGTAGACAAAGAAAAAGATGTTCTGGTAGTCATTCTGCTTTCCATATGATAGCACTCCCATCTTCACTTC MOUSE GTCACAGTTTGGAGGATGTTACTGACATCTAGAGAGTAGACTTTAAAGATACTGATAGTCACCCCATTGTGCACCTCC RAT GTCACAATTTGGAGGATGTTACTGGCATCTAGAGAGTAGACTTTAAGGACACTGATAATCATACTATGCTGCACTTCC RABBIT ATCACAATTTGGGGAACACCACTGGCATCTCGGGTAGCAGGCCAGGCATGCTGGTAATTATACTACAGTGCACAGTACAGTTCCCCACATCCCGCACC LEMUR ATCACAATTGGGGGTGCCACGGTCCTCCAGTGGGTAGAGAACAGGGAGGCTGATAACCACCCTGCAGTGCACAGGGCAGTGCCCCACTCCCACCAC MOUSE-LEMUR ATCACAGTTGGGGGATGCCACTGGCCTCAAGTGGGTAGAGAACAGGGAGGCTGAAAACCACCCTGCAGAGCACGGGGCAGTGCCTTCACCACCACTCC VERVET GTCAGAATTTGGGGGATGCTTCTGGCTCTACTTGGGTAGAGAAACAGGGATGCTTATAATCATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCC MACAQUE GTCAGAATTTGGGGGATGCTTCTGGCTCTACTTGGGTAGAGAAACAGGAATGCTTATAATCATCCTACAGTGCACAGGTCAGTACCCCCACCCACACTCC BABOON GTCAGAATTTGGGGGATGCTTCTGGCTCTACTTGGGTAGAAAAACAGGGATGCTTATAATCATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCC ORANGUTAN GTCACGATTTGGGAGATGCTTCTGGCTCGACTTGGGTAGAGAAGCGGGGATGCTTATAATCATCCAACAGTGCACAGGACAGTACCCCCACCCACACTCC GORILLA GTCACGATTTGGGGGATGCTTCTGGCTCAACTTGGGTAGAGAAGTGGGGATGCTTATACTCATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCC CHIMP GTCACGATTTGGGGGATGCTTCTGGCTCAACTTGGGTAGAGAAGCGGGGATGCTTATAATCATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCC HUMAN GTCACGATTTGGGGGATGCTTCTGGCTCAACTTGGGTAGAGAAGCGGGGATGCTTATAATCATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCC Mutational operations Small-scale : Substitutions, deletions, insertions (inc. transposons) Large scale: Genome rearrangement, segmental/tandem duplications (*): Heterochromatin non-included All of it: Functional, non-functional, introns, intergenic, repeats, everything * !
Reconstruction algorithm 1)Identify orthologous regions in each species
Reconstruction algorithm 2) Compute multiple genome alignment ARMADILLO TGCTACTAATAT-----T-TAGTA-CATAGAG-CC-CAGGGGTGCTGCTGAAA GTCTTAAAATGCACAGTGTAGCCCCTCCTCC ACAAAGAATTAACTAGCCCAGAATGTCAGGA GT--A-CCAAG COW GCCTCTCTTT CTGCCCTGCAGGC-TAGAA-TGTATCA-CT-TAGATGTTCCAA ATCAGAAAGTGTTCAG CCATTTCCATACCACC----AGGAGCTA-CAATGTTGGGCTGCAGCTA TTTGGATCAAA HORSE GTCACAATTTAGGAAGTGCCACTGGCCT-----C-TAGAG-GGTAGAA-GA-CAGGGATGCTAATAATCATCCCACGTCATCCTACAGTGCTCAGAACAGCACCCCTACCCTCACCCCATCAACAAAGAATTATCCAGCCCAAAATGCCAATA GT--GCCCAGA CAT GTCACAGTTTAGGGGGTACTACTGGCAT-----C-TATCG-GGTGGAG-GA-TAGGGATACTGATAATC ATTCTACAGTGCACAGGACAGTACCCCTACTTTCACCCCACAA-CAAAGAATTATCCAGCCCAAAATGCCAACA GT--GCTCAGA DOG GTCACAATTTGGGGGATACTACTGGCAT-----C-TAATG-GGTAGAG-GA-CAGGGATACTGATAATT GCTTTACAGTGCACAGGACAGCACCCTTATCTTCACCCCAAAAGCAAAGTATTATCCAGCCCCAAATGCCAATG GT--GCTCAGA HEDGEHOG GTCATAGTTT----GATTATATGGGCTT-----CTTAGTA-GACAAAGAAA-AAGATGTTCTGGTAGTC ATTCTGCTTTCCATATGATAGCACTCCCATCTTCACTTCCAAAATTAAGAGTCATCATACTCAGTGTGCCAATA TG--GCCCAGA MOUSE GTCACAGTTTGGAGGATGTTACTGACAT-----C-TAGAG-AGTAGAC-TT-TAAAGATACTGATAGTC ACCCCATTGTGCAC CTCCAACAATAATGGCTCATCGAAACCTAAATGCCAATCTGCCAATTAT--GTCCATG RAT GTCACAATTTGGAGGATGTTACTGGCAT-----C-TAGAG-AGTAGAC-TT-TAAGGACACTGATAATC ATACTATGCTGCAC TTCCAACAATAATGGCTCATCTAGACCTAAATACCAATCTGCCAATTAT--ATCCATG RABBIT ATCACAATTTGGGGAACACCACTGGCAT-----C-TCGGGTAGCAGGC----CAGGCATGCTGGTAATT ATACTACAGTGCACAGTACAGTTCCCCACATCCCGCACCAACAACA--GGTTTATGCTGCCCAAAGTGCCAGTGTGC CCACG LEMUR ATCACAA-TTGGGGG-TGCCACGGTCCT-----C-CAGTG-GGTAGAG-AA-CAGGGAGGCTGATAACC ACCCTGCAGTGCACAGGGCAGTGCC-CCACTCCCACCACAACAATGGAGAATTATTGGGCCCCAAATGCCAATA GT--GCCCAAG MOUSELEMUR ATCACAG-TTGGGGGATGCCACTGGCCT-----C-AAGTG-GGTAGAG-AA-CAGGGAGGCTGAAAACC ACCCTGCAGAGCACGGGGCAGTGCCTTCACCACCACTCCAACAACGGAGAATTATTGGGTCCCAAATGCCAATA GT—-GCCCAGG VERVET GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAG-AAACAGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGAACCCAAAATGTTAATA GT--GTCCAGG MACAQUE GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAG-AAACAGGAATGCTTATAATC ATCCTACAGTGCACAGGTCAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGGACCCAAAATGCTAATG GT--GTCCAGG BABOON GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAA-AAACAGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGGACCCAAAATGTTAATG GT--GTCCAGG ORANGUTAN GTCACGATTTGGGAGATGCTTCTGGCTC-----G-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC ATCCAACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCACTGGACCCAAAATGTTAATG GT--GTCCAGG GORILLA GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGTGGGGATGCTTATACTC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCGAAAATGTTAATG GT--GTCCAGG CHIMP GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCGAAAATGTTAATG GT--GTCCAGA HUMAN GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCTAAAATGTTAATG GT--GTCCAGG Goal: Phylogenetic correctness Two nucleotides are aligned if and only if they have a common ancestor.
Reconstruction algorithm 3) Reconstruct insertion/deletion history Find most likely explanation for gaps observed ARMADILLO TGCTACTAATAT-----T-TAGTA-CATAGAG-CC-CAGGGGTGCTGCTGAAA GTCTTAAAATGCACAGTGTAGCCCCTCCTCC ACAAAGAATTAACTAGCCCAGAATGTCAGGA GT--A-CCAAG COW GCCTCTCTTT CTGCCCTGCAGGC-TAGAA-TGTATCA-CT-TAGATGTTCCAA ATCAGAAAGTGTTCAG CCATTTCCATACCACC----AGGAGCTA-CAATGTTGGGCTGCAGCTA TTTGGATCAAA HORSE GTCACAATTTAGGAAGTGCCACTGGCCT-----C-TAGAG-GGTAGAA-GA-CAGGGATGCTAATAATCATCCCACGTCATCCTACAGTGCTCAGAACAGCACCCCTACCCTCACCCCATCAACAAAGAATTATCCAGCCCAAAATGCCAATA GT--GCCCAGA CAT GTCACAGTTTAGGGGGTACTACTGGCAT-----C-TATCG-GGTGGAG-GA-TAGGGATACTGATAATC ATTCTACAGTGCACAGGACAGTACCCCTACTTTCACCCCACAA-CAAAGAATTATCCAGCCCAAAATGCCAACA GT--GCTCAGA DOG GTCACAATTTGGGGGATACTACTGGCAT-----C-TAATG-GGTAGAG-GA-CAGGGATACTGATAATT GCTTTACAGTGCACAGGACAGCACCCTTATCTTCACCCCAAAAGCAAAGTATTATCCAGCCCCAAATGCCAATG GT--GCTCAGA HEDGEHOG GTCATAGTTT----GATTATATGGGCTT-----CTTAGTA-GACAAAGAAA-AAGATGTTCTGGTAGTC ATTCTGCTTTCCATATGATAGCACTCCCATCTTCACTTCCAAAATTAAGAGTCATCATACTCAGTGTGCCAATA TG--GCCCAGA MOUSE GTCACAGTTTGGAGGATGTTACTGACAT-----C-TAGAG-AGTAGAC-TT-TAAAGATACTGATAGTC ACCCCATTGTGCAC CTCCAACAATAATGGCTCATCGAAACCTAAATGCCAATCTGCCAATTAT--GTCCATG RAT GTCACAATTTGGAGGATGTTACTGGCAT-----C-TAGAG-AGTAGAC-TT-TAAGGACACTGATAATC ATACTATGCTGCAC TTCCAACAATAATGGCTCATCTAGACCTAAATACCAATCTGCCAATTAT--ATCCATG RABBIT ATCACAATTTGGGGAACACCACTGGCAT-----C-TCGGGTAGCAGGC----CAGGCATGCTGGTAATT ATACTACAGTGCACAGTACAGTTCCCCACATCCCGCACCAACAACA--GGTTTATGCTGCCCAAAGTGCCAGTGTGC CCACG LEMUR ATCACAA-TTGGGGG-TGCCACGGTCCT-----C-CAGTG-GGTAGAG-AA-CAGGGAGGCTGATAACC ACCCTGCAGTGCACAGGGCAGTGCC-CCACTCCCACCACAACAATGGAGAATTATTGGGCCCCAAATGCCAATA GT--GCCCAAG MOUSELEMUR ATCACAG-TTGGGGGATGCCACTGGCCT-----C-AAGTG-GGTAGAG-AA-CAGGGAGGCTGAAAACC ACCCTGCAGAGCACGGGGCAGTGCCTTCACCACCACTCCAACAACGGAGAATTATTGGGTCCCAAATGCCAATA GT—-GCCCAGG VERVET GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAG-AAACAGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGAACCCAAAATGTTAATA GT--GTCCAGG MACAQUE GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAG-AAACAGGAATGCTTATAATC ATCCTACAGTGCACAGGTCAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGGACCCAAAATGCTAATG GT--GTCCAGG BABOON GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAA-AAACAGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGGACCCAAAATGTTAATG GT--GTCCAGG ORANGUTAN GTCACGATTTGGGAGATGCTTCTGGCTC-----G-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC ATCCAACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCACTGGACCCAAAATGTTAATG GT--GTCCAGG GORILLA GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGTGGGGATGCTTATACTC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCGAAAATGTTAATG GT--GTCCAGG CHIMP GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCGAAAATGTTAATG GT--GTCCAGA HUMAN GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCTAAAATGTTAATG GT--GTCCAGG
Reconstruction algorithm 3) Reconstruct insertion/deletion history Find most likely explanation for gaps observed ARMADILLO TGCTACTAATAT-----T-TAGTA-CATAGAG-CC-CAGGGGTGCTGCTGAAA GTCTTAAAATGCACAGTGTAGCCCCTCCTCC ACAAAGAATTAACTAGCCCAGAATGTCAGGA GT--A-CCAAG COW GCCTCTCTTT CTGCCCTGCAGGC-TAGAA-TGTATCA-CT-TAGATGTTCCAA ATCAGAAAGTGTTCAG CCATTTCCATACCACC----AGGAGCTA-CAATGTTGGGCTGCAGCTA TTTGGATCAAA HORSE GTCACAATTTAGGAAGTGCCACTGGCCT-----C-TAGAG-GGTAGAA-GA-CAGGGATGCTAATAATCATCCCACGTCATCCTACAGTGCTCAGAACAGCACCCCTACCCTCACCCCATCAACAAAGAATTATCCAGCCCAAAATGCCAATA GT--GCCCAGA CAT GTCACAGTTTAGGGGGTACTACTGGCAT-----C-TATCG-GGTGGAG-GA-TAGGGATACTGATAATC ATTCTACAGTGCACAGGACAGTACCCCTACTTTCACCCCACAA-CAAAGAATTATCCAGCCCAAAATGCCAACA GT--GCTCAGA DOG GTCACAATTTGGGGGATACTACTGGCAT-----C-TAATG-GGTAGAG-GA-CAGGGATACTGATAATT GCTTTACAGTGCACAGGACAGCACCCTTATCTTCACCCCAAAAGCAAAGTATTATCCAGCCCCAAATGCCAATG GT--GCTCAGA HEDGEHOG GTCATAGTTT----GATTATATGGGCTT-----CTTAGTA-GACAAAGAAA-AAGATGTTCTGGTAGTC ATTCTGCTTTCCATATGATAGCACTCCCATCTTCACTTCCAAAATTAAGAGTCATCATACTCAGTGTGCCAATA TG--GCCCAGA MOUSE GTCACAGTTTGGAGGATGTTACTGACAT-----C-TAGAG-AGTAGAC-TT-TAAAGATACTGATAGTC ACCCCATTGTGCAC CTCCAACAATAATGGCTCATCGAAACCTAAATGCCAATCTGCCAATTAT--GTCCATG RAT GTCACAATTTGGAGGATGTTACTGGCAT-----C-TAGAG-AGTAGAC-TT-TAAGGACACTGATAATC ATACTATGCTGCAC TTCCAACAATAATGGCTCATCTAGACCTAAATACCAATCTGCCAATTAT--ATCCATG RABBIT ATCACAATTTGGGGAACACCACTGGCAT-----C-TCGGGTAGCAGGC----CAGGCATGCTGGTAATT ATACTACAGTGCACAGTACAGTTCCCCACATCCCGCACCAACAACA--GGTTTATGCTGCCCAAAGTGCCAGTGTGC CCACG LEMUR ATCACAA-TTGGGGG-TGCCACGGTCCT-----C-CAGTG-GGTAGAG-AA-CAGGGAGGCTGATAACC ACCCTGCAGTGCACAGGGCAGTGCC-CCACTCCCACCACAACAATGGAGAATTATTGGGCCCCAAATGCCAATA GT--GCCCAAG MOUSELEMUR ATCACAG-TTGGGGGATGCCACTGGCCT-----C-AAGTG-GGTAGAG-AA-CAGGGAGGCTGAAAACC ACCCTGCAGAGCACGGGGCAGTGCCTTCACCACCACTCCAACAACGGAGAATTATTGGGTCCCAAATGCCAATA GT—-GCCCAGG VERVET GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAG-AAACAGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGAACCCAAAATGTTAATA GT--GTCCAGG MACAQUE GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAG-AAACAGGAATGCTTATAATC ATCCTACAGTGCACAGGTCAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGGACCCAAAATGCTAATG GT--GTCCAGG BABOON GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAA-AAACAGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGGACCCAAAATGTTAATG GT--GTCCAGG ORANGUTAN GTCACGATTTGGGAGATGCTTCTGGCTC-----G-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC ATCCAACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCACTGGACCCAAAATGTTAATG GT--GTCCAGG GORILLA GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGTGGGGATGCTTATACTC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCGAAAATGTTAATG GT--GTCCAGG CHIMP GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCGAAAATGTTAATG GT--GTCCAGA HUMAN GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCTAAAATGTTAATG GT--GTCCAGG
Reconstruction algorithm 3) Reconstruct insertion/deletion history –Find most likely explanation for gaps observed This defines the presence/absence of a base at each position of each ancestor ARMADILLO TGCTACTAATAT-----T-TAGTA-CATAGAG-CC-CAGGGGTGCTGCTGAAA GTCTTAAAATGCACAGTGTAGCCCCTCCTCC ACAAAGAATTAACTAGCCCAGAATGTCAGGA GT--A-CCAAG COW GCCTCTCTTT CTGCCCTGCAGGC-TAGAA-TGTATCA-CT-TAGATGTTCCAA ATCAGAAAGTGTTCAG CCATTTCCATACCACC----AGGAGCTA-CAATGTTGGGCTGCAGCTA TTTGGATCAAA HORSE GTCACAATTTAGGAAGTGCCACTGGCCT-----C-TAGAG-GGTAGAA-GA-CAGGGATGCTAATAATCATCCCACGTCATCCTACAGTGCTCAGAACAGCACCCCTACCCTCACCCCATCAACAAAGAATTATCCAGCCCAAAATGCCAATA GT--GCCCAGA CAT GTCACAGTTTAGGGGGTACTACTGGCAT-----C-TATCG-GGTGGAG-GA-TAGGGATACTGATAATC ATTCTACAGTGCACAGGACAGTACCCCTACTTTCACCCCACAA-CAAAGAATTATCCAGCCCAAAATGCCAACA GT--GCTCAGA DOG GTCACAATTTGGGGGATACTACTGGCAT-----C-TAATG-GGTAGAG-GA-CAGGGATACTGATAATT GCTTTACAGTGCACAGGACAGCACCCTTATCTTCACCCCAAAAGCAAAGTATTATCCAGCCCCAAATGCCAATG GT--GCTCAGA HEDGEHOG GTCATAGTTT----GATTATATGGGCTT-----CTTAGTA-GACAAAGAAA-AAGATGTTCTGGTAGTC ATTCTGCTTTCCATATGATAGCACTCCCATCTTCACTTCCAAAATTAAGAGTCATCATACTCAGTGTGCCAATA TG--GCCCAGA MOUSE GTCACAGTTTGGAGGATGTTACTGACAT-----C-TAGAG-AGTAGAC-TT-TAAAGATACTGATAGTC ACCCCATTGTGCAC CTCCAACAATAATGGCTCATCGAAACCTAAATGCCAATCTGCCAATTAT--GTCCATG RAT GTCACAATTTGGAGGATGTTACTGGCAT-----C-TAGAG-AGTAGAC-TT-TAAGGACACTGATAATC ATACTATGCTGCAC TTCCAACAATAATGGCTCATCTAGACCTAAATACCAATCTGCCAATTAT--ATCCATG RABBIT ATCACAATTTGGGGAACACCACTGGCAT-----C-TCGGGTAGCAGGC----CAGGCATGCTGGTAATT ATACTACAGTGCACAGTACAGTTCCCCACATCCCGCACCAACAACA--GGTTTATGCTGCCCAAAGTGCCAGTGTGC CCACG LEMUR ATCACAA-TTGGGGG-TGCCACGGTCCT-----C-CAGTG-GGTAGAG-AA-CAGGGAGGCTGATAACC ACCCTGCAGTGCACAGGGCAGTGCC-CCACTCCCACCACAACAATGGAGAATTATTGGGCCCCAAATGCCAATA GT--GCCCAAG MOUSELEMUR ATCACAG-TTGGGGGATGCCACTGGCCT-----C-AAGTG-GGTAGAG-AA-CAGGGAGGCTGAAAACC ACCCTGCAGAGCACGGGGCAGTGCCTTCACCACCACTCCAACAACGGAGAATTATTGGGTCCCAAATGCCAATA GT—-GCCCAGG VERVET GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAG-AAACAGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGAACCCAAAATGTTAATA GT--GTCCAGG MACAQUE GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAG-AAACAGGAATGCTTATAATC ATCCTACAGTGCACAGGTCAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGGACCCAAAATGCTAATG GT--GTCCAGG BABOON GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAA-AAACAGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGGACCCAAAATGTTAATG GT--GTCCAGG ORANGUTAN GTCACGATTTGGGAGATGCTTCTGGCTC-----G-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC ATCCAACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCACTGGACCCAAAATGTTAATG GT--GTCCAGG GORILLA GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGTGGGGATGCTTATACTC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCGAAAATGTTAATG GT--GTCCAGG CHIMP GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCGAAAATGTTAATG GT--GTCCAGA HUMAN GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCTAAAATGTTAATG GT--GTCCAGG NNNNNNNNNNNNNNNNNNNNNNNNNNNN-----N-NNNNN-NNNNNNN-NN-NNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Reconstruction algorithm ARMADILLO TGCTACTAATAT-----T-TAGTA-CATAGAG-CC-CAGGGGTGCTGCTGAAA GTCTTAAAATGCACAGTGTAGCCCCTCCTCC ACAAAGAATTAACTAGCCCAGAATGTCAGGA GT--A-CCAAG COW GCCTCTCTTT CTGCCCTGCAGGC-TAGAA-TGTATCA-CT-TAGATGTTCCAA ATCAGAAAGTGTTCAG CCATTTCCATACCACC----AGGAGCTA-CAATGTTGGGCTGCAGCTA TTTGGATCAAA HORSE GTCACAATTTAGGAAGTGCCACTGGCCT-----C-TAGAG-GGTAGAA-GA-CAGGGATGCTAATAATCATCCCACGTCATCCTACAGTGCTCAGAACAGCACCCCTACCCTCACCCCATCAACAAAGAATTATCCAGCCCAAAATGCCAATA GT--GCCCAGA CAT GTCACAGTTTAGGGGGTACTACTGGCAT-----C-TATCG-GGTGGAG-GA-TAGGGATACTGATAATC ATTCTACAGTGCACAGGACAGTACCCCTACTTTCACCCCACAA-CAAAGAATTATCCAGCCCAAAATGCCAACA GT--GCTCAGA DOG GTCACAATTTGGGGGATACTACTGGCAT-----C-TAATG-GGTAGAG-GA-CAGGGATACTGATAATT GCTTTACAGTGCACAGGACAGCACCCTTATCTTCACCCCAAAAGCAAAGTATTATCCAGCCCCAAATGCCAATG GT--GCTCAGA HEDGEHOG GTCATAGTTT----GATTATATGGGCTT-----CTTAGTA-GACAAAGAAA-AAGATGTTCTGGTAGTC ATTCTGCTTTCCATATGATAGCACTCCCATCTTCACTTCCAAAATTAAGAGTCATCATACTCAGTGTGCCAATA TG--GCCCAGA MOUSE GTCACAGTTTGGAGGATGTTACTGACAT-----C-TAGAG-AGTAGAC-TT-TAAAGATACTGATAGTC ACCCCATTGTGCAC CTCCAACAATAATGGCTCATCGAAACCTAAATGCCAATCTGCCAATTAT--GTCCATG RAT GTCACAATTTGGAGGATGTTACTGGCAT-----C-TAGAG-AGTAGAC-TT-TAAGGACACTGATAATC ATACTATGCTGCAC TTCCAACAATAATGGCTCATCTAGACCTAAATACCAATCTGCCAATTAT--ATCCATG RABBIT ATCACAATTTGGGGAACACCACTGGCAT-----C-TCGGGTAGCAGGC----CAGGCATGCTGGTAATT ATACTACAGTGCACAGTACAGTTCCCCACATCCCGCACCAACAACA--GGTTTATGCTGCCCAAAGTGCCAGTGTGC CCACG LEMUR ATCACAA-TTGGGGG-TGCCACGGTCCT-----C-CAGTG-GGTAGAG-AA-CAGGGAGGCTGATAACC ACCCTGCAGTGCACAGGGCAGTGCC-CCACTCCCACCACAACAATGGAGAATTATTGGGCCCCAAATGCCAATA GT--GCCCAAG MOUSELEMUR ATCACAG-TTGGGGGATGCCACTGGCCT-----C-AAGTG-GGTAGAG-AA-CAGGGAGGCTGAAAACC ACCCTGCAGAGCACGGGGCAGTGCCTTCACCACCACTCCAACAACGGAGAATTATTGGGTCCCAAATGCCAATA GT—-GCCCAGG VERVET GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAG-AAACAGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGAACCCAAAATGTTAATA GT--GTCCAGG MACAQUE GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAG-AAACAGGAATGCTTATAATC ATCCTACAGTGCACAGGTCAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGGACCCAAAATGCTAATG GT--GTCCAGG BABOON GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAA-AAACAGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGGACCCAAAATGTTAATG GT--GTCCAGG ORANGUTAN GTCACGATTTGGGAGATGCTTCTGGCTC-----G-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC ATCCAACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCACTGGACCCAAAATGTTAATG GT--GTCCAGG GORILLA GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGTGGGGATGCTTATACTC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCGAAAATGTTAATG GT--GTCCAGG CHIMP GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCGAAAATGTTAATG GT--GTCCAGA HUMAN GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCTAAAATGTTAATG GT--GTCCAGG GTCACAATTTGGGGGATGCTACTGGCAT-----C-TAGTG-GGTAGAG-AA-CAGGGATGCTGATAATC ATCCTACAGTGCACAGGACAGTGCCCCCACCCCCACTCCAACAACAAAGAATTATCCGGCCCAAAATGCCAATA GT--GCCCAGG 4) Infer max.-like. nucleotide at each position Ancestral sequences are inferred!
Reconstructing Cancer Genomes Gleevec TM (Novartis 2001) targets BCR-ABL oncogene. Change gene structure and regulatory “wiring” of the genome. Create “bad” novel fusion genes and break “good” old genes. Example: translocation in leukemia. promoter ABL gene BCR genepromoter Chromosome 9 Chromosome 22 BCR-ABL oncogene
Complex Tumor Genomes 1)What are detailed architectures of tumor genomes? 2)What rearrangements/duplications produce these architectures and what is the order of these events? 3)What are the novel fusion genes and old “broken” genes?
Tumor Genomes Projects Tumor genomeHuman genome 1)Identify recurrent aberrations 2)Identify temporal sequence of aberrations 3)Use these data for tumor diagnostics and therapeutics Mutation, selection Tumor genome 2 Tumor genome 4 Tumor genome 3
[Bejerano Fall10/11] 35 Meet Your Genome contd. [Human Molecular Genetics, 3rd Edition]
[Bejerano Fall10/11] 36 Sequence Conservation implies Function (but which function/s?...) human another species common ancestor...CTTTGCGA-TGAGTAGCATCTACTATTT......ACGTGGGACTGACTA-CATCGACTACGA... functional region! Comparative Genomics of Distantly related species: Note: the inverse “no conservation no function” is a much weaker statement given current knowledge
[Bejerano Fall10/11] 37 Vertebrates: what to sequence? [Human Molecular Genetics, 3rd Edition] you are here, Opossum, Lizard, Stickleback too far sweet spot too close Which species to compare to? Too close and purifying selection will be largely indistinguishable from the neutral rate. Too far and many functional orthologs will diverge beyond our ability to accurately align them.
[Bejerano Fall10/11] 38 The Dawn of Whole Genome Comparative Genomics % DNA alignable 95% coding genes shared
[Bejerano Fall10/11] 39 More Species Have Joined Since