Presentation is loading. Please wait.

Presentation is loading. Please wait.

Interpreting the human genome

Similar presentations


Presentation on theme: "Interpreting the human genome"— Presentation transcript:

1 Interpreting the human genome
Manolis Kellis Broad Institute of MIT and Harvard MIT Computer Science & Artificial Intelligence Laboratory

2 TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAAGTTCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTGCTCACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCAACTGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATATGTCCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTTGCGGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAAATTAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCACTACAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAGATTTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAGATGCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGAAGAATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTACGAGAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACAGAAAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGAAAAATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCATTTTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCATACCCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAA 1. The structure of DNA was solved exactly 50 years ago today. However, it is only in the last 5-10 years that technological advances have enabled the sequencing of complete genomes, including our own. 2. Super. We basically read 3 billion letters of A, C, G, T. 6 billion bits of uncommented lines of machine code, all zeros and ones, probably the biggest computer program ever written outside Seattle. Without the subsequent analysis, this sequence would be useless. 3. This is a new time in science, where we have more data than we know what to do with. And it’s all digital information. It is now POSSIBLE to apply computational methods to understand biology. In fact, it is now IMPERATIVE to apply computational methods to understand biology. Without the subsequent analysis, this information is just uncommented assembly code. How do we interpret 2

3 Genes Regulatory motifs Encode proteins Control gene expression
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAAGTTCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTGCTCACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCAACTGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATATGTCCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTTGCGGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAAATTAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCACTACAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAGATTTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAGATGCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGAAGAATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTACGAGAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACAGAAAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGAAAAATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCATTTTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCATACCCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAA Regulatory motifs Control gene expression Genes Encode proteins

4 TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAAGTTCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTGCTCACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCAACTGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATATGTCCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTTGCGGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAAATTAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCACTACAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAGATTTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAGATGCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGAAGAATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTACGAGAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACAGAAAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGAAAAATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCATTTTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCATACCCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAA

5 Large-scale comparative genomics datasets
32 mammals 12 flies 17 fungi 8 Candida 9 Yeasts Post-duplication Diploid Haploid Pre-dup P N

6 Evolution Genomics Comparative Genomics
Using evolution to study genomes Evolution Genomics Using genomics to study evolution

7 Computational challenges in biology
1. Genomic interpretation Decode the human genome Discovery all functional elements  The building blocks 2. Cell circuitry Discover all control constructs Regulatory network properties  The interconnections 3. Evolutionary innovation Emergence of new functions Genome duplication  The dynamics

8 Evolutionary signatures for diverse functions
Protein-coding genes - Codon Substitution Frequencies - Reading Frame Conservation RNA structures - Compensatory changes - Silent G-U substitutions microRNAs - Shape of conservation profile - Structural features: loops, pairs - Relationship with 3’UTR motifs We have developed comparative genomics methods to systematically interpret complete genomes. [top graph] Comparative genomics can help pinpoint functional regions, based on their increased conservation. However, high conservation alone does not reveal the specific functions of these elements. [bottom figure with four panels] We showed that when you look more closely within conserved elements, you can distinguish their specific functions, based on characteristic patterns of nucleotide mutation associated with each function, which we call “evolutionary signatures”. We developed such signatures for the systematic annotation of protein-coding genes, RNA structures, microRNAs, regulatory motifs, and individual motif instances. [Animation] We applied them genome-wide in yeast, fly, and human, to discover new genes, refine and extend known genes, and recognize mistakes in the existing annotations. [Animation] Using these signatures we were able to also find genes that defied the traditional rules of protein-coding translation. For example, 150 genes, strongly enriched in neuronal functions, showed protein-coding translation well-past a conserved stop codon, suggesting abundant stop-codon read-through in the adult brain. [Animation] We found abundant novel RNA structures strongly associated with RNA editing, mRNA localization, and new structures suggestive of translational regulation. [Animation] Lastly, We were able to discover many novel microRNA genes in the human and fly genome, and to refine the annotation of existing microRNAs, resulting in drastic changes to their target spectrum. Overall, the method of evolutionary signatures has proved extremely powerful in yeast, fly, and human, and has resulted in many new insights about gene usage and gene regulation. The methods are general, and should be applicable in any species. Regulatory motifs - Mutations preserve consensus - Increased Branch Length Score - Genome-wide conservation Stark et al, Nature 2007 Lin et al, Genome Research 2007 8

9 Surprise 1: translational read-through in the brain
Protein-coding conservation Continued protein-coding conservation No more conservation Stop codon read through 2nd stop codon New mechanism of post-transcriptional control. Hundreds of fly genes, handful of human genes. Enriched in brain proteins, ion channels. Initial experiments show potential ADAR role (Reenan Lab). Many questions remain A-to-I editing of stop codon TAG|TGA|TAA  TGG Cryptic splice sites? RNA secondary structure? Lin et al, Genome Research 2007

10 Surprise 2: MicroRNAs and developmental control
Illustrates miR/miR* and miR/miR-AS cooperation

11 miR-iab-4AS leads to homeotic transformations
wing w/bristles Sensory bristles haltere wing haltere WT Note: C,D,E same magnification wing sense Antisense Mis-expression of mir-iab-4S & AS: altereswings homeotic transform. Stronger phenotype for AS miRNA Sense/anti-sense pairs as general building blocks for miRNA regulation 9 new anti-sense miRNAs in mouse Stark et al, Genes&Development 2007

12 Computational challenges in biology
1. Genomic interpretation Decode the human genome Discovery all functional elements  The building blocks 2. Cell circuitry Discover all control constructs Regulatory network properties  The interconnections 3. Evolutionary innovation Emergence of new functions Genome duplication  The dynamics

13 3: Sequence determinants of TF binding
Hundreds of proteins bind overlapping regions Regulatory motif analysis reveals sequence specificity Basis for understanding motif combinations & grammars CTCF, check GAF, check Example: insulator proteins in Drosophila Su(Hw), check BEAF-32, variant CP190, novel Although insulator-bound regions overlap, each motif is specific to exactly one protein Mod(mdg4), novel

14 4: Regulatory network inference
ChIP-grade quality Similar functional enrichment High sens. High spec. Systems-level 81% of Transc. Factors 86% of microRNAs 8k + 2k targets 46k connections Lessons learned Pre- and post- are correlated (hihi/lolo) Regulators are heavily targeted, feedback loop Sushmita Roy Kheradpour et al, Genome Research, 2007

15 Chromatin signatures for genome annotation
Similarly to evolutionary signatures: chromatin signatures encode functional elements The difference this information is dynamic The epigenetic code hypothesis Distinct combinations of marks encode distinct chromatin states Can we discover them de novo

16 A multivariate HMM for understanding chromatin state
Transcription Start Site Enhancer Transcribed Region DNA Observed Histone Modifications Most likely Hidden State 2 4 5 5 6 1 3 5 5 5 6 6 Highly Likely Modifications in State Even though modification was not observed can still infer correct state based on neighboring locations that this state is likely of the same type as its neighboring states 0.8 1: 0.8 4: 0.7 2: 0.9 5: 0.9 3: 0.9 0.8 6: 16

17

18 Combine chromatin signatures and evolutionary signatures  New class of large non-coding RNAs (lincRNAs) in human H3K4me3 - K3K36me3 Our experiments confirm: These regions produce RNA molecules They have exon/intron structures They are evolutionarily conserved They show no coding potential, no evo. sign. Their promoters and regulation are conserved They play diverse roles in chromatin regulation Mikkelsen et al. 2007 Guttman et al. Nature, March 12, 2009 (online already)

19 Combine chromatin signatures and regulatory motifs  New developmental enhancers in human and fly
Visel, Penacchio, Rubin, Ren, Nature 2008 Zeitlinger et al, Genes & Development 2007 Chromatin signatures and evolutionary signature are predictive of enhancer elements Experimental techniques developed for inferring expression domains in human and fly Large-scale databases mapping every elements to its expression pattern emerge Ability to test new patterns and artificial elements in fly / mouse embryos

20 The grand challenge ahead
Binding sites of every developmental regulator Sequence motifs for every regulator Annotations & images for all expression patterns GAF, check Su(Hw), check BEAF-32, variant Mod(mdg4), novel CP190, novel CTCF, check Dorsal-Ventral Expression domain primitives reveal underlying logic Anterior-Posterior Understand regulatory logic specifying development

21 Going forward: Drosophila and Human ENCODE
Kevin White, Bing Ren, Jim Posakony Hundreds of sequence-specific factors Dozens of chromatin / histone modifications Dozens of tissues / stages / conditions Bernstein, Lander, Broad Institute ChIP-seq for dozens of chromatin modifications Follow differentiation lineages – activation inactivation Discover tissue-specific regulatory motifs Many open questions remain: Regulatory logic and architecture within enhancer regions. Sequence determinants of enhancer/promoter specificity. Motif-based prediction of expression domains of enhancer elements. Motif-based design of novel enhancers and developmental patterns.

22 Computational challenges in biology
1. Genomic interpretation Decode the human genome Discovery all functional elements  The building blocks 2. Cell circuitry Discover all control constructs Regulatory network properties  The interconnections 3. Evolutionary innovation Emergence of new functions Genome duplication  The dynamics

23 Alignment of closely related species
S.cerevisiae S.paradoxus S.mikatae S.bayanus When we take a closer look the local gene order is conserved across all four genomes. Easy to argue orthology in presence of synteny and whole-genome disambiguation. Moreover we can distinguish coding and non-coding regions (based on common signals and type of conservation) and analyze these separately. Hence we can align every intergenic region unambiguously based on flanking genes. (Maybe a quick mention of transposable elements tRNA rRNA and their mobility or lack thereof) All genes are present, same order + orientation

24 Distant species: Evolutionary mysteries
K. waltii K. waltii K. waltii Few genes remain in 2 copies Intro (previous slide). Now let’s look closely at the correspondence of these three regions, The one copy from K. yarrowii, and the two copies from S. cerevisiae. We observe a very intriguing pattern of gene correspondence. Two The indirect evidence from the duplicated genes is weak, since nearly 90% of duplicated Genes were lost, but the mapping becomes unambiguous given the full gene interleaving. When we look closely at the gene correspondence within t Look closely at these regiones. They are both orthologous. Gene order is conserved across the entire segment for both copies in S. cerevisiae . In fact, we see an interesting phenomenon, that each of the regions only contains A subset of the k. yarrowii genes, while their union contains the complete gene set. S. cerevisiae regions together account for all K. yarrowii genes Each K. yarrowii gene is present in one sister or the other (and sometimes both) K. yarrowii genes account for both S. cerevisiae regions S. cerevisiae genes from both sisters present in K. yarrowii Gene order and transcriptional orientation conserved Gene interleaving is evidence of complete duplication

25 Evolution by whole-genome duplication
Ancestral gene order Duplicated lineage Single-copy Duplication Gene loss 100 Myrs Yeast Duplication Kellis et al. Nature, Apr 8, 2004 Vertebrate Duplication Jaillon et al. Nature, Oct 21, 2004

26 Duplicate mapping over entire genome
K.waltii Scer copy1 Scer copy2 Chr 1 S. cer. Chr 2 Chr 3 Chr 4 K. waltii chromosomes Chr 5 Chr 6 Chr 7 Chr 8

27 Duplicate mapping of chromosomes
Additionally, this figure also illustrates our ability to detect ancestrally duplicated regions Even in absence of any remaining two-copy genes. This has allowed to complete the duplication map of S. cerevisiae. Most duplicated genes are lost

28 Whole-genome duplication results in 500 new genes
Number of genes 10,000 WGD 5,500 Gene Loss ~500 gained 5,000 We can conclude that a WGD event has indeed occurred in the lineage of Saccharomyces cerevisiae, and we have resolved the controversy regarding its ancestry. time 100Myrs Today Emergence of new functions?

29 Emerging gene functions after duplication
Origin of replication  silencing 4-fold acceleration Scer - Sir3 (silencing) Scer - Orc1 (origin of replication) Kwal - Orc1 Translation initiation  anti-viral defense 3-fold acceleration Scer - Ski7 (anti-viral defense) Scer - Hbs1 (translation initiation) Kwal - Hbs1 Asymmetric divergence  recognize ancestral / derived

30 Distinct properties of emerging functions
Ancestral function Derived function Gene deletion Lethal (20%) Never lethal Expression Abundant Specific (stress, starvation) Localization General (mitochondrion, spores) Gain new function and lose ancestral function

31 Network evolution by duplication
Time Modern Network Pre-WGD Lost Duplicate Network motif Duplication Loss - - Ancestral network motifs Scenario 1 Duplication Modern network motif Gain + + Scenario 2

32 Phylogenomics Traditional phylogenetics focused on uniform trees
Traditionally kept distinct Many species One gene Many species Many genes One species Many genes Traditional phylogenetics focused on uniform trees Any topology makes a good story Phylogenomics imposes additional constraints Gene trees evolve inside species trees Independent measure of phylogenetic accuracy

33 Wide-spread errors in phylogenetic reconstruction
Phylogenies of 5154 syntenic 1:1 orthologs ? Correct phylogeny Incorrect Etc… 316 other topologies Incorrect

34 Errors are due to lack of informative sites
Typical gene length species trees: Concatenate 20 genes Recommended length for accurate Reconstruction accuracy 37% Sequence length Due to lack of informative sites (not methodology) Accuracy increases with gene length ‘Recommended length’ unrealistic for gene trees Need additional information  study gene-trees systematically

35 Study recurring properties of phylogenetic trees
A A A A A A A Study properties of correct trees: Input correct topology Fit branch lengths Study resulting tree properties 0.56 subs 0.8 subs 1.0 subs 1.26 subs 2.01 subs Total: Total tree length varies widely Some genes are fast-evolving, some are slow-evolving But, importantly: branches are uniformly longer or shorter Branch lengths are highly correlated

36 Gene-trees are highly correlated
1.00 1.26 0.80 2.01 0.56 Average gene tree Correlations found: Fast genes are fast in all species (gene-specific mutation rate) Fast species are fast for all genes (species-specific mutation rate) Gene trees scaled versions of avg tree Gene tree = F * average-tree 93% of gene trees show R>.8 to avg tree Scaling factor for each gene Common property across all genes (genome-wide) 0.6 0.8 1.0 correlation

37 The two forces of gene-family evolution
Mutation rate of gene j in species i = gene-rate * species rate bij = Fj * Si 1. Family rate 2. Species-specific rates Fj Si ~gamma(a,b) ~normal(ui,si) Selective pressures on gene function Population dynamics of the species

38 Evaluation: Great increase in accuracy
Real data, syntenic regions: Increasing number of species Both 1-to-1 & duplication sets  Great increase in accuracy Simulated data: Run (generative) model 1 dup event  many dup genes Method robust to dup/loss

39 Computational challenges in biology
1. Genomic interpretation Decode the human genome Discovery all functional elements  The building blocks 2. Cell circuitry Discover all control constructs Regulatory network properties  The interconnections 3. Evolutionary innovation Emergence of new functions Genome duplication  The dynamics

40 compbio.mit.edu where we are

41 compbio.mit.edu who we are
Jason Rachel Jessica Rogerio Sushmita Rogerio Manolis Pouya Abdoulaye Chris Loyal Daniel Mike Matt


Download ppt "Interpreting the human genome"

Similar presentations


Ads by Google