Download presentation
Presentation is loading. Please wait.
1
Whole-genome comparative genomics Analyzing the human genome 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 21Dec 6, 2005
2
Challenges in Computational Biology DNA 4 Genome Assembly Gene Finding Regulatory motif discovery Database lookup Gene expression analysis9 RNA transcript Sequence alignment Evolutionary Theory7 TCATGCTAT TCGTGATAA TGAGGATAT TTATCATAT TTATGATTT Cluster discovery10Gibbs sampling Protein network analysis12 Emerging network properties14 13 Regulatory network inference Comparative Genomics RNA folding
3
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA CATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC AGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC CGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT AGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG ATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA AAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA TTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG ATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT CTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG AACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA AAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA GCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAA CTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGA TAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTT GGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAA GTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAA TGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGA TACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT TCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACAT TTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAA AGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAAT ACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTAC AACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATAT CAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCG TTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTC TTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATT AATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT TCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG TTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATA CCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG TTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTA AGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGA TTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATA GTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATG CTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACT TAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGAT TGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAAT
4
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA CATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC AGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC CGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT AGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG ATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA AAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA TTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG ATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT CTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG AACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA AAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA GCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAA CTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGA TAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTT GGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAA GTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAA TGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGA TACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT TCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACAT TTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAA AGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAAT ACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTAC AACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATAT CAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCG TTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTC TTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATT AATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT TCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG TTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATA CCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG TTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTA AGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGA TTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATA GTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATG CTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACT TAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGAT TGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATTT Promoter motifs 3’ UTR motifsExons Introns
5
Comparing genomes reveals functional elements Ultra-conserved elements Protein-coding genes Short regulatory motifs
6
Monotremata Marsupialia Afrotheria Xenarthra Euarchontoglires Laurasiatheria platypus opossum sloth anteater armadillo hedgehog shrew mole phyllostomid microbat microbat (brown bat) false vampire bat flying fox megabat (horseshoe bat) whale dolphin hippo cow pig llama horse rhino tapir cat dog pangolin squirrel mouse rat hystricid guinea pig rabbit pika tree shrew tree shrew urogale flying lemur variegatus flying lemur volans lemur mouse lemur galago bushbaby tarsier bancanus tarsier syrichta spider monkey goeldi monkey marmoset macaque baboon vervet human chimpanzee gorilla orangutan gibbon tenrec golden mole short eared elephant shrew long eared elephant shrew aardvark sirenian hyrax elephant Black - complete 8X Red - 2x sequencing elephant armadillo rabbit bat tenrec shrew cat hedgehog Average extra branch length 0.2 subs/site Extensive sequencing of mammalian tree
7
Hidden Markov Models for gene finding
8
Modeling biological sequences Ability to emit DNA sequences of a certain type –Not exact alignment to previously known gene –Preserving ‘properties’ of type, not identical sequence Ability to recognize DNA sequences of a certain type (state) –What (hidden) state is most likely to have generated observations –Find set of states and transitions that generated a long sequence Ability to learn distinguishing characteristics of each state –Training our generative models on large datasets –Learn to classify unlabelled data IntergenicCpG island PromoterFirst exon IntronOther exon Intron GGTTACAGGATTATGGGTTACAGGTAACCGTTGTACTCACCGGGTTACAGGATTATGGGTTACAGGTAACCGGTACTCACCGGGTTACAGGATTATGGTAACGGTACTCACCGGGTTACAGGATTGTTACAGG
9
HMM-based Gene Finding GENSCAN (Burge 1997) FGENESH (Solovyev 1997) HMMgene (Krogh 1997) GENIE (Kulp 1996) GENMARK (Borodovsky & McIninch 1993) VEIL (Henderson, Salzberg, & Fasman 1997) TWINSCAN (Brent 2001) NSCAN (Brent 2005)
10
VEIL: Viterbi Exon-Intron Locator Contains 9 hidden states or features Each state is a complex internal Markovian model of the feature Features: –Exons, introns, intergenic regions, splice sites, etc. Exon HMM Model Upstream Start Codon Exon Stop Codon Downstream 3’ Splice Site Intron 5’ Poly-A Site 5’ Splice Site Enter: start codon or intron (3’ Splice Site) Exit: 5’ Splice site or three stop codons (taa, tag, tga) VEIL Architecture
11
Genie Uses a generalized HMM (GHMM) Edges in model are complete HMMs States can be any arbitrary program States are actually neural networks specially designed for signal finding J5’ – 5’ UTR EI – Initial Exon E – Exon, Internal Exon I – Intron EF – Final Exon ES – Single Exon J3’ – 3’UTR Begin Sequenc e Start Translati on Donor splice site Accept or splice site Stop Translati on End Sequenc e
12
Genscan Overview Developed by Chris Burge (Burge 1997) Characteristics: –Designed to predict complete gene structures Introns and exons, Promoter sites, Polyadenylation signals –Incorporates: Descriptions of transcriptional, translational and splicing signal Length distributions (Explicit State Duration HMMs) Compositional features of exons, introns, intergenic, C+G regions –Larger predictive scope Deal w/ partial and complete genes Multiple genes separated by intergenic DNA in a seq Consistent sets of genes on either/both DNA strands Based on a general probabilistic model of genomic sequences composition and gene structure
13
Genscan Architecture It is based on Generalized HMM (GHMM) Model both strands at once –Other models: Predict on one strand first, then on the other strand –Avoids prediction of overlapping genes on the two strands (rare) Each state may output a string of symbols (according to some probability distribution). Explicit intron/exon length modeling Special sensors for Cap-site and TATA-box Advanced splice site sensors Fig. 3, Burge and Karlin 1997
14
GenScan States N - intergenic region P - promoter F - 5’ untranslated region E sngl – single exon (intronless) (translation start -> stop codon) E init – initial exon (translation start -> donor splice site) E k – phase k internal exon (acceptor splice site -> donor splice site) E term – terminal exon (acceptor splice site -> stop codon) I k – phase k intron: 0 – between codons; 1 – after the first base of a codon; 2 – after the second base of a codon
15
Classification-based Gene finding Mike Lin
16
Gene identification TTACGGTACCGCTATACCCGAACGTCTAATAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA MTKSHSEEVIVPEFK Intuition –Genes are translated in units of 3 nucleotides (codons) Every DNA strand can be translated in 3 reading frames Insertions and deletions may cause frame-shifts –Selective pressure on the amino-acid translation Silent substitutions tolerated Codons for similar amino-acids frequently exchanged Method –Observe patterns of nucleotide change in genes / intergenic regions –Develop signatures / tests to discriminate between the two –Validate tests with known genes / intergenic regions –Use them to revisit the yeast and human genomes
17
Gene identification Study known genes Derive conservation rules Discover new genes
18
Overall conservation vs. signatures of divergence Not a gene –Region of perfect/near-perfect non-coding conservation –Scores very well with HMM approaches, ExoniPhy, N-Scan, which measure general levels of local nucleotide conservation Real gene –Mutations do occur, consistent with constraints under which genes evolve –Insertions preserve reading frame. Mutations preserve amino-acid function Quantify and capture these constraints computationally human TGC---CCGCGCGAGGTGGCCGCCTCGGCAGCCGCAGCTAAGAAGGAGCTCAAGTAC mouse TGCCAGCCACGTGACGTGGCTG---TGGCAGCGGCAGCTAAAAAAGAGCTTAAGTAT rat TGCCAGCCACGCGACGTGGCCG---TGGCAGCAGCCGCTAAAAAGGAACTTAAGTAC dog TGCCAGCCACGCGAGGTGGCGG---------CTGCGGCCAAGAAAGAGCTCAAGTAC *** ** ** ** ***** * * ** ** ** ** ** ** ***** human TGCCAGCCGCGCGAGGTGGCCGCCTCGGCAGCCGCAGCTAAGAAGGAGCTCAAGTAC mouse TGCCAGCCGCGCGAGGTGGCCGCCTCGGCAGCCGCAGCTAAGAAGGAGCTCAAGTAC rat TGCCAGCCGCGCGAGGTGGCCGCCTCGGCAGCCGCAGCTAAGAAGGAGCTCAAGTAC dog TGCCAGCCGCGCGAGGTGGCCGCCTCGGCAGCCGCAGCTAAGAAGGAGCTCAAGTAC *********************************************************
19
Signature 1: Reading frame conservation 30% 1.3% 0.14% 58% 14% 10.2% GenesIntergenic Mutations Gaps Frameshifts Separation 2-fold 10-fold 75-fold
20
Signature 2: Distinct patterns of codon substitution Codon observed in species 2 Codon observed in species 1 Genes Codon observed in species 2 Codon observed in species 1 Intergenic Codon substitution patterns specific to genes –Genetic code dictates substitution patterns –Amino acid properties dictate substitution patterns
21
100% 60% 90% 40% 60% 100% 30% 56%100% Evaluating reading frame conservation (RFC) Scer CTTCTAGATTTTCATCTT-GTCGATGTTCAAACAACGTGTTA-----TCAGAGAAACAGCTCTATGAGAAATCAGCTGATG Spar TATTCATA-TCTCATCTTCATCAATGTTCAAACAGCGTGTTACAGACACAGAGAAACAGCTTC-TGAGAAGTCAGCCGGTG Scer CTTCTAGATTTTCATCTT-GTCGATGTTCAAACAACGTGTTA-----TCAGAGAAACAGCTCTATGAGAAATCAGCTGATG Scer_f1 123123123123123123-12312312312312312312312-----3123123123123123123123123123123123 Spar TATTCATA-TCTCATCTTCATCAATGTTCAAACAGCGTGTTACAGACACAGAGAAACAGCTTC-TGAGAAGTCAGCCGGTG Spar_f1 12312312-312312312312312312312312312312312312312312312312312312-31231231231231231 Spar_f2 23123123-123123123123123123123123123123123123123123123123123123-12312312312312312 Spar_f3 31231231-231231231231231231231231231231231231231231231231231231-23123123123123123 Scer CTTCTAGATTTTCATCTT-GTCGATGTTCAAACAACGTGTTA-----TCAGAGAAACAGCTCTATGAGAAATCAGCTGATG Scer_f1 123123123123123123-12312312312312312312312-----3123123123123123123123123123123123 Spar TATTCATA-TCTCATCTTCATCAATGTTCAAACAGCGTGTTACAGACACAGAGAAACAGCTTC-TGAGAAGTCAGCCGGTG Spar_f1 12312312-312312312312312312312312312312312312312312312312312312-31231231231231231 Spar_f2 23123123-123123123123123123123123123123123123123123123123123123-12312312312312312 Spar_f3 31231231-231231231231231231231231231231231231231231231231231231-23123123123123123 Scer CTTCTAGATTTTCATCTT-GTCGATGTTCAAACAACGTGTTA-----TCAGAGAAACAGCTCTATGAGAAATCAGCTGATG Scer_f1 123123123123123123-12312312312312312312312-----3123123123123123123123123123123123 Spar TATTCATA-TCTCATCTTCATCAATGTTCAAACAGCGTGTTACAGACACAGAGAAACAGCTTC-TGAGAAGTCAGCCGGTG RFC Spar_f1 12312312-312312312312312312312312312312312312312312312312312312-31231231231231231 43% Spar_f2 23123123-123123123123123123123123123123123123123123123123123123-12312312312312312 34% Spar_f3 31231231-231231231231231231231231231231231231231231231231231231-23123123123123123 23% F1F2F1F2F3
22
Evaluating the codon substitution score (CSM) p X/Y = P(human codon X aligns to mouse codon Y in genes) q X/Y = P(human codon X aligns to mouse codon Y outside genes) AAA/K AAG/K AAC/N AAT/N AGA/R AGG/R...TAA/X AAA/K 1552 608 12 8 74 26 0 AAG/K 423 2531 11 9 23 73 0 AAC/N 8 13 1368 331 1 1 0 AAT/N 8 12 444 1007 2 1 0 AGA/R 44 22 1 1 664 178 0 AGG/R 15 72 1 1 148 594 0 (×10 -5 ) Mouse Human human CTGTTTTTCCCCTTTTGTAGGAAGTCAC mouse CTGTTTTTCCTCTTTTGTAGTAAGTCAC p CCC/CTC q CCC/CTC p AGG/AGT q AGG/AGT Filling in the CSM Scoring an aligned region Coding Score =
23
Multiple levels of selection Codon observed in species 2 Codon observed in species 1 Genes Codon observed in species 2 Codon observed in species 1 Intergenic Multi-level information –All positions overall conservation –Exclude conserved triplets amino-acid sequence –Exclude conserved amino-acids amino-acid properties
24
Effect of using only off-diagonal CSM positions Using full CSM matrixUsing only off-diagonal positions “Is it conserved like a coding gene?”“Has it diverged like a coding gene?” False positives No false positives CSM coding score for human/mouse (x-axis) and human/dog (y-axis) in CFTR region
25
Putting it all together: ExoClass gene finder Train Support Vector Machine (SVM) classifier –Reading Frame Conservation (RFC) score –Codon Substitution Matrix (CSM) coding score –Splice signal conservation, ESEs, ESIs –Exon length, conservation boundaries Apply it systematically to all candidate intervals Use full gene model constraints for post-processing
26
Results in yeast AcceptReject ~4000 named genes99.9%0.1% ~300 intergenic regions1%99% AcceptReject ~4000 named genes ~300 intergenic regions AcceptReject ~4000 named genes ~300 intergenic regions AcceptReject ~4000 named genes99.9%0.1% ~300 intergenic regions1%99% 2000 Hypothetical ORFs1500500 High sensitivity and specificity Spar Smik Sbay Scer 528 deleted 43 novel 280 boundary changes 34 merged 6235 annotated genes 5695 ‘real’ genes
27
Results in human ENCODE regions (Human/Mouse) High nucleotide sensitivity and specificity –Increases with additional species (with some caveats) ‘Missed’ exons due to: –Sequencing / assembly / alignment problems –Rapidly evolving genes: Immunity and olfactory families ‘Wrong’ exons due to: –Novel exons, Novel exons, Novel exons –Existing evidence: human / non-human spliced mRNAs –New evidence: validated using specific RT-PCR (with MGC) Nucl SnNucl SpExon SnExon SpMissed‘Wrong’ w/evidnc GENSCAN85626749173917 TWINSCAN77886679261125 SGP284 7269182024 Exoniphy73885767261053 ExoClass86877375171437
28
Examples in the human Example 1: New gene Example 2: Deleted gene
29
Example 3: Changed exons
30
Fully rejected genes typically have only weak evidence New exons often supported by existing experimental evidence RT-PCR validation of 90 fully novel genes: 50 confirmed Dog Mouse Rat Human 1065 fully rejected 454 novel (2591 exons) 1,919 not aligned 7,717 refined Initial results for the whole human genome 9862 fully confirmed
31
Experimental validation Select novel predictions with highest specificity –Unique in the genome –No pseudogenes –Absolutely no previous experimental evidence Results –June 2005: 454 genes 90 entirely novel –RT-PCR validation for specific exon splicing –50 fully validated using pooled tissues New validation set –Top of the list: 354 genes, 1162 exons –… and many more (gene families, lower scores)
32
Gene Identification: Summary Exon-centric approach –Identify discriminating variables –Observed distinct patterns of nucleotide change –Systematically identify all exons in the genome –Use gene structure constraints to link them Application –High sensitivity and specificity (~90%) –More powerful than experimental methods –Largest reannotation of the yeast genome –Reannotation of the human gene set
33
Regulatory Motif Discovery Xiaohui Xie
34
ATGACTAAATCTCATTCAGAAGAAGTGA Regulatory Motif Discovery GAL1 CCCCWCGGCCG Gal4 Mig1 CGGCCG Gal4 Gene regulation –Genes are turned on / off in response to changing environments –No direct addressing: subroutines (genes) contain sequence tags (motifs) –Specialized proteins (transcription factors) recognize these tags What makes motif discovery hard? –Motifs are short (6-8 bp), sometimes degenerate –Can contain any set of nucleotides (no ATG or other rules) –Act at variable distances upstream (or downstream) of target gene
35
Regulatory Motif Discovery Study known motifs Derive conservation rules Discover novel motifs
36
human CTCTTAATGGTACACGTTCTGCCT----AAGTAGCCTAGACGCTCCCGTGCGCCC-GGGG dog CTCTTA-CGGGGCACATTCTGCTTTCAACAGTGGGGCAGACGGTCCCGCGCGCCCCAAGG mouse GTCTTAGGAGGCT-CGATCGCC---------------------GCCTGCATTATT----- rat GTCTTAGTTGGCCACGACCTGC---------------------TCATGCATAATT----- ***** * * * * * * human CGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGACCTTGGGTTGCCCCAGCCAGGC dog CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCCGCAGCGGGGC mouse --------------CACAAGCCTGTGGCGCGC-CGTGACCTTGGGCTGCCCCAGGCGGGC rat --------------CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCCCCAGGCGAG- * * * ********** *** *** * human TGCGGGCCCGAGACCCCCG-------------------GGCCTCCCTGCCCCCCGCGCCG dog CGCGGGCCCAGGCCCCCCTCCCTCCCTCCCTCCCTCCCTCCCTCCCTGCCCCCCGGACCG mouse TGCAGGCTCACCACCCCGTCTTTTCT---------------------GCTTTTCGAGTCG rat -GCATACACCCCGCCTTTTTTTTTTTTTT---------TTTTTTTTTGCCGTTCAAG-AG ** * * ** ** * * Known motifs are preferentially conserved human CTCTTAATGGTACACGTTCTGCCT----AAGTAGCCTAGACGCTCCCGTGCGCCC-GGGG dog CTCTTA-CGGGGCACATTCTGCTTTCAACAGTGGGGCAGACGGTCCCGCGCGCCCCAAGG mouse GTCTTAGGAGGCT-CGATCGCC---------------------GCCTGCATTATT----- rat GTCTTAGTTGGCCACGACCTGC---------------------TCATGCATAATT----- ***** * * * * * * human CGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGACCTTGGGTTGCCCCAGCCAGGC dog CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCCGCAGCGGGGC mouse --------------CACAAGCCTGTGGCGCGC-CGTGACCTTGGGCTGCCCCAGGCGGGC rat --------------CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCCCCAGGCGAG- * * * ********** *** *** * human TGCGGGCCCGAGACCCCCG-------------------GGCCTCCCTGCCCCCCGCGCCG dog CGCGGGCCCAGGCCCCCCTCCCTCCCTCCCTCCCTCCCTCCCTCCCTGCCCCCCGGACCG mouse TGCAGGCTCACCACCCCGTCTTTTCT---------------------GCTTTTCGAGTCG rat -GCATACACCCCGCCTTTTTTTTTTTTTT---------TTTTTTTTTGCCGTTCAAG-AG ** * * ** ** * * Gabpa Err human CTCTTAATGGTACACGTTCTGCCT----AAGTAGCCTAGACGCTCCCGTGCGCCC-GGGG dog CTCTTA-CGGGGCACATTCTGCTTTCAACAGTGGGGCAGACGGTCCCGCGCGCCCCAAGG mouse GTCTTAGGAGGCT-CGATCGCC---------------------GCCTGCATTATT----- rat GTCTTAGTTGGCCACGACCTGC---------------------TCATGCATAATT----- ***** * * * * * * human CGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGACCTTGGGTTGCCCCAGCCAGGC dog CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCCGCAGCGGGGC mouse --------------CACAAGCCTGTGGCGCGC-CGTGACCTTGGGCTGCCCCAGGCGGGC rat --------------CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCCCCAGGCGAG- * * * ********** *** *** * human TGCGGGCCCGAGACCCCCG-------------------GGCCTCCCTGCCCCCCGCGCCG dog CGCGGGCCCAGGCCCCCCTCCCTCCCTCCCTCCCTCCCTCCCTCCCTGCCCCCCGGACCG mouse TGCAGGCTCACCACCCCGTCTTTTCT---------------------GCTTTTCGAGTCG rat -GCATACACCCCGCCTTTTTTTTTTTTTT---------TTTTTTTTTGCCGTTCAAG-AG ** * * ** ** * * Is this enough to discover motifs? No.
37
Known motifs are frequently conserved Across the human promoter regions, the Err motif: –appears 434 times –is conserved 162 times Human Dog Mouse Rat Err Conservation rate: 37% Compare to random control motifs –Conservation rate of control motifs: 6.8% –Err enrichment: 5.4-fold –Err p-value < 10 -50 (25 standard deviations under binomial) Motif Conservation Score (MCS)
38
MCS distribution of all 6-mers shows excess conservation –High scoring patterns include known motifs –Excess specific to promoters and 3’-UTRs (not introns) –For MCS > 6, estimate 97% specificity Motif density Motif Conservation Score (MCS) Use MCS to discover new motifs Select motifs with MCS > 6.0, cluster
39
Hill-climbing in sequence space Seed selection –Three mini-motif conservation criteria (CC1, CC2, CC3) Motif extension –Non-random conservation of neighbors Motif collapsing –Merge neighbors using hierarchical clustering, avg-max-linkage Re-scoring complex motifs –Motif conservation score for full motifs (MCS)
40
Test 1: Intergenic conservation Total count Conserved count CGG-11-CCG
41
Test 1: Selecting mini-motifs Estimate basal rate of conservation –Expected conservation rate at the evolutionary distances observed –Average conservation rate of non- outlier mini-motifs Score conservation of mini-motif –k: conserved motif occurrences –n: total motif occurrences –r: basal conservation rate –Evaluate binomial probability of observing k successes out of n trials Assign z-score to each mini-motif –Bulk of distribution is symmetric –Estimate specificity as (R-L)/R –Select cutoff: 5.0 sigma –1190 mini-motifs, 97.5% non-random Conservation rate r N Binomial score Right tail Left tail Specificity Cutoff
42
Test 2: Intergenic vs. Coding Coding Conservation Intergenic Conservation CGG-11-CCG Higher Conservation in Genes
43
Test 3: Upstream vs. Downstream CGG-11-CCG Downstream motifs? Most Patterns Downstream Conservation Upstream Conservation
44
Extend Collapse Full Motifs Constructing full motifs 2,000 Mini-motifs 72 Full motifs 6 CTA CGA R R CTGRC CGAA ACCTGCGAACTGRCCGAACTRAY CGAA Y 5 Extend Collapse Merge Test 1Test 2Test 3
45
Extending mini-motifs Separate conserved and non-conserved instances CTACGA 6 CTxxGA 6 Causal set Random set CTACGARGW CTxxGAYHS Find maximally discriminating neighborhood N1 N2 M1 M2 Evaluate non-randomness of neighborhood –chi-square contingency test on [N1,M1], [N2,M2]
46
Collapsing similar motifs Motif similarity: sequence and genomic positions –Motifs share similar sequences, count bits in common –Motifs appear conserved in similar sets of regions Regions with motif 2 Regions with motif 1 Regions containing both motifs Collapsing: Hierarchical clustering –Sort the order of joins by decreasing similarity –Average max-linkage cluster similarity score
47
Systematically test candidate patterns All potential motifs Evaluate MCS Cluster similar motifs GTC AGT R R Y gap S W 174 motifs in promoters 106 motifs in 3’ UTRs Enumerate –Length between 6 and 15 nt, allow central gap –11 letter alphabet (A C G T, 2-fold codes, N) Score –Compute binomial score (conserved vs. total) –Select MCS > 6.0 specificity 97% Cluster –Sequence similarity –Overlapping occurrences Are these real ?
48
Functions of discovered motifs
49
Evidence of motif function Promoter motifs: (1)Comparison to known motifs (2)Distance from TSS (3)Expression enrichment Promoter3’-UTR ATG Stop 174 motifs106 motifs
50
(1)Promoter motifs match known TF binding sites Compare discovered motifs to TRANSFAC database of 125 known motifs 55% of TRANSFAC motifs match discovered motifs 45% of discovered motifs match TRANSFAC motifs (only 2% of control sequences match TRANSFAC motifs)
51
(2) Promoter motifs show preferred distance to TSS 32% of discovered motifs show strong positional bias Conserved motif sites in all four species Motif instances in human Each of 174 discovered motifs Motif 8 Motif 4 -81 -63 Distance from TSS Discovered motifs occur preferentially Within 200 bp of Transcription Start Site Individual motifs show strong peaks Regardless of conservation
52
(3) Promoter motifs enriched in specific tissues 70% of motifs show significant enrichment in at least one tissue New motifsKnown TFs
53
Summary for promoter motifs RankDiscovered Motif Known TF motif Tissue Enrichment Distance bias 1RCGCAnGCGYNRF-1Yes 2CACGTGMYCYes 3SCGGAAGYELK-1Yes 4ACTAYRnnnCCCRYes 5GATTGGYNF-YYes 6GGGCGGRSP1Yes 7TGAnTCAAP-1Yes 8TMTCGCGAnRYes 9TGAYRTCAATF3Yes 10GCCATnTTGYY1Yes 11MGGAAGTGGABPYes 12CAGGTGE12Yes 13CTTTGTLEF1Yes 14TGACGTCAATF3Yes 15CAGCTGAP-4Yes 16RYTTCCTGC-ETS-2Yes 17AACTTTIRF1(*)Yes 18TCAnnTGAYSREBP-1Yes 19GKCGCn(7)TGAYGYes 20GTGACGYE4F1Yes 21GGAAnCGGAAnYYes 22TGCGCAnKYes 23TAATTACHX10Yes 24GGGAGGRRMAZYes 25TGACCTYERRAYes 174 promoter motifs 70 match known TF motifs 115 expression enrichment 60 show positional bias 75% have evidence Control sequences < 2% match known TF motifs < 5% expression enrichment < 3% show positional bias < 7% false positives Most discovered motifs are likely to be functional New
54
What about 3’-UTR motifs ? Sequence properties of 3’-UTR motifs Regulatory roles of 3’-UTR motifs TSS3’-UTR ATG Stop 174 motifs106 motifs
55
Directionality of 3’-UTR motifs 3’-UTR motifs ATG Stop motif also conserved on reverse strand NOT conserved on reverse strand Promoter motifs 3’-UTR motifs likely to act post-transcriptionally ATG Stop motif ATG Stop motif DNA level – both strands are available RNA level – only coding strand is available Promoter motifs 3’-UTR motifs Forward strand conservation Reverse strand conservation
56
What are microRNAs (miRNAs)? Endogenous small non-coding RNA ~22nt in length Located in genomic loci that can produce fold-back structures Often conserved (but conservation may not be required)
57
miRNA and siRNA miRNA gene/miRNA host gene Double stranded RNA formation POH 5’3’ RISC Complex
58
miRNA & siRNA as Negative Regulators of Gene Expression miRNA siRNA lin-14 mRNA lin-4 RNA, 22 nt mRNA Near Perfect Match Degradation of Target Partial Match Inhibition of Translation Degradation of Target Chromosomal Silencing Off-Target Effect
59
Properties of microRNA genes (miRNAs) Properties similar to the motifs we have discovered DNA ~100 nt precursor ~50 nt stem loop structure ~22 nt miRNA gene cleaved Protein-coding gene3’-UTR miRNA Repress target genes via loose sequence complementarity Small non-coding RNA genes involved in post-transcriptional regulation Properties of 3’-UTR motifs Enriched in motifs of length 8 75% end with nucleotide ‘A’ Sequence properties of miRNAs Near-perfect complement to 7-mer seed Many miRNAs start with ‘U’
60
3’-UTR motif properties (2) Length distribution Have we in fact discovered targets of microRNA genes? Enriched in motifs of length 8 (3) Sequence composition 75% end with nucleotide A
61
Compare 8-mer sequence to known miRNAs Compare 8-mer motifs against all 207 known miRNAs 72 discovered 8-mers match 44% of known miRNA genes (72 control sequences only match 5%) Specifically, 8-mers match 5’-end of miRNA in 95% of cases Position in miRNA where 8-mers match 8-mer motifs are likely miRNA targets
62
Novel miRNA genes show deep evolutionary conservation Using 8-mers to discovery novel miRNA genes Conserved much further than mammalian lineage
63
inferred miRNA …… …… Can we use 8-mers to discover miRNA genes ? TTGCATATATATGCAA 8-mer motif miRNA complement Conserved stem loop 3’ end 5’ end 3’ end ACGGGGAGGTTGAACATCCTGCATAGTGCTGCCAGGAAATCCCTACTTCATACTAAGAGGGGGCTGGCTGGTTGCATATGTAGGATGTCCCATCTCCCGGCC ACGAGGAGGTTGAACATCCTGCATAGTGCTGCCAGGAAATCCCTACTTCATACTAAGAGGGGGCTGGCTGGTTGCATATGTAGGATGTCCCATCTCCTGGCC GCAGGGAGGTTGAACATCCTGCATAGTGCTGCCAGGAAATCCCTATTTTATACTA--AGGGGGCTGGCTGGTTGCATATGTAGGATGTCCCATCTCCCCGCC GCCGGGAGGTTGAACATCCTGCATAGTGCTGCCAGGAAATCCCTATTTCATA-TAAGAGGGGGCTGGCTGGTTGCATATGTAGGATGTCCCATCTCCCAGCC * ***************************************** ** *** ** **************************************** *** 258 stem loops discovered
64
258 candidate miRNA genes discovered –114 correspond to known miRNA genes (of 222) –144 novel candidate miRNA genes Experimentally tested 12 representative novel miRNAs –Specifically tested for expression of inferred 22mer using RT-PCR –Pooled small RNAs from 10 adult human tissues –6 of 12 found to be expressed with predicted structure in adults (developmental tissues may contain additional miRNA genes) Many of the discovered miRNA genes are likely to be real Properties of discovered miRNA genes ATATGCAA 8-mer motif Discovered miRNA gene
65
Two classes of miRNA genes Many targets Evolutionary constraint Co-evolution of miRNA genes and their targets ? Few targets 114 re-discovered108 missed No 8-mers Many targets Conserved 8-mers Slowly evolving Rapidly evolving (5-fold higher mutation rate) 222 known miRNA genes Number of mutations ~150 targets Number of targets ~5 targets
66
What fraction of conserved 8-mers are true miRNA targets ? –40% of genes contain at least one discovered 8-mer –(vs. 25% for appropriate control 8-mers) Extraordinary importance of miRNA regulation How many targets do miRNA genes regulate ? ATATGCAA 8-mer motif miRNA gene Inferred 3’-UTR targets P(conserved) = P(conserved|real)* P(real) + P(conserved|not real)*P(not real) 40% = 1 * p + ¼ * (1-p) p = 20% ~20% of genes are targeted by miRNAs
67
3’ UTR motifs and post-transcriptional regulation Several noteworthy examples –AATAAA: Poly-A signal –6 AT-rich elements: mRNA stability and degradation –24 TGTA-rich elements: mRNA localization (PUF-family) –29 other, potential target of RNA-binding proteins 8-mer associated Other 3’-UTR motifs Motif length May help systematic study of post-transcriptional regulation 46 motifs are 8-mer associated Targets of microRNAs 60 motifs left Targets of RNA-binding proteins
68
Summary: Regulatory motif discovery ATATGCAA discovered 8-mers 114 known + 144 new miRNA genes Target ~20% of human 3’-UTRs miRNA regulation 174 promoter motifs 70 match known TF motifs 115 expression enrichment 60 show positional bias 106 motifs in 3’-UTR Strand specific 8-mers are miRNA-associated mRNA localization and stability TSS3’-UTR ATG Stop Systematic discovery of regulatory motifs in the human Frequently occurring, strongly conserved short regulatory signals
69
Towards human regulatory networks Global motif co-occurrence map Reveal co-operating regulators Initial network of master regulators Reveal hubs, cascades, network motifs From sequence-based discovery to dynamic models Ste12 Tec1 CBF1 Met31 Gcn4 Leu3 rESR1Abf1 rESR2 Gcr1 Msn2
70
Motifs outside promoters and 3’-UTRs
71
Extract conserved regions in the human genome 1.Extract top 5% most conserved regions in the human genome based on PhyloHMM score (142M bp). 2.Remove protein-coding regions. 3.Extract regions with conservation rate above 80% in sliding windows of 20 bp in human/mouse/rat/dog alignment. 4.Remove alignments not in syntenic blocks. 5.Remove alignments not in one-to-one mapping. 6.Mask repeat sequences. => 70M bp sequences (2.5% of the human genome) Procedure for generating conserved regions:
72
Random chance of occurrence of K-mers with different size in conserved regions Size 0 1 2 12 4 15 1 18 0.15 20 0.01 Number of mismatches Mean number of occurrence in 70M bp region by chance:
73
An example K-mer TTCAGCACCATGGACAGC 18-mer Appear 199 times in the conserved regions --> 1300-fold enrichment. Moreover, in the whole human genome: The 18-mer occurred 446 times (45% of the sites in conserved regions) --> an enrichment of 18-fold, compared with 2.5%. Number of occurrence: Enrichment in the conserved regions:
74
Model motifs by consensus with mismatch GGCGCTGTCCGTGGTGCTGA TGCGCTGTCCGTGGTGCTGA GGAGCTGTCCGTGGTACTGA GGCACTGGCCGTGGTGCTGA... Given an k-mer word w, we consider the ball B(w, r) of radius r around w. r is distance measure between two different words. Example: k=20 w= ‘ GGCGCTGTCCGTGGTGCTGA’ r=2
75
Algorithms for searching overrepresented sequences Ver1: Build suffix tree first, and then numerate motifs with mismatches. (don’t allow indels, but motif search is exhaustive, slow) Ver2: Hash k-mer first, and extend shared k-mer sites to screen out sites that are similar to each other. (allow indels, but with lower sensitivity, fast) Word-search based method: Alignment based method (for long sequences > 30 bp): 1.Blastz human vs human sequences. 2.Extract sequences with multiple hits. 3.Generate consensus sequence for each multiple alignment. 4.Smith-Waterman alignment on the whole genome to identify all hits for each consensus.
76
Discovered sequences 67 consensus sequences with average size 80 bp, enrichment rate > 0.6, and number > 20. 30 20-mers enrichment rate > 20%, and number > 20. 46 18-mers, enrichment rate > 30%, number > 30.
77
An example K-mer TTCAGCACCATGGACAGC 18-mer Appear 199 times in the conserved regions --> 1300-fold enrichment. Moreover, in the whole human genome: The 18-mer occurred 446 times (45% of the sites in conserved regions) --> an enrichment of 18-fold, compared with 2.5%. Number of occurrence: Enrichment in the conserved regions:
78
Discovered sequences 67 consensus sequences with average size 80 bp, enrichment rate > 0.6, and number > 20. 30 20-mers enrichment rate > 20%, and number > 20. 46 18-mers, enrichment rate > 30%, number > 30.
79
A few examples Sequence Enrichment Total in_gene in_promotUTR TGGAAATGCTGACACAACCT 0.78921720 TTCATTTACACTTAACTCAT 0.739902850 AAAGGCCCTTTTCAGAGCCA 0.7294646043 AAATGCTGACAGACCCTTAA 0.700251340 GTCTGTCAGCATTTCCATTA 0.698351410 GGTTCCCATGGCAACAGCCT 0.686221030 AACTCCCATTAATGCTAATG 0.68021700 CAGCATCTGGCTCCTTGGCA 0.66721700 GTTGCCATGGCAACAGCAGC 0.640321452 TTTTATGGCTGAGTTATAAA 0.640231111 CTGTTGCCATGGCAACCAGG 0.6303922111 GGTCTCCATGGCAACCAGCC 0.62115730 AGTGGCCTGAAAGAGTTAAT 0.615221210 TTATAATGGAAATGCTGACA 0.604522320 GTCTGTTAGCATTTCCATTA 0.595231020 AATAGGGGTTTATAATGGAA 0.594271121 TCCCATTAATGTTAATGGGA 0.591231020 GCTTTGGTTTCCATGGAAAC 0.58325720 CTGTCAGCATTTCCATTATA 0.556492241 CAGCATTTCCATTACAAACC 0.550221010 CCACAAGAGGGCAGCAGAGG 0.5213215110 GTGCTATATAAATGCTAAAT 0.50021410 GACTACAACTCCCAGCAGGC 0.4744240371 TCAGCACCACGGACAGCGCC 0.3814435240
80
Context of K-mers: conservation island Conservation island
81
Context of K-mers: extended conservation TGCTGTTCCATGGCAAC Palindromic sequence
82
Context of K-mers: connected conservation Histone 3’UTR motif TGGCTCTGAAAAGAGCCTTT
83
Context of K-mers: connected conservation
85
Identify long sequences based on alignment
86
Interesting RNA structure of the sequence GGAAGAAGGGAAGAAATGGCTCACTTTTCAGAGGTGCATTTACTCTTTGACCCACTAGGGTACTATTTAGTGTTCTAGAAGAGGTAATTTAGTAAATTGTA CCCCAGTGGCCTGAAAAAGTTAATGCAACTCTGAAAAGTGAGCCATTCAATCGATTTTCCCTATTGCTTTTAAAAAAT.(((((.(((((((((((((((((((((((((((.((((((.(((.(((.(.(((((.((((((.(((((.((.(((.....))).....)).))))).)) )))).)))))..).))).))).)))))).))))))))))))))))))).......))))))))....)))))....... (-74.51)
87
Conserved instance in the intron of ADCY5 TGCTGTTCCATGGCAAC
88
Conclusion Goldmines of conservation in the human genome –Short motifs, very frequently occurring –Longer motifs, many occurrences –Extremely long elements, near-perfect conservation Regulatory role? –microRNA genes / other non-coding RNAs –Early development, body-plan formation –Repeat elements high-jacked for regulatory roles? Contain strong enhancer regions, scattered across genome –A lot of un-translated transcription
89
Regulatory motif evolution Erez Lieberman Genes Regulation Evolution
90
Motif disappears, and reappears about 100 bp downstream in S. mikatae CGTNNNNNRYGAY Scer GGCTCCATCAATTCGTATCAAGTGATAATT-AT------CACATAAATTATATAATTGTA Spar AACCCTATTAATTCGTAAGCAGTGATATAA-AT-AGAATAACCTAACTTATACAACTGTA Smik AACCCTATGAATTCCTAGTAAGCCACCTATTATAGAGATAACCTAAGTAGTATAGTAGTA Sbay AGCCCTATACATTCGTACCAAGTGATAAAT-ATTATTAAGACCTAACATTTAAAACAGTT * * ** **** ** ** * ** ** *** ** * ** CGTNNNNNRYGAY Scer AACCT------ATTAATAACCCTAAT-ATCATCCTCATGCCCTA-AGAAATATTCAATAT Spar TCCCTTTTAAACCCCCTAATATTACC-ATCTAAGACCTAACTAATATCAA----GGGAAA Smik A-CCTATTAAAATTAAAAACGTTAACCATGATGCCCTAACAATATAATGA-----AGGAA Sbay ACCCT-----ACCCTAAAATGGGAAC-ATAAAACACAAACCCTATATAAACGTAGAGAAA *** ** * ** * * * * * ABF1 YHR078W S. cerevisiae S. paradoxus S. mikatae S. bayanus Evidence of motif movement by neutral evolution
91
Evidence of strand crossing for near-palindromic motifs ABF1 Crosses the Strand in YHL012W CGTNNNNNRYGAY RTCRYNNNNNACG Scer ---TAAAATAGCATATCGTTAAAAACGACAAACGCGT Spar ---TAATATAACATCTCGTTAAAAACGACAAACGCGT Smik TAATGAAATAA-ATCTCGTAAAAAACGACAAACGCGT Sbay ---TGATCTGCCCTTCCGTATATAATGACAAACGCGT ABF1 S. cerevisiae S. paradoxus S. mikatae S. bayanus YHL012W
92
The birth-death process of regulatory motifs Motif birth Motif movement Motif death Hap4 Abf1 Msn2
93
- Footprint i - Information Motif birth governed by random process ? AANNCG GTNNTG GNNNT 2X 1X Wider = Faster AC CT GT 4X 1X More Bits: Slow movmt
94
Motif birth governed by random process ! Model Observed rate Motif birth can be modeled as a largely random process Observed motif birth rate Motif information content
95
Red: All regions Green: Bound regions Age 0 Motif aging Information content Number of instances Age 4 Information content Number of instances Age 1Age 2Age 3 What is responsible for shift in distribution ?
96
3. Death rates governed by selective landscape Green: Death rate in bound regions Red: Death rate in unbound regions Motif death rates drastically different in functional / non-functional regions
97
Intensity of selection determines motif death rate Bound & Cooperative BoundNot bound Cooperative Neither Rate of motif death Each level of selective pressure shows distinct death rate
98
Birth and death events for chromosome arm (16R) Birth-death process governed by selection landscape Green = motif birth Red = motif death Blue = motif aging Strength of selective pressure Chromosomal position on chromosome 16 (right arm) Yap 1 Chromosomal position on chromosome 16 (right arm) Yap 1
99
Motif evolution governed by three processes Motif birth –Short motifs can appear by neutral evolution –Rate of motif birth ~ information content Motif aging –Motif abundance shifts towards bound regions –Distribution changes gradually over time Motif death –Governed by functional selection landscape –Predicted by partner motifs + factor binding Modeling motif evolution can lead to better discovery
100
Network evolution by duplication Aviva Presser Motif discovery Motif evolution Network evolution
101
Networks are dynamic in time and in evolution Global motif co-occurrence map Reveal co-operating regulators Initial network of master regulators Reveal hubs, cascades, network motifs How do networks change in the face of gene duplication ? Ste12 Tec1 CBF1 Met31 Gcn4 Leu3 rESR1Abf1 rESR2 Gcr1 Msn2
102
Evidence of Whole Genome Duplication
103
Whole Genome Duplications in diverse lineages Yeast Duplication Kellis et al. Nature, Apr 8, 2004 Vertebrate Duplication in Fish Jaillon et al. Nature, Oct 21, 2004 Two rounds of WGD in human! Dehal et al. PLoS Biology, Oct 2005
104
The return to haploidy Number of genes 5,000 10,000 WGD 100Myrs time Today 5,500 Gene Loss ~500 gained Advantage of WGD may lie in 500 gained genes
105
Functions of duplicated genes S. cerevisiae copy 1 S. cerevisiae copy 2 K. waltii Evidence of accelerated protein divergence ? As a group –Biased towards environment adaptation –Sugar metabolism, fermentation, regulation Individual pairs –Are new gene functions gained by WGD ? –How are new gene functions emerging ? WGD Rate 1 Rate 2
106
Scenarios for rapid gene evolution One copy faster Both copies faster Scer - copy1 Scer - copy2 Kwal Scer - copy1 Scer - copy2 Kwal Ohno, 1970 Force, 1999 20% of duplicated genes show acceleration 95% of cases: Only one copy faster
107
Emerging gene functions after duplication Asymmetric divergence recognize ancestral / derived Scer - Sir3 (silencing) Scer - Orc1 (origin of replication) Kwal - Orc1 4-fold acceleration Scer - Ski7 (anti-viral defense) Scer - Hbs1 (translation initiation) Kwal - Hbs1 3-fold acceleration Origin of replication silencing Translation initiation anti-viral defense
108
Asymmetric divergence distinct functional properties Gain new function and lose ancestral function Ancestral functionDerived function Gene deletion Lethal (20%)Never lethal
109
Asymmetric divergence distinct functional properties Gain new function and lose ancestral function Ancestral functionDerived function Gene deletion Lethal (20%)Never lethal ExpressionAbundant Specific (stress, starvation) LocalizationGeneral Specific (mitochondrion, spores)
110
Duplicatio n Asymmetric Divergence Asymmetry also found in network connectivity Interaction loss more likely than gain. One protein maintains ancestral function? Study network in context of duplication Duplicated gene Interaction partners
111
Network evolution by duplication Lost Duplicate Time Pre-WGD Modern Network Network motif Duplication ++ - - Loss Duplication Gain Modern network motif Ancestral network motifs Scenario 1 Scenario 2
112
Mechanisms of network motif emergence Duplication Creation Probability p·(1-q) q Divergence Transition Probability [(1-P plus ) ·(1-P minus ) 3 ·P minus 2 ] Lost Interactions Kept Interactions Gained Interactions Pre-Duplication Probabilities –p = probability of interaction –q = probability of self-interaction Post-Duplication Probabilities –P plus = probability of adding an interaction –P minus = probability of eliminating an interaction
113
All have either 4 or 0 edges across the pairs (4-across or 0-across) Emergence of post-duplication network motifs
114
Modeling network evolution –Parameters: Fraction Duplicated vs Spontaneous Generation Fraction Edges Deleted Number of Edges for Spontaneous Genes –90% of timesteps: duplication Pick a gene at random Duplicate with all its connections Delete on average 35% of new connections –10% of timesteps: creation “Create” a new gene Randomly connect it to the existing network with 0 – 20 connections Study emergence of network motifs
115
Abundance of network motifs predicted by duplication
116
2. High frequency of ohnolog pair interaction Ancestral self-interaction or gain of ohnolog interaction Duplication Divergence 1. Asymmetry in network connectivity Interaction loss more likely than gain. One protein maintains ancestral network function? Lessons Learned 1.Abundance of ancestral self- interactions 2.Gain of ohnolog interaction by proximity due to common interactions 3.Selection for ohnologs with interaction, both kept since neither can mutate. Faulty A’ would disrupt polymerization of A-A-A-A, reduced fitness.
117
3. Abundance of global properties and network hubs Duplication + asymmetric divergence modelTraditional preferential attachment model Model matches local and global network properties
118
Network evolution: Conclusions Asymmetric evolution of network connectivity –One pair preserves connections –One pair keeps subset (rarely gains) WGD preserves network connectivity –Duplicates highly interconnected Simple model of network evolution –Estimate rates of interaction gain and loss –Very good fit to simulated and actual yeast network Infer connectivity patterns of ancestral network –Ancestral network shows increased number of self-interactions –Self-interacting proteins favored in duplicated network?
119
Comparative genomics and regulatory networks Regulatory motif discovery –Genome-wide conservation score –Validated using expression, positional bias, multiplicity –Pre- and post-transcriptional regulation microRNA regulation –Motif-centric discovery of new microRNA genes –Many new microRNAs, experimentally validated –Role of microRNA regulation: 20% of the genome Regulatory motif evolution –Underlying birth-death process, random birth process –Aging shifts distribution, death governed by selection –Ability to model motifs for discovery in many species Protein network evolution –Simple duplication-based model –Motif abundance, degree distribution can be predicted –Asymmetric divergence, cross-interactions
120
Acknowledgements Human motifs –Xiaohui Xie –Eric Lander –Vamsi Mootha –Kerstin Lindblad-Toh –Jun Lu –E.J. Kulbokas –Todd R. Golub Fungal comparisons –Bruce Birren –Christina Cuomo –James Galagan –Li-Jun Ma –Joshua Grochow Gene identification –Mike Lin –Michael Brent Network evolution –Aviva Presser –Michael Elovitz –Roy Kishony Motif Evolution –Erez Lieberman –Martin Nowak Genome-wide phylogeny –Matt Rasmussen –Marcia Lara
121
Who’s actually doing the work Matt Rasmussen Whole-genome phylogeny Xiaohui Xie Motif finding Josh Grochow Protein motifs Erez Lieberman Motif evolution Aviva Presser Network evolution Mike Lin Gene identification Alex Stark Fly regulatory networks Pouya Kheradpour Human motifs
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.