Introduction to Bioinformatics Tuesday, 19 March
Are genes encoding proteins with all the universal motifs of cytosine methyltransferases commonly found in phages?
Define motifs (known proteins) Are genes encoding proteins with all the universal motifs of cytosine methyltransferases commonly found in phages? Define motifs (known proteins) Find motif (unknown proteins)
Motifs – not only for proteins! Position-specific scoring matrices (PSSMs)
Motifs – not only for proteins! Nature of Regulatory Sites
Motifs – not only for proteins! Nature of Regulatory Sites Sequence Filter Known sites
Motifs – not only for proteins! Nature of Regulatory Sites Genomic sequence Predicted sites Unknown sites Sequence Filter
Nature of Sequence Filters Hidden Markov model-based methods Ad hoc methods Position-dependent scoring matrix (PSSM) = Position-specific frequency table = Weight table
Some of 106 aligned human promoter sequences (near -26) Making a PSSM CCCTATATAAGGC... histone H1t CGCTATAAAAACT... HMG-17 GGGTATATAAGCG... b'-tubulin b'2 GGCTATATAAAAC... a'-actin skel-m. TTCTATAAAGCGG... a'-cardiac actin CCCTATAAAACCC... b'-actin GAGTATAAAGCAC... keratin I 50K GGTTATAAAAACA... vimentin CAGTATAAAAGGG... a'1(I) collagen CCGTATAAATAGG... a'2(I) collagen TCCCATATAAGCC... fibronectin Some of 106 aligned human promoter sequences (near -26) Consensus TATAAA
Some of 106 aligned human promoter sequences (near -26) Making a PSSM CCCTATATAAGGC... histone H1t CGCTATAAAAACT... HMG-17 GGGTATATAAGCG... b'-tubulin b'2 GGCTATATAAAAC... a'-actin skel-m. TTCTATAAAGCGG... a'-cardiac actin CCCTATAAAACCC... b'-actin GAGTATAAAGCAC... keratin I 50K GGTTATAAAAACA... vimentin CAGTATAAAAGGG... a'1(I) collagen CCGTATAAATAGG... a'2(I) collagen TCCCATATAAGCC... fibronectin Some of 106 aligned human promoter sequences (near -26)
Where to get a training set? Making a PSSM Where to get a training set? Experimentally proven regulatory sites Orthologs of genes in different organisms Not too far (divergence of binding sites) Not too close (hidden amidst overall similarity) Experimentally indicated coregulated genes Suspected coregulated genes
Experimentally proven start sites Using a PSSM atpI ACCTCGAAGGGAGCAGGAGTGAAAAAC bioB ACGTTTTGGAGAAGCCCCATGGCTCAC glnA ATCCAGGAGAGTTAAAGTATGTCCGCT glnH TAGAAAAAAGGAAATGCTATGAAGTCT lacZ TTCACACAGGAAACAGCTATGACCATG rpsJ AATTGGAGCTCTGGTCTCATGCAGAAC serC GCAACGTGGTGAGGGGAAATGGCTCAA sucA GATGCTTAAGGGATCACGATGCAGAAC trpE CAAAATTAGAGAATAACAATGCAAACA Experimentally proven start sites
Experimentally proven start sites Using a PSSM ? Unknown start site aceB ACTATGGAGCATCTGCACATGAAAACC atpI ACCTCGAAGGGAGCAGGAGTGAAAAAC bioB ACGTTTTGGAGAAGCCCCATGGCTCAC glnA ATCCAGGAGAGTTAAAGTATGTCCGCT glnH TAGAAAAAAGGAAATGCTATGAAGTCT lacZ TTCACACAGGAAACAGCTATGACCATG rpsJ AATTGGAGCTCTGGTCTCATGCAGAAC serC GCAACGTGGTGAGGGGAAATGGCTCAA sucA GATGCTTAAGGGATCACGATGCAGAAC trpE CAAAATTAGAGAATAACAATGCAAACA Experimentally proven start sites
Experimentally proven start sites Using a PSSM ? Unknown start site aceB ACTATGGAGCATCTGCACATGAAAACC atpI ACCTCGAAGGGAGCAGGAGTGAAAAAC bioB ACGTTTTGGAGAAGCCCCATGGCTCAC glnA ATCCAGGAGAGTTAAAGTATGTCCGCT glnH TAGAAAAAAGGAAATGCTATGAAGTCT lacZ TTCACACAGGAAACAGCTATGACCATG rpsJ AATTGGAGCTCTGGTCTCATGCAGAAC serC GCAACGTGGTGAGGGGAAATGGCTCAA sucA GATGCTTAAGGGATCACGATGCAGAAC trpE CAAAATTAGAGAATAACAATGCAAACA Experimentally proven start sites
Using a PSSM atpI ACCTCGAAGGGAGCAG.....GAGTGAAAAAC bioB ACGTTTTGGAGAAGC...CCCATGGCTCAC glnA ATCCAGGAGAGTTA.AAGTATGTCCGCT glnH TAGAAAAAAGGAAATG.....CTATGAAGTCT lacZ TTCACACAGGAAACAG....CTATGACCATG rpsJ AATTGGAGCTCTGGTCTCATGCAGAAC serC GCAACGTGGTGAGGG...GAAATGGCTCAA sucA GATGCTTAAGGGATCA....CGATGCAGAAC trpE CAAAATTAGAGAATA...ACAATGCAAACA A C G T
Using a PSSM aceB ACCACATAACTATGGAGCATCTGCACATGAAAACC atpI ACCTCGAAGGGAGCAG.....GAGTGAAAAAC bioB ACGTTTTGGAGAAGC...CCCATGGCTCAC glnA ATCCAGGAGAGTTA.AAGTATGTCCGCT glnH TAGAAAAAAGGAAATG.....CTATGAAGTCT lacZ TTCACACAGGAAACAG....CTATGACCATG rpsJ AATTGGAGCTCTGGTCTCATGCAGAAC serC GCAACGTGGTGAGGG...GAAATGGCTCAA sucA GATGCTTAAGGGATCA....CGATGCAGAAC trpE CAAAATTAGAGAATA...ACAATGCAAACA A C G T
Using a PSSM aceB ACCACATAACTATGGAGCATCT.GCACATGAAAACC atpI ACCTCGAAGGGAGCAG.....GAGTGAAAAAC bioB ACGTTTTGGAGAAGC...CCCATGGCTCAC glnA ATCCAGGAGAGTTA.AAGTATGTCCGCT glnH TAGAAAAAAGGAAATG.....CTATGAAGTCT lacZ TTCACACAGGAAACAG....CTATGACCATG rpsJ AATTGGAGCTCTGGTCTCATGCAGAAC serC GCAACGTGGTGAGGG...GAAATGGCTCAA sucA GATGCTTAAGGGATCA....CGATGCAGAAC trpE CAAAATTAGAGAATA...ACAATGCAAACA A C G T
What to do with no training set? New pattern discovery (Meme, Gibbs sampler, BioProspector) snRNA U1 (pU1-6) AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1t GCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14 CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1 GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1 CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT nucleolin GCAGGCTCAGTCTTTCGCCTCAGTCTCGAGCTCTCGCTGG snRNP E TGCCGCCGCGTGACCTTCACACTTCCGCTTCCGGTTCTTT rp S14 GACACGGAAGTGACCCCCGTCGCTCCGCCCTCTCCCACTC rp S17 TGGCCTAAGCTTTAACAGGCTTCGCCTGTGCTTCCTGTTT ribosomal p. S19 ACCCTACGCCCGACTTGTGCGCCCGGGAAACCCCGTCGTT a'-tubulin ba'1 GGTCTGGGCGTCCCGGCTGGGCCCCGTGTCTGTGCGCACG b'-tubulin b'2 GGGAGGGTATATAAGCGTTGGCGGACGGTCGGTTGTAGCA a'-actin skel-m. CCGCGGGCTATATAAAACCTGAGCAGAGGGACAAGCGGCC a'-cardiac actin TCAGCGTTCTATAAAGCGGCCCTCCTGGAGCCAGCCACCC b'-actin CGCGGCGGCGCCCTATAAAACCCAGCGGCGCGACGCGCCA Human sequences 5’ to transcriptional start
Things to do
ME