Frog’s eye view of the jungle (time frozen) Push to restart time
Frog’s eye view of the jungle (time moving) Frog’s eye view of the jungle (time frozen)
Frog’s eye view of the jungle (through movement filter) Push to restart time
Frog’s eye view of the jungle (through movement filter)
Filters: Information reducers Movement filter
Filters: Information reducers Sequence filter TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA TATGAGGCAA TCACAGCATC AGGTGACCTT AGTATCTATT CTCGGGAGCG CACGGCTCTA AAGAGGCCCA TATCCAGGCA CCTTTAGATG CAAGAAGGAG GAAACAGCTC GAAATCCCTG AGGCCGGAGG GTCAAGAACT CTCCACCGGC GGCAGCGGCC CCCCGGCCTA AGGCTGCCTG TGCTATAAAT ACGCGGCCCA TTCCCTGGGC TCGGCGGGAC AGATAACATG AATGTGCCCT CTCCGTAAAC CTCTAAC... How organism is made How organism works
From Sequence to Organism How does Nature do it? ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Genetic code Rules of folding Active site
From Sequence to Organism How does Nature do it? ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Active site Cell interaction Metabolism, Architecture Genetic codeRules of folding
From Sequence to Organism How does Nature do it? ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Genetic code Active site Gives us: Custom antibiotics Genetic code Rules of folding
From Sequence to Organism How does Nature do it? ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Gives us: Custom antibiotics Custom antibodies Custom enzymes New materials Genetic code Rules of folding Active site
From Sequence to Organism How does Nature do it? ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Genetic code Rules of transcriptional and post-transcriptional control Transcr’l initiation Transcr’l termination/ polyA tailing Splicing Transl’l initiation ? TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA ATGACTTATGATCAACGCACAGGGCTA 3% TCTACTTATATTCAATCCACAGGGCTA CACCTAGTTCTTGAAGAGTCTGTTGAA TGAACACATACATGGTTTATCTGTTTT TCTGTCTGCTCTGACCTCTGGCAGCTT TAGCCTGCCCCACTCTTAGATAAACGA ACCTTAGTGACTTCTGCTATACCAAAG TCTCCACGCCCCTCCGTAAACCTCTAA CATGATGTCAGCAAATATTAAAAATGA 97%
From Sequence to Organism How does Nature do it? Natural filters/transformations Selective transcription Selective processing Translation Folding DNA Functional protein
From Sequence to Organism How does Nature do it? Natural filters/transformations DNA Functional protein
From Sequence to Organism How can WE do it? Simulation of Nature Utterence of Wm Shakespeare Utterence of George W Bush “Whether ‘tis nobler in the mind to suffer the slings and arrows of outrageous fortune...” “We must give our military every tool and weapon it needs to prevail...” ???
From Sequence to Organism How can WE do it? Surrogate Processes Utterence of Wm Shakespeare Utterence of George W Bush “Whether ‘tis nobler in the mind to suffer the slings and arrows of outrageous fortune...” “We must give our military every tool and weapon it needs to prevail...” Words/sentence; Choice of words; Sentence structure; …
From Sequence to Organism How can WE do it? Natural filters/transformations Selective transcription Selective processing Translation Folding Surrogate filters Characteristics of coding sequences/introns My sequence Gene finders Predicted coding regions
From Sequence to Organism How can WE do it? Natural filters/transformations Selective transcription Selective processing Translation Folding Surrogate filters Gene finders Similarity finders Sequence/motif Databases My sequence
From Sequence to Organism How can WE do it? Natural filters/transformations Selective transcription Selective processing Translation Folding Surrogate filters Gene finders Similarity finders Feature finders Predicted features Characteristics of features My sequence
From Sequence to Organism How can WE do it? Natural filters/transformations Selective transcription Selective processing Translation Folding Surrogate filters Gene finders Similarity finders Feature finders Pattern finders My sequences Statistical engine
Surrogate Filters Gene finders Similarity finders Feature finders Pattern finders How do they work? Case studies Real problems Mixed strategies You do it
Surrogate Filters Gene finders Class 1: Start/Stop codon search (Map, Frames, OrfFinder) CTCCACGCCCCTCCGTACACCTCTAACATGATGTCAGCAAATATTAAAAATGAATAAACTTTGTGACATGTACAAATGGAAATATGCAA CT CCA CGC CCC TCC GTA CAC CTC TAA CAT GAT CTC AGC AAA TAT TAA AAA TGA ATA AAC TTT GTG ACA TGT ACA AAT GGA AAT ATG CAA Look for start codons (ATG) (GTG,TTG) Look for stop codons (TAA,TAG,TGA)
CTCCACGCCCCTCCGTACACCTCTAACATGATGTCAGCAAATATTAAAAATGAATAAACTTTGTGACATGTACAAATGGAAATATGCAA TTGCATATTTCCATTTGTACATGTCACAAAGTTTATTCATTTTTAATATTTGCTGAGATCATGTTAGAGGTGTACGGAGGGGCGTGGAG Surrogate Filters Gene finders Class 1: Start/Stop codon search (Map, Frames, OrfFinder) Look for start codons (ATG) (GTG,TTG) Look for stop codons (TAA,TAG,TGA)
Pro: Quick, simple Con: Useless for eukaryotic genomic sequences (introns) Inaccurate (start codon problem) Inaccurate (doubtful short open reading frames) Surrogate Filters Gene finders Class 1: Start/Stop codon search (Map, Frames, OrfFinder)
Surrogate Filters Gene finders The code is degenerate Class 2: Codon bias recognition (TestCode) Are codons equally used?
Surrogate Filters Gene finders Codon usage is biased Most frequently used codons Class 2: Codon bias recognition (TestCode) Codon bias universal?
Surrogate Filters Gene finders Class 2: Codon bias recognition (TestCode) Pro: Quick, simple, available through GCG Better than Class 1 in excluding false open reading frames Con: Useless for eukaryotic genomic sequences (introns) Gives only general areas of open reading frames
Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition Principle Step 1: Create model through extensive training set * Training set = proven or suspected genes * Organism-specific Step 2: Assess candidate genes through filter of model
Step 1: Create model through extensive training set Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition AAA AAC AAG AAT ACA... TTG TTT Training Set AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATC AATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAA CCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAAT GACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACAC TTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCT ATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACG TTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAA TCCATAGTTATTATTACTTATGACTAAAACAAAATTACTA TGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTA TATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTC AAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACT GAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCA CTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGAT GCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGG TAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT
Step 1: Create model through extensive training set AAAA: 33% AAAC: 25% AAAG: 12% AAAT: 30% Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition AAA AAC AAG AAT ACA... TTG TTT Training Set AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATC AATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAA CCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAAT GACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACAC TTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCT ATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACG TTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAA TCCATAGTTATTATTACTTATGACTAAAACAAAATTACTA TGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTA TATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTC AAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACT GAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCA CTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGAT GCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGG TAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT
Step 1: Create model through extensive training set AACA: 30% AACC: 20% AACG: 15% AACT: 35% AAA AAC AAG AAT ACA... TTG TTT Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition Training Set AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATC AATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAA CCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAAT GACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACAC TTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCT ATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACG TTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAA TCCATAGTTATTATTACTTATGACTAAAACAAAATTACTA TGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTA TATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTC AAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACT GAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCA CTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGAT GCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGG TAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT
Step 2: Assess candidate genes A C G T AAA AAC AAG AAT ACA TTG TTT Candidate gene AAAGCAA… rd order Markov model Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition
Step 2: Assess candidate genes AAAGCAA… 0.12 x rd order Markov model Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition A C G T AAA AAC AAG AAT ACA TTG TTT Candidate gene
Step 2: Assess candidate genes AAAGCTA… 0.12 x So far, not a good candidate! 3 rd order Markov model Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition A C G T AAA AAC AAG AAT ACA TTG TTT Candidate gene
Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition Pro: Almost most accurate method known Con: Needs big training set May miss genes of foreign origin Will miss very small genes
Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition Pro: Almost most accurate method known Con: Needs big training set May miss genes of foreign origin Will miss very small genes
Surrogate Filters Scenario I – Case of the Hidden Heterocyst
Case of the Hidden Heterocyst heterocysts Matveyev and Elhai (unpublished) N2N2 NH 3 O2O2
Case of the Hidden Heterocyst Strategy to find heterocyst differentiation genes Nostoc genome Transposon 1. Use transposon mutagenesis
Case of the Hidden Heterocyst Strategy to find heterocyst differentiation genes Nostoc genome Transposon 1. Use transposon mutagenesis to find a mutant defective in heterocyst differentiation
Case of the Hidden Heterocyst Strategy to find heterocyst differentiation genes Nostoc genome 2. Sequence out from transposon AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATA ATCAATGACTATCAGACAGAGAATCATCGTGCTGTCA GTAAAACCTCTGATTTCGATCTTTACCATAATTGTTA TGTTGTAATGACTAACCAGACTATCTTTTACAGAGCT TCTGGTTAACACTTGTCTAATTAGACATTGATAATGT TTGTGGGGGTTGGTCATCAGGAATGGTAAATAGCAAT TACCCTTCAGACTTTCCTATGAGACGCTCCGCCAACG AGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTT AACTTCAGAAATTCACGGCGGAAATCCATAGTTATTA TTACTTATGACTAAAACAAAATTACTATGGCGGCTTG TTTAATATAGATTCTGTGTTCTGAGAAATGACTTTTA AAGTCCCACTAACTTTTTTCTCATCTATTGCTATATT TCGACTTTAAAACTTATAGTAGATGGCTTAATTCTCA AATAACAAACTCATTTTTAGTAGATATTTCATGCAAA CTGAGGTTTTTAGTGATATTTTCCCCTTATTGAGTAC AGCCACTCCACAAACCTTAGAATGGCTACTCAATATT GCAATTGATCATGAATATCCCACTGGTAGAGCAGTTT TAATGGAAGATGCCTGGGGTAATGCAGTTTATTTCGT TGTATCTGGATGGGTAAAAGTTCGGCGCACCTGTGGA 1. Use transposon mutagenesis to find a mutant defective in heterocyst differentiation
Case of the Hidden Heterocyst Strategy to find heterocyst differentiation genes Nostoc genome 2. Sequence out from transposon AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATA ATCAATGACTATCAGACAGAGAATCATCGTGCTGTCA GTAAAACCTCTGATTTCGATCTTTACCATAATTGTTA TGTTGTAATGACTAACCAGACTATCTTTTACAGAGCT TCTGGTTAACACTTGTCTAATTAGACATTGATAATGT TTGTGGGGGTTGGTCATCAGGAATGGTAAATAGCAAT TACCCTTCAGACTTTCCTATGAGACGCTCCGCCAACG AGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTT AACTTCAGAAATTCACGGCGGAAATCCATAGTTATTA TTACTTATGACTAAAACAAAATTACTATGGCGGCTTG TTTAATATAGATTCTGTGTTCTGAGAAATGACTTTTA AAGTCCCACTAACTTTTTTCTCATCTATTGCTATATT TCGACTTTAAAACTTATAGTAGATGGCTTAATTCTCA AATAACAAACTCATTTTTAGTAGATATTTCATGCAAA CTGAGGTTTTTAGTGATATTTTCCCCTTATTGAGTAC AGCCACTCCACAAACCTTAGAATGGCTACTCAATATT GCAATTGATCATGAATATCCCACTGGTAGAGCAGTTT TAATGGAAGATGCCTGGGGTAATGCAGTTTATTTCGT TGTATCTGGATGGGTAAAAGTTCGGCGCACCTGTGGA 1. Use transposon mutagenesis to find a mutant defective in heterocyst differentiation 3. Find gene boundaries 4. Identify gene Do it
Case of the Hidden Heterocyst Strategy to find heterocyst differentiation genes 1. Go to 2. Open second browser (Ctrl-N in Netscape) Go to same site (copy and paste URL) 3. In 1 st browser, go to Program List Click on Gene Finders Open GeneMark 4. In 2 nd browser, open Nostoc sequence
Case of the Hidden Heterocyst Strategy to find heterocyst differentiation genes Mission successful: >Translation: (direct), 81 amino acids VLGSKIEEGPKHIILDLSQIDFIDSSGLGALVQLAKQAQTAEGTLQIVTNAR VTQTVKLVRLEKFLSLQKSVEEALENVK* … or was it? Check predicted protein against databases
Surrogate Filters Similarity finders Blast BlastP: Protein sequence to search protein database BlastN: Nucleotide sequence to search nucleotide database BlastX: Nucleotide sequence (translated) to search protein database TBlastN: Protein sequence to search (translated) nucleotide database Blast2Seq: Compare two sequences you specify Do it FastA (Various flavors) Pfam (Protein motif families) Finds conserved motifs similar to protein sequence
Case of the Hidden Heterocyst Strategy to find heterocyst differentiation genes Mission successful: >Translation: (direct), 81 amino acids VLGSKIEEGPKHIILDLSQIDFIDSSGLGALVQLAKQAQTAEGTLQIVTNAR VTQTVKLVRLEKFLSLQKSVEEALENVK* Why? GeneMark correct: Conservation of noncoding regions VLGSK GeneMark wrong: Fooled by weird aa sequence or start codon
Case of the Hidden Heterocyst Strategy to find heterocyst differentiation genes Moral Automated gene finders are wonderful, but common sense is better Don’t trust automated annotation
Surrogate Filters Feature finders Hidden Markov model-based methods Good for contiguous features (e.g. signal sequences) Not good with features with gaps (e.g. promoters) Ad hoc methods Feature-specific rules (e.g. tandem repeats, terminators) Position-dependent frequency tables = Position-specific scoring matrix (PSSM) = Weight table
Surrogate Filters Feature finders Position-dependent frequency tables CCCTATATAAGGC...histone H1t CGCTATAAAAACT...HMG-17 GGGTATATAAGCG...b'-tubulin b'2 GGCTATATAAAAC...a'-actin skel-m. TTCTATAAAGCGG...a'-cardiac actin CCCTATAAAACCC...b'-actin GAGTATAAAGCAC...keratin I 50K GGTTATAAAAACA...vimentin CAGTATAAAAGGG...a'1(I) collagen CCGTATAAATAGG...a'2(I) collagen TCCCATATAAGCC...fibronectin Some of 106 aligned human promoter sequences (near -26) Consensus TATAAA
Surrogate Filters Feature finders Position-dependent frequency tables CCCTATATAAGGC...histone H1t CGCTATAAAAACT...HMG-17 GGGTATATAAGCG...b'-tubulin b'2 GGCTATATAAAAC...a'-actin skel-m. TTCTATAAAGCGG...a'-cardiac actin CCCTATAAAACCC...b'-actin GAGTATAAAGCAC...keratin I 50K GGTTATAAAAACA...vimentin CAGTATAAAAGGG...a'1(I) collagen CCGTATAAATAGG...a'2(I) collagen TCCCATATAAGCC...fibronectin Some of 106 aligned human promoter sequences (near -26)
aceBACTATGGAGCATCTGCACATGAAAACC atpIACCTCGAAGGGAGCAGGAGTGAAAAAC bioBACGTTTTGGAGAAGCCCCATGGCTCAC glnAATCCAGGAGAGTTAAAGTATGTCCGCT glnHTAGAAAAAAGGAAATGCTATGAAGTCT lacZTTCACACAGGAAACAGCTATGACCATG rpsJAATTGGAGCTCTGGTCTCATGCAGAAC serCGCAACGTGGTGAGGGGAAATGGCTCAA sucAGATGCTTAAGGGATCACGATGCAGAAC trpE CAAAATTAGAGAATAACAATGCAAACA Position-Specific Scoring Matrix in action Surrogate Filters Feature finders Experimentally proven start sites unknown
aceBACTATGGAGCATCTGCACATGAAAACC atpIACCTCGAAGGGAGCAGGAGTGAAAAAC bioBACGTTTTGGAGAAGCCCCATGGCTCAC glnAATCCAGGAGAGTTAAAGTATGTCCGCT glnHTAGAAAAAAGGAAATGCTATGAAGTCT lacZTTCACACAGGAAACAGCTATGACCATG rpsJAATTGGAGCTCTGGTCTCATGCAGAAC serCGCAACGTGGTGAGGGGAAATGGCTCAA sucAGATGCTTAAGGGATCACGATGCAGAAC trpE CAAAATTAGAGAATAACAATGCAAACA Position-Specific Scoring Matrix in action Surrogate Filters Feature finders Experimentally proven start sites unknown
aceBACCACATAACTATGGAGCATCTGCACATGAAAACC atpI ACCTCGAAGGGAGCAG.....GAGTGAAAAAC bioB ACGTTTTGGAGAAGC...CCCATGGCTCAC glnA ATCCAGGAGAGTTA.AAGTATGTCCGCT glnH TAGAAAAAAGGAAATG.....CTATGAAGTCT lacZ TTCACACAGGAAACAG....CTATGACCATG rpsJ AATTGGAGCTCTGGTCTCATGCAGAAC serC GCAACGTGGTGAGGG...GAAATGGCTCAA sucA GATGCTTAAGGGATCA....CGATGCAGAAC trpE CAAAATTAGAGAATA...ACAATGCAAACA Surrogate Filters Feature finders Position-Specific Scoring Matrix in action ACGTACGT
aceB ACCACATAACTATGGAGCATCT.GCACATGAAAACC atpI ACCTCGAAGGGAGCAG.....GAGTGAAAAAC bioB ACGTTTTGGAGAAGC...CCCATGGCTCAC glnA ATCCAGGAGAGTTA.AAGTATGTCCGCT glnH TAGAAAAAAGGAAATG.....CTATGAAGTCT lacZ TTCACACAGGAAACAG....CTATGACCATG rpsJ AATTGGAGCTCTGGTCTCATGCAGAAC serC GCAACGTGGTGAGGG...GAAATGGCTCAA sucA GATGCTTAAGGGATCA....CGATGCAGAAC trpE CAAAATTAGAGAATA...ACAATGCAAACA Surrogate Filters Feature finders Position-Specific Scoring Matrix in action ACGTACGT
Surrogate Filters Pattern finders Specified patterns (FindPatterns, PatScan) e.g. Find instances of restriction sites New pattern discovery (Meme, Gibbs sampler) snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT nucleolinGCAGGCTCAGTCTTTCGCCTCAGTCTCGAGCTCTCGCTGG snRNP ETGCCGCCGCGTGACCTTCACACTTCCGCTTCCGGTTCTTT rp S14GACACGGAAGTGACCCCCGTCGCTCCGCCCTCTCCCACTC rp S17TGGCCTAAGCTTTAACAGGCTTCGCCTGTGCTTCCTGTTT ribosomal p. S19ACCCTACGCCCGACTTGTGCGCCCGGGAAACCCCGTCGTT a'-tubulin ba'1GGTCTGGGCGTCCCGGCTGGGCCCCGTGTCTGTGCGCACG b'-tubulin b'2GGGAGGGTATATAAGCGTTGGCGGACGGTCGGTTGTAGCA a'-actin skel-m.CCGCGGGCTATATAAAACCTGAGCAGAGGGACAAGCGGCC a'-cardiac actinTCAGCGTTCTATAAAGCGGCCCTCCTGGAGCCAGCCACCC b'-actinCGCGGCGGCGCCCTATAAAACCCAGCGGCGCGACGCGCCA Human sequences 5’ to transcriptional start
Surrogate Filters Pattern finders How do pattern finders work? snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence Step 2. Find best matches to pattern in all sequences Step 3. Construct position-dependent frequency table based on matches Step 4. Calculate relative probability of matches from frequency table GACAGGGCAGAA GCCCGGGTGTTT GCCGGGGACGCG GCCCCCGGGCCT GCCGCAGAGCTG
Surrogate Filters Pattern finders How do pattern finders work? snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence Step 2. Find best matches to pattern in all sequences Step 3. Construct position-dependent frequency table based on matches Step 4. Calculate relative probability of matches from frequency table Step 5. If probability score high, remember pattern and score
Surrogate Filters Pattern finders How do pattern finders work? snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence Step 2. Find best matches to pattern in all sequences Step 3. Construct position-dependent frequency table based on matches Step 4. Calculate relative probability of matches from frequency table Step 5. If probability score high, remember pattern and score Step 6. Repeat Steps 1 - 5
Surrogate Filters Scenario II – Case of the Masked Motif You’ve found a gene related to Purple Tongue Syndrome BlastP: Encoded protein related to cAMP-binding proteins Are the similarities trivial? Related to cAMP binding? Does your protein contain cAMP-binding site? What IS a cAMP-binding site? Task 1.Determine what is a cAMP-binding site 2.Determine if your protein has one
Surrogate Filters Scenario II – Case of the Masked Motif 1.Collect sequences of known cAMP-binding proteins 2.Run Meme, a pattern-finding program Ask it to find any significant motifs 3.Rerun Meme. Demand that every protein has identified motifs 4.Run Pfam over known sequence to check Do it Strategy
Surrogate Filters Scenario III – Case of the Mortal Mitochondrion Progressive External Ophthalmoplegia (PEO) Slow paralysis of voluntary eye muscles Many other symptoms (e.g., frequent deafness) Loss of mitochondrial DNA
Surrogate Filters Scenario III – Case of the Mortal Mitochondrion Progressive External Ophthalmoplegia (PEO) Slow paralysis of voluntary eye muscles Many other symptoms (e.g., frequent deafness) Loss of mitochondrial DNA Inheritance Mendelian Autosomal dominant Linked to chromosome 4q34
Surrogate Filters Scenario III – Case of the Mortal Mitochondrion Progressive External Ophthalmoplegia (PEO) Slow paralysis of voluntary eye muscles Many other symptoms (e.g., frequent deafness) Loss of mitochondrial DNA Inheritance Mendelian Autosomal dominant Linked to chromosome 4q34 Your task Examine sequence of 4q34 region Assess likelihood that a gene in the area could cause disease symptoms
Surrogate Filters Scenario III – Case of the Mortal Mitochondrion Examining Sequence of 4q34 Region tctacttatattcaatccacagggctacacctagttcttggtacacagtacatgctcagcaagagtctgttgaatgaacacatacatggtttatctgtttgtctcttccgagttcttgacttctgtctgct ctgacctctggcagctttccactagtttctagctttcattctgcttacctggatttcggaactctagcctgccccactcttagataaacgcatgccctctgtggccctggaaccttagtgacttctgctat accaaagtctccacgcccagggtgacacgcagctgcagctccgtaaacctctaacatgatgtcagcaaatattaaaaaaaaaaagtttataaaaacaatgaataaactttgttaaaggtacaaatgaaaat tagcaaacatgggaagataattgagtaaagagtttaaagttaaaaacgaattgcagtcattctaggggaaggaacagttgtatttgaaaacctgtatggttacatgaactgcctaaaaaacaagctaagga aaattaaagctcagatttatatattttaagaaattaattgcaattaatttcctgggattaaatagcatttcctcaaccccagctgtcattaaaaagaggcaaatacagccaaggactggatcttctccgga aggctgacagcactgaccctcaagaaggcaccggctgacagacagaacattctgccctaatatgtgctgaaattccgctgagagcagagtggtacattgaaccctttaggggcttacaaaagaagtgtcct gtgttttagagtcacagagttttgcagaaacaagtatgaattcacctagtggccccctgcaccaggtctttcctgtgggcactgagtgcagacacatcaatatgtaatagcagaatgaatgactgaacgaa cgattgaatgaaaagaaatgagaggcagcaggttgtcagattctatgaggcaatcacagcatcaggtgaccttagtatctatttgagaggactgccatttattctcgggagcgcacggctctaaagaggcc catatccaggcagtgagctctggtggggggcgcctttagatgcaagaaggaggaaacagctcgaaatccctgggcctgagcgcggcccgtgcaggccggagggtcaagaactctccaccggcggcagcggc ccggtgtctgccccggcttcgccccggcctaaggctgcctgtgctataaatacgcggcccacatgccgcggtgacacggtgttccctgggctcggcgggacagataacatgaatgtgccctttaaacgtcc caagttgcagggacagcccccggcccagcctcgctcccggaagcgccttcgcccccgatgccctctgcagctgggaggagggggcgccccgcacctgcccagccaatgcgcggcgcgagcgccggccgcga cccgcctcctctcgcgagagcccggcggggatataagggggagctgcgggccaggcggcggccccctagcgtcgcgcagggtcggggactgcgcgcggtgccaggccgggcgtgggcgagagcacgaacgg gctgcctgcgggctgagagcgtcgagctgtcaccatgggtgatcacgcttggagcttcctaaaggacttcctggccgggggcgtcgccgctgccgtctccaagaccgcggtcgcccccatcgagagggtca aactgctgctgcaggtgaggaccgcgcggtgcaagaggcgggcgcgggcgcggcgggccgggcggggcgcgcgatgcggcgcgagctgcagggcgcggggcgccgcggaaaatctgcgccaggccacaggc ccgggcgcccgcccgcccgcgggggaagaaggtgccctctgcgtagagacaggtccagcgtcagtcgcagattcctggtgtcgggtggcgcccggcgttcgggtgtctatatatggaaacccacccggagc cggtttacgtgtgccagatcctgcgcccgtgacagcacgggcgtgcactcaggcccggaggcacctagtgattgccagtatttttggcaccgtcttatgcgcacgcacctttacaataaaaacatcaaaat aatcatcacccaagaattcccttatcgtatctcatgcacaatgctgtatgtaggctgacgccttcatctttatgtaacctctgtgagagagttattcttctccattttacagatgaagctgaggttttgaa atattaagaaacaattttcggaataaactcagatcatcctgtctccaaatcttttcctcccctacctggtcgctgaatggtttatcatcctctcgtgttttcctccacctgcccaaaaggtcagggcccct caatgaggaagagcccaatttgggagtcagaattactaacaacaaaacccccacaaattgctcacaacggcagcaaacccttaataattgattacttggattatctgcttgaaaactttggaggcctaatg tttagtggatttattctccttcctctattagagcatctagtagagatcctcatctccagggtgatcagagtgacactgagaaattgtcattttttggccatcatgtctattaaatccaaagccctttgaag cagggagtgttactcatttctgtcccccagtaagcccctcatacagttctcaaacctagggaaagtgaaataaataaatggctatagctttatataattcaatcaccttttcagtttatttggggcaatac ctttccctcaaataccctaataattgaagcaacattggattattttggcttgttatccagtaactaacatggataacagtatccatttacacgtcctcgtatccatttgatttcctcatcctttttttctt caaaaaaaaaatctaggaagtgcaaaccttttttttttctcctgtcctcttcccttctctctaccctgcctgtcctctgtcacccaccctcccctccaccaggtccagcatgccagcaaacagatcagtgc tgagaagcagtacaaagggatcattgattgtgtggtgagaatccctaaggagcagggcttcctctccttctggaggggtaacctggccaacgtgatccgttacttccccacccaagctctcaacttcgcct tcaaggacaagtacaagcagctcttcttagggggtgtggatcggcataagcagttctggcgctactttgctggtaacctggcgtccggtggggccgctggggccacctccctttgctttgtctacccgctg gactttgctaggaccaggttggctgctgatgtgggcaagggcgccgcccagcgtgagttccatggtctgggcgactgtatcatcaagatcttcaagtctgatggcctgagggggctctaccagggtttcaa cgtctctgtccaaggcatcattatctatagagctgcctacttcggagtctatgatactgccaagggtgagagaggggcatcggggagaaggagggtggtgtggaaagaggatcctatgggatctataactc acaaaggacctgatatatattgatcttgttttttctagtctctgggataattgaggcttctgaatgaggaggtgatgtgcataagttaatagctgaagcgttccttgtgtcctctactgaaataaactctg gcctttagttattcagagaggaggaggggggagcctgtctccctctagacacagccatagcagttactgagtttaacttgaagccacttccaatgccctgtatacaagctgagcactgcccctccggggtc cggagagggcagcagccacctttgctgtctgcctggtcatatgtgaagcacctgcacaggggcaggttccccgcaaggtcagagcatggagctggaggtgcagtggcctctctccctccacctgctttctg ctgagaacaggcacttcatagccgttcggcttctgggctctgtccacagggatgctgcctgaccccaagaacgtgcacatttttgtgagctggatgattgcccagagtgtgacggcagtcgcagggctggt gtcctacccctttgacactgttcgtcgtagaatgatgatgcagtccggccggaaagggggtaagcttgtgctctactcatctaaacttgtttggttttgcccgaggagaacattttacagggctcctttca gtcttccttactggaaattaattttcaaaattatttgataaggacttagggaagaaagatggtattaattccccctaacgttctcaactatcctattagggaaaagtattttccattttattagagatgat aagaacatgaatagtaagacatttagatgtgaatttaactaggtatccagcattatagagaccctaggccctcttcccttagagcctgggtgcaaaagctagggaaaagaagtagttagctacttcttaca aagaactcttgcttccctcctagttacaggtgttagtgggatggggtgtttagctgggtagagatggcctgaagcaatctgttgtgccagagaaagttttggcttctataggttgaaccatatgaaattgc cactttaaaagtcaaaaacagtccaatgttagcagtttcgtatgtttcaacgaatagttacagccttttatttagactgcataacctcgtgcaggatcatctgaggctcagcctcagttcggtcctccata aaaaaaggtaaccgcgtagcataatactcctgctccactgcgcccttcttgtttcgcagttgggcagtccatgaattacttggttaattgccccagttcttcactgaccttgaactaatggagtaggaatg acaggagacccagcctgccagtgaagcaaggaaggagatgtccagtgggatgttgcatggagctgggactccatgcccagatgaccctgattttataaaactggtaacagtgtgtacagatatgtttcagg ggaaaagtctctttcctccagcgttacggagccctcaccagcatttgtttccacagccgatattatgtacacggggacagttgactgctggaggaagattgcaaaagacgaaggagccaaggccttcttca aaggtgcctggtccaatgtgctgagaggcatgggcggtgcttttgtattggtgttgtatgatgagatcaaaaaatatgtctaatgtaattaaaacacaagttcacagatttacatgaacttgatctacaag ttcacagatccattgtgtggtttaatagactattcctaggggaagtaaaaagatctgggataaaaccagactgaaggaatacctcagaagagatgcttcattgagtgttcattaaaccacacatgtatttt gtatttattttacatttaaattcccacagcaaatagaaaataatttatcatacttgtacaattaactgaagaattgataataactgaatgtgaaacatcaataaagaccacttaatgcacgctttctattt tattgaactcttattaactgtaaaatgcatttttaaaagatcaaaaatgcatattttctagcatgattcatgtatcagtcagcagccaagcttctaaatgccagatattatattgagaatgtattatatga gaacgtacaatgcttaaagttccggttttcaaacttaggcaggtcatattctatctatcttatccagcgttactgtaggctagaaagtgataatggctttcataatcctgccttgtcttaggcactttcct gcag
Strategy Protein has function associated with mitochondrial location? Protein has structure associated with mitochondrial location? Assume that encoded protein is in mitochondria – Use Gene finder to identify protein sequence(s) – Use Similarity finder to identify possible function – Use Feature finders to identify pertinent regions – (What ARE pertinent regions?) Surrogate Filters Scenario III – Case of the Mortal Mitochondrion
Name: PEO-related_gene? First three lines of sequence: tctacttatattcaatccacagggctacacctagttcttggtacacagtacatgctcagcaagagtctgttgaat gaacacatacatggtttatctgtttgtctcttccgagttcttgacttctgtctgctctgacctctggcagctttc cactagtttctagctttcattctgcttacctggatttcggaactctagcctgccccactcttagataaacgcatg fgene Wed Feb 27 16:55:29 GMT 2002 >PEO-related_gene? length of sequence number of predicted exons - 5 positions of predicted exons: w= ORF: w= 9.13 ORF: w= 6.08 ORF: w= ORF: w= 1.93 ORF: Length of Coding region- 708bp Amino acid sequence - 235aa MGDHAWSFLKDFLAGGVAAAVSKTAVAPIERVKLLLQVQHASKQISAEKQYKGIIDCVVR IPKEQGFLSFWRGNLANVIRYFPTQALNFAFKDKYKQLFLGGVDRHKQFWRYFAGNLASG IIIYRAAYFGVYDTAKGMLPDPKNVHIFVSWMIAQSVTAVAGLVSYPFDTVRRRMMMQSG RKGADIMYTGTVDCWRKIAKDEGAKAFFKGAWSNVLRGMGGAFVLVLYDEIKKYV* Surrogate Filters Scenario III – Case of the Mortal Mitochondrion Run 4q34 region through FGene
Name: PEO-related_gene? First three lines of sequence: tctacttatattcaatccacagggctacacctagttcttggtacacagtacatgctcagcaagagtctgttgaat gaacacatacatggtttatctgtttgtctcttccgagttcttgacttctgtctgctctgacctctggcagctttc cactagtttctagctttcattctgcttacctggatttcggaactctagcctgccccactcttagataaacgcatg Fgenesh Wed Feb 27 16:59:14 GMT 2002 FGENESH 1.0 Prediction of potential genes in Human genomic DNA Time: Wed Feb 27 16:59: Seq name: PEO-related_gene? Length of sequence: 5768 GC content: 48 Zone: 2 Positions of predicted genes and exons: G Str Feature Start End Score ORF Len 1 + TSS CDSf CDSi CDSi CDSl PolA Predicted protein(s): >FGENESH 1 4 exon (s) aa, chain + MGDHAWSFLKDFLAGGVAAAVSKTAVAPIERVKLLLQVQHASKQISAEKQYKGIIDCVVR IPKEQGFLSFWRGNLANVIRYFPTQALNFAFKDKYKQLFLGGVDRHKQFWRYFAGNLASG GAAGATSLCFVYPLDFARTRLAADVGKGAAQREFHGLGDCIIKIFKSDGLRGLYQGFNVS VQGIIIYRAAYFGVYDTAKGMLPDPKNVHIFVSWMIAQSVTAVAGLVSYPFDTVRRRMMM QSGRKGADIMYTGTVDCWRKIAKDEGAKAFFKGAWSNVLRGMGGAFVLVLYDEIKKYV FGENE output w= w= w= w= w= 1.93 Surrogate Filters Scenario III – Case of the Mortal Mitochondrion Run 4q34 region through FGeneSH
How to decide where exons are? AAAAAAAA mRNA DNA P Exon Intron Exon Intron Exon hnRNA Strategy Compare sequence of 4q34 region to sequence of mRNA Sequence of mRNA may be in cDNA library Expressed Sequence Tag (EST) library Problems Library may not exist Expression of gene may be low
MORAL: Trust, but verify. Final Score Card for Gene Finders Surrogate Filters Scenario III – Case of the Mortal Mitochondrion Run 4q34 region through BlastN (x human est’s)
Strategy Protein has function associated with mitochondrial location? Protein has structure associated with mitochondrial location? Assume that encoded protein is in mitochondria – Use Gene finder to identify protein sequence(s) – Use Similarity finder to identify possible function – Use Feature finders to identify pertinent structures – (What ARE pertinent structures?) Surrogate Filters Scenario III – Case of the Mortal Mitochondrion
Name: PEO-related_gene? First three lines of sequence: tctacttatattcaatccacagggctacacctagttcttggtacacagtacatgctcagcaagagtctgttgaat gaacacatacatggtttatctgtttgtctcttccgagttcttgacttctgtctgctctgacctctggcagctttc cactagtttctagctttcattctgcttacctggatttcggaactctagcctgccccactcttagataaacgcatg Fgenesh Wed Feb 27 16:59:14 GMT 2002 FGENESH 1.0 Prediction of potential genes in Human genomic DNA Time: Wed Feb 27 16:59: Seq name: PEO-related_gene? Length of sequence: 5768 GC content: 48 Zone: 2 Positions of predicted genes and exons: G Str Feature Start End Score ORF Len 1 + TSS CDSf CDSi CDSi CDSl PolA Predicted protein(s): >FGENESH 1 4 exon (s) aa, chain + MGDHAWSFLKDFLAGGVAAAVSKTAVAPIERVKLLLQVQHASKQISAEKQYKGIIDCVVR IPKEQGFLSFWRGNLANVIRYFPTQALNFAFKDKYKQLFLGGVDRHKQFWRYFAGNLASG GAAGATSLCFVYPLDFARTRLAADVGKGAAQREFHGLGDCIIKIFKSDGLRGLYQGFNVS VQGIIIYRAAYFGVYDTAKGMLPDPKNVHIFVSWMIAQSVTAVAGLVSYPFDTVRRRMMM QSGRKGADIMYTGTVDCWRKIAKDEGAKAFFKGAWSNVLRGMGGAFVLVLYDEIKKYV Surrogate Filters Scenario III – Case of the Mortal Mitochondrion Run 4q34 region through BlastP
Summary One protein in region Contains mitochondrial carrier motifs Similar to ATP/ADP transporter Mitochondrial signal sequence? Reasonable candidate for PEO-related protein Surrogate Filters Scenario III – Case of the Mortal Mitochondrion Run 4q34 region through BlastP
Complex gene discovery Your turn: Repeat and extend characterization of PEO-related gene 1. Take same sequence (FastA format) ed to you 2. Get better estimate of promoter and polyA site (e.g. by TSSW and PolyASH) (Is there a TATA box upstream from the predicted promoter?) 3. Find encoded protein sequence by suitable method (e.g. FGeneSH(GC) or comparison with cDNA) 4. Continue characterization of protein * Contains signal sequence? * Contains transmembrane domains?
Filter limitation Inevitable… but whose filter?
Filters controlled by outside programmers
Filters controlled by you