Frog’s eye view of the jungle (time frozen) Push to restart time.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

An Introduction to Bioinformatics Finding genes in prokaryotes.
Bioinformatics. Bioinformatics is an applied science that uses computer programs to access molecular biology databanks to make inferences about the information.
Finding Eukaryotic Open reading frames.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Gene Identification Lab
Lecture 12 Splicing and gene prediction in eukaryotes
Genome Annotation BCB 660 October 20, From Carson Holt.
Biological Motivation Gene Finding in Eukaryotic Genomes
Finding prokaryotic genes and non intronic eukaryotic genes
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Gene Structure and Identification
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
google. com/search
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Bikash Shakya Emma Lang Jorge Diaz.  BLASTx entire sequence against 9 plant genomes. RepeatMasker  55.47% repetitive sequences  82.5% retroelements.
Welcome to Advanced Molecular Genetics, Bioinformatics, and Computational Genomics Pattern Recognition and Gene Finding.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
Billions and Billions of Bases How does a biologist maintain a grip on reality?
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
More on translation. How DNA codes proteins The primary structure of each protein (the sequence of amino acids in the polypeptide chains that make up.
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
BBSI Research Simulation News Project proposals - Monday, June 16 - Format (see News, Presentations and other dates) Renaissance fair and other events.
BME 110L / BIOL 181L Computational Biology Tools February 19: In-class exercise: a phylogenetic tree for that.
Genome Organization & Evolution. Chromosomes Genes are always in genomic structures (chromosomes) – never ‘free floating’ Bacterial genomes are circular.
Chapter 10 Transcription RNA processing Translation Jones and Bartlett Publishers © 2005.
Supplemental Table S1 For Site Directed Mutagenesis and cloning of constructs P9GF:5’ GAC GCT ACT TCA CTA TAG ATA GGA AGT TCA TTT C 3’ P9GR:5’ GAA ATG.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Genome Annotation Rosana O. Babu.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
From Genomes to Genes Rui Alves.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Eukaryotic Gene Structure. 2 Terminology Genome – entire genetic material of an individual Transcriptome – set of transcribed sequences Proteome – set.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Welcome to Introduction to Bioinformatics Monday, 21 March 2005 Genome Comparison Coming attractions How to compare genomes Chi-squared analysis.
Integrated Bioinformatics Nature of research articles Comparison of genomes – Scenario Regular expressions in Python Installing and running Blast How to.
ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity.
Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
Annotation of eukaryotic genomes
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Performing BlastP Amino acids Based on the nature of the side chains:  Aliphatic amino acids- G, A, V, L, I, P  Aromatic amino acids- F, Y, W  Polar.
Welcome to Advanced Molecular Genetics, Bioinformatics, and Computational Genomics Pattern Recognition and Gene Finding Today is the last class. Would.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Bacterial infection by lytic virus
ORF Calling.
bacteria and eukaryotes
Bacterial infection by lytic virus
Eukaryotic Gene Structure
Pattern Recognition and Gene Finding
Transcription.
Gene architecture and sequence annotation
Genome Center of Wisconsin, UW-Madison
Recitation 7 2/4/09 PSSMs+Gene finding
More on translation.
Introduction to Bioinformatics II
What do you with a whole genome sequence?
Introduction to Molecular Biology
Basic Local Alignment Search Tool
Introduction to Bioinformatics Tuesday, 19 March
Presentation transcript:

Frog’s eye view of the jungle (time frozen) Push to restart time

Frog’s eye view of the jungle (time moving) Frog’s eye view of the jungle (time frozen)

Frog’s eye view of the jungle (through movement filter) Push to restart time

Frog’s eye view of the jungle (through movement filter)

Filters: Information reducers Movement filter

Filters: Information reducers Sequence filter TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA TATGAGGCAA TCACAGCATC AGGTGACCTT AGTATCTATT CTCGGGAGCG CACGGCTCTA AAGAGGCCCA TATCCAGGCA CCTTTAGATG CAAGAAGGAG GAAACAGCTC GAAATCCCTG AGGCCGGAGG GTCAAGAACT CTCCACCGGC GGCAGCGGCC CCCCGGCCTA AGGCTGCCTG TGCTATAAAT ACGCGGCCCA TTCCCTGGGC TCGGCGGGAC AGATAACATG AATGTGCCCT CTCCGTAAAC CTCTAAC... How organism is made How organism works

From Sequence to Organism How does Nature do it? ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Genetic code Rules of folding Active site

From Sequence to Organism How does Nature do it? ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Active site Cell interaction Metabolism, Architecture Genetic codeRules of folding

From Sequence to Organism How does Nature do it? ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Genetic code Active site Gives us: Custom antibiotics Genetic code Rules of folding

From Sequence to Organism How does Nature do it? ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Gives us: Custom antibiotics Custom antibodies Custom enzymes New materials Genetic code Rules of folding Active site

From Sequence to Organism How does Nature do it? ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Genetic code Rules of transcriptional and post-transcriptional control Transcr’l initiation Transcr’l termination/ polyA tailing Splicing Transl’l initiation ? TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA ATGACTTATGATCAACGCACAGGGCTA 3% TCTACTTATATTCAATCCACAGGGCTA CACCTAGTTCTTGAAGAGTCTGTTGAA TGAACACATACATGGTTTATCTGTTTT TCTGTCTGCTCTGACCTCTGGCAGCTT TAGCCTGCCCCACTCTTAGATAAACGA ACCTTAGTGACTTCTGCTATACCAAAG TCTCCACGCCCCTCCGTAAACCTCTAA CATGATGTCAGCAAATATTAAAAATGA 97%

From Sequence to Organism How does Nature do it? Natural filters/transformations Selective transcription Selective processing Translation Folding DNA Functional protein

From Sequence to Organism How does Nature do it? Natural filters/transformations DNA Functional protein

From Sequence to Organism How can WE do it? Simulation of Nature Utterence of Wm Shakespeare Utterence of George W Bush “Whether ‘tis nobler in the mind to suffer the slings and arrows of outrageous fortune...” “We must give our military every tool and weapon it needs to prevail...” ???

From Sequence to Organism How can WE do it? Surrogate Processes Utterence of Wm Shakespeare Utterence of George W Bush “Whether ‘tis nobler in the mind to suffer the slings and arrows of outrageous fortune...” “We must give our military every tool and weapon it needs to prevail...” Words/sentence; Choice of words; Sentence structure; …

From Sequence to Organism How can WE do it? Natural filters/transformations Selective transcription Selective processing Translation Folding Surrogate filters Characteristics of coding sequences/introns My sequence Gene finders Predicted coding regions

From Sequence to Organism How can WE do it? Natural filters/transformations Selective transcription Selective processing Translation Folding Surrogate filters Gene finders Similarity finders Sequence/motif Databases My sequence

From Sequence to Organism How can WE do it? Natural filters/transformations Selective transcription Selective processing Translation Folding Surrogate filters Gene finders Similarity finders Feature finders Predicted features Characteristics of features My sequence

From Sequence to Organism How can WE do it? Natural filters/transformations Selective transcription Selective processing Translation Folding Surrogate filters Gene finders Similarity finders Feature finders Pattern finders My sequences Statistical engine

Surrogate Filters Gene finders Similarity finders Feature finders Pattern finders How do they work? Case studies Real problems Mixed strategies You do it

Surrogate Filters Gene finders Class 1: Start/Stop codon search (Map, Frames, OrfFinder) CTCCACGCCCCTCCGTACACCTCTAACATGATGTCAGCAAATATTAAAAATGAATAAACTTTGTGACATGTACAAATGGAAATATGCAA CT CCA CGC CCC TCC GTA CAC CTC TAA CAT GAT CTC AGC AAA TAT TAA AAA TGA ATA AAC TTT GTG ACA TGT ACA AAT GGA AAT ATG CAA Look for start codons (ATG) (GTG,TTG) Look for stop codons (TAA,TAG,TGA)

CTCCACGCCCCTCCGTACACCTCTAACATGATGTCAGCAAATATTAAAAATGAATAAACTTTGTGACATGTACAAATGGAAATATGCAA TTGCATATTTCCATTTGTACATGTCACAAAGTTTATTCATTTTTAATATTTGCTGAGATCATGTTAGAGGTGTACGGAGGGGCGTGGAG Surrogate Filters Gene finders Class 1: Start/Stop codon search (Map, Frames, OrfFinder) Look for start codons (ATG) (GTG,TTG) Look for stop codons (TAA,TAG,TGA)

Pro: Quick, simple Con: Useless for eukaryotic genomic sequences (introns) Inaccurate (start codon problem) Inaccurate (doubtful short open reading frames) Surrogate Filters Gene finders Class 1: Start/Stop codon search (Map, Frames, OrfFinder)

Surrogate Filters Gene finders The code is degenerate Class 2: Codon bias recognition (TestCode) Are codons equally used?

Surrogate Filters Gene finders Codon usage is biased Most frequently used codons Class 2: Codon bias recognition (TestCode) Codon bias universal?

Surrogate Filters Gene finders Class 2: Codon bias recognition (TestCode) Pro: Quick, simple, available through GCG Better than Class 1 in excluding false open reading frames Con: Useless for eukaryotic genomic sequences (introns) Gives only general areas of open reading frames

Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition Principle Step 1: Create model through extensive training set * Training set = proven or suspected genes * Organism-specific Step 2: Assess candidate genes through filter of model

Step 1: Create model through extensive training set Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition AAA AAC AAG AAT ACA... TTG TTT Training Set AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATC AATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAA CCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAAT GACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACAC TTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCT ATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACG TTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAA TCCATAGTTATTATTACTTATGACTAAAACAAAATTACTA TGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTA TATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTC AAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACT GAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCA CTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGAT GCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGG TAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT

Step 1: Create model through extensive training set AAAA: 33% AAAC: 25% AAAG: 12% AAAT: 30% Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition AAA AAC AAG AAT ACA... TTG TTT Training Set AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATC AATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAA CCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAAT GACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACAC TTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCT ATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACG TTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAA TCCATAGTTATTATTACTTATGACTAAAACAAAATTACTA TGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTA TATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTC AAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACT GAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCA CTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGAT GCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGG TAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT

Step 1: Create model through extensive training set AACA: 30% AACC: 20% AACG: 15% AACT: 35% AAA AAC AAG AAT ACA... TTG TTT Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition Training Set AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATC AATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAA CCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAAT GACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACAC TTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCT ATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACG TTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAA TCCATAGTTATTATTACTTATGACTAAAACAAAATTACTA TGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTA TATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTC AAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACT GAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCA CTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGAT GCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGG TAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT

Step 2: Assess candidate genes A C G T AAA AAC AAG AAT ACA TTG TTT Candidate gene AAAGCAA… rd order Markov model Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition

Step 2: Assess candidate genes AAAGCAA… 0.12 x rd order Markov model Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition A C G T AAA AAC AAG AAT ACA TTG TTT Candidate gene

Step 2: Assess candidate genes AAAGCTA… 0.12 x So far, not a good candidate! 3 rd order Markov model Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition A C G T AAA AAC AAG AAT ACA TTG TTT Candidate gene

Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition Pro: Almost most accurate method known Con: Needs big training set May miss genes of foreign origin Will miss very small genes

Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition Pro: Almost most accurate method known Con: Needs big training set May miss genes of foreign origin Will miss very small genes

Surrogate Filters Scenario I – Case of the Hidden Heterocyst

Case of the Hidden Heterocyst heterocysts Matveyev and Elhai (unpublished) N2N2 NH 3 O2O2

Case of the Hidden Heterocyst Strategy to find heterocyst differentiation genes Nostoc genome Transposon 1. Use transposon mutagenesis

Case of the Hidden Heterocyst Strategy to find heterocyst differentiation genes Nostoc genome Transposon 1. Use transposon mutagenesis to find a mutant defective in heterocyst differentiation

Case of the Hidden Heterocyst Strategy to find heterocyst differentiation genes Nostoc genome 2. Sequence out from transposon AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATA ATCAATGACTATCAGACAGAGAATCATCGTGCTGTCA GTAAAACCTCTGATTTCGATCTTTACCATAATTGTTA TGTTGTAATGACTAACCAGACTATCTTTTACAGAGCT TCTGGTTAACACTTGTCTAATTAGACATTGATAATGT TTGTGGGGGTTGGTCATCAGGAATGGTAAATAGCAAT TACCCTTCAGACTTTCCTATGAGACGCTCCGCCAACG AGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTT AACTTCAGAAATTCACGGCGGAAATCCATAGTTATTA TTACTTATGACTAAAACAAAATTACTATGGCGGCTTG TTTAATATAGATTCTGTGTTCTGAGAAATGACTTTTA AAGTCCCACTAACTTTTTTCTCATCTATTGCTATATT TCGACTTTAAAACTTATAGTAGATGGCTTAATTCTCA AATAACAAACTCATTTTTAGTAGATATTTCATGCAAA CTGAGGTTTTTAGTGATATTTTCCCCTTATTGAGTAC AGCCACTCCACAAACCTTAGAATGGCTACTCAATATT GCAATTGATCATGAATATCCCACTGGTAGAGCAGTTT TAATGGAAGATGCCTGGGGTAATGCAGTTTATTTCGT TGTATCTGGATGGGTAAAAGTTCGGCGCACCTGTGGA 1. Use transposon mutagenesis to find a mutant defective in heterocyst differentiation

Case of the Hidden Heterocyst Strategy to find heterocyst differentiation genes Nostoc genome 2. Sequence out from transposon AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATA ATCAATGACTATCAGACAGAGAATCATCGTGCTGTCA GTAAAACCTCTGATTTCGATCTTTACCATAATTGTTA TGTTGTAATGACTAACCAGACTATCTTTTACAGAGCT TCTGGTTAACACTTGTCTAATTAGACATTGATAATGT TTGTGGGGGTTGGTCATCAGGAATGGTAAATAGCAAT TACCCTTCAGACTTTCCTATGAGACGCTCCGCCAACG AGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTT AACTTCAGAAATTCACGGCGGAAATCCATAGTTATTA TTACTTATGACTAAAACAAAATTACTATGGCGGCTTG TTTAATATAGATTCTGTGTTCTGAGAAATGACTTTTA AAGTCCCACTAACTTTTTTCTCATCTATTGCTATATT TCGACTTTAAAACTTATAGTAGATGGCTTAATTCTCA AATAACAAACTCATTTTTAGTAGATATTTCATGCAAA CTGAGGTTTTTAGTGATATTTTCCCCTTATTGAGTAC AGCCACTCCACAAACCTTAGAATGGCTACTCAATATT GCAATTGATCATGAATATCCCACTGGTAGAGCAGTTT TAATGGAAGATGCCTGGGGTAATGCAGTTTATTTCGT TGTATCTGGATGGGTAAAAGTTCGGCGCACCTGTGGA 1. Use transposon mutagenesis to find a mutant defective in heterocyst differentiation 3. Find gene boundaries 4. Identify gene Do it

Case of the Hidden Heterocyst Strategy to find heterocyst differentiation genes 1. Go to 2. Open second browser (Ctrl-N in Netscape) Go to same site (copy and paste URL) 3. In 1 st browser, go to Program List Click on Gene Finders Open GeneMark 4. In 2 nd browser, open Nostoc sequence

Case of the Hidden Heterocyst Strategy to find heterocyst differentiation genes Mission successful: >Translation: (direct), 81 amino acids VLGSKIEEGPKHIILDLSQIDFIDSSGLGALVQLAKQAQTAEGTLQIVTNAR VTQTVKLVRLEKFLSLQKSVEEALENVK* … or was it? Check predicted protein against databases

Surrogate Filters Similarity finders Blast BlastP: Protein sequence to search protein database BlastN: Nucleotide sequence to search nucleotide database BlastX: Nucleotide sequence (translated) to search protein database TBlastN: Protein sequence to search (translated) nucleotide database Blast2Seq: Compare two sequences you specify Do it FastA (Various flavors) Pfam (Protein motif families) Finds conserved motifs similar to protein sequence

Case of the Hidden Heterocyst Strategy to find heterocyst differentiation genes Mission successful: >Translation: (direct), 81 amino acids VLGSKIEEGPKHIILDLSQIDFIDSSGLGALVQLAKQAQTAEGTLQIVTNAR VTQTVKLVRLEKFLSLQKSVEEALENVK* Why? GeneMark correct: Conservation of noncoding regions VLGSK GeneMark wrong: Fooled by weird aa sequence or start codon

Case of the Hidden Heterocyst Strategy to find heterocyst differentiation genes Moral Automated gene finders are wonderful, but common sense is better Don’t trust automated annotation

Surrogate Filters Feature finders Hidden Markov model-based methods Good for contiguous features (e.g. signal sequences) Not good with features with gaps (e.g. promoters) Ad hoc methods Feature-specific rules (e.g. tandem repeats, terminators) Position-dependent frequency tables = Position-specific scoring matrix (PSSM) = Weight table

Surrogate Filters Feature finders Position-dependent frequency tables CCCTATATAAGGC...histone H1t CGCTATAAAAACT...HMG-17 GGGTATATAAGCG...b'-tubulin b'2 GGCTATATAAAAC...a'-actin skel-m. TTCTATAAAGCGG...a'-cardiac actin CCCTATAAAACCC...b'-actin GAGTATAAAGCAC...keratin I 50K GGTTATAAAAACA...vimentin CAGTATAAAAGGG...a'1(I) collagen CCGTATAAATAGG...a'2(I) collagen TCCCATATAAGCC...fibronectin Some of 106 aligned human promoter sequences (near -26) Consensus TATAAA

Surrogate Filters Feature finders Position-dependent frequency tables CCCTATATAAGGC...histone H1t CGCTATAAAAACT...HMG-17 GGGTATATAAGCG...b'-tubulin b'2 GGCTATATAAAAC...a'-actin skel-m. TTCTATAAAGCGG...a'-cardiac actin CCCTATAAAACCC...b'-actin GAGTATAAAGCAC...keratin I 50K GGTTATAAAAACA...vimentin CAGTATAAAAGGG...a'1(I) collagen CCGTATAAATAGG...a'2(I) collagen TCCCATATAAGCC...fibronectin Some of 106 aligned human promoter sequences (near -26)

aceBACTATGGAGCATCTGCACATGAAAACC atpIACCTCGAAGGGAGCAGGAGTGAAAAAC bioBACGTTTTGGAGAAGCCCCATGGCTCAC glnAATCCAGGAGAGTTAAAGTATGTCCGCT glnHTAGAAAAAAGGAAATGCTATGAAGTCT lacZTTCACACAGGAAACAGCTATGACCATG rpsJAATTGGAGCTCTGGTCTCATGCAGAAC serCGCAACGTGGTGAGGGGAAATGGCTCAA sucAGATGCTTAAGGGATCACGATGCAGAAC trpE CAAAATTAGAGAATAACAATGCAAACA Position-Specific Scoring Matrix in action Surrogate Filters Feature finders Experimentally proven start sites unknown

aceBACTATGGAGCATCTGCACATGAAAACC atpIACCTCGAAGGGAGCAGGAGTGAAAAAC bioBACGTTTTGGAGAAGCCCCATGGCTCAC glnAATCCAGGAGAGTTAAAGTATGTCCGCT glnHTAGAAAAAAGGAAATGCTATGAAGTCT lacZTTCACACAGGAAACAGCTATGACCATG rpsJAATTGGAGCTCTGGTCTCATGCAGAAC serCGCAACGTGGTGAGGGGAAATGGCTCAA sucAGATGCTTAAGGGATCACGATGCAGAAC trpE CAAAATTAGAGAATAACAATGCAAACA Position-Specific Scoring Matrix in action Surrogate Filters Feature finders Experimentally proven start sites unknown

aceBACCACATAACTATGGAGCATCTGCACATGAAAACC atpI ACCTCGAAGGGAGCAG.....GAGTGAAAAAC bioB ACGTTTTGGAGAAGC...CCCATGGCTCAC glnA ATCCAGGAGAGTTA.AAGTATGTCCGCT glnH TAGAAAAAAGGAAATG.....CTATGAAGTCT lacZ TTCACACAGGAAACAG....CTATGACCATG rpsJ AATTGGAGCTCTGGTCTCATGCAGAAC serC GCAACGTGGTGAGGG...GAAATGGCTCAA sucA GATGCTTAAGGGATCA....CGATGCAGAAC trpE CAAAATTAGAGAATA...ACAATGCAAACA Surrogate Filters Feature finders Position-Specific Scoring Matrix in action ACGTACGT

aceB ACCACATAACTATGGAGCATCT.GCACATGAAAACC atpI ACCTCGAAGGGAGCAG.....GAGTGAAAAAC bioB ACGTTTTGGAGAAGC...CCCATGGCTCAC glnA ATCCAGGAGAGTTA.AAGTATGTCCGCT glnH TAGAAAAAAGGAAATG.....CTATGAAGTCT lacZ TTCACACAGGAAACAG....CTATGACCATG rpsJ AATTGGAGCTCTGGTCTCATGCAGAAC serC GCAACGTGGTGAGGG...GAAATGGCTCAA sucA GATGCTTAAGGGATCA....CGATGCAGAAC trpE CAAAATTAGAGAATA...ACAATGCAAACA Surrogate Filters Feature finders Position-Specific Scoring Matrix in action ACGTACGT

Surrogate Filters Pattern finders Specified patterns (FindPatterns, PatScan) e.g. Find instances of restriction sites New pattern discovery (Meme, Gibbs sampler) snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT nucleolinGCAGGCTCAGTCTTTCGCCTCAGTCTCGAGCTCTCGCTGG snRNP ETGCCGCCGCGTGACCTTCACACTTCCGCTTCCGGTTCTTT rp S14GACACGGAAGTGACCCCCGTCGCTCCGCCCTCTCCCACTC rp S17TGGCCTAAGCTTTAACAGGCTTCGCCTGTGCTTCCTGTTT ribosomal p. S19ACCCTACGCCCGACTTGTGCGCCCGGGAAACCCCGTCGTT a'-tubulin ba'1GGTCTGGGCGTCCCGGCTGGGCCCCGTGTCTGTGCGCACG b'-tubulin b'2GGGAGGGTATATAAGCGTTGGCGGACGGTCGGTTGTAGCA a'-actin skel-m.CCGCGGGCTATATAAAACCTGAGCAGAGGGACAAGCGGCC a'-cardiac actinTCAGCGTTCTATAAAGCGGCCCTCCTGGAGCCAGCCACCC b'-actinCGCGGCGGCGCCCTATAAAACCCAGCGGCGCGACGCGCCA Human sequences 5’ to transcriptional start

Surrogate Filters Pattern finders How do pattern finders work? snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence Step 2. Find best matches to pattern in all sequences Step 3. Construct position-dependent frequency table based on matches Step 4. Calculate relative probability of matches from frequency table GACAGGGCAGAA GCCCGGGTGTTT GCCGGGGACGCG GCCCCCGGGCCT GCCGCAGAGCTG

Surrogate Filters Pattern finders How do pattern finders work? snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence Step 2. Find best matches to pattern in all sequences Step 3. Construct position-dependent frequency table based on matches Step 4. Calculate relative probability of matches from frequency table Step 5. If probability score high, remember pattern and score

Surrogate Filters Pattern finders How do pattern finders work? snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence Step 2. Find best matches to pattern in all sequences Step 3. Construct position-dependent frequency table based on matches Step 4. Calculate relative probability of matches from frequency table Step 5. If probability score high, remember pattern and score Step 6. Repeat Steps 1 - 5

Surrogate Filters Scenario II – Case of the Masked Motif You’ve found a gene related to Purple Tongue Syndrome BlastP: Encoded protein related to cAMP-binding proteins Are the similarities trivial? Related to cAMP binding? Does your protein contain cAMP-binding site? What IS a cAMP-binding site? Task 1.Determine what is a cAMP-binding site 2.Determine if your protein has one

Surrogate Filters Scenario II – Case of the Masked Motif 1.Collect sequences of known cAMP-binding proteins 2.Run Meme, a pattern-finding program Ask it to find any significant motifs 3.Rerun Meme. Demand that every protein has identified motifs 4.Run Pfam over known sequence to check Do it Strategy

Surrogate Filters Scenario III – Case of the Mortal Mitochondrion Progressive External Ophthalmoplegia (PEO) Slow paralysis of voluntary eye muscles Many other symptoms (e.g., frequent deafness) Loss of mitochondrial DNA

Surrogate Filters Scenario III – Case of the Mortal Mitochondrion Progressive External Ophthalmoplegia (PEO) Slow paralysis of voluntary eye muscles Many other symptoms (e.g., frequent deafness) Loss of mitochondrial DNA Inheritance Mendelian Autosomal dominant Linked to chromosome 4q34

Surrogate Filters Scenario III – Case of the Mortal Mitochondrion Progressive External Ophthalmoplegia (PEO) Slow paralysis of voluntary eye muscles Many other symptoms (e.g., frequent deafness) Loss of mitochondrial DNA Inheritance Mendelian Autosomal dominant Linked to chromosome 4q34 Your task Examine sequence of 4q34 region Assess likelihood that a gene in the area could cause disease symptoms

Surrogate Filters Scenario III – Case of the Mortal Mitochondrion Examining Sequence of 4q34 Region tctacttatattcaatccacagggctacacctagttcttggtacacagtacatgctcagcaagagtctgttgaatgaacacatacatggtttatctgtttgtctcttccgagttcttgacttctgtctgct ctgacctctggcagctttccactagtttctagctttcattctgcttacctggatttcggaactctagcctgccccactcttagataaacgcatgccctctgtggccctggaaccttagtgacttctgctat accaaagtctccacgcccagggtgacacgcagctgcagctccgtaaacctctaacatgatgtcagcaaatattaaaaaaaaaaagtttataaaaacaatgaataaactttgttaaaggtacaaatgaaaat tagcaaacatgggaagataattgagtaaagagtttaaagttaaaaacgaattgcagtcattctaggggaaggaacagttgtatttgaaaacctgtatggttacatgaactgcctaaaaaacaagctaagga aaattaaagctcagatttatatattttaagaaattaattgcaattaatttcctgggattaaatagcatttcctcaaccccagctgtcattaaaaagaggcaaatacagccaaggactggatcttctccgga aggctgacagcactgaccctcaagaaggcaccggctgacagacagaacattctgccctaatatgtgctgaaattccgctgagagcagagtggtacattgaaccctttaggggcttacaaaagaagtgtcct gtgttttagagtcacagagttttgcagaaacaagtatgaattcacctagtggccccctgcaccaggtctttcctgtgggcactgagtgcagacacatcaatatgtaatagcagaatgaatgactgaacgaa cgattgaatgaaaagaaatgagaggcagcaggttgtcagattctatgaggcaatcacagcatcaggtgaccttagtatctatttgagaggactgccatttattctcgggagcgcacggctctaaagaggcc catatccaggcagtgagctctggtggggggcgcctttagatgcaagaaggaggaaacagctcgaaatccctgggcctgagcgcggcccgtgcaggccggagggtcaagaactctccaccggcggcagcggc ccggtgtctgccccggcttcgccccggcctaaggctgcctgtgctataaatacgcggcccacatgccgcggtgacacggtgttccctgggctcggcgggacagataacatgaatgtgccctttaaacgtcc caagttgcagggacagcccccggcccagcctcgctcccggaagcgccttcgcccccgatgccctctgcagctgggaggagggggcgccccgcacctgcccagccaatgcgcggcgcgagcgccggccgcga cccgcctcctctcgcgagagcccggcggggatataagggggagctgcgggccaggcggcggccccctagcgtcgcgcagggtcggggactgcgcgcggtgccaggccgggcgtgggcgagagcacgaacgg gctgcctgcgggctgagagcgtcgagctgtcaccatgggtgatcacgcttggagcttcctaaaggacttcctggccgggggcgtcgccgctgccgtctccaagaccgcggtcgcccccatcgagagggtca aactgctgctgcaggtgaggaccgcgcggtgcaagaggcgggcgcgggcgcggcgggccgggcggggcgcgcgatgcggcgcgagctgcagggcgcggggcgccgcggaaaatctgcgccaggccacaggc ccgggcgcccgcccgcccgcgggggaagaaggtgccctctgcgtagagacaggtccagcgtcagtcgcagattcctggtgtcgggtggcgcccggcgttcgggtgtctatatatggaaacccacccggagc cggtttacgtgtgccagatcctgcgcccgtgacagcacgggcgtgcactcaggcccggaggcacctagtgattgccagtatttttggcaccgtcttatgcgcacgcacctttacaataaaaacatcaaaat aatcatcacccaagaattcccttatcgtatctcatgcacaatgctgtatgtaggctgacgccttcatctttatgtaacctctgtgagagagttattcttctccattttacagatgaagctgaggttttgaa atattaagaaacaattttcggaataaactcagatcatcctgtctccaaatcttttcctcccctacctggtcgctgaatggtttatcatcctctcgtgttttcctccacctgcccaaaaggtcagggcccct caatgaggaagagcccaatttgggagtcagaattactaacaacaaaacccccacaaattgctcacaacggcagcaaacccttaataattgattacttggattatctgcttgaaaactttggaggcctaatg tttagtggatttattctccttcctctattagagcatctagtagagatcctcatctccagggtgatcagagtgacactgagaaattgtcattttttggccatcatgtctattaaatccaaagccctttgaag cagggagtgttactcatttctgtcccccagtaagcccctcatacagttctcaaacctagggaaagtgaaataaataaatggctatagctttatataattcaatcaccttttcagtttatttggggcaatac ctttccctcaaataccctaataattgaagcaacattggattattttggcttgttatccagtaactaacatggataacagtatccatttacacgtcctcgtatccatttgatttcctcatcctttttttctt caaaaaaaaaatctaggaagtgcaaaccttttttttttctcctgtcctcttcccttctctctaccctgcctgtcctctgtcacccaccctcccctccaccaggtccagcatgccagcaaacagatcagtgc tgagaagcagtacaaagggatcattgattgtgtggtgagaatccctaaggagcagggcttcctctccttctggaggggtaacctggccaacgtgatccgttacttccccacccaagctctcaacttcgcct tcaaggacaagtacaagcagctcttcttagggggtgtggatcggcataagcagttctggcgctactttgctggtaacctggcgtccggtggggccgctggggccacctccctttgctttgtctacccgctg gactttgctaggaccaggttggctgctgatgtgggcaagggcgccgcccagcgtgagttccatggtctgggcgactgtatcatcaagatcttcaagtctgatggcctgagggggctctaccagggtttcaa cgtctctgtccaaggcatcattatctatagagctgcctacttcggagtctatgatactgccaagggtgagagaggggcatcggggagaaggagggtggtgtggaaagaggatcctatgggatctataactc acaaaggacctgatatatattgatcttgttttttctagtctctgggataattgaggcttctgaatgaggaggtgatgtgcataagttaatagctgaagcgttccttgtgtcctctactgaaataaactctg gcctttagttattcagagaggaggaggggggagcctgtctccctctagacacagccatagcagttactgagtttaacttgaagccacttccaatgccctgtatacaagctgagcactgcccctccggggtc cggagagggcagcagccacctttgctgtctgcctggtcatatgtgaagcacctgcacaggggcaggttccccgcaaggtcagagcatggagctggaggtgcagtggcctctctccctccacctgctttctg ctgagaacaggcacttcatagccgttcggcttctgggctctgtccacagggatgctgcctgaccccaagaacgtgcacatttttgtgagctggatgattgcccagagtgtgacggcagtcgcagggctggt gtcctacccctttgacactgttcgtcgtagaatgatgatgcagtccggccggaaagggggtaagcttgtgctctactcatctaaacttgtttggttttgcccgaggagaacattttacagggctcctttca gtcttccttactggaaattaattttcaaaattatttgataaggacttagggaagaaagatggtattaattccccctaacgttctcaactatcctattagggaaaagtattttccattttattagagatgat aagaacatgaatagtaagacatttagatgtgaatttaactaggtatccagcattatagagaccctaggccctcttcccttagagcctgggtgcaaaagctagggaaaagaagtagttagctacttcttaca aagaactcttgcttccctcctagttacaggtgttagtgggatggggtgtttagctgggtagagatggcctgaagcaatctgttgtgccagagaaagttttggcttctataggttgaaccatatgaaattgc cactttaaaagtcaaaaacagtccaatgttagcagtttcgtatgtttcaacgaatagttacagccttttatttagactgcataacctcgtgcaggatcatctgaggctcagcctcagttcggtcctccata aaaaaaggtaaccgcgtagcataatactcctgctccactgcgcccttcttgtttcgcagttgggcagtccatgaattacttggttaattgccccagttcttcactgaccttgaactaatggagtaggaatg acaggagacccagcctgccagtgaagcaaggaaggagatgtccagtgggatgttgcatggagctgggactccatgcccagatgaccctgattttataaaactggtaacagtgtgtacagatatgtttcagg ggaaaagtctctttcctccagcgttacggagccctcaccagcatttgtttccacagccgatattatgtacacggggacagttgactgctggaggaagattgcaaaagacgaaggagccaaggccttcttca aaggtgcctggtccaatgtgctgagaggcatgggcggtgcttttgtattggtgttgtatgatgagatcaaaaaatatgtctaatgtaattaaaacacaagttcacagatttacatgaacttgatctacaag ttcacagatccattgtgtggtttaatagactattcctaggggaagtaaaaagatctgggataaaaccagactgaaggaatacctcagaagagatgcttcattgagtgttcattaaaccacacatgtatttt gtatttattttacatttaaattcccacagcaaatagaaaataatttatcatacttgtacaattaactgaagaattgataataactgaatgtgaaacatcaataaagaccacttaatgcacgctttctattt tattgaactcttattaactgtaaaatgcatttttaaaagatcaaaaatgcatattttctagcatgattcatgtatcagtcagcagccaagcttctaaatgccagatattatattgagaatgtattatatga gaacgtacaatgcttaaagttccggttttcaaacttaggcaggtcatattctatctatcttatccagcgttactgtaggctagaaagtgataatggctttcataatcctgccttgtcttaggcactttcct gcag

Strategy Protein has function associated with mitochondrial location? Protein has structure associated with mitochondrial location? Assume that encoded protein is in mitochondria – Use Gene finder to identify protein sequence(s) – Use Similarity finder to identify possible function – Use Feature finders to identify pertinent regions – (What ARE pertinent regions?) Surrogate Filters Scenario III – Case of the Mortal Mitochondrion

Name: PEO-related_gene? First three lines of sequence: tctacttatattcaatccacagggctacacctagttcttggtacacagtacatgctcagcaagagtctgttgaat gaacacatacatggtttatctgtttgtctcttccgagttcttgacttctgtctgctctgacctctggcagctttc cactagtttctagctttcattctgcttacctggatttcggaactctagcctgccccactcttagataaacgcatg fgene Wed Feb 27 16:55:29 GMT 2002 >PEO-related_gene? length of sequence number of predicted exons - 5 positions of predicted exons: w= ORF: w= 9.13 ORF: w= 6.08 ORF: w= ORF: w= 1.93 ORF: Length of Coding region- 708bp Amino acid sequence - 235aa MGDHAWSFLKDFLAGGVAAAVSKTAVAPIERVKLLLQVQHASKQISAEKQYKGIIDCVVR IPKEQGFLSFWRGNLANVIRYFPTQALNFAFKDKYKQLFLGGVDRHKQFWRYFAGNLASG IIIYRAAYFGVYDTAKGMLPDPKNVHIFVSWMIAQSVTAVAGLVSYPFDTVRRRMMMQSG RKGADIMYTGTVDCWRKIAKDEGAKAFFKGAWSNVLRGMGGAFVLVLYDEIKKYV* Surrogate Filters Scenario III – Case of the Mortal Mitochondrion Run 4q34 region through FGene

Name: PEO-related_gene? First three lines of sequence: tctacttatattcaatccacagggctacacctagttcttggtacacagtacatgctcagcaagagtctgttgaat gaacacatacatggtttatctgtttgtctcttccgagttcttgacttctgtctgctctgacctctggcagctttc cactagtttctagctttcattctgcttacctggatttcggaactctagcctgccccactcttagataaacgcatg Fgenesh Wed Feb 27 16:59:14 GMT 2002 FGENESH 1.0 Prediction of potential genes in Human genomic DNA Time: Wed Feb 27 16:59: Seq name: PEO-related_gene? Length of sequence: 5768 GC content: 48 Zone: 2 Positions of predicted genes and exons: G Str Feature Start End Score ORF Len 1 + TSS CDSf CDSi CDSi CDSl PolA Predicted protein(s): >FGENESH 1 4 exon (s) aa, chain + MGDHAWSFLKDFLAGGVAAAVSKTAVAPIERVKLLLQVQHASKQISAEKQYKGIIDCVVR IPKEQGFLSFWRGNLANVIRYFPTQALNFAFKDKYKQLFLGGVDRHKQFWRYFAGNLASG GAAGATSLCFVYPLDFARTRLAADVGKGAAQREFHGLGDCIIKIFKSDGLRGLYQGFNVS VQGIIIYRAAYFGVYDTAKGMLPDPKNVHIFVSWMIAQSVTAVAGLVSYPFDTVRRRMMM QSGRKGADIMYTGTVDCWRKIAKDEGAKAFFKGAWSNVLRGMGGAFVLVLYDEIKKYV FGENE output w= w= w= w= w= 1.93 Surrogate Filters Scenario III – Case of the Mortal Mitochondrion Run 4q34 region through FGeneSH

How to decide where exons are? AAAAAAAA mRNA DNA P Exon Intron Exon Intron Exon hnRNA Strategy Compare sequence of 4q34 region to sequence of mRNA Sequence of mRNA may be in cDNA library Expressed Sequence Tag (EST) library Problems Library may not exist Expression of gene may be low

MORAL: Trust, but verify. Final Score Card for Gene Finders Surrogate Filters Scenario III – Case of the Mortal Mitochondrion Run 4q34 region through BlastN (x human est’s)

Strategy Protein has function associated with mitochondrial location? Protein has structure associated with mitochondrial location? Assume that encoded protein is in mitochondria – Use Gene finder to identify protein sequence(s) – Use Similarity finder to identify possible function – Use Feature finders to identify pertinent structures – (What ARE pertinent structures?)  Surrogate Filters Scenario III – Case of the Mortal Mitochondrion

Name: PEO-related_gene? First three lines of sequence: tctacttatattcaatccacagggctacacctagttcttggtacacagtacatgctcagcaagagtctgttgaat gaacacatacatggtttatctgtttgtctcttccgagttcttgacttctgtctgctctgacctctggcagctttc cactagtttctagctttcattctgcttacctggatttcggaactctagcctgccccactcttagataaacgcatg Fgenesh Wed Feb 27 16:59:14 GMT 2002 FGENESH 1.0 Prediction of potential genes in Human genomic DNA Time: Wed Feb 27 16:59: Seq name: PEO-related_gene? Length of sequence: 5768 GC content: 48 Zone: 2 Positions of predicted genes and exons: G Str Feature Start End Score ORF Len 1 + TSS CDSf CDSi CDSi CDSl PolA Predicted protein(s): >FGENESH 1 4 exon (s) aa, chain + MGDHAWSFLKDFLAGGVAAAVSKTAVAPIERVKLLLQVQHASKQISAEKQYKGIIDCVVR IPKEQGFLSFWRGNLANVIRYFPTQALNFAFKDKYKQLFLGGVDRHKQFWRYFAGNLASG GAAGATSLCFVYPLDFARTRLAADVGKGAAQREFHGLGDCIIKIFKSDGLRGLYQGFNVS VQGIIIYRAAYFGVYDTAKGMLPDPKNVHIFVSWMIAQSVTAVAGLVSYPFDTVRRRMMM QSGRKGADIMYTGTVDCWRKIAKDEGAKAFFKGAWSNVLRGMGGAFVLVLYDEIKKYV Surrogate Filters Scenario III – Case of the Mortal Mitochondrion Run 4q34 region through BlastP

Summary One protein in region Contains mitochondrial carrier motifs Similar to ATP/ADP transporter Mitochondrial signal sequence? Reasonable candidate for PEO-related protein Surrogate Filters Scenario III – Case of the Mortal Mitochondrion Run 4q34 region through BlastP

Complex gene discovery Your turn: Repeat and extend characterization of PEO-related gene 1. Take same sequence (FastA format) ed to you 2. Get better estimate of promoter and polyA site (e.g. by TSSW and PolyASH) (Is there a TATA box upstream from the predicted promoter?) 3. Find encoded protein sequence by suitable method (e.g. FGeneSH(GC) or comparison with cDNA) 4. Continue characterization of protein * Contains signal sequence? * Contains transmembrane domains?

Filter limitation Inevitable… but whose filter?

Filters controlled by outside programmers

Filters controlled by you