BIOS816/VBMS818 Lecture 7 – Gene Prediction Guoqing Lu Office: E115 Beadle Center Tel: (402) Website:
Genes Protein coding genesgenes –ORF –Regulatory signals Depend on organism RNA genes –rRNA –tRNA –snRNA, others…
Prokaryotic Gene Expression PromoterCistron1Cistron2CistronNTerminator TranscriptionRNA Polymerase mRNA 5’3’ Translation Ribosome, tRNAs, Protein Factors 12N Polypeptides N C N C N C 123
Eukaryotic Gene Expression PromoterTranscribed RegionTerminator TranscriptionRNA Polymerase II Primary transcript 5’ 3’ Translation Polypeptide N C Enhancer Exon1Exon2 Intron1 Cap Splice Cleave/Polyadenylate 7m GAnAn AnAn Transport
Gene Finding Comparative –Compare your sequence to what is already known – BLASTN, BLASTX Predictive: Stitch together a consensus –HMM, GRAIL… –Frames, Testcode –Findpatterns … Empirical approach –cDNA OR protein OR genetic evidence
ORF Characteristics Primary characters –Start Codon – ( ATG ) –Stop Coden - (TAA, TAG, TGA) Secondary characters –Codon bias –Biased nucleotide distribution
ORF finding tools GCG –Frames, Map VectorNTI –ORF WWW tools –ORF Finder (NCBI) –…
Vector NTI - ORF ORFs of the lac operon GI:
Statistical analysis as a means to find genes ORF example Codon Bias Fickett’s Statistic
Codon Bias Genetic code degenerate Codon usage varies –organism to organism –gene to gene high bias correlates with high level expression bias correlates with tRNA isoacceptors Change bias or tRNAs, change expression
Codon Bias GAL4 ADH1 Gly GGG Gly GGA Gly GGT Gly GGC Gene Differences GCG: CodonFrequency
Codon Bias Organism Differences PcMl
Codon Bias Calculation frequency/synonymous family frequency Pref = frequency in random/Family frequency in random Bias >1 in CORRECT frame Bias < 1 in Incorrect frame
Codon-Biased Gene Ribosomal Protein S2, Ef-Ts Frame 2 Frame 3 rpsB tsf
Fickett’s Statistic rpsB tsf -analyzes the local nonrandomness at every third base in the sequence in a frame-independent fashion. -does not use codon frequency statistics
Error-rich DNA Fickett’s Normal Corrupted 1% substitution 2 indels
ORF Found, Now What? Find ORFs is the biggest target, but easiest to find Find Promoter elements –Should be upstream of 5’-most ORF Remember, one promoter can regulate expression of multiple cistrons –May have ambiguous sequence Find Ribosome Binding Site(s) and Start Codon(s) –1 WITHIN each ORF (cistron) near 5’ end –RBS is close to (~5-10nt) and upstream of the start codon P
More complex signals/regulatory elements More genes Combinatorial regulation common Introns/exons ORF Found, Now What?
Eukaryotic Gene Complexity Yeast –introns rare –promoters adjacent –genome dense
Eukaryotes, cont’d “higher” Eukaryotes –introns common, LONGER than exons –Promoter/enhancer –genome sparse Fungi –introns common, short relative to exons –promoter/enhancer –genome dense
Fungi and “higher” eukaryotes Sew together exons –ORF regions –consensus sequences –domain/polypeptide matches
Exon/Intron Structure CCACATTgt n(30-10,000) a n(5-20) agCAGAA...CCACATTCAGAA ProHisSerGlu...
Alternative Splice CCACATTgtn(30-10,000)an(5-20)agcagAA...CCACATTAA......ProHisSTOP
How do we know what sequences to look for? Promoter sites Intron/Exon Transcription Termination/PolyA Translation initiation
Finding Functional Sequences Known Consensus Sequences Consensus Sequence Generation –Position Weight Matrices –Sequence Logos –Hidden Markov Models Functional Tests
Gene finding Tools-WWW GRAIL II: integrated gene parsing GenLang GENIE HMMGene GENESCAN GENEMARK
GLIMMER for gene-finding in bacteria (
YOU are the best universal gene finder… You understand the “rules” –ORF, Promoter, RBS –Organism specific You understand relationships/sequences –5’ to 3’ You are a good sequence finder –search patterns You can resolve ambiguities EXPERIENCE
Exercise ORF analysis using Vector NTI: Open Vector NTI Retrieve the E. coli lac operon sequence –Find Tools -> Open Link -> GID in the molecular display window –Type in in the Genbank ID required window Do ORF analysis –Find Analysis->ORF in the molecular display window –Use the Default Start & Stop setting Present a figure showing your ORF analysis result and report the start and stop positions and lengths of the ORF's.
Exercise (cont’d) ORF analysis using GeneMark Go to Genmark web site: ark24.cgi ark24.cgi Paste in the lac operon sequence Choose E. coli as the organism Report the start and stop positions and lengths of the predicted ORF's and compare them to those found with the Vector NTI ORF
Assignment #2 Download from Blackboard –Go to “Assignment” page –Open “Assignment #2” –Download the file “Assignment1” Submit to Blackboard –Go to “Assignment” page –Open “Assignment #2” – Submit your answer through Tools->Digital Drop Box Assignment #2 – due March 12