Download presentation
Presentation is loading. Please wait.
1
Gene Structure and Identification
Genes and Genomes ORFs and more Consensus Sequences Gene Finding
2
Cells recognize genes from DNA sequence.
GATC to Gene Cells recognize genes from DNA sequence. Can we??
3
Genes Protein Coding RNA genes rRNA tRNA snRNA, snoRNA…
4
Protein Coding Genes ORF Regulatory signals long (usually >100 aa)
“known” proteins likely Regulatory signals Depend on organism Prokaryotes vs Eukaryotes Verterbrate vs fungi, eg. Yeast, ~1% of genes have ORFs<100 aa
5
??? Infer Gene Structure mRNA Promoter Splicing Strength Stability
ORF=protein Promoter Strength Regulation
6
Genomes Gene Content E. coli 4000 genes X 1 kbp/gene=4 Mbp
Genome=4 Mbp!
7
2200 Mbp=??? Genomes Gene Content Introns=600 Mbp? Human
100,000 genes X 2 kbp=200 Mbp Introns=600 Mbp? 2200 Mbp=???
8
Prokaryotic Gene Expression
Promoter Cistron1 Cistron2 CistronN Terminator Transcription RNA Polymerase mRNA 5’ 3’ 1 2 N Translation Ribosome, tRNAs, Protein Factors N N C N C C 1 2 3 Polypeptides
9
ORF Characteristics No STOPS! Codon bias
Biased nucleotide distribution periodicity of 3 dicodon frequency
10
ORFs P(ORF)=(61/64)n P(20)=(61/64)20=.38 P(100)=0.008 P(200)=10-4
11
ORF finding tools Translate/Map Frames: graphical 6 frames Testcode
UNIX graphics problem (see WWW) Testcode CodonPreference WWW tools ORF Finder (NCBI) BCM Search Launcher...
12
Codon Bias Genetic code degenerate Codon usage varies
organism to organism gene to gene high bias correlates with high level expression bias correlates with tRNA isoacceptors Change bias or tRNAs, change expression
13
Codon Bias
14
Codon Bias Gene Differences
15
Codon Bias Organism Differences
Micrococcus luteus Pneumocystis carinii
16
Codon Bias Organism Differences
Pc Ml
17
Nucleotide Bias Useful: DNA sequence Errors?
Coding DNA vs non-Coding DNA often G+C content higher than bulk Empirical statistics (Fickett’s TESTCODE) Useful: ORF matches “typical” organism, bias ORF obscured by STOP codons DNA sequence Errors?
18
Complex Genome DNA ~10% highly repetitive (300 Mbp)
NOT GENES ~25% moderate repetitive (750 Mbp) Some genes ~25% exons and introns (800 Mbp) 40%=? Regulatory regions Intergenic regions
19
Alternate sigma factors
Bacterial Promoter -35 T82T84G78A65C54A45… (16-18 bp)… T80A95T45A60A50T96…(A,G) Alternate sigma factors CCCTTGAA….CCCGATNT
20
Terminators Stem/loop 3’-U tail Rho-independent C-rich G-poor
structural only 3’-U tail Rho-independent C-rich G-poor “loose” consensus Rho-dependent
21
Translation Ribosome Binding Site, Shine-Dalgarno Site
nnGGAGGnnnnnATG… typical E. coli nnaaAGGnnnnnATG
22
Operon Structure Promoter?
23
GCG Tools Frames Testcode Findpatterns (bacterial promoters) Setplot
Options Frames –all myseq.seq output.png FTP output.png View output.png Testcode Findpatterns (bacterial promoters)
24
.ps
25
Eukaryotic Gene Expression
Enhancer Promoter Transcribed Region Terminator Transcription RNA Polymerase II Primary transcript 5’ Intron1 3’ Exon1 Exon2 Cap Splice Cleave/Polyadenylate Translation 7mG An N C Transport Polypeptide 7mG An
26
Eukaryotic Gene Complexity
Yeast introns rare promoters adjacent genome dense
27
Eukaryotes, cont’d Fungi “large” Eukaryotes
introns common, short relative to exons promoter/enhancer genome dense “large” Eukaryotes introns common, LONGER than exons Promoter/enhancer genome sparse
28
Intron Prevalence
29
Intron Size
30
Exon Size
31
Yeast ORFS=genes! Small ORFS (RNA genes) Regulatory Sequnces
32
Fungi Sew together exons ORF regions consensus sequences
domain/polypeptide matches
33
Exon/Intron Structure
CCACATTgtn(30-10,000)an(5-20)agCAGAA ...CCACATTCAGAA... ...ProHisSerGlu...
34
Alternative Splice CCACATTgtn(30-10,000)an(5-20)agcagAA
...CCACATTAA... ...ProHisSTOP
35
Position Weight Matrices
Consensus Sequences Promoter sites Intron/Exon Transcription Termination/PolyA Translation initation Position Weight Matrices
36
Finding Functional Sequences
Known Consensus Sequences Consensus Sequence Generation Functional Tests
37
Consensus Inference ProfileScan Position Weight Matrices
Sequence Logos Hidden Markov Models ProfileScan
38
Translation Initiation Sites
C G T C A T G G
39
Functional Assay Conservation Correlated Positions CCATGG 100 CCCTGG 0
CCTTGG 5 CCATAG 0 CTATGG 90 CCATGA 85 Conservation Correlated Positions
40
Splicing Consensus Alternate Splicing!?? A64G73GTA62A68G84T63…
Y80NY80Y87R75AY95…C65AGNN Vert GTRNGT(N){ } CTRAC(N){5-15}YAG Fungi Alternate Splicing!??
41
Linguistic Approach Looks like a duck... Non-repetitive DNA!! Long ORF
similar to known protein ORF extended by “reasonable” splices ORF begins with “good” ATG Promoter/terminator flanks Looks like a duck...
42
Protein Database Matches
Great for the “known” What about the unknown???
43
Codon Bias-useful? High bias = high confidence
Low bias = low confidence Sensitive to indel
44
Tools-GCG Most USEFUL Frames Testcode FindPatterns Map/Translate
45
Tools-WWW HMM Probabilities GRAIL II: integrated gene parsing GenLang
GENIE HMMGene (lock ESTs, etc.) GENESCAN GENEMARK HMM Probabilities
46
Hidden Markov Models Probabilistic Models
Applicable to linear sequences P(all states)=1, infer probabilities of all states from observed (hidden states unobserved) Work best when local correlations unimportant Genefinding, phylogeny, secondary structure, genetic mapping Work best with “Training Set” Quantitative probabilities
47
Accuracy Assessment AC = ((TP/(TP+FN)) + (TP/(TP+FP))
PP=predicted coding PN=predicted non-coding AP=“real” positive AN=“’real” negatives TP=number correct positive TN=number correct negative FP=number false positive FN=number false negative Sn=TP/AP Sp=TP/PP AC = ((TP/(TP+FN)) + (TP/(TP+FP)) + (TN/(TN+FP)) + (TN/(TN+FN))) / 2 - 1
48
Accuracy Levels DNA Sequence Error Rate!??
49
NEXT Regulatory Sequences Real examples Known Consensus Sequences
Consensus Sequence Generation Functional (Lab) Data Real examples
50
Gene Regulatory Sequences
Functional sites Consensus Experimental tests
51
Transcript initiation
Regulatory Sites Transcript initiation mRNA processing Translation sites
52
Transcript Initiation
Basal Promoters Enhancers/Silencers/Regulatory Sites Boundary elements? Transcription Initation Prokaryotes vs Eukaryotes Organism-to-Organism
53
mRNA processing Exon/Intron Polyadenylation/Cleavage Stability
Alternate splicing Polyadenylation/Cleavage Stability
54
Translation Initation site (Frameshifting)
Translational regulatory elements upstream ORFs translational enhancers
55
Infer from expression data?
Regulatory Factors lacI, trpR, CAP, araC…. GAL4, NDT80… Known from experiment Infer from genome? Infer from expression data?
56
EUKARYOTES More complex signals More genes More dispersed signals
Combinatoric regulation common
57
Basal Promoter Analysis
Myers and Maniatis, Genes VI, 831 ATATAA -30 TBP GGCCAATC -75 CTF/NF1 GCCACACCC -90 SP1 +1 TATA CAAT GC
58
Enhancer Elements False +, False - Octamer OCT1, OCT2 B NF B ATF ATF
AP1… AP1 …….. False +, False -
59
Poly A sites Metazoans AATAAA Yeast-different
60
Translation Sites Initiate at 5’-ATG (Frameshifting)
upstream ORF…regulatory (Frameshifting) Translation enhancers….
61
Consensus Sequence Databases
WWW-based TFD (transcription factor database) BCM Search launcher
62
Practical Gene Finding
Use ALL tools Comparative BLASTN, BLASTX Predictive: Stitch together a consensus HMM, GRAIL… Frames, Testcode Findpatterns (and WWW pattern searches) cDNA OR protein OR genetic evidence Most Genefinding starts with mRNA—that’s not where the cell actually starts!
63
DATABASE SEARCH www.ncbi.nlm.nih.gov BLASTN BLASTX/TBLASTX
DNA:DNA comparison (ALWAYS!) Not sensitive (DNA conservation low) BLASTX/TBLASTX 6 frame ORFS:polypeptide database 6 frames vs. 6 frames of a DNA database
64
FRAMES-aldolase gene
65
If aldolase is so tough, how do you really do it?
Combine DNA sequence with other data!
66
Infer Promoter, Enhancer
Genome-cDNA P Infer Promoter, Enhancer Test in cis DNA sequencing Align (GAP) cDNA
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.