Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gene Structure and Identification

Similar presentations


Presentation on theme: "Gene Structure and Identification"— Presentation transcript:

1 Gene Structure and Identification
Genes and Genomes ORFs and more Consensus Sequences Gene Finding

2 Cells recognize genes from DNA sequence.
GATC to Gene Cells recognize genes from DNA sequence. Can we??

3 Genes Protein Coding RNA genes rRNA tRNA snRNA, snoRNA…

4 Protein Coding Genes ORF Regulatory signals long (usually >100 aa)
“known” proteins likely Regulatory signals Depend on organism Prokaryotes vs Eukaryotes Verterbrate vs fungi, eg. Yeast, ~1% of genes have ORFs<100 aa

5 ??? Infer Gene Structure mRNA Promoter Splicing Strength Stability
ORF=protein Promoter Strength Regulation

6 Genomes Gene Content E. coli 4000 genes X 1 kbp/gene=4 Mbp
Genome=4 Mbp!

7 2200 Mbp=??? Genomes Gene Content Introns=600 Mbp? Human
100,000 genes X 2 kbp=200 Mbp Introns=600 Mbp? 2200 Mbp=???

8 Prokaryotic Gene Expression
Promoter Cistron1 Cistron2 CistronN Terminator Transcription RNA Polymerase mRNA 5’ 3’ 1 2 N Translation Ribosome, tRNAs, Protein Factors N N C N C C 1 2 3 Polypeptides

9 ORF Characteristics No STOPS! Codon bias
Biased nucleotide distribution periodicity of 3 dicodon frequency

10 ORFs P(ORF)=(61/64)n P(20)=(61/64)20=.38 P(100)=0.008 P(200)=10-4

11 ORF finding tools Translate/Map Frames: graphical 6 frames Testcode
UNIX graphics problem (see WWW) Testcode CodonPreference WWW tools ORF Finder (NCBI) BCM Search Launcher...

12 Codon Bias Genetic code degenerate Codon usage varies
organism to organism gene to gene high bias correlates with high level expression bias correlates with tRNA isoacceptors Change bias or tRNAs, change expression

13 Codon Bias

14 Codon Bias Gene Differences

15 Codon Bias Organism Differences
Micrococcus luteus Pneumocystis carinii

16 Codon Bias Organism Differences
Pc Ml

17 Nucleotide Bias Useful: DNA sequence Errors?
Coding DNA vs non-Coding DNA often G+C content higher than bulk Empirical statistics (Fickett’s TESTCODE) Useful: ORF matches “typical” organism, bias ORF obscured by STOP codons DNA sequence Errors?

18 Complex Genome DNA ~10% highly repetitive (300 Mbp)
NOT GENES ~25% moderate repetitive (750 Mbp) Some genes ~25% exons and introns (800 Mbp) 40%=? Regulatory regions Intergenic regions

19 Alternate sigma factors
Bacterial Promoter -35 T82T84G78A65C54A45… (16-18 bp)… T80A95T45A60A50T96…(A,G) Alternate sigma factors CCCTTGAA….CCCGATNT

20 Terminators Stem/loop 3’-U tail Rho-independent C-rich G-poor
structural only 3’-U tail Rho-independent C-rich G-poor “loose” consensus Rho-dependent

21 Translation Ribosome Binding Site, Shine-Dalgarno Site
nnGGAGGnnnnnATG… typical E. coli nnaaAGGnnnnnATG

22 Operon Structure Promoter?

23 GCG Tools Frames Testcode Findpatterns (bacterial promoters) Setplot
Options Frames –all myseq.seq output.png FTP output.png View output.png Testcode Findpatterns (bacterial promoters)

24 .ps

25 Eukaryotic Gene Expression
Enhancer Promoter Transcribed Region Terminator Transcription RNA Polymerase II Primary transcript 5’ Intron1 3’ Exon1 Exon2 Cap Splice Cleave/Polyadenylate Translation 7mG An N C Transport Polypeptide 7mG An

26 Eukaryotic Gene Complexity
Yeast introns rare promoters adjacent genome dense

27 Eukaryotes, cont’d Fungi “large” Eukaryotes
introns common, short relative to exons promoter/enhancer genome dense “large” Eukaryotes introns common, LONGER than exons Promoter/enhancer genome sparse

28 Intron Prevalence

29 Intron Size

30 Exon Size

31 Yeast ORFS=genes! Small ORFS (RNA genes) Regulatory Sequnces

32 Fungi Sew together exons ORF regions consensus sequences
domain/polypeptide matches

33 Exon/Intron Structure
CCACATTgtn(30-10,000)an(5-20)agCAGAA ...CCACATTCAGAA... ...ProHisSerGlu...

34 Alternative Splice CCACATTgtn(30-10,000)an(5-20)agcagAA
...CCACATTAA... ...ProHisSTOP

35 Position Weight Matrices
Consensus Sequences Promoter sites Intron/Exon Transcription Termination/PolyA Translation initation Position Weight Matrices

36 Finding Functional Sequences
Known Consensus Sequences Consensus Sequence Generation Functional Tests

37 Consensus Inference ProfileScan Position Weight Matrices
Sequence Logos Hidden Markov Models ProfileScan

38 Translation Initiation Sites
C G T C A T G G

39 Functional Assay Conservation Correlated Positions CCATGG 100 CCCTGG 0
CCTTGG 5 CCATAG 0 CTATGG 90 CCATGA 85 Conservation Correlated Positions

40 Splicing Consensus Alternate Splicing!?? A64G73GTA62A68G84T63…
Y80NY80Y87R75AY95…C65AGNN Vert GTRNGT(N){ } CTRAC(N){5-15}YAG Fungi Alternate Splicing!??

41 Linguistic Approach Looks like a duck... Non-repetitive DNA!! Long ORF
similar to known protein ORF extended by “reasonable” splices ORF begins with “good” ATG Promoter/terminator flanks Looks like a duck...

42 Protein Database Matches
Great for the “known” What about the unknown???

43 Codon Bias-useful? High bias = high confidence
Low bias = low confidence Sensitive to indel

44 Tools-GCG Most USEFUL Frames Testcode FindPatterns Map/Translate

45 Tools-WWW HMM Probabilities GRAIL II: integrated gene parsing GenLang
GENIE HMMGene (lock ESTs, etc.) GENESCAN GENEMARK HMM Probabilities

46 Hidden Markov Models Probabilistic Models
Applicable to linear sequences P(all states)=1, infer probabilities of all states from observed (hidden states unobserved) Work best when local correlations unimportant Genefinding, phylogeny, secondary structure, genetic mapping Work best with “Training Set” Quantitative probabilities

47 Accuracy Assessment AC = ((TP/(TP+FN)) + (TP/(TP+FP))
PP=predicted coding PN=predicted non-coding AP=“real” positive AN=“’real” negatives TP=number correct positive TN=number correct negative FP=number false positive FN=number false negative Sn=TP/AP Sp=TP/PP AC = ((TP/(TP+FN)) + (TP/(TP+FP)) + (TN/(TN+FP)) + (TN/(TN+FN))) / 2 - 1

48 Accuracy Levels DNA Sequence Error Rate!??

49 NEXT Regulatory Sequences Real examples Known Consensus Sequences
Consensus Sequence Generation Functional (Lab) Data Real examples

50 Gene Regulatory Sequences
Functional sites Consensus Experimental tests

51 Transcript initiation
Regulatory Sites Transcript initiation mRNA processing Translation sites

52 Transcript Initiation
Basal Promoters Enhancers/Silencers/Regulatory Sites Boundary elements? Transcription Initation Prokaryotes vs Eukaryotes Organism-to-Organism

53 mRNA processing Exon/Intron Polyadenylation/Cleavage Stability
Alternate splicing Polyadenylation/Cleavage Stability

54 Translation Initation site (Frameshifting)
Translational regulatory elements upstream ORFs translational enhancers

55 Infer from expression data?
Regulatory Factors lacI, trpR, CAP, araC…. GAL4, NDT80… Known from experiment Infer from genome? Infer from expression data?

56 EUKARYOTES More complex signals More genes More dispersed signals
Combinatoric regulation common

57 Basal Promoter Analysis
Myers and Maniatis, Genes VI, 831 ATATAA -30 TBP GGCCAATC -75 CTF/NF1 GCCACACCC -90 SP1 +1 TATA CAAT GC

58 Enhancer Elements False +, False - Octamer OCT1, OCT2 B NF B ATF ATF
AP1… AP1 …….. False +, False -

59 Poly A sites Metazoans AATAAA Yeast-different

60 Translation Sites Initiate at 5’-ATG (Frameshifting)
upstream ORF…regulatory (Frameshifting) Translation enhancers….

61 Consensus Sequence Databases
WWW-based TFD (transcription factor database) BCM Search launcher

62 Practical Gene Finding
Use ALL tools Comparative BLASTN, BLASTX Predictive: Stitch together a consensus HMM, GRAIL… Frames, Testcode Findpatterns (and WWW pattern searches) cDNA OR protein OR genetic evidence Most Genefinding starts with mRNA—that’s not where the cell actually starts!

63 DATABASE SEARCH www.ncbi.nlm.nih.gov BLASTN BLASTX/TBLASTX
DNA:DNA comparison (ALWAYS!) Not sensitive (DNA conservation low) BLASTX/TBLASTX 6 frame ORFS:polypeptide database 6 frames vs. 6 frames of a DNA database

64 FRAMES-aldolase gene

65 If aldolase is so tough, how do you really do it?
Combine DNA sequence with other data!

66 Infer Promoter, Enhancer
Genome-cDNA P Infer Promoter, Enhancer Test in cis DNA sequencing Align (GAP) cDNA


Download ppt "Gene Structure and Identification"

Similar presentations


Ads by Google