Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)

Similar presentations


Presentation on theme: "Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)"— Presentation transcript:

1 Chap 9. Gene Discovery

2 DNARNA cDNA protein EST (Expressed Seq. Tag)

3 Gene Discovery A major application of bioinformatics Matching known patterns of genes A gene Promoter + 5’ UTR + Protein coding sequence + 3’ UTR Coding sequence starts with ATG, stops with TAG,TGA or TAA Coding sequence is called an open reading frame (ORF)

4 Gene Structure

5 ORF (Open Reading Frame): DNA can encode six Proteins 5’ CAT CAA 5’ ATC AAC 5’ TCA ACT 5’ GTG GGT 5’ TGG GTA 5’ GGG TAG 5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’ 3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’

6 Transcription Gene sequence is copied from one strand Sense strand = mRNA sequence Antisense strand is used to generate mRNA sequence 5’CGCTATAGCGTTTCAT 3’ -- antisense, template strand 3’GCGATATCGCAAAGTA 5’ – sense, coding strand

7 sense Template, anti-sense

8 Transcription initiation Double-helix DNA strands are separated in the gene coding region Which enzyme detects the beginning of a gene ? RNA Polymerase (multi-subunit enzyme that synthesize RNA) binds to promoter RNA polymerase I – 28S, 5.8S and 18S rRNA genes RNA polymerase II – coding genes, snRNA RNA polymerase III – tRNA, 5S rRNA, snoRNA Other enzymes General (Basal) Transcription Factor (GTF) TFIIA, TFIIB, TFIID TFIID – recognize promoter sequence http://www.youtube.com/watch?v=MkUgkDLp2iE

9 Promoter in E.coli

10 Transcription initiation in E.coli

11 Transcription initiation in eukaryotes Promoter consists of -25 or TATA box(TATAWAW; W=A, T) And Inr (initiator) seq. (YYCARR: Y=C,T; R=A,G)

12 Transcription initiation in eukaryotes Initial contact is made by general transcription factor (GTF) TFIID, which consists of TATA-binding protein (TBP) and at least 12 TBP-associated factors (TAF)

13

14 Transcription Start Site (TSS) www.cs.uml.edu/~Kim/580/review_polII_11_Kadonaga.pdf TSS – the first base copied to mRNA Core promoter – region around a TSS Conventionally, core promoter has TA box at -30 bp of a Inr (Initiator) Transcription Factor (TF) bind to TATA box, Inr sequence, and other sites; bend DNA 90 degree; recruite general TF CpG islands: 300-3000 bp of C & G in 40% of promoters More recently, TATA box only in 10-20% or promoters

15 Core Promoter Elements IIB Recognition Element (BRE) (SSRCGCC) BREu (BREd) suppresses (enhances) transcription TATA box – TATAWAAR (metazoans) W (A,T); R (A,G-Purine); Y (T,C – Pyrimidine) Inr – YYANWYY (A +1 ) DPE (downstream Core Promoter Element) MTE (Motif Ten Element)

16 Focused/Dispersed TSS Focused (Sharp) TSS Distinct TSS site Usually TATA box in sharp TSS Primarily in tissue-specific expressions Dispersed (Broad) TSS Multiple weak start sites in 50-100 nt A few Inr or Inr-like seq in the neighborhood Generally associated with ubiquitously expressed genes Thought to be related to CpG islands

17

18 How to recognize the end of transcription ? Terminator seq. stalls polymerase

19 Splicing Alternative splicing to produce mRNA Splicesome – a collection of snRNA

20 Function of Introns www.cs.uml.edu/~kim/580/review_intron.pdf When inserted into protomer, boost expression level First introns are long Alternative exons are flanked by long introns But, association between intron length and expression breadth in human is not found Removal of 2 nd intron of human beta-globin gene reduces the efficiency of 3’-end formation RNA pol II elongation rate – 3.8kb/min Introns may serve as time delays between activation of a gene

21 Annotation: How do I get from this… >mouse_ear_cress_1080 AGGCTTGTAAAAGTGATTAAAACTGTGACATTTACTCTAAGAGAAGTAACCTGTTTGATGCATTTCCCTAATATACC GGTGTGGAAAAGTGTAGGTATCTGTACTCAGCTGAAATGGTGGACGATTTTGAAGAAGATGAACTCTCATTGACTGA AAGCGGGTTGAAGAGTGAAGATGGCGTTATTATCGAGATGAATGTCTCCTGGATGCTTTTATTATCATGTTTGGGAA TTTACCAAGGGAGAGGTATCAGAATCTATCTTAGAAGGTTACATTTAGCTCAAGCTTGCATCAACATCTTTACTTAG AGCTCTACGGGTTTTAGTGTGTTTGAAGTTTCTTAACTCCTAGTATAATTAGAATCTTCTGCAGCAGACTTTAGAGT TTTGGGATGTAGAGCTAACCAGAGTCGGTTTGTTTAAACTAGAATCTTTTTATGTAGCAGACTTGTTCAGTACCTGA ATACCAGTTTTAAATTACCGTCAGATGTTGATCTTGTTGGTAATAATGGAGAAACGGAAGAATAATTAGACGAAACA AACTCTTTAAGAACGTATCTTTCAGTTTTCCATCACAAATTTTCTTACAAGCTACAAAAATCGAACTATATATAACT GAACCGAATTTAAACCGGAGGGAGGGTTTGACTTTGGTCAATCACATTTCCAATGATACCGTCGTTTGGTTTGGGGA AGCCTCGTCGTACAAATACGACGTCGTTTAAGGAAAGCCCTCCTTAACCCCAGTTATAAGCTCAAAGTTGTACTTGA CCTTTTTAAAGAAGCACGAAACGAAAAACCCTAAAATTCCCAAGCAGAGAAAGAGAGACAGAGCAAGTACAGATTTC AACTAGCTCAAGATGATCATCCCTGTTCGTTGCTTTACTTGTGGAAAGGTTGATATTTTCCCCTTCGCTTTGGTCTT ATTTAGGGTTTTACTCCGTCTTTATAGGGTTTTAGTTACTCCAAATTTGGCTAAGAAGAGATCTTTACTCTCTGTAT TTGACACGAATGTTTTTAATCGGTTGGATACATGTTGGGTCGATTAGAGAAATAAAGTATTGAGCTTTACTAAGCTT TCACCTTGTGATTGGTTTAGGTGATTGGAAACAAATGGGATCAGTATCTTGATCTTCTCCAGCTCGACTACACTGAA GGGTAAGCTTACAATGATTCTCACTTCTTGCTGCTCTAATCATCATACTTTGTGTCAAAAAGAGAGTAATTGCTTTG CGTTTTAGAGAAATTAGCCCAGATTTCGTATTGGGTCTGTGAAGTTTCATATTAGCTAACACACTTCTCTAATTGAT AACAGAAGCTATAAAATAGATTTGCTGATGAAGGAGTTAGCTTTTTATAATCTTCTGTGTTTGTGTTTTACTGTCTG TGTCATTGGAAGAGACTATGTCCTGCCTATATAATCTCTATGTGCCTATCTAGATTTTCTATACAATTGATATTTGA

22 …to this?

23 Meaning?

24 Comparative Tools (Database searches)

25 What do we know about genes? Expressed (Transcribed)  Transcriptional start & termination sites (TXSS, TXTS)  Transcription artifacts (cDNA & ESTs (Expressed Sequence Tags)) Regulated  Promoters (TATAAA)  Transcription Factor Binding Sites  CpG Meaningful (Translated)  3n basepairs  Codon usage  Translational start & stop/termination codons (TLSS, TLTS)  Translation artifacts (proteins) Spliced  Splice sites (GT-AG) Derived (Homology: Paralogy/Orthology)  Search for known genes, proteins (BLAST)

26 How might this knowledge help to find genes? Predict genes  Look for potential starts and stops.  Connect them into open reading frames (ORFs).  Filter for “correct’ length & codon usage. Search databases  Known genes: UniGene  Known proteins: UniProt Use transcript evidence  cDNA  ESTs (Expressed Sequence Tags)  proteins

27 Exon Intron Pre-mRNA 5’ Splice Site 3’ Splice Site Reddy, S.N. Annu. Rev. Plant Biol. 2007 58:267-94 Of 1588 examined predicted splice sites in Arabidopsis 1470 sites (93%) followed the canonical GT…AG consensus. (Plant (2004) 39, 877–885) Canonical splice sites

28 The primary transcript of a gene is spliced into different mRNAs leading to multiple proteins generated from the same gene. -Contributes to protein diversity. -Can occur in any part of the transcript including UTRs.. -Can alter start codons, stop codons, reading frame, CDS, UTRs. -May alter stability-life, translation (time, location, duration), protein sequence, or both. Alternative Splicing

29 One gene, one enzyme One gene, one polypetide One gene, one set of transcripts (> 0) The dogmas – they are a~changing…

30 Alternative splicing in metazoans (Animalia) Alternative splicing well characterized in animals. As many as 96% of human genes may have multiple splice forms. Functional significance of alternative spicing still poorly understood. Alternative splicing in animals. Nature Genetics Research 36; 2004 Bridging the gap between genome and transcriptome Nucleic Acids Research 32, 2004. Splice statistics for human genes

31 RuBisCo alternative splicing one of first plant examples: “The data presented here demonstrate the existence of alternative splicing in plant systems, but the physiological significance of synthesizing two forms of rubisco activase remains unclear. However, this process may have important implications in photosynthesis. If these polypeptides were functionally equivalent enzymes in the chloroplast, there would be no need for the production of both….” Alternative splicing in plants

32 Biological significance of AS in plants …includes: -regulation of flowering; -resistance to diseases; -enzyme activity (timing, duration, turn-over time, location). Most genome databases give alternatively spliced plant gene variants

33 Example: Jasmonate signaling in Arabidopsis -Plant hormone; affects cell division, growth, reproduction and responses to insects, pathogens, and abiotic stress factors. -Jasmonate Signaling Repressor Protein JAZ 10 splice variants JAZ 10.1, JAZ 10.3 and JAZ 10.4 differ in susceptibility to degradation. -Phenotypic consequences include male sterility and altered root growth.

34 Example: Jasmonate signaling in Arabidopsis -Alternative splice sites C’ and D’ lead to different splice variants -JAZ10.3: premature stop codon in D exon, intact JAS domain -JAZ10.4: truncated C exon, protein lacks JAS domain -JAZ 10 encoded by At5G13220

35 AS in different Reading Frames

36 Gene Prediction

37 Gene Prediction Methods Intrinsic or template methods (ab initio)  Search by signal  Signals (Short, functional DNA elements involved in gene spec)  Four basic signals defining coding exons  Translation start site, 5’ (donor), 3’ (acceptor), stop site  Search by content Extrinsic or look-up methods  Homology-based  Compare sequence of interest against known coding sequences  Comparative gene prediction  Compare sequence of interest against anonymous sequences

38 Gene Prediction Methods Sequence-based  Search for ORFs, and consensus sequences Alignment-based  Search for orthologous genes of other organisms  Search for strong conservation of a genome region Content-based  Search for patterns such as nucleotide or codon frequency, characteristic of coding sequences Probabilistic  Prediction algorithsm

39 Typical Computational Steps in Gene Prediction Identify and score suitable splice sites and start/stop signals along the query sequence Predict candidate exons as detected by these signals Score exons as a function of signals and coding stats Factor in the quality of alignment between the query and known coding sequences Assemble a subset of these exon candidates into a predicted gene structure Assemble to maximizes a particular scoring function

40 Prediction and Scoring of Exons Protein coding regions have characteristic compositional bias e.g., A triplet pattern in coding region Hexamer frequency method with 5 th order Markov models widely used Likelihood of a particular base at a given position is dependent on five preceding bases

41 From Exons to RNA Assembly of several Exons to a gene Combinatorially difficult Can use dynamic programming GRAIL (Gene Recognition and Anslysis Internet Link), FGENESH, GENEID HMM (Hidden Markov Model) GENSCAN Sequence Similarity-Based Gene Prediction GENEWISE

42 How Well Do Predictions Work ? Sensitivity (Sn) = TP / (TP+FN) Specificity (Sp) = TP / (TP+FP) Correlation coefficient (CC)

43

44 Accuracy of Gene Finding Programs Sanja Rogic, Alan K. Mackworth, and Francis B.F. Ouellette (2001) Genome Research 11

45 Promoter Analysis

46 Annotation Cheat Sheet Open existing project or generate new (Red square) Run RepeatMasker Generate evidence (Predictions, BLAST searches) Synthesize evidence into gene models (Apollo) Browse results locally and in context (Phytozome) Conduct functional analysis (link from Browser) Prospect for gene family (Yellow Line from Browser) A. DNA Subway

47 Select region that holds biological gene evidence Optimize work space and zoom to region (View tab) Expand all tiers (Tiers tab) Drag evidence item(s) onto workspace (mouse) Edit to match biol. evidence (right-click item for tools) Record what was done in Annotation Info Editor Assess necessity to build alternative model(s) Upload model(s) to DNA Subway (File tab) B. Apollo


Download ppt "Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)"

Similar presentations


Ads by Google