Presentation is loading. Please wait.

Presentation is loading. Please wait.

ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity.

Similar presentations


Presentation on theme: "ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity."— Presentation transcript:

1 ORF Calling

2 Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity studies Proteins are better for remote similarities than DNA sequences Protein sequences change slower than DNA sequences ORF Calling

3 Intrinsic gene calling Extrinsic gene calling Compare your DNA sequences to known sequences. Needs other sequences that are known! Only use information in your DNA sequences. Does not use other information. ORF Calling

4  Start with DNA sequence  Translate in all 6 reading frames Extrinsic gene calling

5 AGT AAA ACT TTA ATT GTT GGT TAA 1 3 2 TCA TTT TGA AAT TAA CAA CCA ATT | | | | | | | | | | | | | | | | | | | | | | | | T CAT TTT GAA ATT AAC AAC CAA TT-3 TCA TTT TGA AAT TAA CAA CCA ATT TC ATT TTG AAA TTA ACA ACC AAT T-2 Why are there 6 reading frames?

6  Start with DNA sequence  Translate in all 6 reading frames  Compare your sequence to known protein sequences  Find the ends of each, and call those genes! Extrinsic gene calling

7 DNA sequence } Similar protein sequences e.g. from BLAST Protein encoding gene For example

8  This is how (most) metagenome ORF calling is done  Eukaryotic ORF calling – especially using EST sequences Uses of extrinsic calling

9  Very slow (depending on search algorithm)  Dependent on your database  Only finds known genes Problems with extrinsic calling

10  Intrinsic gene calling  Ab initio gene calling  What are the start codons?  What are the stop codons? ATG TAA TAG TGA Alternatives to extrinsic gene calling

11 Approximately once every 20 amino acids at random! A stretch of 100 amino acids is likely to have a stop codon! How frequently do stop codons appear?

12 DNA 3 2 1 -2 -3 How to call ORFs (the easy way)

13 DNA 3 2 1 -2 -3 Find all the stop codons

14 DNA 3 2 1 -2 -3 X is often 100 amino acids Find all the ORFs > x amino acids

15 DNA 3 2 1 -2 -3 Trim to those ORFs that have a start

16 DNA 3 2 1 -2 -3 Short ORFs that overlap others Remove “shadow” ORFs

17 DNA 3 2 1 -2 -3 Trim the start sites to first ATG

18 DNA 3 2 1 -2 -3 These are the ORFs

19 Intrinsic ORF calling using Markov Models

20  Based on language processing  Common for gene and protein finding, alignments, and so on Markov Models

21 English: the Spanish: el (la) Portuguese: que What is the most common word?

22 Scrabble

23 In scrabble, how do they score the letters? The most abundant letters (easiest to place on the board) are given the lowest score Scrabble

24 1 point: E, A, I, O, N, R, T, L, S, U 2 points: D, G 3 points: B, C, M, P 4 points: F, H, V, W, Y 5 points: K 8 points: J, X 10 points: Q, Z Scrabble

25 Frequency of letters

26 If I want to make up a sentence, I could choose some letters at random, based on their occurrence in the alphabet (i.e their scrabble score) rla bsht es stsfa ohhofsd Making up sentences

27 What follows a period (“.”)? What follows a t? Usually a space “ ” Usually an “i” (-tion, -tize,...) Lets get clever!

28 When the first letter is “t” (from 3,269 words): ti 51% te 20% ta 15% th 8% Frequency of two letters

29 Choose a letter based on the probability that it follows the letter before: shandtuchtineymeleolld Level 1 analysis

30 1 letter (a, e, o …) 2 letters (th, ti, sh …) 3 letters (the, and, …) 4 letters (that, …) Zero order model First order model Second order model Third order model Levels of analysis

31 With about 10 th order Markov models of English you get complete words and sentences! Markov models

32 With about 10 th order Markov models of English you get complete words and sentences! Markov models

33 Scoring words with Markov Models If I choose random letters how can I tell if they are real words? Sum the scores of 10 th order Markov models across the words … if it is high it is likely to be a real word! In reality, maybe use 1 st, 2 nd, 3 rd, 4 th, 5 th, 6 th … order models and compare to some known words

34 Codons have three letters (ATG, CAC, GGG,...) Use a 2 nd order Markov model for ORF calling The frequency of a letter is predicted based on the frequency of the two letters before Markov Models and ORF calling

35 Scrabble

36 Do English and Spanish use the same letters? Scrabble (México)

37

38 1 point: E, A, I, O, N, R, T, L, S, U 2 points: D, G 3 points: B, C, M, P 4 points: F, H, V, W, Y 5 points: K 8 points: J, X 10 points: Q, Z Scrabble (US) Based on the front page of the NY Times!

39 1 point: A, E, O, I, S, N, L, R, U, T 2 points: D, G 3 points: C, B, M, P 4 points: H, F, V, Y 5 points: CH, Q 8 points: J, LL, Ñ, RR, X 10 points: Z Scrabble (Spanish)

40 Will vary with the composition of the organism! Remember, some organisms have high G+C compared to A+T What about scrabble scores for DNA?

41 Use a 2 nd order Markov model for ORF calling The frequency of a letter is predicted based on the frequency of the two letters before Markov Models and ORF calling

42 Need to train the Markov model – not all organisms are the same Can use phylogentically close organisms Can use “long orfs” – likely to be correct because unlikely to be random stretches without a stop codon! Problems!

43 Markov Models order 1-8 (word size 2-9) Discard (or ↓ weight) for rare words Promote (or ↑ weight) for common words Probability is the sum of all probabilities from 1-8 2-9 Interpolated Markov Model (The imm in GLIMMER)

44 As with proteins, two main methods: Ab initio Intrinsic Homology based extrinsic RNA genes

45 Ribosomes are made of proteins and RNA Ribosomes

46 30S subunit from Thermus aquaticus Blue: protein Orange: rRNA

47 E. coli 16S rRNA secondary structure

48 Variable region Conserved region

49 Variable regions in the 16S rRNA. Vn – 9 regions (n) – variable loop(s) forward/rev primers V1 (6) V2 (8- 11) V3 (18) V4 (P23- 1, 24) V5 (28, 29) V6 (37 ) V7 (43) V8 (45, 46) V9 (49) Van de Peer Y, Chapelle S, De Wachter R. (1996) A quantitative map of nucleotide substitution rates in bacterial rRNA. Nucl. Acids Res. 24:3381-3391

50 Ribosomes are made of proteins and RNA Prokaryotic ribosome: Large subunit: 50S 5S and 23S rRNA genes Small subunit: 30S 16S rRNA gene Ribosomes

51 Easiest way is iterative:  BLAST  ALIGN  TRIM Problem: secondary structure makes identification of the ends difficult Finding 16S genes

52 Not as easy as rRNA Much shorter Varied sequence Only conservation is 2° structure Finding tRNA genes

53 tRNAScan-SE Sean Eddy Use it!

54 How does this relate to tRNA? tRNA-Phe by Yikrazuul - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons https://commons.wikimedia.org/wiki/File:TRNA-Phe_yeast_en.svg

55 tRNA structure Start of acceptor stem (7-9 bp) D-loop (4-6-bp) stem plus loop anticodon arm (6-bp) stem plus loop with anticodon T-loop (4-5-bp) stem plus loop End of acceptor stem (7-9 bp) CCA to attach amino acid (may not be in sequence... added during processing)


Download ppt "ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity."

Similar presentations


Ads by Google