Presentation is loading. Please wait.

Presentation is loading. Please wait.

ORF Calling.

Similar presentations


Presentation on theme: "ORF Calling."— Presentation transcript:

1 ORF Calling

2 ORF Calling Why? Need to know protein sequence
Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity studies Proteins are better for remote similarities than DNA sequences Protein sequences change slower than DNA sequences

3 ORF Calling Extrinsic gene calling Intrinsic gene calling
Compare your DNA sequences to known sequences. Needs other sequences that are known! Intrinsic gene calling Only use information in your DNA sequences. Does not use other information.

4 Extrinsic gene calling
Start with DNA sequence Translate in all 6 reading frames

5 Why are there 6 reading frames?
AG TAA AAC TTT AAT TGT TGG TTA A 3 A GTA AAA CTT TAA TTG TTG GTT AA 2 AGT AAA ACT TTA ATT GTT GGT TAA 1 AGT AAA ACT TTA ATT GTT GGT TAA TCA TTT TGA AAT TAA CAA CCA ATT | | | | | | | | | | | | | | | | | | | | | | | | TCA TTT TGA AAT TAA CAA CCA ATT -1 TC ATT TTG AAA TTA ACA ACC AAT T -2 T CAT TTT GAA ATT AAC AAC CAA TT -3

6 Extrinsic gene calling
Start with DNA sequence Translate in all 6 reading frames Compare your sequence to known protein sequences Find the ends of each, and call those genes!

7 } For example DNA sequence Similar protein sequences e.g. from BLAST
Protein encoding gene DNA sequence } Similar protein sequences e.g. from BLAST

8 Uses of extrinsic calling
This is how (most) metagenome ORF calling is done Eukaryotic ORF calling – especially using EST sequences

9 Problems with extrinsic calling
Very slow (depending on search algorithm) Dependent on your database Only finds known genes

10 Alternatives to extrinsic gene calling
Intrinsic gene calling Ab initio gene calling What are the start codons? What are the stop codons? ATG TAA TAG TGA

11 How frequently do stop codons appear?
Approximately once every 20 amino acids at random! A stretch of 100 amino acids is likely to have a stop codon!

12 How to call ORFs (the easy way)
3 2 1 DNA -1 -2 -3

13 Find all the stop codons
3 2 1 DNA -1 -2 -3

14 Find all the ORFs > x amino acids
X is often 100 amino acids 3 2 1 DNA -1 -2 -3

15 Trim to those ORFs that have a start
3 2 1 DNA -1 -2 -3

16 Remove “shadow” ORFs Short ORFs that overlap others 3 2 1 DNA -1 -2 -3

17 Trim the start sites to first ATG
3 2 1 DNA -1 -2 -3

18 These are the ORFs 3 2 1 DNA -1 -2 -3

19 Intrinsic ORF calling using Markov Models

20 Markov Models Based on language processing
Common for gene and protein finding, alignments, and so on

21 What is the most common word?
English: the Spanish: el (la) Portuguese: que

22 Scrabble

23 Scrabble In scrabble, how do they score the letters?
The most abundant letters (easiest to place on the board) are given the lowest score

24 Scrabble 1 point: E, A, I, O, N, R, T, L, S, U 2 points: D, G
3 points: B, C, M, P 4 points: F, H, V, W, Y 5 points: K 8 points: J, X 10 points: Q, Z

25 Frequency of letters

26 Making up sentences If I want to make up a sentence, I could choose some letters at random, based on their occurrence in the alphabet (i.e their scrabble score) rla bsht es stsfa ohhofsd

27 Lets get clever! What follows a period (“.”)? What follows a t?
Usually a space “ ” Usually an “i” (-tion, -tize, ...)

28 Frequency of two letters
When the first letter is “t” (from 3,269 words): ti % te % ta % th %

29 Level 1 analysis Choose a letter based on the probability that it follows the letter before: s h a n d t u c t h i n e y m e l e o l l d

30 Levels of analysis 1 letter (a, e, o …) 2 letters (th, ti, sh …)
3 letters (the, and, …) 4 letters (that, …) Zero order model First order model Second order model Third order model

31 Markov models With about 10th order Markov models of English you get complete words and sentences!

32 Markov models With about 10th order Markov models of English you get complete words and sentences!

33 Markov Models and ORF calling
Codons have three letters (ATG, CAC, GGG, ...) Use a 2nd order Markov model for ORF calling The frequency of a letter is predicted based on the frequency of the two letters before

34 Scrabble

35 Scrabble (México) Do English and Spanish use the same letters?

36 Scrabble (México)

37 Scrabble (US) 1 point: E, A, I, O, N, R, T, L, S, U 2 points: D, G
3 points: B, C, M, P 4 points: F, H, V, W, Y 5 points: K 8 points: J, X 10 points: Q, Z Based on the front page of the NY Times!

38 Scrabble (Spanish) 1 point: A, E, O, I, S, N, L, R, U, T
2 points: D, G 3 points: C, B, M, P 4 points: H, F, V, Y 5 points: CH, Q 8 points: J, LL, Ñ, RR, X 10 points: Z

39 What about scrabble scores for DNA?
Will vary with the composition of the organism! Remember, some organisms have high G+C compared to A+T

40 Markov Models and ORF calling
Use a 2nd order Markov model for ORF calling The frequency of a letter is predicted based on the frequency of the two letters before

41 Problems! Need to train the Markov model – not all organisms are the same Can use phylogentically close organisms Can use “long orfs” – likely to be correct because unlikely to be random stretches without a stop codon!

42 Interpolated Markov Model (The imm in GLIMMER)
Markov Models order 1-8 (word size 2-9) Discard (or ↓ weight) for rare words Promote (or ↑ weight) for common words Probability is the sum of all probabilities from 1-8 2-9

43 RNA genes As with proteins, two main methods: Ab initio Intrinsic
Homology based extrinsic

44 Ribosomes Ribosomes are made of proteins and RNA

45 30S subunit from Thermus aquaticus
Blue: protein Orange: rRNA

46 E. coli 16S rRNA secondary structure

47 Variable region Conserved region

48 V6 V5 V7 (43) (37) (28, 29) V4 V8 (P23-1, (45, 46) 24) V9 (49) V3 (18)
Variable regions in the 16S rRNA. Vn – 9 regions (n) – variable loop(s) forward/rev primers V1 (6) Van de Peer Y, Chapelle S, De Wachter R. (1996) A quantitative map of nucleotide substitution rates in bacterial rRNA. Nucl. Acids Res. 24: V2 (8-11)

49 Ribosomes Ribosomes are made of proteins and RNA Prokaryotic ribosome:
Large subunit: 50S 5S and 23S rRNA genes Small subunit: 30S 16S rRNA gene

50 Finding 16S genes Easiest way is iterative: BLAST ALIGN TRIM
Problem: secondary structure makes identification of the ends difficult

51 Finding tRNA genes Not as easy as rRNA Much shorter Varied sequence
Only conservation is 2° structure

52 tRNAScan-SE Sean Eddy Use it!

53 How does this relate to tRNA?
tRNA-Phe by Yikrazuul - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons

54 tRNA structure Start of acceptor stem (7-9 bp)
D-loop (4-6-bp) stem plus loop anticodon arm (6-bp) stem plus loop with anticodon T-loop (4-5-bp) stem plus loop End of acceptor stem (7-9 bp) CCA to attach amino acid (may not be in sequence ... added during processing)


Download ppt "ORF Calling."

Similar presentations


Ads by Google