Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  1996-2001. All rights reserved.

Similar presentations


Presentation on theme: "Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  1996-2001. All rights reserved."— Presentation transcript:

1 Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  1996-2001. All rights reserved.

2 Sequence Analysis Tasks  Representing sequences and sequence features, and finding sequence features using consensus sequences and frequency matrices

3 Sequence features A sequence is a linear set of characters (sequence elements) representing nucleotides or amino acids A sequence is a linear set of characters (sequence elements) representing nucleotides or amino acids A sequence feature is a pattern that is observed to occur in more than one sequence and (usually) to be correlated with some function A sequence feature is a pattern that is observed to occur in more than one sequence and (usually) to be correlated with some function

4 Sequence features Features following an exact pattern Features following an exact pattern  restriction enzyme recognition sites Features with approximate patterns Features with approximate patterns  promoters  transcription initiation sites  transcription termination sites  polyadenylation sites  ribosome binding sites  protein features

5 Consensus sequences A consensus sequence is a sequence that summarizes or approximates the pattern observed in a group of aligned sequences containing a sequence feature A consensus sequence is a sequence that summarizes or approximates the pattern observed in a group of aligned sequences containing a sequence feature Consensus sequences are regular expressions Consensus sequences are regular expressions

6 Representation of Sequences characters characters  simplest  easy to read, edit, etc. bit-coding bit-coding  more compact, both on disk and in memory  comparisons more efficient  more to come on this for 03-510 students

7 Character representation of sequences DNA or RNA DNA or RNA  use 1-letter codes (e.g., A,C,G,T) protein protein  use 1-letter codes  can convert to/from 3-letter codes

8 Representing uncertainty in nucleotide sequences It is often the case that we would like to represent uncertainty in a nucleotide sequence, i.e., that more than one base is “possible” at a given position It is often the case that we would like to represent uncertainty in a nucleotide sequence, i.e., that more than one base is “possible” at a given position  to express ambiguity during sequencing  to express variation at a position in a gene during evolution  to express ability of an enzyme to tolerate more than one base at a given position of a recognition site

9 Representing uncertainty in nucleotide sequences To do this for nucleotides, we use a set of single character codes that represent all possible combinations of bases To do this for nucleotides, we use a set of single character codes that represent all possible combinations of bases This set was proposed and adopted by the International Union of Biochemistry and is referred to as the I.U.B. code This set was proposed and adopted by the International Union of Biochemistry and is referred to as the I.U.B. code

10 The I.U.B. Code A, C, G, T, U A, C, G, T, U R = A, G (puRine) R = A, G (puRine) Y = C, T (pYrimidine) Y = C, T (pYrimidine) S = G, C (Strong hydrogen bonds) S = G, C (Strong hydrogen bonds) W = A, T (Weak hydrogen bonds) W = A, T (Weak hydrogen bonds) M = A, C (aMino group) M = A, C (aMino group) K = G, T (Keto group) K = G, T (Keto group) B = C, G, T (not A) B = C, G, T (not A) D = A, G, T (not C) D = A, G, T (not C) H = A, C, T (not G) H = A, C, T (not G) V = A, C, G (not T/U) V = A, C, G (not T/U) N = A, C, G, T/U (iNdeterminate) X or - are sometimes used N = A, C, G, T/U (iNdeterminate) X or - are sometimes used

11 Representing uncertainty in protein sequences Given the size of the amino acid “alphabet”, it is not practical to design a set of codes for ambiguity in protein sequences Given the size of the amino acid “alphabet”, it is not practical to design a set of codes for ambiguity in protein sequences Fortunately, ambiguity is less common in protein sequences than in nucleic acid sequences Fortunately, ambiguity is less common in protein sequences than in nucleic acid sequences Could use bit-coding as for nucleic acids but rarely done Could use bit-coding as for nucleic acids but rarely done

12 Finding occurrences of consensus sequences Example: recognition site for a restriction enzyme Example: recognition site for a restriction enzyme  EcoRI recognizes GAATTC  AccI recognizes GTMKAC Basic Algorithm Basic Algorithm  Start with first character of sequence to be searched  See if enzyme site matches starting at that position  Advance to next character of sequence to be searched  Repeat Steps 2 and 3 until all positions have been tested

13 Interactive Demonstration Copies of all demo spreadsheets (Excel files) can be found on course web page Copies of all demo spreadsheets (Excel files) can be found on course web page (A1 Pattern matching demo) (A1 Pattern matching demo)

14 Block Diagram for Search with a Consensus Sequence Search Engine Sequence to be searched Consensus Sequence (in IUB codes) List of positions where matches occur

15 Sequence Analysis Tasks  Calculating the probability of finding a sequence pattern

16 Statistics of pattern appearance Goal: Determine the significance of observing a feature (pattern) Goal: Determine the significance of observing a feature (pattern) Method: Estimate the probability that that pattern would occur randomly in a given sequence. Three different methods Method: Estimate the probability that that pattern would occur randomly in a given sequence. Three different methods  Assume all nucleotides are equally frequent  Use measured frequencies of each nucleotide (mononucleotide frequencies)  Use measured frequencies with which a given nucleotide follows another (dinucleotide frequencies)

17 Determining mononucleotide frequencies Count how many times each nucleotide appears in sequence Count how many times each nucleotide appears in sequence Divide (normalize) by total number of nucleotides Divide (normalize) by total number of nucleotides Result:f A  mononucleotide frequency of A (frequency that A is observed) Result:f A  mononucleotide frequency of A (frequency that A is observed) Define:p A  mononucleotide probability that a nucleotide will be an A Define:p A  mononucleotide probability that a nucleotide will be an A p A assumed to equal f A p A assumed to equal f A

18 Determining dinucleotide frequencies Make 4 x 4 matrix, one element for each ordered pair of nucleotides Make 4 x 4 matrix, one element for each ordered pair of nucleotides Zero all elements Zero all elements Go through sequence linearly, adding one to matrix entry corresponding to the pair of sequence elements observed at that position Go through sequence linearly, adding one to matrix entry corresponding to the pair of sequence elements observed at that position Divide by total number of dinucleotides Divide by total number of dinucleotides Result: f AC  dinucleotide frequency of AC (frequency that AC is observed out of all dinucleotides) Result: f AC  dinucleotide frequency of AC (frequency that AC is observed out of all dinucleotides)

19 Determining conditional dinucleotide probabilities Divide each dinucleotide frequency by the mononucleotide frequency of the first nucleotide Divide each dinucleotide frequency by the mononucleotide frequency of the first nucleotide Result:p * AC  conditional dinucleotide probability of observing a C given an A Result:p * AC  conditional dinucleotide probability of observing a C given an A p * AC = f AC / f A p * AC = f AC / f A

20 Illustration of probability calculation What is the probability of observing the sequence feature ART (A followed by a purine, either A or G, followed by a T)? What is the probability of observing the sequence feature ART (A followed by a purine, either A or G, followed by a T)? Using equal mononucleotide frequencies Using equal mononucleotide frequencies  p A = p C = p G = p T = 1/4  p ART = 1/4 * (1/4 + 1/4) * 1/4 = 1/32

21 Illustration (continued) Using observed mononucleotide frequencies: Using observed mononucleotide frequencies:  p ART = p A (p A + p G ) p T Using dinucleotide frequencies: Using dinucleotide frequencies:  p ART = p A (p * AA p * AT + p * AG p * GT )

22 Another illustration What is p ACT in the sequence TTTAACTGGG? What is p ACT in the sequence TTTAACTGGG?  f A = 2/10, f C = 1/10  p A = 0.2  f AC = 1/10, f CT = 1/10  p * AC = 0.1/0.2 = 0.5, p * CT = 0.1/0.1 = 1 p ACT = p A p * AC p * CT = 0.2 * 0.5 * 1 = 0.1 p ACT = p A p * AC p * CT = 0.2 * 0.5 * 1 = 0.1 (would have been 1/5 * 1/10 * 4/10 = 0.008 using mononucleotide frequencies) (would have been 1/5 * 1/10 * 4/10 = 0.008 using mononucleotide frequencies)

23 Expected number and spacing Probabilities are per nucleotide Probabilities are per nucleotide How do we calculate number of expected features in a sequence of length L? How do we calculate number of expected features in a sequence of length L?  Expected number (for large L)  Lp How do we calculate the expected spacing between features? How do we calculate the expected spacing between features?   ART  expected spacing between ART features = 1/p ART

24 Interactive Demonstration For this demo, need to set maximum iterations during calculations to 1. For Excel, select Options... from Tools, select the Calculation tab, check the Iteration box and enter 1 for Maximum Iterations (see following slide for illustration). These options are loaded along with the spreadsheet IF IT IS THE FIRST ONE LOADED. For this demo, need to set maximum iterations during calculations to 1. For Excel, select Options... from Tools, select the Calculation tab, check the Iteration box and enter 1 for Maximum Iterations (see following slide for illustration). These options are loaded along with the spreadsheet IF IT IS THE FIRST ONE LOADED.

25

26 Interactive Demonstration (A3 Mono- and Dinucleotide Frequencies) (A3 Mono- and Dinucleotide Frequencies)

27 Renewals For greatest accuracy in calculating spacing of features, need to consider renewals of a feature (taking into account whether a feature can overlap with a neighboring copy of that feature) For greatest accuracy in calculating spacing of features, need to consider renewals of a feature (taking into account whether a feature can overlap with a neighboring copy of that feature) See Eric S. Lander (1989) in Mathematical Methods for DNA Sequences, M.S. Waterman (ed.), CRC Press, Inc. Boca Raton, FL See Eric S. Lander (1989) in Mathematical Methods for DNA Sequences, M.S. Waterman (ed.), CRC Press, Inc. Boca Raton, FL

28 Comments Note that we are calculating probabilities given a consensus sequence and therefore how well such sites might match typical sites is not known. The probability analysis applies well to restriction enzymes but poorly to promoters. Note that we are calculating probabilities given a consensus sequence and therefore how well such sites might match typical sites is not known. The probability analysis applies well to restriction enzymes but poorly to promoters. Be careful to distinguish dinucleotide frequencies, conditional dinucleotide probabilities, and expected number of occurences. Be careful to distinguish dinucleotide frequencies, conditional dinucleotide probabilities, and expected number of occurences.

29 Summary, Part 2 Sequences can be represented with characters or binary (bit-coded) values Sequences can be represented with characters or binary (bit-coded) values Ambiguity in a nucleic acid sequence can be represented using I.U.B. codes Ambiguity in a nucleic acid sequence can be represented using I.U.B. codes Features of more than one sequence element can be represented by a string of characters (including ambiguity) - e.g., restriction enzyme sites Features of more than one sequence element can be represented by a string of characters (including ambiguity) - e.g., restriction enzyme sites

30 Summary, Part 2 Mononucleotide and dinucleotide frequencies can be used to estimate the probability of observing a feature Mononucleotide and dinucleotide frequencies can be used to estimate the probability of observing a feature


Download ppt "Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  1996-2001. All rights reserved."

Similar presentations


Ads by Google