Download presentation
Presentation is loading. Please wait.
Published byKristopher Harrington Modified over 8 years ago
1
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright 1996-2001. All rights reserved.
2
Sequence Analysis Tasks Calculating the probability of finding a region with a particular base composition
3
Statistics of AT- or GC-rich regions What is the probability of observing a “run” of the same nucleotide (e.g., 25 A’s) What is the probability of observing a “run” of the same nucleotide (e.g., 25 A’s) Let p x be the mononucleotide probability of nucleotide x Let p x be the mononucleotide probability of nucleotide x The per nucleotide probability of a run of N consecutive x’s is p x N The per nucleotide probability of a run of N consecutive x’s is p x N The probability of occurence in a sequence of length L longer than N is ≈ L p x N The probability of occurence in a sequence of length L longer than N is ≈ L p x N
4
Statistics of AT- or GC-rich regions What if J “mismatches” are allowed? What if J “mismatches” are allowed? Let p y be the probability of observing a different nucleotide (normally p y = 1 - p x ) Let p y be the probability of observing a different nucleotide (normally p y = 1 - p x ) The probability of observing N-J of nucleotide x and J of nucleotide y in a region of length N is The probability of observing N-J of nucleotide x and J of nucleotide y in a region of length N is p x N-J p y J C(N,J) where where C(N,J) = N! / ( (N-J)! J! )
5
Statistics of AC- or GC-rich regions As before, we can multiply by L to approximate the probability of observing that combination in a sequence of length L As before, we can multiply by L to approximate the probability of observing that combination in a sequence of length L Note that this is the probability of observing exactly N-J matches and exactly J mismatches. We may also wish to know the probability of finding at least N-J matches, which requires summing the probability for I=0 to I=J. Note that this is the probability of observing exactly N-J matches and exactly J mismatches. We may also wish to know the probability of finding at least N-J matches, which requires summing the probability for I=0 to I=J.
6
Statistics of AT- or GC-rich regions (A4 Enriched seq prob demo) (A4 Enriched seq prob demo)
7
Sequence Analysis Tasks Calculating the probability of finding a sequence pattern Calculating the probability of finding a region with a particular base composition Representing and finding sequence features/motifs using frequency matrices
8
Describing features using frequency matrices Goal: Describe a sequence feature (or motif) more quantitatively than possible using consensus sequences Goal: Describe a sequence feature (or motif) more quantitatively than possible using consensus sequences Need to describe how often particular bases are found in particular positions in a sequence feature Need to describe how often particular bases are found in particular positions in a sequence feature
9
Describing features using frequency matrices Definition: For a feature of length m using an alphabet of n characters, a frequency matrix is an n by m matrix in which each element contains the frequency at which a given member of the alphabet is observed at a given position in an aligned set of sequences containing the feature Definition: For a feature of length m using an alphabet of n characters, a frequency matrix is an n by m matrix in which each element contains the frequency at which a given member of the alphabet is observed at a given position in an aligned set of sequences containing the feature
10
Frequency matrices (continued) Three uses of frequency matrices Three uses of frequency matrices Describe a sequence feature Calculate probability of occurrence of feature in a random sequence Calculate degree of match between a new sequence and a feature
11
Interactive Demonstration (A2 Frequency matrix demo) (A2 Frequency matrix demo)
12
Frequency Matrices, PSSMs, and Profiles A frequency matrix can be converted to a Position-Specific Scoring Matrix (PSSM) by converting frequencies to scores (e.g., by taking logs) A frequency matrix can be converted to a Position-Specific Scoring Matrix (PSSM) by converting frequencies to scores (e.g., by taking logs) PSSMs also called Position Weight Matrixes (PWMs) or Profiles PSSMs also called Position Weight Matrixes (PWMs) or Profiles
13
Finding occurrences of a sequence feature using a Profile As with finding occurrences of a consensus sequence, we consider all positions in the target sequence as candidate matches As with finding occurrences of a consensus sequence, we consider all positions in the target sequence as candidate matches For each position, we calculate a score by “looking up” the value corresponding to the base at that position For each position, we calculate a score by “looking up” the value corresponding to the base at that position
14
Interactive Demonstration (A10 Searching with Profile demo) (A10 Searching with Profile demo)
15
Block Diagram for Building a PSSM PSSM builder Set of Aligned Sequence Features Expected frequencies of each sequence element PSSM
16
Block Diagram for Searching with a PSSM PSSM search PSSM Set of Sequences to search Sequences that match above threshold Threshold Positions and scores of matches
17
Block Diagram for Searching for sequences related to a family with a PSSM PSSM search PSSM Set of Sequences to search Sequences that match above threshold Threshold Positions and scores of matches PSSM builder Set of Aligned Sequence Features Expected frequencies of each sequence element
18
Consensus sequences vs. frequency matrices Should I use a consensus sequence or a frequency matrix to describe my site? Should I use a consensus sequence or a frequency matrix to describe my site? If all allowed characters at a given position are equally "good", use IUB codes to create consensus sequence Example: Restriction enzyme recognition sites If some allowed characters are "better" than others, use frequency matrix Example: Promoter sequences
19
Consensus sequences vs. frequency matrices Advantages of consensus sequences: smaller description, quicker comparison Advantages of consensus sequences: smaller description, quicker comparison Disadvantage: lose quantitative information on preferences at certain locations Disadvantage: lose quantitative information on preferences at certain locations
20
Summary, Part 3 Probability of finding sequences enriched in one or more bases can be calculated using probability of consecutive bases multiplied by number of combinations allowed Probability of finding sequences enriched in one or more bases can be calculated using probability of consecutive bases multiplied by number of combinations allowed Complex sequence features can be described using frequency matrices Complex sequence features can be described using frequency matrices Frequency matrices can be used for quantitative estimates of the degree to which a given sequence matches a feature Frequency matrices can be used for quantitative estimates of the degree to which a given sequence matches a feature
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.