Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  1996-2001. All rights reserved.

Slides:



Advertisements
Similar presentations
Computational Biology, Part 2 Sequence Motifs Robert F. Murphy Copyright  1996, All rights reserved.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Computational Biology, Part 7 Similarity Functions and Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.
Measuring the degree of similarity: PAM and blosum Matrix
Ka-Lok Ng Dept. of Bioinformatics Asia University
Profiles for Sequences
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences
Lecture outline Database searches
Heuristic alignment algorithms and cost matrices
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Sequence similarity.
Pairwise profile alignment Usman Roshan BNFO 601.
Similar Sequence Similar Function Charles Yan Spring 2006.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Sequence analysis How to locate rare/important sub- sequences.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Statistics in Bioinformatics May 12, 2005 Quiz 3-on May 12 Learning objectives-Understand equally likely outcomes, counting techniques (Example, genetic.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
An Introduction to Bioinformatics
Promoter Prediction in E.coli using ANN
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Chapter 6 Profiles and Hidden Markov Models. The following approaches can also be used to identify distantly related members to a family of protein (or.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence analysis – an overview A.Krishnamachari
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Computational Biology, Part A More on Sequence Operations Robert F. Murphy Copyright  1997, All rights reserved.
Uses of Statistics: 1)Descriptive : To describe or summarize a collection of data points The data set in hand = the population of interest 2)Inferential.
Comp. Genomics Recitation 3 The statistics of database searching.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Tutorial 4 Substitution matrices and PSI-BLAST 1.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Two Main Uses of Statistics: 1)Descriptive : To describe or summarize a collection of data points The data set in hand = the population of interest 2)Inferential.
Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Local Multiple Sequence Alignment Sequence Motifs
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept.
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.
HW4: sites that look like transcription start sites Nucleotide histogram Background frequency Count matrix for translation start sites (-10 to 10) Frequency.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
Learning Sequence Motif Models Using Expectation Maximization (EM)
Generalizations of Markov model to characterize biological sequences
Basic Local Alignment Search Tool
Presentation transcript:

Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.

Sequence Analysis Tasks  Calculating the probability of finding a region with a particular base composition

Statistics of AT- or GC-rich regions What is the probability of observing a “run” of the same nucleotide (e.g., 25 A’s) What is the probability of observing a “run” of the same nucleotide (e.g., 25 A’s) Let p x be the mononucleotide probability of nucleotide x Let p x be the mononucleotide probability of nucleotide x The per nucleotide probability of a run of N consecutive x’s is p x N The per nucleotide probability of a run of N consecutive x’s is p x N The probability of occurence in a sequence of length L longer than N is ≈ L p x N The probability of occurence in a sequence of length L longer than N is ≈ L p x N

Statistics of AT- or GC-rich regions What if J “mismatches” are allowed? What if J “mismatches” are allowed? Let p y be the probability of observing a different nucleotide (normally p y = 1 - p x ) Let p y be the probability of observing a different nucleotide (normally p y = 1 - p x ) The probability of observing N-J of nucleotide x and J of nucleotide y in a region of length N is The probability of observing N-J of nucleotide x and J of nucleotide y in a region of length N is  p x N-J p y J C(N,J) where where  C(N,J) = N! / ( (N-J)! J! )

Statistics of AC- or GC-rich regions As before, we can multiply by L to approximate the probability of observing that combination in a sequence of length L As before, we can multiply by L to approximate the probability of observing that combination in a sequence of length L Note that this is the probability of observing exactly N-J matches and exactly J mismatches. We may also wish to know the probability of finding at least N-J matches, which requires summing the probability for I=0 to I=J. Note that this is the probability of observing exactly N-J matches and exactly J mismatches. We may also wish to know the probability of finding at least N-J matches, which requires summing the probability for I=0 to I=J.

Statistics of AT- or GC-rich regions (A4 Enriched seq prob demo) (A4 Enriched seq prob demo)

Sequence Analysis Tasks   Calculating the probability of finding a sequence pattern  Calculating the probability of finding a region with a particular base composition  Representing and finding sequence features/motifs using frequency matrices

Describing features using frequency matrices Goal: Describe a sequence feature (or motif) more quantitatively than possible using consensus sequences Goal: Describe a sequence feature (or motif) more quantitatively than possible using consensus sequences Need to describe how often particular bases are found in particular positions in a sequence feature Need to describe how often particular bases are found in particular positions in a sequence feature

Describing features using frequency matrices Definition: For a feature of length m using an alphabet of n characters, a frequency matrix is an n by m matrix in which each element contains the frequency at which a given member of the alphabet is observed at a given position in an aligned set of sequences containing the feature Definition: For a feature of length m using an alphabet of n characters, a frequency matrix is an n by m matrix in which each element contains the frequency at which a given member of the alphabet is observed at a given position in an aligned set of sequences containing the feature

Frequency matrices (continued) Three uses of frequency matrices Three uses of frequency matrices  Describe a sequence feature  Calculate probability of occurrence of feature in a random sequence  Calculate degree of match between a new sequence and a feature

Interactive Demonstration (A2 Frequency matrix demo) (A2 Frequency matrix demo)

Frequency Matrices, PSSMs, and Profiles A frequency matrix can be converted to a Position-Specific Scoring Matrix (PSSM) by converting frequencies to scores (e.g., by taking logs) A frequency matrix can be converted to a Position-Specific Scoring Matrix (PSSM) by converting frequencies to scores (e.g., by taking logs) PSSMs also called Position Weight Matrixes (PWMs) or Profiles PSSMs also called Position Weight Matrixes (PWMs) or Profiles

Finding occurrences of a sequence feature using a Profile As with finding occurrences of a consensus sequence, we consider all positions in the target sequence as candidate matches As with finding occurrences of a consensus sequence, we consider all positions in the target sequence as candidate matches For each position, we calculate a score by “looking up” the value corresponding to the base at that position For each position, we calculate a score by “looking up” the value corresponding to the base at that position

Interactive Demonstration (A10 Searching with Profile demo) (A10 Searching with Profile demo)

Block Diagram for Building a PSSM PSSM builder Set of Aligned Sequence Features Expected frequencies of each sequence element PSSM

Block Diagram for Searching with a PSSM PSSM search PSSM Set of Sequences to search Sequences that match above threshold Threshold Positions and scores of matches

Block Diagram for Searching for sequences related to a family with a PSSM PSSM search PSSM Set of Sequences to search Sequences that match above threshold Threshold Positions and scores of matches PSSM builder Set of Aligned Sequence Features Expected frequencies of each sequence element

Consensus sequences vs. frequency matrices Should I use a consensus sequence or a frequency matrix to describe my site? Should I use a consensus sequence or a frequency matrix to describe my site?  If all allowed characters at a given position are equally "good", use IUB codes to create consensus sequence  Example: Restriction enzyme recognition sites  If some allowed characters are "better" than others, use frequency matrix  Example: Promoter sequences

Consensus sequences vs. frequency matrices Advantages of consensus sequences: smaller description, quicker comparison Advantages of consensus sequences: smaller description, quicker comparison Disadvantage: lose quantitative information on preferences at certain locations Disadvantage: lose quantitative information on preferences at certain locations

Summary, Part 3 Probability of finding sequences enriched in one or more bases can be calculated using probability of consecutive bases multiplied by number of combinations allowed Probability of finding sequences enriched in one or more bases can be calculated using probability of consecutive bases multiplied by number of combinations allowed Complex sequence features can be described using frequency matrices Complex sequence features can be described using frequency matrices Frequency matrices can be used for quantitative estimates of the degree to which a given sequence matches a feature Frequency matrices can be used for quantitative estimates of the degree to which a given sequence matches a feature