Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.

Slides:

Advertisements

Similar presentations

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.

Advertisements

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.

Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz

Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.

Measuring the degree of similarity: PAM and blosum Matrix

Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.

Profiles for Sequences

Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.

Psi-BLAST, Prosite, UCSC Genome Browser Lecture 3.

Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.

© Wiley Publishing All Rights Reserved. Analyzing Protein Sequences.

Heuristic alignment algorithms and cost matrices

HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.

Progressive MSA Do pair-wise alignment Develop an evolutionary tree Most closely related sequences are then aligned, then more distant are added. Genetic.

HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT

Matching Problems in Bioinformatics Charles Yan Fall 2008.

Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.

Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.

Protein Modules An Introduction to Bioinformatics.

Sequence similarity.

Similar Sequence Similar Function Charles Yan Spring 2006.

Prosite and UCSC Genome Browser Exercise 3. Protein motifs and Prosite.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

Single Motif Charles Yan Spring Single Motif.

Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)

Situations where generic scoring matrix is not suitable Short exact match Specific patterns.

Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒黃尹柔田耕豪蕭逸嫻謝朝茂莊閔傑 2014/05/12 1.

PROTEIN SEQUENCE ANALYSIS. Need good protein sequence analysis tools because: As number of sequences increases, so gap between seq data and experimental.

Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.

Hidden Markov Models for Sequence Analysis 4

Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.

Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.

Sequence analysis: Macromolecular motif recognition Sylvia Nagl.

Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.

Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Profile Searches Revised 07/11/06. Overview Introduction Motif representation Motif screening Motif Databases Exercise.

Construction of Substitution Matrices

Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.

HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.

Protein and RNA Families

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Generic substitution matrix based sequence comparison Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W T V A. Total:

PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.

Generic substitution matrix based sequence comparison Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W T V A. Total:

Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.

Chapter 7 - Sequence patterns1 Chapter 7 – Sequence patterns (first part) We want a signature for a protein sequence family. The signature should ideally.

Local Multiple Sequence Alignment Sequence Motifs

Construction of Substitution matrices

Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.

Step 3: Tools Database Searching

Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.

(H)MMs in gene prediction and similarity searches.

Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

InterPro Sandra Orchard.

Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.

Pairwise Sequence Alignment. Three modifications for local alignment The scoring system uses negative scores for mismatches The minimum score for.

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

bacteria and eukaryotes

Sequence Based Analysis Tutorial

Presentation transcript:

Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett

Recap Multiple Sequence Alignment....|....|....|....|....|....|....|....|....|....| Sp1 ACTCPYCKDS EGRGSG---- DPGKKKQHIC HIQGCGKVYG KTSHLRAHLR Sp2 ACTCPNCKDG EKRS GEQGKKKHVC HIPDCGKTFR KTSLLRAHVR Sp3 ACTCPNCKEG GGRGTN---- -LGKKKQHIC HIPGCGKVYG KTSHLRAHLR Sp4 ACSCPNCREG EGRGSN---- EPGKKKQHIC HIEGCGKVYG KTSHLRAHLR DrosBtd RCTCPNCTNE MSGLPPIVGP DERGRKQHIC HIPGCERLYG KASHLKTHLR DrosSp TCDCPNCQEA ERLGPAGV-- HLRKKNIHSC HIPGCGKVYG KTSHLKAHLR CeT22C8.5 RCTCPNCKAI KHG DRGSQHTHLC SVPGCGKTYK KTSHLRAHLR Y40B1A.4 PQISLKKKIF FFIFSNFR-- GDGKSRIHIC HL--CNKTYG KTSHLRAHLR

Introduction Terms used in pattern finding is quite loose. Terms may be used differently by different authors. Thus there is a need to know the context in which the terms are used.

....|....|....|....|....|....|....|....|....|....|....|....| Sp1 ACTCPYCKDS EGRGSG---- DPGKKKQHIC HIQGCGKVYG KTSHLRAHLR WHTGERPFMC Sp2 ACTCPNCKDG EKRS GEQGKKKHVC HIPDCGKTFR KTSLLRAHVR LHTGERPFVC Sp3 ACTCPNCKEG GGRGTN---- -LGKKKQHIC HIPGCGKVYG KTSHLRAHLR WHSGERPFVC Sp4 ACSCPNCREG EGRGSN---- EPGKKKQHIC HIEGCGKVYG KTSHLRAHLR WHTGERPFIC DrosBtd RCTCPNCTNE MSGLPPIVGP DERGRKQHIC HIPGCERLYG KASHLKTHLR WHTGERPFLC DrosSp TCDCPNCQEA ERLGPAGV-- HLRKKNIHSC HIPGCGKVYG KTSHLKAHLR WHTGERPFVC CeT22C8.5 RCTCPNCKAI KHG DRGSQHTHLC SVPGCGKTYK KTSHLRAHLR KHTGDRPFVC Y40B1A.4 PQISLKKKIF FFIFSNFR-- GDGKSRIHIC HL--CNKTYG KTSHLRAHLR GHAGNKPFAC C 2 H 2 Zinc finger motif Prosite pattern C-x(2,4)-C-x(12)-H-x(3)-H

Motif –Common sequence elements shared by a group of sequences. Indicative of functional or evolutionary relationship. –N-Glycosylation site, N-{P}-[ST]-{P}

Pattern –“A consistent, characteristic form, style, or method, as a composite of traits or features characteristic of an individual or a group.” (dictionary.com) –A physical expression of a motif. –Many forms of expression.

Signature/Print –A set of patterns that defines a group of sequences having a certain common characteristic. –Bacterial Rhodopsin (2 patterns) R-Y-x-[DT]-W-x-[LIVMF]-[ST]-T-P-[LIVM](3) [FYIV]-x-[FYVG]-[LIVM]-D-[LIVMF]-x-[STA]-K- x(2)-[FY]

A single point is not indicative of identity. But many points allow for identification.

Why pattern finding and not sequence comparison? Useful in event of low sequence similarity to infer function or family –Certain motifs are characteristic of function or family. –Zinc finger motif, indicative of DNA binding. –Avidin motif, indicative of Avidin family of proteins.

Detection of specific motifs or signals –Example: Restriction Endonuclease sites –EcoRI »5’-G^AATT C-3’ (Sense strand) »3’–C TTAA^G-3’ (Antisense strand) Transcription factor binding sites –GAL4 »CCCCAGaTTTTC Protein motifs –Zinc finger

Usually faster than sequence comparison –Blast has to search using many fragments. –Pattern searching just search once

Types of Patterns DNA –Restriction Endonuclease sites –DNA binding motifs –Transcription Factor binding sites –Splicing site motifs –Other signals

Protein –Sequence motifs Zinc finger SH2 domains –Structural patterns

Representations Regular Expression (RE) Prosite Patterns Profiles (PSSM) Hidden Markov Models (HMM)

Sp1 CHIQGCGKVYGKTSHLRAHLRWH Sp2 CHIPDCGKTFRKTSLLRAHVRLH Sp3 CHIPGCGKVYGKTSHLRAHLRWH Sp4 CHIEGCGKVYGKTSHLRAHLRWH DrosBtd CHIPGCERLYGKASHLKTHLRWH DrosSp CHIPGCGKVYGKTSHLKAHLRWH CeT22C8.5 CSVPGCGKTYKKTSHLRAHLRKH Y40B1A.4 CHL--CNKTYGKTSHLRAHLRGH Sequences containing zinc finger motif

Regular Expression Used in computer science Syntax: CharacterMeaning ^Match the beginning of the line $Match the end of the line *Match 0 or more repetitions of preceding character +Match 1 or more repetitions of preceding character

CharacterMeaning ?Match 0 or 1 occurrence of preceding character {m}Match m repetition of preceding character {m,n}Match range m to n repetition of preceding character CharMatch character.Match any character []Match any character within bracket [^Char]Not character Zinc finger motif C.{2,4}C.{12}H.{3}H

Sp1 CHIQGCGKVYGKTSHLRAHLRWH Sp2 CHIPDCGKTFRKTSLLRAHVRLH Sp3 CHIPGCGKVYGKTSHLRAHLRWH Sp4 CHIEGCGKVYGKTSHLRAHLRWH DrosBtd CHIPGCERLYGKASHLKTHLRWH DrosSp CHIPGCGKVYGKTSHLKAHLRWH CeT22C8.5 CSVPGCGKTYKKTSHLRAHLRKH Y40B1A.4 CHL--CNKTYGKTSHLRAHLRGH C.{2,4}C.{12}H.{3}H Example

Prosite Patterns Very similar to RE Patterns encoded in Prosite style or RE style can be switched easily between these two styles More familiar to biologist

REProsite ^< $> ?(0,1) {m}(m) {m,n}(m,n) Char.x [] [^char]{} Zinc finger motif RE C.{2,4}C.{12}H.{3}H Prosite C-x(2,4)-C-x(12)-H-x(3)-H

Profiles Similar to scoring matrices used in sequence comparison The outcome of applying the matrices is a score A threshold is used to determine whether it is a hit

Sp1 C H I Q G C G K VYGKTSHLRAHLRWH Sp2 C H I P D C G K TFRKTSLLRAHVRLH Sp3 C H I P G C G K VYGKTSHLRAHLRWH Sp4 C H I E G C G K VYGKTSHLRAHLRWH DrosBtd C H I P G C E R LYGKASHLKTHLRWH DrosSp C H I P G C G K VYGKTSHLKAHLRWH CeT22C8.5 C S V P G C G K TYKKTSHLRAHLRKH Y40B1A.4 C H L - - C N K TYGKTSHLRAHLRGH Profile Pos A C D E F G H I K L M N P Q R S T V W X –

Pos A C D E F G H I K L M N P Q R S T V W X – seq – C H I Q G C G K – = 49

Sp1 CHIQGCGK = = 49 Sp2 CHIPDCGK = = 48 Sp3 CHIPGCGK = = 53 Sp4 CHIEGCGK = = 49 DrosBtd CHIPGCER = = 42 DrosSp CHIPGCGK = = 53 CeT22C8.5 CSVPGCGK = = 42 Y40B1A.4 CHL--CNK = = 34 <- lowest Since all the sequences are known to contain the zinc finger motif, the threshold can be set at 34. Thus any sequence having a lower score than the threshold will be rejected and any sequence having a higher score is likely to have the zinc finger motif. Example Unrelated seq – CADEGCEK – = 31 REJECT

The unrelated sequence was rejected due to a low score. However if one was using a Prosite pattern, one would have accepted it. –C-x(2,4)-C-x(2) <= Prosite motif Advantage of profile –More expressive, details are included –More sensitive –Provides a quantitative value

Example provided is very simple It is possible to include –Evolutionary distance –Amino acid frequency –Substitution matrix This makes the profile even more accurate

Hidden Markov Models (HMM) Profiles are a special case of HMM HMM have a number of states Transitions from one state to another is based on a set or probabilities called transitional probabilities At each state an observation is generated

It is known as HMM as only the observations are visible and the states hidden. The probabilities are first determined using MSA. The determined probabilities are then used to determine whether a sequence has the pattern or not.

I1 M1 D2 M2 I1 M1 D2 A Short Profile HMM I represents insertion states, M represents match states and D represents deletion state. Both I and M emits amino acids.

Sources and Creation of Patterns Source of patterns –The source of patterns is mainly MSA. Creation of patterns –Manually as in Prosite –Automatically through machine learning Meme Pratt

Considerations Sensitivity/Recall –How much of the patterns were discovered –TP / (TP + FN) Specificity/Precision –How many of the discovered patterns are correct –TP / (TP + FP) It is usually a balance between these two measures.

Ideal situation Threshold

False Positive False Negative The real situation

Other points: –A literature search can be done to identify potential conserved/functional regions suitable for use in pattern creation. For example, Alanine Scanning may indicate a region of functional importance. –All calculations of Sensitivity and Specificity is based on current state of database. –Need to consider the coverage of existing database.

Summary Definition of patterns and motifs Why use pattern finding Types of patterns Sources and Creation of Patterns