PROTEIN SEQUENCE ANALYSIS. Need good protein sequence analysis tools because: As number of sequences increases, so gap between seq data and experimental.

Slides:



Advertisements
Similar presentations
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Pairwise alignments.
Measuring the degree of similarity: PAM and blosum Matrix
1 Levels of Protein Structure Primary to Quaternary Structure.
Heuristic alignment algorithms and cost matrices
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Progressive MSA Do pair-wise alignment Develop an evolutionary tree Most closely related sequences are then aligned, then more distant are added. Genetic.
Introduction to Bioinformatics Algorithms Sequence Alignment.
©CMBI 2008 Aligning Sequences The most powerful weapon in the bioinformaticist’s armory is sequence alignment. Why? Lets’ think about an alignment. It.
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Introduction to bioinformatics
©CMBI 2005 Why align sequences? Lots of sequences with unknown structure and function. A few sequences with known structure and function If they align,
Protein Modules An Introduction to Bioinformatics.
Similar Sequence Similar Function Charles Yan Spring 2006.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Multiple Sequence Alignments
Single Motif Charles Yan Spring Single Motif.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
An Introduction to Bioinformatics
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Protein Sequence Alignment and Database Searching.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Chapter 6 Profiles and Hidden Markov Models. The following approaches can also be used to identify distantly related members to a family of protein (or.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Tutorial 4 Substitution matrices and PSI-BLAST 1.
Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
Protein Sequence Alignment Multiple Sequence Alignment
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Protein families, domains and motifs in functional prediction May 31, 2016.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Sequence similarity, BLAST alignments & multiple sequence alignments
Aligning Sequences You have learned about: Data & databases Tools
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
Alignment IV BLOSUM Matrices
Presentation transcript:

PROTEIN SEQUENCE ANALYSIS

Need good protein sequence analysis tools because: As number of sequences increases, so gap between seq data and experimental data increases But increase number of sequences - increase sequence DB and therefore increased chance of finding similar sequence Computer analysis can narrow down number of functional experiments required

UNKNOWN PROTEIN SEQUENCE LOOK FOR: Similar sequences in databases ((PSI) BLAST) Distinctive patterns/domains associated with function Functionally important residues Secondary and tertiary structure Physical properties (hydrophobicity, IEP etc)

BASIC INFORMATION COMES FROM SEQUENCE One sequence- can get some information eg amino acid properties More than one sequence- get more info on conserved residues, fold and function Multiple alignments of related sequences- can build up consensus sequences of known families, domains, motifs or sites. Sequence alignments can give information on loops, families and function from conserved regions

LEVEL OF FUNCTION INFORMATION IN PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE SECONDARY STRUCTURE 3D STRUCTURE

AMINO ACID PROPERTIES SmallAla, Gly Small hydroxylSer, Thr BasicHis, Lys, Arg AromaticPhe, Tyr, Trp Small hydrophobicVal, Leu, Ile Medium hydrophobic Val, Leu, Ile, Met Acidic/amideAsp, Glu, Asn, Gln Small/polarAla, Gly, Ser, Thr, Pro

Polar (C,D,E,H,K,N,Q,R,S,T) - active sites Aromatic (F,H,W,Y) - protein ligand- binding sites Zn+-coord (C,D,E,H,N,Q) - active site, zinc finger Ca2+-coord (D,E,N,Q) - ligand-binding site Mg/Mn-coord (D,E,N,S,R,T) - Mg2+ or Mn2+ catalysis, ligand binding Ph-bind (H,K,R,S,T) - phosphate and sulphate binding Protein functions from specific residues Cdisulphide-rich, metallo- thionein, zinc fingers DEacidic proteins (unknown) Gcollagens Hhistidine-rich glycoprotein KRnuclear proteins, nuclear localisation Pcollagen, filaments SRRNA binding motifs STmucins

Protein functions from regions Active sites- short, highly conserved regions Loops- charged residues and variable sequence Interior of protein- conservation of charged amino acids

Additional analysis of protein sequences transmembrane regions signal sequences localisation signals targeting sequences GPI anchors glycosylation sites hydrophobicity amino acid composition molecular weight solvent accessibility antigenicity

FINDING CONSERVED PATTERNS IN PROTEIN SEQUENCES Pattern - short, simplest, but limited Motif - conserved element of a sequence alignment, usually predictive of structural or functional region To get more information across whole alignment: Matrix Profile HMM

PATTERNS Small, highly conserved regions Shown as regular expressions Example: [AG]-x-V-x(2)-x-{YW} –[] shows either amino acid –X is any amino acid –X(2) any amino acid in the next 2 positions –{} shows any amino acid except these BUT- limited to near exact match in small region

MATRIX 210 possible aa pairs (190 different aa, 20 identical aa) Start with sequence alignment and build up a table of probabilites of finding each aa in each position of the sequence Can be scored in several different ways

Matrix scores can be based on: Genetic code - base changes required to convert codons for 2 amino acids Chemical similarity - polarity, size, shape, charge Observed substitutions - based on analysing frequencies seen in alignments- inter-reliable Dayhoff mutation data matrix - likelihood of mutation from one aa to another, but different positions are not equally mutatable, and only useful for close function because sequence alignments are very related proteins

Matrix scoring continued BLOSUM - matrix from ungapped alignments of distantly related sequences -cluster sequences similar at a threshold value of % identity -substitution frequencies for all pairs of aa calculated -used to calculate a log odds BLOSUM (blocks substitution matrix). Can vary threshold values 3D structure matrix - derived from tertiary structure alignment, good, but only used if structure is known Best matrices are derived from observed substitution data, it is important to use select scoring appropriate for evolutionary distance interested in.

PROFILES Table or matrix containing comparison information for aligned sequences Used to find sequences similar to alignment rather than one sequence Contains same number of rows as positions in sequences Row contains score for alignment of position with each residue

Example of a Profile Match values are higher for conserved residues

Building a Profile To get good profile need good, hand-curated alignment Use alignment to build up position-specific scoring matrix Use matrix (profile) to do PSI-BLAST with several iterations

SCORES E-value is chance of a random sequence sequence hitting. E-value 1.0 not significant, 0.1 possibly significant,< 0.01 most likely to be significant. All depends on database size

HIDDEN MARKOV MODELS (HMM) An HMM is a large-scale profile with gaps, insertions and deletions allowed in the alignments, and built around probabilities Package used HMMER ( Start with one sequence or alignment -HMMbuild, then calibrate with HMMcalibrate, search database with HMM E-value- number of false matches expected with a certain score Assume extreme value distribution for noise, calibrate by searching random seq with HMM build up curve of noise (EVD)

REPEATS Structural and evolutionary entities found in 2 or more copies Often assemble into elongated “rods”, “superhelices” or “barrel” structures Specialised cases when building profiles

PITFALLS OF METHODS BLAST - only pick up homologues, not distant, divergent family members PSI-BLAST - fine for superfamilies, not very good for small very conserved motifs Patterns - small, localised and need to be highly conserved regions HMMER - slow process for searching database Profiles - if false positive picked up, pulls in its companions, in large families members can be missed Alignment methods - automatic, less biological significance

Big problem in protein sequence analysis- multidomain proteins: Most conserved domain will score highest in sequence similarity searches, may overlook lower scoring domains Iterative searching of multi-domain proteins could pick up unrelated proteins Domain 1 Domain 2 Domain 1 A B C A B C A=B, B=C, A  C A,B & C share a common domain

SUMMARY OF PATTERN METHODS Single motif method Full domain alignment methods (ProDom, DOMO) Full domain profile or HMM (Pfam, SMART) Multiple motif methods Frequency matrix (PRINTS) or PSS matrix (BLOCKS) xxxxxx Extract regular expression (PROSITE)

COMMON PROTEIN PATTERN DATABASES Prosite patterns Prosite profiles Pfam SMART Prints ProDom DOMO BLOCKS

SOFTWARE FOR PROTEIN SEQUENCE ANALYSIS GCG ( EMBOSS (ftp:ftp.sanger.ac.uk/pub/EMBOSS) PIX- HGMP ( ExPASy Proteomics tools ( PredictProtein ( heidelberg.de/predictprotein/)