Psi-Blast Morten Nielsen, Department of systems biology, DTU.

Slides:

Advertisements

Similar presentations

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence information, logos and Hidden Markov Models Morten Nielsen, CBS, BioCentrum,

Advertisements

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Neural Networks and hidden.

Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.

BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.

Hidden Markov Models What are the good for? Morten Nielsen CBS.

Optimization methods Morten Nielsen Department of Systems Biology, DTU.

Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.

Protein Fold recognition Morten Nielsen, CBS, BioSys, DTU.

Protein Fold recognition Morten Nielsen, CBS, BioCentrum, DTU.

Heuristic alignment algorithms and cost matrices

Protein structure and homology modeling Morten Nielsen, CBS, BioCentrum, DTU.

Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.

Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU.

PSI (position-specific iterated) BLAST The NCBI page described PSI blast as follows: “Position-Specific Iterated BLAST (PSI-BLAST) provides an automated,

Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Position-Specific Substitution Matrices. PSSM A regular substitution matrix uses the same scores for any given pair of amino acids regardless of where.

Protein Fold recognition

Introduction to bioinformatics

Multiple sequence alignments and motif discovery Tutorial 5.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS,

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS,

Similar Sequence Similar Function Charles Yan Spring 2006.

Protein homology modeling Morten Nielsen, CBS, BioCentrum, DTU.

Psi-Blast Morten Nielsen, CBS, Department of Systems Biology, DTU.

Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Protein Fold recognition Morten Nielsen, CBS, BioCentrum, DTU.

Blast heuristics Morten Nielsen Department of Systems Biology, DTU.

Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU.

Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒黃尹柔田耕豪蕭逸嫻謝朝茂莊閔傑 2014/05/12 1.

What is bioinformatics?. What are bioinformaticians up to, actually? Manage molecular biological data –Store in databases, organise, formalise, describe...

Protein Sequence Alignment and Database Searching.

Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.

Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.

Sequence encoding, Cross Validation Morten Nielsen BioSys, DTU

Motif discovery Tutorial 5. Motif discovery MEME Creates motif PSSM de-novo (unknown motif) MAST Searches for a PSSM in a DB TOMTOM Searches for a PSSM.

Sequence alignment SEQ1: VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKK VADALTNAVAHVDDPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHA SLDKFLASVSTVLTSKYR.

Construction of Substitution Matrices

Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

The Blosum scoring matrices Morten Nielsen BioSys, DTU.

Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.

Tutorial 4 Substitution matrices and PSI-BLAST 1.

Motif discovery and Protein Databases Tutorial 5.

BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.

Dealing with Sequence redundancy Morten Nielsen Department of Systems Biology, DTU.

Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Lecture 7 CS5661 Heuristic PSA “Words” to describe dot-matrix analysis Approaches –FASTA –BLAST Searching databases for sequence similarities –PSA –Alternative.

Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.

Sequence Alignment.

Construction of Substitution matrices

Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.

Blosum matrices What are they? Morten Nielsen BioSys, DTU

Step 3: Tools Database Searching

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Chapter 6 - Profiles1 Assume we have a family of sequences. To search for other sequences in the family we can Search with a sequence from the family Search.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Tutorial 4 Comparing Protein Sequences Intro to Bioinformatics 1.

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

Blast heuristics, Psi-Blast, and Sequence profiles Morten Nielsen Department of systems biology, DTU.

Outline Basic Local Alignment Search Tool

Outline Basic Local Alignment Search Tool

Sequence Alignment Algorithms Morten Nielsen BioSys, DTU

Blosum matrices What are they

Basic Local Alignment Search Tool

BLAST Slides adapted & edited from a set by

BLAST Slides adapted & edited from a set by

Presentation transcript:

Psi-Blast Morten Nielsen, Department of systems biology, DTU

Objectives Understand why BLAST often fails for low sequence similarity See the beauty of sequence profiles –Position specific scoring matrices (PSSMs) Use BLAST to generate Sequence profiles Use profiles to identify amino acids essential for protein function and structure

Sequence Profiles Alignments based on conventional scoring matrices (BLOSUM62) scores all positions in a sequence in an equal manner Some positions are highly conserved, some are highly variable (more than what is described in the BLOSUM matrix) Sequence profile are ideal suited to describe such position specific variations

What goes wrong when Blast fails? Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences

Alignment scoring matrices Blosum62 score matrix. Fg=1. Ng=0? LAGDSD F I G D S L

Alignment scoring matrices Blosum62 score matrix. Fg=1. Ng=0? Score = =17 LAGDSD F I G-4060 D S L LAGDS I-GDS

What goes wrong when Blast fails? Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences This scoring matrix is identical at all positions in the protein sequence! EVVFIGDSLVQLMHQC X X AGDS.GGGDSAGDS.GGGDS

When Blast works! 1PLC._ 1PLB._

When Blast fails! 1PLC._ 1PMY._

Sequence profiles In reality not all positions in a protein are equally likely to mutate Some amino acids (active cites) are highly conserved, and the score for mismatch must be very high Other amino acids can mutate almost for free, and the score for mismatch should be lower than the BLOSUM score Sequence profiles can capture these differences

Sequence profiles TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDGMERNTAGVP TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADPPAYVTQCPI

ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I -TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD--- -TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE---- TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADPPAYVTQCPI Sequence profiles Conserved Non-conserved Matching any thing but G => large negative score Any thing can match TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDGMERNTAGVP

Visualization of Sequence logos ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV P A = 6/10 = 0.6 P G = 2/10 = 0.2 P T = P K = 1/10 = 0.1 P C = P D = …P V = 0.0 q A = 0.07 q G = 0.07 q T = 0.05 q K = 0.06

Sequence logos (Kullback-Leibler) Height of a column equal to I Relative height of a letter is p (Letters upside-down if p a < q a ) High information positions

ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I -TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD--- -TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE---- TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI Sequence profiles Conserved Non-conserved Matching any thing but G => large negative score Any thing can match TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI

How to make sequence profiles 1.Align (BLAST) sequence against large sequence database (Swiss-Prot) 2.Select significant alignments and make sequence profile 3.Use profile to align against sequence database to find new significant hits 4.Repeat 2 and 3 (normally 3 times!)

Protein world Blast iterations Protein

How to make sequence profiles The blast command blastpgp -d db -e j 4 -Q blastprofile -i fastafile -o out Last position-specific scoring matrix computed A R N D C Q E G H I L K M F P S T W Y V 1 T T V Y L A G D S T M A

Sequence profiles for a single sequence TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V

Sequence profiles (1K7C.A) 0 iterations (Blosum62) A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V

Sequence profiles (1K7C.A) 0 iterations (Blosum62) A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V

Sequence profiles (1K7C.A) 3 iterations A R N D C Q E G H I L K M F P S T W Y V T T V Y L A G D S T M A K N G G G S G T Sequence profile

Example. >1K7C.A TTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADV VTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKL FTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAGVEYVDHWSYVDSIYETL GNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVLTTTSFEGTCL What is the function Where is the active site?

What would you do? Function Run Blast against PDB No significant hits Run Blast against NR (Sequence database) Function is Acetylesterase? Where is the active site?

Example. Where is the active site? 1WAB Acetylhydrolase 1G66 Acetylxylan esterase 1USW Hydrolase

When Blast fails! 1K7A.A 1WAB._

Example. (SGNH active site)

Example. Where is the active site? Sequence profiles might show you where to look! The active site could be around S9, G42, N74, and H195

Profile-profile scoring matrix 1K7C.A 1WAB._

Summary Sequence logo is a power tool to visualize (binding) motifs –Information content identifies essential residues for function and/or structural stability Weight matrices and sequence profiles can be derived from very limited number of data using the techniques of –Sequence weighting –Pseudo counts Weight matrices and sequences profiles can accurately describe binding motifs, sequence conservation, active sites...