1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Heuristic alignment algorithms and cost matrices
We continue where we stopped last week: FASTA – BLAST
Database searching. Purposes of similarity search Function prediction by homology (in silico annotation) Function prediction by homology (in silico annotation)
Course Summary June 2, 2005 Programming Workshop Overview of course (presentation) Protein modeling, part 2 Instructor evaluations.
Introduction to bioinformatics
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Heuristic Approaches for Sequence Alignments
BLAST.
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence alignment, E-value & Extreme value distribution
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.
An Introduction to Bioinformatics
Protein Sequence Alignment and Database Searching.
BLAST Workshop Maya Schushan June 2009.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Sequence Based Analysis Tutorial NIH Proteomics Workshop Lai-Su Yeh, Ph.D. Protein Information Resource at Georgetown University Medical Center.
School B&I TCD Bioinformatics Database homology searching May 2010.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Part 2- OUTLINE Introduction and motivation How does BLAST work?
(PSI-)BLAST & MSA via Max-Planck. Where? (to find homologues) Structural templates- search against the PDB Sequence homologues- search against SwissProt.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Sequence Based Analysis Tutorial
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Construction of Substitution matrices
Bioinformatics Computing 1 CMP 807 – Day 2 Kevin Galens.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
What is sequencing? Video: WlxM (Illumina video) WlxM.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Computer Applications and Bioinformatics
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST.
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure Protein Data Bank (PDB) Similar protein sequences/Domain analysis Protein Families (Pfam) BLAST Homology modeling Swiss Model Evolution trees Protein databases CLUSTAL-W Most of these databases can be accessed by :  Sequence identifier  Keywords  BLAST 3D structure visualization Protein workshop Swiss PDBViewer The GenBank Release (Aug 16th 2010) requires roughly 451 GB (uncompressed sequence files only). Translation? ex1 ex2

2 BLAST : basic local alignment search tool query : CGNLSTCMLGTYTQDFNKF----- HTFPQTAIGVGAP |.||. :.: : : :..| :| : match : KCNTATCATQRLANFLVHSSNNFGAILSSTNVGSNTY High-Scoring Element Pairs (HSP)  scores  E-value  P-value Multiple alignment : ClustalW protein sequence database BLOSUM = Block Substitution Matrix PAM : Point/Percent accepted mutation Gap insertion penalty Gap extension penalty Smith-Waterman algorithm Substitution matrix find all segment pairs whose scores can not be improved by extension or trimming cutoff or Sbjct or hit Altschul SF et al. Basic Local Alignment Search Tool. J Mol Biol. 1990; 215: 403–410.

3 Alignment score matrices Example of BLOSUM 62 : set of ‘trusted’ aligned protein sequences  select pairs of sequences with less than 62% identity  calculate probability frequency p a,b where f x is the occurrence probability of amino acid x BLOSUM80 : more conserved sequences BLOSUM40 : more divergent sequences Sean R Eddy 2004, Nature Biotechnology 22 :1035-6

4 Evaluation of the similarity : E- and P-value m : query size n : database size S : score E-value : the expected number of HSPs with score at least S is E = K m n e - S where K and depends on the database statistics (amino acid frequencies) and on the scoring system. K and are estimated from the score distribution. Bit-scores : normalized E-values. E = m n 2 -S’ P-value : the probability that the score S from the comparison of two unrelated sequences is at least x is P(S ≥ x) = 1 - e -E(x) For small E-values, P ≈ E Example of score distribution fitted with the E-distribution P- score distribution of the same data

5 Practical BLAST The different BLAST programs : ProgramDatabaseQuery BLASTNnucleotidenucleotide BLASTPproteinprotein BLASTXproteintranslated nucleotide TBLASTNtranslated nucleotide protein TBLASTX translated nucleotide translated nucleotide Databases : Species-specific genomes (not curated) : choose one or more species or group at Protein database (curated) : Parameters : Cutoff E ≤ 0.01 : conservative search Cutoff E ≤ 1 : weak homologies Gap penalties : gap-open, gap-extend... Let them as they are, to start with ! Filter repetitive sequences : Yes ! PSI-BLAST : an iterative BLAST program, to find distantly related proteins

6 More information GenBank, Pubmed, Entrez The NCBI handbook More on bioinformatics Bioinformatics for Human Biologists - course programme, winter Expasy UniProtKB protein database Protein analysis tools, Swiss-PDB Viewer, Swiss-Model Protein DataBank Protein 3D structure, Protein workshop Protein families (Pfam)

7 Example : calcitonin sequence Expasy  UniProtKB  ‘human calcitonin’  P01258 (CALC_HUMAN) Retrieve calcitonin peptide sequence in FASTA format : >P01258| CGNLSTCMLGTYTQDFNKFHTFPQTAIGVGAP

8 Graphical overview of BLAST results The query sequence is represented by the numbered red bar at the top of the figure. Database hits are shown aligned to the query, below the red bar. Of the aligned sequences, the most similar are shown closest to the query. In this case, there are three high-scoring database matches that align to most of the query sequence. The next twelve bars represent lower-scoring matches that align to two regions of the query, from about residues 3–60 and residues 220–500. The cross-hatched parts of the these bars indicate that the two regions of similarity are on the same protein, but that this intervening region does not match. The remaining bars show lower-scoring alignments. Mousing over the bars displays the definition line for that sequence to be shown in the window above the graphic. The NCBI handbook, The BLAST Sequence Analysis Tool, Tom Madden

9 Release 2010_09 of 10-Aug-2010 The UniProtKB database A curated database : SwissProt A Bairoch et al. An automated database : TrEMBL Sequence length distributionOrganism distribution H sapiens : 0.6%

10