Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases Work Steps Extract and analyze results.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
On line (DNA and amino acid) Sequence Information Lecture 7.
BLAST Sequence alignment, E-value & Extreme value distribution.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
1 Exercise: BIOINFORMATIC DATABASES and BLAST. 2 Outline  NCBI and Entrez  Pubmed  Google scholar  RefSeq  Swissprot  Fasta format  PDB: Protein.
Sequence Similarity Searching Class 4 March 2010.
Archives and Information Retrieval
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Bioinformatics and Phylogenetic Analysis
Overview of sequence database searching techniques and multiple alignment May 1, 2001 Quiz on May 3-Dynamic programming- Needleman-Wunsch method Learning.
Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.
Similar Sequence Similar Function Charles Yan Spring 2006.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
BLAST.
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
Sequence alignment, E-value & Extreme value distribution
© Wiley Publishing All Rights Reserved. Searching Sequence Databases.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
An Introduction to Bioinformatics
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Blast 1. Blast 2 Low Complexity masking >GDB1_WHEAT MKTFLVFALIAVVATSAIAQMETSCISGLERPWQQQPLPPQQSFSQQPPFSQQQQQPLPQ QPSFSQQQPPFSQQQPILSQQPPFSQQQQPVLPQQSPFSQQQQLVLPPQQQQQQLVQQQI.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
School B&I TCD Bioinformatics Database homology searching May 2010.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Part I: Identifying sequences with … Speaker : S. Gaj Date
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
BLAST Basic Local Alignment Search Tool (Altschul et al. 1990)
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Part 2- OUTLINE Introduction and motivation How does BLAST work?
(PSI-)BLAST & MSA via Max-Planck. Where? (to find homologues) Structural templates- search against the PDB Sequence homologues- search against SwissProt.
Construction of Substitution matrices
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
Copyright OpenHelix. No use or reproduction without express written consent1.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
PatchFinder. The ConSurf web-server calculates the evolutionary rate for each position in the protein. Surface clusters of spatially close & conserved.
What is BLAST? Basic BLAST search What is BLAST?
HANDS-ON ConSurf! Web-Server: The ConSurf webserver.
HANDS-ON ConSurf! Web-Server: The ConSurf webserver.
What is sequencing? Video: WlxM (Illumina video) WlxM.
DNA / protein sequence analysis 第九組成員: 吳宇軒 侯卜夫 朱子豪 王俊偉
Bioinformatics Shared Resource Bioinformatics : How to… Bioinformatics Shared Resource Kutbuddin Doctor, PhD.
BLAST and Psi-BLAST and MSA Nov. 1, 2012 Workshop-Use BLAST2 to determine local sequence similarities. Homework #6 due Nov 8 Chapter 5, Problem 8 Chapter.
Lab 3.2: Database Similarity Searching “The BLAST Buffet” Stephanie Minnema University of Calgary.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
What is BLAST? Basic BLAST search What is BLAST?
A Practical Guide to NCBI BLAST
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Identifying templates for protein modeling:
Genome Center of Wisconsin, UW-Madison
BLAST.
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases Work Steps Extract and analyze results

BLAST programs 2 All types of searches are possible Query:DNAProtein Database:DNAProtein blastn – nuc vs. nuc blastp – prot vs. prot blastx – translated query vs. protein database tblastn – protein vs. translated nuc. DB tblastx – translated query vs. translated database

Amino acid sequence – most suitable for homology search The database and the query can be either nucleotides or amino acids! We prefer amino acid sequence: -amino acid sequence is more conserved -20 letter alphabet. Two random hits share 5% identity in average (comparing to 25% in DNA seq). -protein comparison matrices are more sensitive. - protein databases are smaller – less random hits. - we want to conclude about the structure- proteins are much more relevant. BLAST programs

Where? (to find homologues) Structural templates- search against the PDB Sequence homologues- search against SwissProt or Uniprot (recommended!) How many? As many as possible, as long as the MSA looks good (next week…) General Issues

How long? (length of homologues) Fragments- short homologues (less than 50,60% the query’s length) = bad alignment Ensure your sequences exhibit the wanted domain(s) N/C terminal tend to vary in length between homologues How close? (distance from query sequence) All too close- no information Too many too far- bad alignment Ensure that you have a balanced collection! General Issues

From who? (which species the sequence belongs to) Don’t care, all homologues are welcome Orthologues/paralogues may be helpful Sequences from distant/close species provide different types of information Which method? (BLAST/PSI-BLAST) Depends on the protein, available homologues, the goal in mind… General Issues

Sequence databases Where do we want to search? DNA sequences ESTs- no annotated coding sequence pool. the largest pool of sequence data for many organisms (NCBI) NR- All GenBank + EMBL + DDBJ + PDB sequences. No longer "non- redundant" due to computational cost. Genomes a specific organisms RefSeq- mRna or genomic- an annotated collection from NCBI Reference Sequence Project. EMBL- Europe's primary nucleotide sequence resource (EBI) ….

Sequence databases Where do we want to search? Protein databases: PDB- the sequences of proteins for which structures are available NR (non-redundant)- Non-redundant GenBank CDS translations + PDB + SwissProt + PIR + PRF, excluding those in env_nr RefSeq- sequences from NCBI Reference Sequence project.NCBI Reference Sequence project Proteins of a specific organisms Uniprot –swissprot or trembl ….

Sequence databases Where do we want to search? UniProt UniProt is a collaboration between the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR). European Bioinformatics Institute (EBI) Swiss Institute of Bioinformatics (SIB)Protein Information Resource (PIR) In 2002, the three institutes decided to pool their resources and expertise and formed the UniProt Consortium.

Sequence databases Where do we want to search? UniProt The world's most comprehensive catalog of information on proteins- Sequence, function & more… Comprised mainly of the databases: – SwissProt – last year, protein entries now – high quality annotation, non-redundant & cross-referenced to many other databases. – TrEMBL last year, protein entries now – computer translation of the genetic information from the EMBL Nucleotide Sequence Database  many proteins are poorly annotated since only automatic annotation is generated

Overall work steps 1.Run the search- 1.Select database 2.E-value threshold 3.BLAST or PSI-BLAST- how many rounds? 2.Take out sequences 1.HSP or full sequences 2.Can (should!) filter out redundant and sequences that are too short (fragments) 3.Usually- align sequences- choose alignment program 4.View alignment with BioEdi tor another program 5.Calculate trees, conservatino scores (conseq) etc…

Multiple Sequence Alignment (MSA) Overall work steps Perform alignment of a large collection of sequences Many algorithms, leading ones: 1.ClustalW 2.MUSCLE 3.T-COFFEE

Examining BaliBase 2005… Edgar, R.C., 2004 MUSCLE is superior! Overall work steps

BLAST NCBI

All program types Many databases to chose from, both nucleotide and protein 12 genome-specific databases Can also look for conserved domain, SNPs and more… The well-known server

BLASTp BLAST NCBI

Query Sequence Database Run BLAST NCBI BLASTp

As many as possible Matrix BLAST NCBI Evalue

Mark all Mark only wanted BLAST NCBI

BLAST NCBI

BLAST NCBI

BLAST EBI

Many databases, including UniProt Insert sequence RUN Get maximum number of alignments! BLAST EBI

Send sequences to ClustalW Mark all or wanted Get sequences BLAST EBI

PSI-BLAST

Query Sequence Database Run PSI-BLAST NCBI PSI-BLAST

Pre-calculated PSSM PSI-BLAST NCBI Threshold for inclusion in PSSM

PSI-BLAST NCBI Run next round Include sequence in the PSSM Not found in previous round

Query Sequence Database Run PSI-BLAST EBI Number of iterations

(PSI-)BLAST on ConSeq, extract sequence & align

PSI-BLAST on ConSeq The ConSeq webserver Calculates evolutionary conservation scores that are than displayed on the sequence. Requires a Multiple Sequence Alignment (MSA)- if nor provided, can create one automatically Runs (PSI-)BLAST, extracts hits from the BLAST results, filters according to e-value and aligns the sequences.

PSI-BLAST on ConSeq The ConSeq webserver-

PSI-BLAST on ConSeq The ConSeq webserver- Query sequence

PSI-BLAST on ConSeq The ConSeq webserver- Alignment algorithm Database- swissprot or uniprot No. of homologues Iterations E-value

PSI-BLAST on ConSeq The ConSeq webserver-

PSI-BLAST on ConSeq The ConSeq webserver- All BLAST hits MSA

Summary of web servers: 1. PSI-BLAST at NCBI- -Can control PSSM, included sequences & threshold -All types of BLAST programs -Not against UniProt- SwissProt or NR -Against RefSeq and NT -Full sequences downloaded like BLAST -Number of sequences up to 2000 NCBI vs. EBI vs. ConSeq

Summary of web servers: 2. BLAST at EBI – - Against UniProt or EMBL, not NR or specific genomes - Can’t control PSSM- just get last round - Download and align only full sequences - The number of presented sequences is limited to blastN, blastP, tblastN, tblastX NCBI vs. EBI vs. ConSeq

Summary of web servers: 3. BLAST at ConSeq – Get HSPs, not entire sequences!!! Only blastP Search uniprot/swissprot Still, can’t control all options… such as redundancy and minimal length of HSP NCBI vs. EBI vs. ConSeq

(PSI-)BLAST via Max-Planck

Run (PSI-) BLAST Send HSP or full sequences to an alignment program Forward HSP to filtration via “BLAMMER” Download filtered sequences Align the sequences via program of choice

(PSI-)BLAST via Max-Planck BLAST at Max-Planc Databases- swissprot, tremble, NR, env, pdb or any combination for proteins, but only NT for DNA. All BLAST programs Main advantage- you can easily extract and filter the HSPs, on top of full sequences.

The Query Protein Name: Dihydrodipicolinate reductase Enzyme reaction: Molecular process: Lysine biosynthesis (early stages) Organism: E. coli Sequence length: 273 aa

Query: DAPB_ECOLI <DAPB_ECOLI MHDANIRVAIAGAGGRMGRQLIQAALALEGVQLGAALEREGSSLLGSDAGELAGAG KTGVTVQSSLDAVKDDFDVFIDFTRPEGTLNHLAFCRQHGKGMVIGTTGFDEAGKQ AIRDAAADIAIVFAANFSVGVNVMLKLLEKAAKVMGDYTDIEIIEAHHRHKVDAPSGTA LAMGEAIAHALDKDLKDCAVYSREGHTGERVPGTIGFATVRAGDIVGEHTAMFADIGE RLEITHKASSRMTFANGAVRSALWLSGKESGLFDMRDVLDLNNL The Query Protein

(PSI-)BLAST via Max-Planck Choose database or databases (selecting a few using CTRL) Upload sequence or MSA

(PSI-)BLAST via Max-Planc Save PSi-BLAST result

(PSI-)BLAST via Max-Planck E-value threshold can be assessed using the distribution

Filter Results via Max-Planck Forward results to BLAMMER

BLAMMER Suppose to create MSAs from BLAST results, we will use it just to filter the results and then align them via MUSCLE or another known MSA program. Filter according to: E-value Min. coverage- min. percent of the query protein Max. redundancy- extract similar sequences Max. number of homolgoues- if wanted Filter Results via Max-Planck

Filter Results via Max-Planck Forwarded PSI- BLAST result Filtering parameters

Filter Results via Max-Planck Save & then re-align!

Align the BLAST sequences

Align via Max-Planck

1.Forward BLAST to MUSCLE, MAFFT etc... Choose program Use hits or full sequences Align via Max-Planck

2. Filter via BLAMMER and then ALIGN: Upload the results of the BLAMMER – downloaded file

Align via Max-Planck Alignment results: Save the alignment

Alignmen viewing & editing BioEdit Easy-to-use sequence alignment editor View and manipulate alignments up to 20,000 sequences. F our modes of manual alignment: select and slide, dynamic grab and drag, gap insert and delete by mouse click, and on-screen typing which behaves like a text editor. Reads and writes Genbank, Fasta, Phylip 3.2, Phylip 4, and NBRF/PIR formats. Also reads GCG and Clustal formats

Easiest Using Bioedit Alignmen viewing & editing

Easiest Using Bioedit Find a specific sequence: “Edit-> search -> in titles” Erase\add sequences: “Edit-> cut\paste\delete sequence” “Sequence Identity matrix” under “Alignment”- useful for a rough evaluation of distances within the alignment. After taking out sequences, “Minimize Alignment” under “Alignment” takes out unessential gaps. Can save an image using: “File -> Graphic View” & then “Edit -> Copy page as BITMAP” Alignmen viewing & editing

Each sequence is a different story  adjust parameters: BLAST- E-value, substitution matrix, gap penalties, database, minimum length, redundancy level, fragment overlap… PSI-BLAST- BLAST parameters + PSSM inclusion threshold (or chose manually), number of rounds… Try using HSP or full sequences, different MSA programs… No “Miracle solution” 

THANKS Some slides were taken from previous presentations by members of the Pupko lab and Prof. Beni Chor