Sequence Analysis Tools

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Sequence Alignment.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Sequence Similarity Searching Class 4 March 2010.
Heuristic alignment algorithms and cost matrices
Introduction to bioinformatics
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
Sequence alignment, E-value & Extreme value distribution
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Chapter 5 Multiple Sequence Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple sequence alignment
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
An Introduction to Bioinformatics
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
BLAST What it does and what it means Steven Slater Adapted from pt.
Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.
Protein Sequence Alignment and Database Searching.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight.
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Pairwise Sequence Alignment and Database Searching
Bioinformatics for Research
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Sequence Based Analysis Tutorial
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Sequence Analysis Tools Erik Arner Omics Science Center, RIKEN Yokohama, Japan arner@gsc.riken.jp

Aim of lecture Why align sequences? How are sequences aligned to each other? Variants Limitations Basic understanding of common tools for Similarity search Multiple alignment

Outline Sequence analysis Basics of sequence alignment Homology/similarity Basics of sequence alignment Global vs. local Computing/scoring alignments Substitution matrices Similarity search BLAST Multiple alignment ClustalW Show of hands: how many have used BLAST? Clustal? How many know how they work?

Sequence analysis Sequence analysis Inferring biological properties through Similarity with other sequences Properties intrinsic to the sequence itself Combination Sequence analysis often (always?) includes sequence alignment Sequence alignment methods fundamental part of bioinformatics

Sequence analysis Why aligning sequences? Similarity in sequence → similarity in function Similarity in sequence → common ancestry Homology = similarity due to shared ancestry Similar → important Selective pressure

Sequence analysis Wikipedia

Sequence analysis Example: comparative genomics – when new genome is sequenced, a lot of the annotation work is done at a button press. Fig from Genome Biology (http://genomebiology.com/2006/7/s1/S11/)

Sequence analysis Similarity ≠ homology Similarity = factual (% identity) Homology = hypothesis supported by evidence

Sequence analysis Similarity ≠ homology Similarity = factual (% identity) Homology = hypothesis supported by evidence … but in many cases, similarity is the only tool we have accessible Need a measure of the significance of the similarity

Basics of sequence alignment Global vs. local alignment Global Assumes sequences are similar across entire length Local Allows locally similar sub-regions to be pinpointed Introns/exons Protein domains

Basics of sequence alignment Which one is correct?

Basics of sequence alignment Which one is correct? Both? None? In sequence alignment, you get what you ask for

Basics of sequence alignment Other types of alignment Glocal Overlaps in shotgun sequencing Structural thioredoxins from human and Drosophila, Wikipedia

Basics of sequence alignment Computing alignments Dynamic programming Needleman – Wunsch (global alignment) Smith – Waterman (local alignment) For a given pair of sequences and a scoring scheme, find the optimal alignment Several may exist

Basics of sequence alignment Scoring alignments Simple example Match = +1 Mismatch = -1 Gap = -1 ATGC AGTC ATGC AGTC = 0   A T G C +1 -1 ATG-C A-GTC = 1

Basics of sequence alignment Scoring alignments Simple example Match = +1 Mismatch = -1 Gap = -2 ATGC AGTC ATGC AGTC = 0   A T G C +1 -1 ATG-C A-GTC = -1

Basics of sequence alignment In sequence alignment, you get EXACTLY what you ask for Heavily penalized gaps → less gaps in alignment Heavily penalized mismatches → more gaps in alignment

Basics of sequence alignment Substitution matrices DNA scoring mostly straightforward More clever scoring for protein sequences Biochemical properties Lower penalties for substitutions into amino acids with similar properties Low penalty for isoleucine(I) → valine(V) subsitution – both hydrophobic Observed substitution frequencies Multiple alignments of proteins known to share ancestry and/or function Insert pic of simple DNA subst matrix

Basics of sequence alignment Common substitution matrices PAM BLOSUM BLOSUM62 most widely used Default in BLAST Recent paper discovered bug in BLOSUM62… …but buggy matrix performs “better”! Insert pic of BLOSUM62 Graphic from EBI Paper in Nature Biotechnology

Basics of sequence alignment Gap penalties Gaps generally considered to cause greater disruption of function than mismatches Gap open penalty Gap extension penalty What matrix to use?

Similarity search Premise: The sequence itself is not informative; it must be analyzed by comparative methods against existing databases to develop hypothesis concerning relatives and function. Abundance of biological sequence data forbids extensive searches All nucleotides/amino acids in query sequence cannot be compared to all aa:s/nt:s in database Fast searches are achieved using methods that trade off sensitivity for speed and specificity NCBI

Similarity search General approach: A set of algorithms (e.g. BLAST) are used to compare a query sequence to all the sequences in a specified database Comparisons are made in a pairwise fashion Each comparison is given a score reflecting the degree of similarity between the query and the sequence being compared The higher the score, the greater the degree of similarity Alignments can be global or local (BLAST: local) Discriminating between real and artifactual matches is done using an estimate of probability that the match might occur by chance Similarity, by itself, cannot be considered a sufficient indicator of function http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/similarity.html

Similarity search – BLAST A set of sequence comparison algorithms introduced in 1990 Breaks the query and database sequences into fragments ("words"), initially seeks matches between fragments Initial search is done for a word of length "W" that scores at least "T" when compared to the query using a given substitution matrix Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of "S“ "W" parameter dictates the speed and sensitivity of the search

Similarity search – BLAST

Similarity search – BLAST Scoring Unitary matrix used for DNA Only identical nucleotides give positive score Substitution matrices are used for amino acid alignments BLOSUM62 is default Non-identical amino acids may give positive score Gaps Gap scores are negative The presence of a gap is ascribed more significance than the length of the gap A single mutational event may cause the insertion or deletion of more than one residue Initial gap is penalized heavily, whereas a lesser penalty is assigned to each subsequent residue in the gap No widely accepted theory for selecting gap costs It is rarely necessary to change gap values from the default

Similarity search – BLAST Significance of hits P value Given the database size, the probability of an alignment occurring with the same score or better Highly significant P values close to 0 Expectation value The number of different alignments with equivalent or better scores that are expected to occur in a database search by chance The lower the E value, the more significant the score Human judgment

Similarity search – BLAST BLAST at NCBI http://blast.ncbi.nlm.nih.gov

Similarity search – BLAST BLAST at NCBI http://blast.ncbi.nlm.nih.gov

Similarity search – BLAST BLAST at NCBI http://blast.ncbi.nlm.nih.gov First consider your research question: –Are you looking for an orthologin a particular species? •BLAST against the genome of that species. –Are you looking for additional members of a protein family across all species? •BLAST against nr, if you can’t find hits check wgs, htgs, and the trace archives. –Are you looking to annotate genes in your species of interest? •BLAST against known genes (RefSeq) and/or ESTsfrom a closely related species.

Similarity search – BLAST BLAST at NCBI http://blast.ncbi.nlm.nih.gov

Multiple alignment Why align multiple sequences? Determine evolutional relationship between sequences → species Phylogenetics Identify domains PWM:s Pinpoint functional elements Highly conserved amino acids among more divergent ones → catalytic activity?

Multiple alignment Multiple alignment algorithms Finding optimal alignment is very time consuming Exponential complexity Approximations and heuristics used for speeding up Heuristics: "rules of thumb", educated guesses, intuitive judgments or simply common sense (from Wikipedia) Progressive alignment GIGO

Multiple alignment – ClustalW Basics of progressive algorithm All sequences are compared to each other pairwise A guide tree is constructed, where sequences are grouped according to pairwise similarity The multiple alignment is iteratively computed, using the guide tree

Multiple alignment – ClustalW

Multiple alignment – ClustalW Heuristics Individual weights are assigned to each sequence in a partial alignment in order to down-weight near-duplicate sequences and up-weight the most divergent ones Amino acid substitution matrices are varied at different alignment stages according to the divergence of the sequences to be aligned Residue-specific gap penalties and locally reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structure Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage the opening up of new gaps at these positions

Summary Know your parameters Defaults are good choices in most cases However, be aware of what they mean You get what you ask for

Sequence analysis tools EMBOSS Suite of tools for various analysis tasks ORF finding, alignment, secondary structure prediction... http://www.ebi.ac.uk/emboss/ http://emboss.sourceforge.net/

Sequence analysis tools ExPASy Comprehensive collection of protein analysis webtools http://www.expasy.ch/

Sequence analysis tools EBI SRS One-stop shop for sequence searching to analysis http://srs.ebi.ac.uk/

Sequence analysis Berkeley – Understanding evolution (http://evolution.berkeley.edu/evolibrary/article/similarity_hs_01)