Sequence similarity, BLAST alignments & multiple sequence alignments

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Lecture 8 Alignment of pairs of sequence Local and global alignment
Heuristic alignment algorithms and cost matrices
Sequence analysis course
Introduction to Bioinformatics Algorithms Sequence Alignment.
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
It & Health 2009 Summary Thomas Nordahl Petersen.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Introduction to bioinformatics
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
An Introduction to Bioinformatics
Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty closely) This lecture is designed to introduce you to.
BLAST Workshop Maya Schushan June 2009.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Biology 4900 Biocomputing.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
In-Class Assignment #1: Research CD2
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Sequence Alignment.
Construction of Substitution matrices
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Protein Sequence Alignment Multiple Sequence Alignment
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Sequence similarity search Glance to the protein world.
Arginine, who are you? Why so important?. Release 2015_01 of 07-Jan-15 of UniProtKB/Swiss-Prot contains sequence entries, comprising
Homologues finding and Multiple Sequence Alignment Maya Schushan November 2010.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Basics of Comparative Genomics
Protein Sequence Alignments
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Entropy, Information contents & Logo plots By Thomas Nordahl Petersen
Entropy, Information contents & Logo plots By Thomas Nordahl Petersen
Sequence Based Analysis Tutorial
It og Sundhed Thomas Nordahl Petersen, Associate Professor
Entropy, Information contents & Logo plots By Thomas Nordahl Petersen
It og Sundhed Thomas Nordahl Petersen, Associate Professor
Basics of Comparative Genomics
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Thomas Nordahl Petersen, Associate Bioinformatics, DTU
Presentation transcript:

Sequence similarity, BLAST alignments & multiple sequence alignments May 30, 2017

Sequence similarity Why do we care? Workhorse of bioinformatics: Genome assembly & annotation Protein function prediction Phylogeny & evolution (metagenomics)

Most common methods Pairwise alignment Multiple sequence alignment BLAST Multiple sequence alignment ClustalW, MUSCLE Protein domain profiles PFAM, INTERPRO, PANTHER

Pairwise alignments How are two sequences related to each other? Homologous – share a common ancestor Cannot be measured Measure similarity; infer homology Orthologs: separated by speciation Paralogs: separated by duplication Are there gaps in one versus the other? What is the percent similarity? What is a significant alignment?

Multiple sequence alignments Do these sequences share a core level of similarity? Can be used to build a profile for that family Protein domain profiles used to annotate the function of genes in a newly sequenced genome Starting point for phylogenetic analyses

Pairwise alignment First string = a b c d e Second string = a c d d e f Two alignments: a b c d - e – a – c d d e f a b c – d e – Which alignment is better?

Scoring schemes Method of scoring matches, mismatches & gaps that is biologically relevant Nucleotide alignments: Identity only, with positive score for matches & negative score for mismatches Transitions (A -> G, T -> C) & transversions (purine -> pyrimidine) scored differently Transitions more common and more likely to be silent

Amino acid substitution matrices Based on observed frequencies of amino acid distributions and substitutions Models conservative nature of substitutions Implicitly represent evolutionary patterns Scores are based in Information Theory

Amino acid Symbol Freq (%) Leu L 9.66 Ala A 8.25 Gly G 7.07 Val V 6.87 Glu E 6.75 Ser S 6.56 Ile I 5.96 Lys K 5.84 Arg R 5.53 Asp D 5.45 Thr T 5.34 Pro P 4.7 Asn N 4.06 Gln Q 3.93 Phe F 3.86 Tyr Y 2.92 Met M 2.42 His H 2.27 Cys C 1.37 Trp W 1.08 Met and Tryp have only 1 codon Leu, Ser and Arg have 6 codons

Scoring amino acid substitutions Amino acids share similarity based on chemical and physical properties Not all substitutions are equally likely due to physical/chemical constraints i.e. L -> I is much more conservative than L -> Y vs

Information theory H = information, as associated with some probability p, is the base 2 logarithm of the inverse of p. Values converted to base 2 logarithms are given the unit bits. Information is described as a message of symbols. If there are n symbols and all n have an equal probability then the probability of any symbol appearing is 1/n

Information Theory If all symbols are NOT equally probable, then the entropy (H) is the negative sum over all symbols (n) of the probability of a symbol (pi) multiplied by the log base 2 of the symbol (log pi) The entropy of a normal coin is therefore: -( (0.5)(-1) + (0.5)(-1) ) = 1 bit The entropy of a trick coin where heads comes up ¾ of the time is: -( (0.75)(-.415) + (0.25)(-2) ) = 0.81 bit The entropy of random DNA is: -( (0.25)(-2) + (0.25)(-2) + (0.25)(-2) + (0.25)(-2) ) = 2 bits

Scoring matrices S = score for amino acid pairing in the alignment qij is the observed pairing frequency of amino acids i and j. pi and pj are the expected frequencies for amino acids i and j. Commonly observed substitutions: S > 0 Rarely observed substitutions: S < 0 Observed and random frequency same: S = 0

BLOSUM62 Matrix BLOcks SUbstitution Matrix are based on protein alignments Number indicates minimal percent identity between proteins in the alignment

Amino acid chemical relationships

Large negative; unlikely subs Near zero; no penalty for subs BLOSUM62 Matrix Large positive; Rare amino acids Large negative; unlikely subs Near zero; no penalty for subs

More positive; more negative than BLOSUM62 Based on blocks of aligned protein sequences that are at least 90% identical to another sequence in the block

Choosing a matrix Matrix Best use Similarity (%) BLOSUM90 Short alignments that are highly similar 70-90 BLOSUM80 Detecting members of a protein family 50-60 BLOSUM62 Most effective in finding all potential similarities (DEFAULT in BLAST) 30-40 BLOSUM45 Longer alignments of more divergent sequences <45

BLAST Build a list of words from query sequence Calculate statistical significance of matches Build word list from query sequence Find hits in database sequence Extend the hits to form HSPs Build a list of words from query sequence (3 for proteins, 11 for DNA) Evaluate each word for match using scoring matrix and discard all below threshold Generally 50 matches per word T value is threshold; determines sensitivity and speed of search

Query sequence: Word list: Threshold score (T): Matches to PSA Score PSATPVLICWAAG Word list: PSA ATP VLI CWA Threshold score (T): 11 Matches to PSA Score PSA 15 PST 9 PDA 11 WSA 4

BLAST Find match for each word in database Calculate statistical significance of matches Build word list from query sequence Extend the hits to form HSPs Find hits in database sequence Find match for each word in database Database is indexed so all possible words in all sequences is known This search is very fast (500K words/sec) Matches > threshold(T) are used as seed for alignments

BLAST Calculate statistical significance of matches Build word list from query sequence Find hits in database sequence Extend the hits to form HSPs Extend alignment from each word in both directions so long as score increases These alignments are the high scoring pairs (HSPs) Keep HSPs if score is above a given threshold

Extending the hit (1) (2) (3) + = = + + = = + p S A P S A 15 C 9 Score of previous alignment (A) Score of new aligned pair Score of new alignment (1) + = p S A P S A 15 C 9 P S A C 24 = + (2) Score of previous alignment (B) Score of new aligned pair + = Score of alignment (C) P S A C 24 Y W 2 P S A C Y P S A C W 26 = + (3) Repeat adding aligned pairs until score goes down or reach end of sequence.

BLAST Combine HSPs into a gapped alignment Build word list from query sequence Find hits in database sequence Extend the hits to form HSPs Highest scoring HSPs extended in both directions as long as score > threshold Do NOT usually get an alignment over the ENTIRE length of the sequence

Score = 272 bits Identities = 135/310 (43%) Positives = 200/310 (64%) Expect = 2e-73

Significance of alignment P = probability that the observed match could have happened by chance; values between 0 and 1 Expect value: number of matches as good as the observed one that would be expected to appear by chance in a database of the size probed E = P x size of the database E values range from 0 to the size of the database E =

When is an alignment significant? Identify a true ortholog between species In a protein-protein alignment, E-values < 10-25 Are all the domains present in both? Does the number of exons match? Are the splice boundaries the same? Annotation (transfer between species) E-values < 10-25 Functional homolog? Protein alignment, E-values < 10-10

Multiple Sequence Alignments (MSA) Alignment of ≥ 3 sequences to bring as many similar characters into register as possible Hypothetical model of mutations (substitutions, insertions & deletions) Best represents most likely evolutionary scenario. Cannot be unambiguously established

MSA: Motivation Correspondence. Which parts “do the same thing” Similar genes are conserved across widely divergent species, often performing similar functions Structure prediction Use knowledge of structure of one or more members of a protein MSA to predict structure of other members Structure is more conserved than sequence Create “profiles” for protein families Allow us to search for other members of the family MSA is the starting point for phylogenetic analysis

Globin alignment

ClustalW Alignment * identity : high similarity . low similarity - gap in sequence Amino acids often color coded based on physical -chemical properties

MSA -> Profiles Profile: A table that lists the frequencies of each amino acid in each position of protein sequence. Frequencies are calculated from a MSA containing a domain of interest Allows us to identify consensus sequence Derived scoring scheme allows us to align a new sequence to the profile Profile can be used in database searches Find new sequences that match the profile

Why not just use BLAST? Database searches using a profile or position-specific scoring matrices (PSSM) are much more sensitive for detecting weak or distant relationships than are database searches using a single sequence as query Information content higher in a PSSM

Pairwise alignment

Position Specific Scoring Matrix (PSSM)

Where and how are profiles used? Used extensively in defining functional domain profiles of proteins Major protein domain databases: PFAM, InterPro, PANTHER PSSM also generated using PSI-BLAST and can then be used to search in a different database to find remote homologs

This weeks exercise Using BLAST to identify the source of unknown DNA sequences Using BLAST to identify taxonomic distribution a known sequence Using BLAST to identify homologs of specific proteins in other species