Bioinformatics for Research

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Last lecture summary.
Measuring the degree of similarity: PAM and blosum Matrix
1 ALIGNMENT OF NUCLEOTIDE & AMINO-ACID SEQUENCES.
DNA sequences alignment measurement
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Sequence Similarity Searching Class 4 March 2010.
Heuristic alignment algorithms and cost matrices
Introduction to bioinformatics
Sequence Analysis Tools
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
An Introduction to Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty closely) This lecture is designed to introduce you to.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Sequence comparison: Local alignment
Biology 162 Computational Genetics Todd Vision Fall Aug 2004
Identifying templates for protein modeling:
Sequence Based Analysis Tutorial
Pairwise sequence Alignment.
Sequence Based Analysis Tutorial
Pairwise Sequence Alignment
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Sequence alignment, E-value & Extreme value distribution
It is the presentation about the overview of DOT MATRIX and GAP PENALITY..
Presentation transcript:

Bioinformatics for Research 5/20/2018 Bioinformatics for Research Module 1 Sequence Alignment September 1, 2015 Mainlab Bioinformatics, Washington State University

Learning Outcomes Understanding of what is sequence alignment and how is it useful Understanding of the different types of sequence alignment and when you might use them Understanding of the importance of homology Understanding scoring matrices and when to use them. Understanding that there are different alignment algorithms Basic understanding of BLAST and it’s different flavors

What is Sequence Analysis ? A sequence is ___________________________________ A biological sequence is __________________________ ______________________________________________ Sequence analysis in bioinformatics refers to__________ ______________________________________________ ______________________________________________ ______________________________________________ ______________________________________________ ______________________________________________

Concept of Sequence Alignment An alignment is a mutual arrangement of two sequences Shows where two sequences are similar, and where they differ An ‘optimal’ alignment – most correspondences and the least differences Sequences that are similar probably have the same function (descent from a common ancestor) Sequence alignment involves the identification of the correct location of deletions and insertions that have occurred in either of the two lineages since divergence from a common ancestor.

Sequence Alignment Sequence alignment is the procedure of comparing (at the residue level) two (pairwise alignment) or more sequences (multiple sequence alignment) by searching for a series of common features (characters or patterns) that occur in the same order in the sequences. Sequence alignment is useful for discovering functional, structural and evolutionary information in biological sequences. =

Importance of Homology! Homology strongly suggests that molecules have similar structure and function Significantly similar molecular sequences are very unlikely to occur by chance. Significant similarity between sequences infers that the sequences/structures are homologous i.e. at some point in the shared a common ancestor and therefore share structure and function. Differences between families of species resulted from mutations during the course of evolution. Most of these changes are due to local mutations between nucleotide sequences

Orthologs Sharing a common ancestor Orthologs : occur in separate species, common ancestor A1 Descendent 1 Ancestor A = A2 Descendent 2 Time

Paralogs Sharing a Common Ancestor Paralogs : gene duplication independent of speciation (genome duplication) Ancestor A1 A Descendent A2 = Time

Homology Homology designates a qualitative relationship of common descent between entities Two genes are either homologs or they are not ! It doesn’t make sense to say “two genes are 43% homologous” It doesn’t make sense to say “Jane is 43% diabetic”

Sequence Alignment Changes in sequence may be found in alignments due to divergence from the common ancestor. These changes are categorized as substitutions, insertions and deletions. A primary use of sequence alignment is to determine if two sequences are sufficiently similar to declare them homologous and therefore likely to share similar structure and function What else can we use sequence alignment for? ______________________________________ ______________________________________ ______________________________________ =

Sequence Alignment What does sequence identity mean? ___________________ _________________________________________________ What does sequence similarity mean? __________________ _________________________________________________ What is sequence homology? _________________________ _________________________________________________ _________________________________________________ _________________________________________________ _________________________________________________ =

Sequence Alignment Methods Method Use Dot Matrix Plot General exploration of your sequence: - Discovering Repeats - Finding Rearrangements - Predicting regions of self complimentary RNA - Extracting portions of sequence to make a multiple alignment Global Alignments Comparing two sequences over their entire length: - Identifying long insertion/deletions - Checking the quality of your data - Identifying every mutation in your sequence Local Alignments Comparing Sequences with partial homology: - Making high quality alignments - Making residue-per-residue analysis 12

Dot Matrix Plots Also known as dot plots, they represent the simplest method of evaluating similarity between two sequences Identifies all possible matches of residues between the two sequences. One sequence (A) is listed horizontally (top of page) and the other sequence (B) vertically (left side of page). Starting with the first character in B, the comparison moves across the row and places a dot in the plot space where both of the sequence elements are the same. Adjacent regions of identity between the two sequences produce diagonal lines of dots in the plot.

Dot Matrix Plots The diagonal line always appears when a sequence is compared to itself. Can filter out random matches by using by increasing the window size In a dot matrix, detection of matching regions may be improved by filtering out random matches. This is done using a sliding window to compare the two sequences Sliding windows Window size: Number of characters to compare Stringency: Number of characters that have to match exactly

Dot Matrix Plots Window size GAACTCATACGAATTCACATTAGAC A larger window size is used for DNA sequences than for proteins because the number of random matches is much larger due to the use of only 4 DNA characters compared to 20 amino acid characters For DNA sequences comparisons use long windows and high stringencies, e.g. 15 and 10. For protein comparisons, use short windows and low stringencies except when looking for a short domain in a partially similar sequence. 15 GAACTCATACGAATTCACATTAGAC

Dot Matrix Plots Try this with the example THEFATCAT as sequence 1 and THEFASTCAT as sequence 2. What does the result tell you? T H E F A T C A T T H E F A S T C A T

Sequence Analysis Dot Plot Programs http://emboss.bioinformatics.nl/ Dottup - Displays a wordmatch dotplot of two sequences Dotmatcher - Draw a threshold dotplot of two sequences Dotpath - Draw a non-overlapping wordmatch dotplot of two sequences Polyplot - Draw dotplots for all-against-all comparison of a sequence set http://emboss.bioinformatics.nl/ 17

Types of Sequence Alignment Global: Alignments that stretch over the entire sequence length include as many matching residues as possible G G S D N W S A - T I P G G N – R A W A A M N P A Used to align two closely related sequences over similar length Useful for checking minor differences between two sequences, analyze polymorphism between closely related species, comparing two sequences that partially overlap Needleman-Wunsch Algorithm

Types of Sequence Alignment Local: Higher priority given to aligning local regions of high similarity rather than extending the alignment to neighboring residues with lower scores. - - - - - D T G A - - - - - - - - - - D T G A - - - - - Dynamic programming - Smith Waterman Algorithm (provides the best possible alignment, but slow!) Heuristic methods - BLAST and FASTA use fast approximate methods to align two sequences. Smith-Waterman Algorithm

Types of Sequence Alignment Local Cont. Heuristic algorithms are empirical (use rules of thumb to align) Much faster than dynamic programming algorithms so better suited for database searches Does not guarantee an optimal alignment like dynamic algorithms Question – You have found a homolog to your unknown gene of interest using BLAST, what might you do to optimize the alignment?

Alignment Algorithms Require a scoring system for evaluating match or mismatch of 2 characters (aa or nt)

Substitution Matrices Likelihood of One Amino Acid Mutated into Another Over Evolutionary Time Negative Score: Unlikely to Happen (e.g., Gly/Trp, -7) Positive Score: Conservative Substitution (e.g., Lys/Arg, +3) High Score for Identical Matches: Rare Amino Acids (e.g., Trp, Cys) 10

Scoring Matrices There are several of them and the choice can affect the outcome Values proportional to the probability one aa mutates into another Can be based on chemical similarity, functional similarity, structure, evolutionary similarity etc Common matrices for protein comparisons PAM (Point Accepted Mutation) – based on global alignments of closely related proteins that are at least 85% identical and are based on an implicit model of evolution PAM250 matrix is widely used PAM is not necessarily good for identifying relationships in highly divergent species. Does not account for conserved blocks or motifs

Scoring Matrices Blossum Matrices Blossum62 Look only for differences in conserved, ungapped regions of a protein family Directly calculated, using no extrapolations More sensitive to structural or functional substitutions Generally perform better than PAM matrices for local similarity searches (Henikoff and Henikoff, 1993) Blossum62 Every possible identity and substitution is assigned a score based on the observed frequencies of such occurrences in alignments of related proteins BLOSUM 62 is a matrix calculated from comparisons of sequences with no less than 62% divergence. Also Blossum 32 or 80 Default for BLAST

Scoring Matrices

Alignment Algorithms Require A penalty function for gaps in sequences A method for finding an optimal pairing of sequences (may introduce gaps to optimize the score) A gap is a space introduced into one alignment to compensate for insertions and deletions in the sequences being compared For each gap introduced there is a penalty and extending the gap further increases the penalty AGGVLIIQVG llllllxxxx AGGVLIQVG- lllllxllll AGGVL-IQVG llllllxlll AGGVLI-QVG Score the residue matches and score the residue gaps

Heuristic Algorithms FASTA: Based on K-Tuples (2-Amino Acid) BLAST: Triples of Conserved Amino Acids Gapped-BLAST: Allow Gaps in Segment Pairs PHI-BLAST: Pattern-Hit Initiated Search PSI-BLAST: Position-Specific Iterated Search What is a heuristic algorithm?

FASTA Algorithm at EBI http://www.ebi.ac.uk/Tools/fasta/index.html

BLAST Algorithms BLAST (Basic Local Alignment Search Tool) To search a sequence against the database Extremely fast Robust Most widely used It finds very short segment pairs between the query and sequence in the database These segments are then extended in both directions until the maximum possible score of this particular segment is reached Available at NCBI, EBI and many other community database sites

BLAST A BLAST search has five components: query, database, program, search purpose/goal and results interpretation Query: a sequence that you want to find out more information about Database: need to know what databases are available (NCBI, EBI etc) Program: what program to select to meet your specific purpose Interpreting your results, what does it mean?

NCBI Protein Databases

NCBI Nucleotide Databases

NCBI Nucleotide Databases

BLAST Program Selection Nucleotide queries Protein queries http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=ProgSelectionGuide

BLAST Program Selection Specialized queries Protein queries http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=ProgSelectionGuide

Optional parameters in (blastn)

Basic Optional Parameters in BLASTP

Advanced Blast Parameters The default parameters are not always the right parameters for your search, depends on your question G Cost to open gap: default = 5 for nucleotides/11 for proteins E Cost to extend gap: default =2 for nucleotides/1 for proteins Q Penalty for nucleotide mismatch: default = -3 R Reward for nucelotide match: default = 1 E Expectation value: default = 10

BLAST Choice of Programs MEGABLAST is specifically designed to efficiently find long alignments between very similar sequences and thus is the best tool to use to find the identical match to your query sequence Discontiguous MEGABLAST is better at finding nucleotide sequences similar, but not identical, to your nucleotide query. This program uses non-contiguous word within a longer window of template. In coding mode, the third base wobbling is taken into consideration by focusing on finding matches at the first and second codon positions while ignoring the mismatches in the third position. Searching in discontiguous MEGABLAST using the same word size is more sensitive and efficient than standard blastn using the same word size

BLAST Choice of Programs Search for short nearly exact matches" is useful for primer or short nucleotide searches. Short sequences (less than 20 bases) will often not find any significant matches to the database entries under the standard nucleotide-nucleotide BLAST settings. The usual reasons for this are that the significance threshold governed by the Expect value parameter is set too stringently and the default word size parameter is set too high. You can adjust both the word size and the expect value on the standard BLAST pages to work with short sequences A common use of this page is to check the specificity of PCR or hybridization primers. A useful way to check a pair of PCR primers is to first concatenate them by inserting string of 20 or more N's in between the two primers, and then search the concatenated pair as one sequence. Since BLAST looks for local alignments and automatically searches both strands, there is no need to reverse complement the reverse primer before doing the concatenation or the search.

BLAST Choice of Programs Use the Trace Archive BLAST page to search raw primary sequence trace files. The sequence data come from a variety of projects and sequencing strategies, including Whole Genome Shotgun (WGS), BAC end sequencing, and EST sequencing. The trace data are single pass sequencing reads not trimmed for quality or vector contamination. Their average lengths are between 500 to 700 bp. Standard protein BLAST is designed for protein searches. Standard protein-protein BLAST (blastp) is used for both identifying a query amino acid sequence and for finding similar sequences in protein databases. Like other BLAST programs, blastp is designed to find local regions of similarity. When sequence similarity spans the whole sequence, blastp will also report a global alignment, which is the preferred result for protein identification purposes. PSI-BLAST is designed for more sensitive protein-protein similarity searches. Position-Specific Iterated (PSI)-BLAST is the most sensitive BLAST program, making it useful for finding very distantly related proteins or new members of a protein family. Use PSI-BLAST when your standard protein-protein BLAST search either failed to find significant hits, or returned hits with descriptions such as "hypothetical protein" or "similar to...".

Database Search Questions What database should I search? What kind of sequences should I search with? What E-value is significant? What can I reliably infer about the function of my sequence based on homology?

Databases Bigger databases have more sequences. Bigger databases are also more redundant, which can skew the statistics. Bigger databases are also poorly annotated (homology with an "unidentified sequence" doesn't really tell you much) Bigger databases take lots of time to search. Smaller databases (like Swiss-Prot) are often better curated and annotated.

Databases Smaller databases are much less redundant. Smaller databases can contain phylogenetically relevant sequences (all plant) Smaller databases are much faster to search.

What is a significant E-value? For a single search, an E-value of 10-3 is significant, though typically quite distant. For multiple searches, the E-value cutoff varies according to the number of searches.

Multiple Sequence Searches e.g. 15,000 EST query sequences A 10-3 E-value cutoff means that you should expect one false positive in 1000 searches. Thus with 15,000 searches, we should expect 15 false positives with a cutoff of 10-3. To reduce the chances of identifying a false positive, set the E-value cutoff lower. For 15,000 searches, an E-value cutoff of 10-5 will mean that you should expect 0.15 false positives.

Multiple Sequence Searches In general: DNA to DNA alignment For nucleotide sequences at least 100 bp long, if 70% of your nucleotides are identical with your match sequence then they can be considered to be homologous AA to AA alignment For amino acid sequences at least 100 aa long, if 25% of your aa are identical with your match sequence then they can be considered to be homologous. Below these values, the alignments are considered to be in the twilight zone! 48