Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University 12015-12-09.

Slides:



Advertisements
Similar presentations
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Bioinformatics Sequence Analysis I
Sequence Similarity Searching Class 4 March 2010.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Bioinformatics and Phylogenetic Analysis
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Sequence Analysis Tools
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Similar Sequence Similar Function Charles Yan Spring 2006.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Multiple Sequence Alignments
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Basics of Sequence Alignment and Weight Matrices and DOT Plot
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Developing Pairwise Sequence Alignment Algorithms
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Pairwise & Multiple sequence alignments
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
Copyright OpenHelix. No use or reproduction without express written consent1.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Construction of Substitution matrices
Step 3: Tools Database Searching
Protein Sequence Alignment Multiple Sequence Alignment
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Bioinformatics Overview
Sequence similarity, BLAST alignments & multiple sequence alignments
Sequence Based Analysis Tutorial
Presentation transcript:

Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University

Sequence alignment A way of arranging two or more sequences to identify regions of similarity Shows locations of similarities and differences between the sequences An 'optimal' alignment exhibits the most similarities and the least differences The aligned residues correspond to original residue in their common ancestor Insertions and deletions are represented by gaps in the alignment Examples Protein sequence alignment MSTGAVLIY--TSILIKECHAMPAGNE GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** *** Nucleotide sequence alignment attcgttggcaaatcgcccctatccggccttaa att---tggcggatcg-cctctacgggcc---- *** **** **** ** ******

Sequence alignment: Purpose Reveal structural, functional and evolutionary relationship between biological sequences Similar sequences may have similar structure and function Similar sequences are likely to have common ancestral sequence Annotation of new sequences Modelling of protein structures Design and analysis of gene expression experiments

Sequence alignment: Types Global alignment –Aligns each residue in each sequence by introducing gaps –Example: Needleman-Wunsch algorithm L G P S S K Q T G K G S - S R I W D N L N - I T K S A G K G A I M R L G D A

Sequence alignment: Types Local alignment –Finds regions with the highest density of matches locally –Example: Smith-Waterman algorithm T G K G A G K G

Sequence alignment: Scoring Scoring matrices are used to assign scores to each comparison of a pair of characters Identities and substitutions by similar amino acids are assigned positive scores Mismatches, or matches that are unlikely to have been a result of evolution, are given negative scores ACDEFGHIK ACYEFGRIK TACGGGCAG -AC-GGC-G Option 1 TACGGGCAG -ACGG-C-G Option 2 TACGGGCAG -ACG-GC-G Option 3

Sequence alignment: Scoring PAM matrices –PAM - Percent Accepted Mutations –PAM gives the probability that a given amino acid will be replaced by any other amino acid –An accepted point mutation in a protein is a replacement of one amino acid by another, accepted by natural selection –Derived from global alignments of closely related sequences –The numbers with the matrix (PAM40, PAM100) refer to the evolutionary distance (greater numbers mean greater distances) –1-PAM matrix refers to the amount evolution that would change 1% of the residues/bases (on average) –2-PAM matrix does NOT refer to change in 2% of residues Refers 1-PAM twice Some variations may change back to original residue

PAM

Sequence alignment: Scoring BLOSUM matrices –BLOSUM - Blocks Substitution Matrix –Score for each position refers to obtained frequencies of substitutions in blocks of local alignments of protein sequences [Henikoff & Henikoff]. –For example BLOSUM62 is derived from sequence alignments with no more than 62% identity

BLOSUM

Which scoring matrix to use? For global alignments use PAM matrices. Lower PAM matrices tend to find short alignments of highly similar regions Higher PAM matrices will find weaker, longer alignments For local alignments use BLOSUM matrices BLOSUM matrices with HIGH number, are better for similar sequences BLOSUM matrices with LOW number, are better for distant sequences

Sequence alignment: Methods Pairwise alignment –Finding best alignment of two sequences –Often used for searching best similar sequences in the sesequence databases Dot Matrix Analysis Dynamic Programming (DP) Short word matching Multiple Sequence Alignment (MSA) –Alignment of more than two sequences –Often used to find conserved domains, regions or sites among many sequences Dynamic programming Progressive methods Iterative methods Structural alignments –Alignments based on structure

Dot matrix Method for comparing two amino acid or nucleotide sequences Lets align two sequences using dot matrix A:A G C T A G G A B:G A C T A G G C –Sequence A is organized in X-axis and sequence B in Y-axis AGCTAGGA G A C T A G G C Sequence A Sequence B

Dot matrix –Starting from the first nucleotide in B, move along the first row placing a dot in columns with matching nucleotide –Repeat the procedure for all the nucleotides in B –Region of similarity is revealed by a diagonal row of dots –Other isolated dots represent random matches AGCTAGGA G●●● A C T A G G C Sequence A Sequence B

Dot matrix –Starting from the first nucleotide in B, move along the first row placing a dot in columns with matching nucleotide –Repeat the procedure for all the nucleotides in B –Region of similarity is revealed by a diagonal row of dots –Other isolated dots represent random matches AGCTAGGA G●●● A●●● C T A G G C Sequence A Sequence B

Dot matrix –Starting from the first nucleotide in B, move along the first row placing a dot in columns with matching nucleotide –Repeat the procedure for all the nucleotides in B –Region of similarity is revealed by a diagonal row of dots –Other isolated dots represent random matches AGCTAGGA G●●● A●●● C● T● A●●● G●●● G●●● C● Sequence A Sequence B

Dot matrix –Starting from the first nucleotide in B, move along the first row placing a dot in columns with matching nucleotide –Repeat the procedure for all the nucleotides in B –Region of similarity is revealed by a diagonal row of dots –Other isolated dots represent random matches AGCTAGGA G●●● A●●● C● T● A●●● G●●● G●●● C● Sequence A Sequence B

Dot matrix Two similar, but not identical, sequences An insertion or deletion A tandem duplication

Dot matrix An inversionJoining sequences

Limitations of dot matrix Sequences with low-complexity regions give false diagonals –Sequence regions with little diversity Noisy and space inefficient Limited to 2 sequences

Dotplot exercise Use the following three tools to generate dot plots for the given two sequences YASS:: genomic similarity search tool – Lalign/Palign – multi-zPicture –

Dynamic programming Breaks down the alignment problem into smaller problems Example –Needleman-Wunsch algorithm: global alignment –Smith-Waterman algorithm: local alignment Three steps –Initialization –Scoring –Traceback

Gap penalties Insertion of gaps in the alignment Gaps should be penalized Gap opening should be penalized higher than gap extension (or at least equal) In BLOSUM62 –Gap opening score = -11 –Gap extension score = AAAGAGAAA AAA--AAAA Gap extention AAAGAGAAA AAA-A-AAA Gap initiation

Needleman-Wunsch vs Smith-Waterman -AGTTA A2 G-2 T-3 G-4 C-5 A-6 -AGTTA A02 G0 T0 G0 C0 A0 Needleman-Wunsch –Match=+2 –Mismatch=-1 –Gap=-1 Smith-Waterman –Match=+2 –Mismatch=-1 –Gap=-1 All negative values are replaced by 0 Traceback starts at the highest value and ends at

Needleman-Wunsch vs Smith-Waterman Sequence alignment teacher (

Dynamic programming: example Scoring –Match= +2 –Mismatch= -2 –Gap=

Dynamic programming exercise Generate a scoring matrix for nucleotides (A, C, G, and T) Align two sequences using dynamic programming Align two sequences using following tools –EMBOSS Needle –EMBOSS Water

Multiple sequence alignment A multiple sequence alignment (MSA) is an alignment of three or more sequences Why MSA? –To identify patterns of conservation across more than 2 sequences –To characterize protein families and generate profiles of protein families –To infer relationships within and among gene families –To predict secondary and tertiary structures of new sequences –To perform phylogenetic studies

Recall: dynamic programming 2 sequences 3 sequences

MSA methods Dynamic programming –Align each pair of sequences –Sum scores for each pair at each position Progressive sequence alignment –Hierarchical or tree based method –E.g. ClustalW, T-Coffee Iterative sequence alignment –Improved progressive alignment –Realigns the sequences repeatedly –E.g. MUSCLE

Tools for MSA

ClustalW Progressive sequence alignment Basic steps –Calculate pairwise distances based on pairwise alignments between the sequences –Build a guide tree, which is an inferred phylogeny for the sequences –Align the sequences

Progressive MSA d

MUSCLE Iterative sequence alignment Follows 3 steps Second progressive alignment Refinement Progressive alignment

Phylogenetic tree A phylogenetic tree shows evolutionary relationships between the sequences Types: –Rooted Nodes represent most recent common ancestor Edge lengths represents time estimates –Unrooted No ancestry and time estimates Algorithms to generate phylogenetic tree –Neighbor-joining –Unweighted Pair Group Method with Arithmetic Mean (UPGMA) –Maximum parsimony

Neighbor joining method

MSA exercise Align the protein sequences SET 1 and SET 2 using MSA tools and compare the alignments Clustalw2 – MUSCLE –

What to align: DNA or protein sequence? Many mis-matches in DNA sequences are synonymous DNA sequences contain non-coding regions, which should be avaided in homology searching Matches are more reliable in protein sequence –Probability to occur randomly at any position in a sequence Amino acids: 1/20 = 0.05 Nucleotides: 1/4 = 0.25 Searcing at protein level: In case of frameshifts, the alignment score for protein sequence may be very low even though the DNA sequence are similar ACTTTTCATGGG... ThrPheHisGly... ACTTTTTCATGGG.. ThrPheSerTrp If ORF exists, then always align at protein level

Searching bioinformatics databases using: keywords and, sequences

Search strategy Keyword search –Find information related to specific keywords –Each bioinformatics database has its own search tool –Some search tools have a wide spectrum which access multiple databases and gather results together –Gquery, EBI search Sequence search –Use a sequence of interest to find more information about the sequence –BLAST, FASTA

Keyword search Find information related to specific keywords Gquery –A central search tool to find information in NCBI databases –Searches in large number of NCBI databases and shows them in one page – EBI search –Search tool to find infroamtion from databases developed, managed and hosted by EMBL-EBI –

Gquery

EBI search

Limitations Synonyms Misspellings Old and new names/terms NOTES: –Use different synonyms and read literature to find more approriate keywords –Use boolean operators to combine different keywords –Do not expect to find all the information using keyword search alone –Note the database version or the version of entries in the databases you used ELA2 ELANE HIV 1 HIV-1 PubMed ClinVar

Gene nomenclature HUGO Gene Nomenclature Committee (HGNC) –Assigns standardized nomenclature to human genes –Each symbol is unique and each gene is given only one name Species specific nomenclature committees –Mouse Genome Informatics Database –Rat Genome Database

HGNC symbol report Approved symbol Approved name Synonyms –Terms used in literature to indicate the gene –HGNC, Ensembl, Entrez Gene, OMIM Previous symbols and names –Previous HGNC approved symbol NOTE: HGNC does not approve protein names. Usually genes and proteins have the same name and gene names are written in italics

HGNC search

Keyword search Exercise