COT 6930 HPC and Bioinformatics Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.

Slides:



Advertisements
Similar presentations
Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy Sep 10 th, 2013 BMI/CS 576.
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Pairwise sequence alignment.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
1 ALIGNMENT OF NUCLEOTIDE & AMINO-ACID SEQUENCES.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Lecture 8 Alignment of pairs of sequence Local and global alignment
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Structural bioinformatics
Sequence Similarity Searching Class 4 March 2010.
Heuristic alignment algorithms and cost matrices
Sequence Alignment.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Summer Bioinformatics Workshop 2008 Sequence Alignments Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Sequence Analysis Tools
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Recap Don’t forget to – pick a paper and – me See the schedule to see what’s taken –
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequencing a genome and Basic Sequence Alignment
Basics of Sequence Alignment and Weight Matrices and DOT Plot
8/27/20151 Pairwise sequence Alignment. 8/27/20152 Many of the images in this power point presentation are from Bioinformatics and Functional Genomics.
Pairwise & Multiple sequence alignments
Pairwise Alignments Part 1 Biology 224 Instructor: Tom Peavy Sept 8
An Introduction to Bioinformatics
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
. Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology  Two genes are homologous if they share a common evolutionary.
Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty closely) This lecture is designed to introduce you to.
BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Sequencing a genome and Basic Sequence Alignment
Construction of Substitution Matrices
You have worked for 2 years to isolate a gene involved in axon guidance. You sequence the cDNA clone that contains axon guidance activity. What do you.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Introduction to NCBI & Ensembl tools including BLAST and database searching Incorporating Bioinformatics into the High School Biology Curriculum Fran Lewitter,
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Construction of Substitution matrices
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Sequence Similarity The bioinformatics for molecular biologists lecture series.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Last lecture summary. Sequence alignment What is sequence alignment Three flavors of sequence alignment Point mutations, indels.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
Introduction to sequence alignment Mike Hallett (David Walsh)
Bioinformatics Lecture 2 By: Dr. Mehdi Mansouri
Basic Local Alignment Search Tool
Introduction to bioinformatics Lecture 5 Pair-wise sequence alignment
Presentation transcript:

COT 6930 HPC and Bioinformatics Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Outline Why sequence alignment and definitions What is sequence alignment How to score an alignment Substitution matrix Gap penalty

Access Sequence Database Query by sequence TCCAGCAGTGGATCTACTGGAGAAATATATGCAGCAAGGGA AAAGACAGAGAGAGCAGAGAGAGAGACCATACAAGGAGGT GACAGAGGACTTACTGCACCTCGAGCAGAGGGAGACACCAT ACAGGGAGCCACCAACAGAGGACTTGCTGCACCTCAATTCT CTCTTTGGAAAAGACCAGTAGTCACAGCATACATTGAGGGT CAGCCAGTAGAAGTTTTGTTAGACACGGGAGCTGACGACTC AATAGTAGCAGGAATAGAGTTAGGAAACAATTATAGCCCAAA AATAGTAGGGGGAATAGGGGGATTCATAAATACCAAGGAAT ATAAAAATGTAGAGATAGAAGTTCTAAATAAAAAGGTACGGG CCACCATAATGACAGGCGACACCCCAATCAACATTTTTGGCA GAAATATT CTGACAGCCTTAGGCATGTCATTAAATCTA

Aligning biological sequences Nucleic acid (4 letter alphabet + gap) TT-GCAC TTTACAC Proteins (20 letter alphabet + gap) RKVA--GMAKPNM RKIAVAAASKPAV

Why sequence alignment Lots of sequences with unknown structure and function vs. a few (but growing number) sequences with known structure and function If they align, they are “similar” If they are similar, then they might have similar structure and/or function. Identify conserved patterns (motifs) If one of them has known structure/function, then alignment of other might yield insight about how the structure/functions works. Similar motif content might hint to similar function Define evolutionary relationships

Problems! How much is “similar” 95% similarity in proteins is ~ identical 80% similarity is a lot in proteins Less similarity than that needed for DNA Database techniques inadequate – they are too precise! Datasets very large to search

Pair-wise sequence alignment is the most fundamental operation of bioinformatics It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify domains or motifs that are shared between proteins It is used in the analysis of genomes

What is sequence alignment Given two strings, and a scoring scheme for evaluating matching letters, find the optimal pairing of letters from one sequence to letters of the other sequence Align: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT THIS IS A SHORT SENTENCE THIS IS A RATHER LONGER - SENTENCE THAN THE NEXT |||| || | --*|-- -|---| - |||||||| THIS IS A --SH-- -O---R T SENTENCE or THIS IS A RATHER LONGER SENTENCE THAN THE NEXT |||| || | |||||||| THIS IS A SHORT SENTENCE

Definitions Similar (or not) Similarity The extent to which nucleotide or protein sequences are related. It is based upon identity plus conservation. Identity The extent to which two sequences are invariant. Conservation Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue. RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVA 59 + K GTW++MA + L + A glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKA 55

retinol-binding protein 4 (NP_006735)  -lactoglobulin (P02754) Page 42

1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP. ||| |. |... | :.||||.:| : 1...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: |.|. || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV QYSC 136 RBP || ||. | :.|||| |..| 94 IPAVFKIDALNENKVL VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. | | | : ||. | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI lactoglobulin Pairwise alignment of retinol-binding protein 4 and  -lactoglobulin

1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP. ||| |. |... | :.||||.:| : 1...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: |.|. || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV QYSC 136 RBP || ||. | :.|||| |..| 94 IPAVFKIDALNENKVL VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. | | | : ||. | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI lactoglobulin Pairwise alignment of retinol-binding protein and  -lactoglobulin Identity (bar)

1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP. ||| |. |... | :.||||.:| : 1...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: |.|. || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV QYSC 136 RBP || ||. | :.|||| |..| 94 IPAVFKIDALNENKVL VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. | | | : ||. | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI lactoglobulin Pairwise alignment of retinol-binding protein and  -lactoglobulin Somewhat similar (one dot) Very similar (two dots)

1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP. ||| |. |... | :.||||.:| : 1...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: |.|. || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV QYSC 136 RBP || ||. | :.|||| |..| 94 IPAVFKIDALNENKVL VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. | | | : ||. | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI lactoglobulin Pairwise alignment of retinol-binding protein and  -lactoglobulin Internal gap Terminal gap

Biological Interpretation of sequence alignment Homology Similarity attributed to descent from a common ancestor. Two types of homology Orthologs Homologous sequences in different species that arose from a common ancestral gene during speciation May or may not be responsible for a similar function. Differences due to speciation (evolution) Paralogs Homologous sequences within a single species that arose by gene duplication Usually assume different functions SimilarityHomology

Orthologs: members of a gene (protein) family in various organisms. common carp zebrafish rainbow trout teleost African clawed frog chicken mouse rat rabbitcowpig horse human 10 changes This tree shows Retinol binding protein (RBP) orthologs.

Paralogs: members of a gene (protein) family within a species apolipoprotein D retinol-binding protein 4 Complement component 8 prostaglandin D2 synthase neutrophil gelatinase- associated lipocalin 10 changes Lipocalin 1 Odorant-binding protein 2A progestagen- associated endometrial protein Alpha-1 Microglobulin /bikunin

Similarity versus Homology Similarity refers to the likeness or % identity between 2 sequences Similarity means sharing a statistically significant number of bases or amino acids Similarity does not imply homology Homology refers to shared ancestry Two sequences are homologous if they are derived from a common ancestral sequence Homology usually implies similarity

Similarity versus Homology Similarity can be quantified It is correct to say that two sequences are X% identical It is correct to say that two sequences have a similarity score of Z It is generally incorrect to say that two sequences are X% similar

Aligning biological sequences Any two sequences can always be aligned There are many possible alignments Nucleic acid (4 letter alphabet + gap) TT-GCACTTG-CAC TTTACACTTTACAC Proteins (20 letter alphabet + gap) RKVA--GMAKPNMRK--VAGMAKPNM RKIAVAAASKPAVRKIAVAAASKPAV Sequence alignment needs to be scored to find the “optimal” alignment

Statement of problem Given: 2 sequences Scoring system for evaluating match (or mismatch) of two characters (simple for nucleic acids / difficult for proteins) Penalty function for gaps in sequences Produce: Optimal pairing of sequences that retains the order of characters in each sequence, perhaps introducing gaps, such that the total score is optimal. “Optimal” alignment Alignment with best score relative to metrics used May or may not have biological significance  because algorithm relies on approximations  Scoring matches / mismatches  Scoring gaps  many more…

Pairwise alignment: protein sequences or DNA sequence? Protein is more informative (20 vs 4 characters); many amino acids share related biophysical properties Codons are degenerate: changes in the third position often do not alter the amino acid that is specified

5’ CATCAA 5’ ATCAAC 5’ TCAACT 5’ GTGGGT 5’ TGGGTA 5’ GGGTAG 5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’ 3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’ Pairwise alignment: protein sequences can be more informative than DNA DNA sequences can be translated into protein, and then used in pairwise alignments DNA can be translated into six potential proteins

Scoring Function Positive score for identities Some partial positive score for conservative substitutions Gap penalties

Parameters of Sequence Alignment Scoring Systems: Each symbol pairing is assigned a numerical value, based on a symbol comparison table. Gap Penalties: Opening: The cost to introduce a gap Extension: The cost to elongate a gap

Scoring Protein Similarity – PAM

Gaps Can be inserted in aligned sequences Can represent Actual insertions / deletion (indel) mutations Regions of low sequence similarity Scoring gaps Commonly use affine cost model ( Cost = h + g × gap length) h = gap opening penalty (large) g = gap extension penalty (small) Costs empirically determined (relative to scoring matrix)

Global & Local Alignment Global alignment Best alignment of entire sequences to each other Q: Are two sequences generally the same? Local alignment Best alignment of parts of sequence Q: Do two sequences contain regions of high similarity? Biologically " Two sequences may differ in structure and function, but share common substructure / subfunction In general Use local alignment to find sequences with shared similarity Use global alignment to compare resulting sequences

Recap Sequence alignment reveals the similarity between two sequences Similar sequences might be homolog sequences and (due to the evolutionary connection) have similar function The sequence alignment problem is an optimization problem: produce the best alignment according to a scoring function A scoring function provide numeric values for each possible symbol pairing and for gaps in an alignment.