Sequence Analysis Topics to be covered What is sequence analysis? Why we need it? How we do it? What are the tools available?

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
EVOLUTIONARY CHANGE IN DNA SEQUENCES - usually too slow to monitor directly… … so use comparative analysis of 2 sequences which share a common ancestor.
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Sequence Similarity Searching Class 4 March 2010.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Sequence analysis course
Bioinformatics and Phylogenetic Analysis
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Introduction to bioinformatics
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Basics of Sequence Alignment and Weight Matrices and DOT Plot
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Sequence Analysis Alignments dot-plots scoring scheme Substitution matrices Search algorithms (BLAST)
Pairwise & Multiple sequence alignments
An Introduction to Bioinformatics
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
. Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology  Two genes are homologous if they share a common evolutionary.
Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.
Pairwise Sequence Alignment
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty closely) This lecture is designed to introduce you to.
Alineamiento Matricial (Harr Plot, Matrix Plot, Dot Plot, Dot Matrix)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
COT 6930 HPC and Bioinformatics Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Protein Sequence Alignment Multiple Sequence Alignment
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
Basic Local Alignment Search Tool
Presentation transcript:

Sequence Analysis Topics to be covered What is sequence analysis? Why we need it? How we do it? What are the tools available?

Sequence Databases GenBank at the National Center of Biotechnology Information, National Library of Medicine, Washington, DC accessible from: European Molecular Biology Laboratory (EMBL) Outstation at Hinxton, England DNA DataBank of Japan (DDBJ) at Mishima, Japan Protein International Resource (PIR) database at the National Biomedical Research Foundation in Washington, DC 5.

Sequence Databases The SwissProt protein sequence database at ISREC, Swiss Institute for Experimental Cancer Research in Epalinges/Lausanne bin/sprot-search-de

Servers for Sequence Alignment Bytes Block Aligner BCM Search Launcher: Pair-wise sequence alignment search/alignment.html SIM—Local similarity program for finding alternative alignments Global alignment programs (GAP, NAP) FASTA program suite BLAST 2 sequence alignment (BLASTN, BLASTP) Likelihood-weighted sequence alignment (lwa) wa.html

ALIGNMENT How do we tell whether two macromolecules are similar? Why?

Sequence Comparison Similarity between and among sequences is at the heart of sequence comparison methods. Q: What causes sequences to be similar? A: Evolutionary OR developmental causes.

Similarity, Homology, Orthology, Paralogy Similarity cannot be equated with Homology, Orthology, or Paralogy. Different organisms may have similar characteristics for different causes/reasons: evolutionary OR developmental. We need to concern ourselves with one set only, the set whose similarities are due to evolutionary causes/events; in other words, we take into account ONLY those characteristics that are similar among organisms due to evolution (common ancestry).

Similarity,Homology, Orthology, Paralogy

Homology Homology: designates a relationship of common descent between any entities, without further specification of the evolutionary scenario. Accordingly, the entities related by homology, in particular, genes, are called homologs.

Sequences are said to be homologous if they are related by divergence from a common ancestor Homology is not a measure of similarity, but an absolute statement that sequences have a divergent rather than a convergent relationship.

Orthology Orthologs are homologous genes in different species with analogous functions.

Paralogy Paralogs are similar genes that are the result of a gene duplication. Paralogs are therefore neither homologs nor orthologs.

Orthology vs. Paralogy A phylogeny that includes both orthologs and paralogs is likely to be incorrect. The figure above shows how gene duplication gave rise to two paralogous branches, α and β, species within each branch are orthologs of each other.

ALIGNMENT One-to-One One-to-Database Many-to-Many

Origins of Sequence Similarity Homology –common evolutionary descent Similarity in function –convergence Chance

Identity GAACAAT |||||||7/7 OR 100% GAACAAT ||| |||6/7 OR 84% GAATAAT MISMATCH

Mismatches GAACAAT ||| |||6/7 OR 84% GAATAAT GAACAAT ||| |||6/7 OR 84% GAAGAAT

Terminal Mismatch GAACAATttttt ||| ||| aaaccGAATAAT 6/7 OR 84%

Types of Sequence Analysis Knowledge-based single/multiple sequence analysis for sequence characteristics Pair-wise sequence alignment Multiple sequence alignment ~ Sequence motif discovery in multiple sequence alignment ~Phylogenetic analysis

Sequence Alignment Procedure for comparing two (pair-wise alignment) or more (multiple sequence alignment) sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences.

Sequence Alignment I Fundamental to inferring homology (common ancestry) and function. IIIf two sequences are in alignment  part or all of the pattern of nucleotides and polypeptides match, then they are similar and can be said to be homologous. IIIIf the sequence of a protein or other molecule ‘significantly’ matches the sequence of a protein with a known structure and function, then the molecules may share structure and function.

Pair-wise Sequence Alignment One pair of elements at a time Challenge – Find optimum alignment of 2 sequences with some degree of similarity Optimality is based on SCORE Score reflects the no. of paired characters in the 2 sequences and the no. and length of gaps introduced to adjust the sequences so that max no. of characters are in alignment

Pair-wise Sequence Alignment Example 1 Consider the ideal case of 2 identical nucleotide sequences Score – 1 point per pair of aligned characters SCORE = 20 points ATTCGGCATTCAGTGCTAGA ATTCGGCATTCAGTGCTAGA

Pair-wise Sequence Alignment Example 2 Consider the case when several of the characters are not aligned A) SCORE = 11 points?? Note that the last 6 characters are identical! ATTCGGCATTCAGTGCTAGA ATTCGGCATTGCTAGA

Pair-wise Sequence Alignment Example 2 contd. SCORE = 16 points?? Any cost/penalty of inserting gaps? Penalty = -0.5 / gap Final Score = 16 – 0.5*4 = 14 ATTCGGCATTCAGTGCTAGA ATTCGGCATT----GCTAGA

Pair-wise Sequence Alignment Example 3 Areas of similarity and dissimilarity not obvious as opposed to the previous example where there were two blocks of identical characters SCORE = 12 points ATTCGGCATTCAGAGCTAGA ATTCGACATTGCTAGTGGTA

Pair-wise Sequence Alignment Example 3 contd… Areas of similarity and dissimilarity not obvious SCORE = 14 – 0.5*4 = 12 SCORE = 14 (1 for exact match) – 0.5*4 ( for gap) – 0.5*1 ( for inexact match) = 14 – 2 – 0.5 =11.5 ATTCGGCATTCAGAGCTAGa ATTCGACATT----GCTAGt

Pair-wise Sequence Alignment Example 3 contd. SCORE = 14 – 0.5*4 – 0.5*1 (inexact matches) = 11.5 Penalty(gap)=Cost(opening) + Cost(per ext)* Length(gap)  in Example 2 and above, gap penalty = (-0.5*4) = -2.5  score in the above case becomes 8.5 ATTCGGCATTCAGAGCTAGa ATTCGACATT----GCTAGt

Two types of Sequence Alignment Global Alignment: An attempt is made to align the entire sequence using as many characters as possible, till the ends of both the sequences. Local Alignment: Stretches of sequence with the highest density of matches are aligned generating one or more islands of matches. This is particularly relevant for multi-domain proteins that share a conserved domain.

Local vs. Global Alignment Global alignment An attempt to line up two sequences matching as many characters as possible Considers all characters in a sequence. Bases alignment on the total score, even at the expense of stretches that that share obvious similarity Used for determining whether two protein sequences are in the same family

Local vs. Global Alignment Local Alignment More meaningful – points out conserved regions between two sequences Aligns two partially overlapping sequences Aligns two sequences where one is a subsequence of another

Methods Dot Matrix/Dot Plot Bayesian Method Dynamic Programming Smith-Waterman Algorithm Hidden Markov Models Genetic Algorithms Neural Networks Word-based techniques FASTA BLAST

Dot Matrix Method Visual inspection of linear sequences with 100 or more characters is impractical Dot Matrix method is visually more informative Makes similarities in patterns more obvious to visual inspection A dot is placed at the intersection of matching character pairs Dotter is a graphical dot plot program for detailed comparison of two sequences. Reference A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis" Erik L.L. Sonnhammer and Richard Durbin Gene 167:GC1-10 (1995 )

Dot Matrix Method

The biggest asset of dot matrix analysis is it allows you to visualize the entire comparison at once, not concentrating on any one ‘optimal’ region. Since your own mind and eyes are still better than computers at discerning complex visual patterns, especially when more than one pattern is being considered, you can see all these ‘less than best’ comparisons as well as the main one and then you can ‘zoom-in’ on those regions of interest using more detailed procedures.

Dot Matrix Method ‘mutated’ inter-sequence comparison

Dot Matrix Method: Detection of Repeats

Dot Matrix Method: detecting complicated mutations

Dot Matrix Method Complicated MUTATIONS Again, notice the diagonals. However, they have now been displaced off of the center diagonal of the plot. This is an example of a ‘transposition’. Dot matrix analysis is one of the only sensible ways to locate such transpositions in sequences. The ‘deletion’ of ‘PRIMER’ is shown by the lack of a corresponding diagonal.

Noise in Dot plots and filtering SHIBASISHBITSCHOWDHURY SSSSS HHHHH IIII BBB AA SSSSS IIII SSSSS HHHHH BBB IIII TT SSSSS

Dot Matrix Method: filtering noise using the sliding window approach Reconsider the same plot. Notice the extraneous dots that neither indicate runs of identity between the two sequences nor inverted repeats. These merely contribute ‘noise’ to the plot and are due to the ‘random’ occurrence of the letters in the sequences, the composition of the sequences themselves. How can we ‘clean up’ the plots so that this noise does not detract from our interpretations? Consider the implementation of a filtered windowing approach; a dot will only be placed if some ‘stringency’ is met within a user-defined window. What is meant by this is that if within some defined window size, some defined criterion is met, then and only then, will a dot be placed at the middle of that window. Then the window is shifted one position and the entire process is repeated. This very successfully rids the plot of unwanted noise.

Sliding Window Algorithm T A C G G T A T G A C A G T A T C T A C G G T A T G A C A G T A T C T A C G G T A T G A C A G T A T C T A C G G T A T G A C A G T A T C CTAT GACATACGGTATGCTAT GACATACGGTATG Window Size = 3 Stringency = 3 

Hemoglobin  -chain Hemoglobin  -chain Dotplot (Window = 130 / Stringency = 9)

Dotplot (Window = 18 / Stringency = 10) Hemoglobin  -chain Hemoglobin  -chain

Parameters of Sequence Alignment Scoring Systems: Each symbol pairing is assigned a numerical value, based on a symbol comparison table. Gap Penalties: Opening: The cost for opening a gap Extension: The cost for elongating a gap

Protein Scoring Systems The concept of amino acid substitution matrices : The concept of amino acid substitution matrices : A 20x20 matrix containing values proportional to the probability that amino acid i mutates into amino acid j for all pairs of amino acids. –Pioneered by Margaret Dayhoff (1968) who first described a method to derive scoring matrices from amino acid replacements observed in aligned families –of present-day sequences. –In 1978 her group studied 1572 mutations in 71 families of closely-related protein sequences. Based on these they built matrices known as PAM (percent accepted mutation) matrices.

PAM 250 (log odds form or mutation data matrix or MDM) A R N D C Q E G H I L K M F P S T W Y V B Z A R N D C Q E G H I L K M F P S T W Y V B Z C W W Each matrix value is calculated from an odds score, which is the probability that the amino acid pair will be found in alignments of homologous proteins divided by the probability that the pair will be found in alignments of unrelated proteins by random chance.

Using the matrix… 3 Alignment Sequence ATyrCysAspAla Sequence BPheMetGluGly PAM 250 matrix value Total score for alignment of sequence A with sequence B =?

Using the matrix (example 2) PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM Sequence 1 Sequence 2 Scoring matrix T:G= -2 T:T = 5 Score= 48 CSTPAGND.. C 9 S-1 4 T P A G N D CSTPAGND.. C 9 S-1 4 T P A G N D

First pairs of aligned amino acids in verified alignments are used to build a count matrix Then the count matrix is used to estimate a mutation matrix at 1 PAM (evolutionary unit). From the mutation matrix, a Dayhoff scoring matrix is constructed. This Dayhoff matrix along with a model of indel events is then used to score new alignments These alignments can then be used in an iterative process to construct new count matrices.

Extrapolating the PAM 1 matrix to obtain other matrices of the PAM family Percent Accepted Mutation. A unit introduced by Dayhoff et al. to quantify the amount of evolutionary change in a protein sequence. 1.0 PAM unit, is the amount of evolution which will change, on average, 1% of amino acids in a protein sequence.

PAM family PAM matrices are based on global alignments of closely related proteins. The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. Other PAM matrices can be extrapolated from PAM1. Each PAM matrix gives the substitutions expected for a given period of evolutionary time. It is rare that a PAM matrix would be used for an evolutionary distance any greater than 256 PAMS.

BLOSUM (Block Substitution Matrices) BLOSUM matrices are based on local alignments. BLOSUM 62 is a matrix calculated from comparisons of sequences with no less than 62% divergence. Unlike PAM matrices, all BLOSUM matrices are based on observed alignments, they are not extrapolated from comparisons of closely related sequences. BLOSUM 62 is the default matrix in BLAST. Though BLOSUM 62 is tailored for comparisons of moderately distant proteins, it performs well in detecting close relationships. A search for distant relatives may be more sensitive with a different matrix.

Blosum 62 substitution matrix S Henikoff, and JG Henikoff Amino Acid Substitution Matrices from Protein Blocks PNAS 89: , 1992.

The relationship between PAM and BLOSUM matrices BLOSUM 80BLOSUM 62BLOSUM 45 PAM 1 PAM 120 PAM 250 Less divergent More divergent

Nucleic Acid Scoring Systems -very simple actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact Sequence 1 Sequence 2 AGCTA1000G0100C0010T0001AGCTA1000G0100C0010T0001 Match: 1 Mismatch: 0 Score = 5 Unitary matrix

T A T G T G G A A T G A Scoring Insertions and Deletions and Gap Penalties A T G T - - A A T G C A A T G T A A T G C A T A T G T G G A A T G A The creation of a gap is penalized with a negative score value. insertion / deletion

1 GTGATAGACACAGACCGGTGGCATTGTGG 29 ||| | | ||| | || || | 1 GTGTCGGGAAGAGATAACTCCGATGGTTG 29 Why Gap Penalties? Gaps allowed but not penalized Score: 88 Gaps not permitted Score: 0 1 GTG.ATAG.ACACAGA..CCGGT..GGCATTGTGG 29 ||| || | | | ||| || | | || || | 1 GTGTAT.GGA.AGAGATACC..TCCG..ATGGTTG 29 Match = 5 Mismatch = -4

The optimal alignment of two similar sequences is usually that which maximizes the number of matches and minimizes the number of gaps. There is a tradeoff between these two - adding gaps reduces mismatches Permitting the insertion of arbitrarily many gaps can lead to high scoring alignments of non-homologous sequences. Penalizing gaps forces alignments to have relatively few gaps. Why Gap Penalties?

Gap Penalties How to balance gaps with mismatches? Gaps must get a steep penalty, or else you’ll end up with nonsense alignments. In real sequences, multi-base (or amino acid) gaps are quit common genetic insertion/deletion events “Affine” gap penalties give a big penalty for each new gap, but a much smaller “gap extension” penalty.

Scoring Insertions and Deletions A T G T T A T A C T A T G T G C G T A T A Total Score: 4 Gap parameters: d = 3(gap opening) e = 0.1(gap extension) g = 3(gap length)  (g) = -3 - (3 -1) 0.1 = -3.2 T A T G T G C G T A T A A T G T T A T A C insertion / deletion match = 1 mismatch = 0 Total Score: = 4.8 Scoring system 1 Scoring system 2

Building the PAM mutation data matrix Studied 1572 mutations in 71 families of closely-related protein sequences. 1 Similar sequences were organized into a phylogenetic tree in each family. In each of the aligned sequence families, the number of changes of each amino acid to every other amino acid were counted. Example Out of 1572 changes, there were 260 changes between F and Y. Multiply