Pairwise sequence alignment Urmila Kulkarni-Kale Bioinformatics Centre, University of Pune, Pune 411 007.

Pairwise sequence alignment Urmila Kulkarni-Kale Bioinformatics Centre, University of Pune, Pune 411 007. urmila@bioinfo.ernet.in

October 2K52 Bioinformatics Databases –Collection of records DNA sequences: GenBank, EMBL Protein sequences: NBRF-PIR, SWISSPROT –organized to permit search and retrieval Text-based searching: Entrez, SRS –Authors, Keywords Sequence-based searching: BLAST, FASTA –allow processing and reorganization Alignments, finding patterns –help to discover patterns

Heuristic approaches: local sequence alignment Two main Heuristic Local Alignment Algorithms: BLAST and FASTA. They are significantly faster but do not guarantee to find the optimal alignment.

October 2K54 How to analyse sequences? Analysis of single sequence –Composition –Location of pattern –Profile of properties such as hydrophilicity, hydrophobicity Comparison with self –Repeats Comparison with one or more sequences –Sequence and/or structural similarity –Evolutionary relationship (homology)

October 2K55 Basis for Sequence comparison Theory of evolution: –gene sequences have evolved/derived from a common ancestor Proteins that are similar in sequence are likely to have similar structure and function

October 2K56 WHAT IS ALIGNMENT? Alignments are useful organizing tools because they provide pictorial representation of similarity / homology in the protein or nucleic acid sequences.

October 2K57 Sample Alignment SEQ_A: GDVEKGKKIFIMKCSQ SEQ_B: GCVEKGKIFINWCSQ There are two possible linear alignments 1.GDVEKGKKIFIMKCSQ | ||||| GCVEKGKIFINWCSQ 2.GDVEKGKKIFIMKCSQ |||| ||| GCVEKGKIFINWCSQ

October 2K58 The optimal alignment GDVEKGKKIFIMKCSQ | ||||| ||| ||| GCVEKGK-IFINWCSQ Insertion of one break maximizes the identities.

October 2K59 Theoretical background Alignment is the method based on the theoretical view that the two sequences are derived from each other by a number of elementary transformations – –Mutations (residue substitution) –Insertion/deletion –Slide function

October 2K510 Transformations Substitution, Addition/deletion, Slide function The most homologous sequences are those which can be derived from one another by the smallest number of such transformations. How to decide “the smallest number of transformation?” Therefore alignments are an optimization problem.

October 2K511 Terminology Identity Similarity Homology

October 2K512 Identity Objective and well defined Can be quantified –Percent –The number of identical matches divided by the length of the aligned region

October 2K513 What is Similarity? Objective and well defined Can be quantified by using the ‘scoring schemes’ –Percent –The number of “similar matches” divided by the length of the aligned region Protein similarity could be due to – Evolutionary relationship Similar two or three dimensional structure Common Function

October 2K514 What is Homology? Homologous proteins may be encoded by- Same genes in different species Genes that have transferred between the species Genes that have originated from duplication of ancestral genes.

October 2K515 Difference between Homology and Similarity Similarity does not necessarily imply Homology. Homology has a precise definition: having a common evolutionary origin. Since homology is a qualitative description of the relationship, the term “% homology” has no meaning. Supporting data for a homologous relationship may include sequence or structural similarities, which can be described in quantitative terms. –% identities, rmsd

October 2K516 An optimal alignment AALIM AAL-M A sub-optimal alignment AALIM AA-LM

October 2K517 Global Alignment

October 2K518 Local Alignment

October 2K519 Needleman & Wunsch algorithm JMB (1970). 48:443-453. Maximizes the number of amino acids of one protein that can be matched with the amino acids of other protein while allowing for optimum deletions/insertions. Based on theory of random walk in two dimensions

October 2K520 Random walk in two dimensions 3 possible paths –Diagonal –Horizontal –Vertical Optimum path –Diagonal

October 2K521 N & W Algorithm The optimal alignment is obtained by maximizing the similarities and minimizing the gaps. GLOSSARY 1. PROTEINS The words composed of 20 letters 2. LETTERis an element other than NULL 3. NULLis an symbol “-” i.e. the GAP 4. GAPSRun of nulls which indicates the deletion(s) in one sequence and insertion(s) in other sequence

October 2K522 Contd../ 5. SCORINGAssigns a value to each possible MATRIXpair of Amino acids. Examples of matrices are UN, MD, GCM, CSW, UP.UP 6. PENALTYThere are two types of penalties. Matrix Bias: is added to every cell of the scoring matrix and decides the size of the break. Also called Gap continuation penalty. Break Penalty: Applied every time a gap is inserted in either sequence.

October 2K523 Unitary Matrix Simplest scoring scheme Amino acids pairs are classified into 2 types: –Identical –Non-identical Identical pairs are scored 1 Non-identical pairs are scored 0 Less effective for detection of weak similarities ARND A1000 R0100 N0010 D0001 …......

October 2K524 N & W definitions/variables A,BTwo sequences under comparison M,L lengths of two sequences A(i) i th amino acid in sequence A B(j)j th amino acid in sequence B MATis a two dimensional array used to compare all possible pair combinations of sequence A and B. SM(i,j)The cell that represents a pair combination that contains A(i) and B(j). In a simplest way –SM (i,j) = 1; if A(i) = B(j) –SM(i,j) = 0; if A(I)  B(j)

October 2K525 MAT(i,j)=SM(A i,B j )+max(x,y,z) where X= row max along the diagonal– penalty Y = column max along the diagonal – penalty Z= next diagonal: MAT (i+1,j+1) GDVEKGKKIFIMKCSQ | ||||| ||| ||| GCVEKGK-IFINWCSQ

October 2K526 Trace back GDVEKGKKIFIMKCSQ | ||||| ||| ||| GCVEKGK-IFINWCSQ

October 2K527 Generation of Random sequences: How & Why Obtain randomized sequences such that – –Length & composition is same Why randomisation? –To filter chance similarity from biologically significant ones –To obtain statistical scores

October 2K528 Contd../ Real Score ( R ) –Similarity Score of real sequences Mean Score ( M ) –Average similarity score of randomly permuted sequences Standard deviation ( Sd ) –Standard deviation of the similarity scores of randomly permuted sequences. Alignment Score ( A ) –A = (R-M)/sd –Alignment score is expressed as number of standard deviation units by which the similarity score for real sequences (R) exceeds the average similarity score (M) of randomly permuted sequences

October 2K529 Significant Alignment Score A< 3Sd –No homology A> 3-6 Sd –May /may not be similar OR homologous –Need additional evidence to prove similarity/homology. A> 6 Sd –Sequence are similar and may be homologous –Additional experimental evidence required to prove homology. A> 9 Sd –Homology could be deduced from sequence alignment studies alone.

October 2K530 Calculation of Normalized Alignment Score ( # Ident * 10) + (# C *25) – (# B * 20) NAS = ----------------------------------------------------* 100 Length of Alignment

October 2K531 Sample output

October 2K532 An example of high scoring alignment (7.55 sd) that actually shares no structural similarity between citrate synthase (2cts) and transthyritin (2paba). Note completely different secondary structures.

October 2K533 The distribution of S.D. scores for 100,000 optimal alignments of length >20 between proteins of unrelated three-dimensional structure

October 2K534 Evolutionary process Orthologues Gene X A single Gene X is retained as the species diverges into two separate species Genes in two species are Orthologues Gene X

October 2K535 Evolutionary process Paralogues: genes that arise due to duplication Single gene X in one species is duplicated As each gene gathers mutations, it may begin to perform new function or may specialize in carrying out functions of ancestral genes These genes in a single species are paralogues If the species diverges, the daughter species may maintain the duplicated genes, therefore each species contain an Orthologue and a Paralogue to each gene in other species Gene X Gene AGene B

October 2K536 Homologous/Orthologous/Paralogous sequences Orthologous sequences are homologous sequences in different species that have a common origin Distinction of Orthologoes is a result of gradual evolutionary modifications from the common ancestor Perform same function in different species Paralogous sequences are homologous sequences that exists within a species They have a common origin but involve gene duplication events to arise Purpose of gene duplication is to use sequence to implement a new function Perform different functions

Local Sequence Alignment Using Smith- Waterman Dynamic Programming Algorithm

Significance of local sequence alignment In locating common domains in proteins Example: transmembrane proteins, which might have different ends sticking out of the cell membrane, but have common 'middleparts' For comparing long DNA sequences with a short one Comparing a gene with a complete genome For detecting similarities between highly diverged sequences which still share common subsequences (that have little or no mutations).

Local sequence alignment Performs an exhaustive search for optimal local alignment Modification of Needleman-Wunsch algorithm: Negative weighting of mismatches Matrix entries non-negative Optimal path may start anywhere (not just first / last row/column) After the whole path matrix is filled, the optimal local alignment is simply given by a path starting at the highest score overall in the path matrix, containing all the contributing cells until the path score has dropped to zero.

Smith-Waterman Algorithm

Example of local alignment

HEAGAWGHEE 00000000000 P0-2 -2-4-2 A0-2505-30-2 W0-3 15-3 H0100-2 -3-21000 E006-3-3 066 A0-2505-30-2 E006 -3-3 066 Scoring the alignment using BLOSUM50 matrix Gap penalty: -8

Summary: S & W Fill the matrix using a similarity scoring matrix Implement the dynamic programming algorithm Find the maximal value in the matrix Trace back from that value until a 0 value is reached As we can start a new alignment anywhere the scores cannot be negative. Trace-back is started at the highest values rather than at the lower right hand corner. Trace-back is stopped as soon as a zero is encountered.

Pairwise sequence alignment Urmila Kulkarni-Kale Bioinformatics Centre, University of Pune, Pune 411 007.

Similar presentations

Presentation on theme: "Pairwise sequence alignment Urmila Kulkarni-Kale Bioinformatics Centre, University of Pune, Pune 411 007."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pairwise sequence alignment Urmila Kulkarni-Kale Bioinformatics Centre, University of Pune, Pune 411 007.

Similar presentations

Presentation on theme: "Pairwise sequence alignment Urmila Kulkarni-Kale Bioinformatics Centre, University of Pune, Pune 411 007."— Presentation transcript:

Similar presentations

About project

Feedback