Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
1 ALIGNMENT OF NUCLEOTIDE & AMINO-ACID SEQUENCES.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Heuristic alignment algorithms and cost matrices
Sequence analysis course
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Protein structures in the PDB
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Pairwise alignment Computational Genomics and Proteomics.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Developing Pairwise Sequence Alignment Algorithms
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Structural alignment Protein structure Every protein is defined by a unique sequence (primary structure) that folds into a unique.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
An Introduction to Bioinformatics
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Chapter 14 Protein Structure Classification
Sequence comparison: Local alignment
Pairwise alignment incorporating dipeptide covariation
Dot Plots, Path Matrices, Score Matrices
Sequence Alignment 11/24/2018.
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
Protein structure prediction.
Protein Structural Classification
Introduction to bioinformatics Lecture 5 Pair-wise sequence alignment
Presentation transcript:

Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding optimal alignment, but fast

Pairwise sequence alignments APLFVA----ITRSDD APVFIAGDTRITRSEE Assumptions: -evolution of sequences through mutations and deletions/insertions; -the closer similarity between sequences, the more chances they are evolutionarily related.

Similarity measures: Percent Identity Identity score – Exact matches receive score of 1 and non-exact matches score of 0 AVLILKQW AVLI I LQ T = 5 (Score of the alignment under “identity”) Percent identity: identity_score/length_of_the_shorter_protein Disadvantage of % id: does not take into account the similarity between their properties.

Substitution Matrices – measure of “similarity” score of amino-acids M(i,j) ~ probability of substituting i into j over some time period Percent Accepted Mutation (PAM) unit = evolutionary time corresponding to average of 1 mutation per 100 res. Two most popular classes of matrices: – PAMn: relates to mutation probabilities in evolutionary interval of n PAM units (PAM 120 is often used in practice) – BLOSUMx: relates to mutation probabilities observed between pairs of related proteins that diverged so above x % identity. BLOSUM62 ~ PAM250

Scoring the gaps Solution: Have additional penalty for opening a gap ATTTTAGTAC ATT- - AGTAC ATTTTAGTAC A-T-T -AGTAC The two alignments below have the same score. The second alignment is better. w(k) = h + gk ; h,g constants Interpretation: const of starting a gap: h+g, extending gap: +g Affine gap penalty

Dot plot illustration TTACTCAAT A C T C A T T A C The alignment corresponds to path from upper left corner to lower right corner going trough max. nr of dots Deletions TTACTCAAT ACTCA- TTAC Adapted from T. Przytycka

Gap penalties First problem is corrected by introducing “gap penalty”: for each gap subtract gap penalty from the score Second problem is corrected by introducing additional penalty for opening a gap : ATCG ATTG and AT – C G AT T - G They have the same score but the right alignment is more likely from evolutionary perspective (simpler explanation = better explanation) AT - C - T A AT T T T TA ATC - - T A ATT T T TA Consider two pairs of alignments: and w(k) = h + gk ; h,g constants Interpretation: const of starting a gap: h+g, extending gap: +g Affine gap penalty

Organizing the computation – dynamic programming table Align(S i-1,S ’ j-1 )+ s(a i, a’ j ) Align(S i-1,S ’ j ) - g Align(S i,S ’ j-1 ) - g Align(S i,S ’ j )= max { j i Align(i,j) = Align +s(a i,a j ) max

Recovering the path A T T G - A T - G C A T T G ATGCATGC

Ignoring initial and final gaps – semiglobal comparison Recall the initialization step for the dynamic programming table: A[0,i], A[j,0] – these are responsible for initial gaps. set them to zero! How to ignore final gaps? CAGCA - CTTGGATTCTCGG CAGCGTGG No penalties for these gaps Take the largest value in the last row /column and trace-back form there

Comparing similar sequences Similar sequences – optimal alignment has small number of gaps. The “alignment path” stays close to the diagonal From book Setubal Meidanis”Introduction Comp. Mol. Biol”

Global Local Local and global alignments

Local alignment (Smith - Waterman) So far we have been dealing with global alignment. Local alignment – alignment between substrings. Main idea: If alignment becomes too bad – drop it. a[i,j]= max a[i-1,j-1]+ s(a i, a j ) a[i-1,j +g a[i,j-1]+ g 0 {

Example

BLAST Local heuristics Fast Good statistics Precalculated lookup table of all high score word matches of three residue long Extend the hit until score drops below some threshold

IDVVVVC LDLV--C A LDLVFVC ADIIFLI R N D C Sequence-profile alignments: sequence profiles describe conserved features with respect to position in multiple alignment Gribskov et al, PNAS, 1987; Schaffer et al, Nucleic Acids Res., 2001

Computational aspects of protein structure

Examples of protein architecture β-sheet with all pairs of strands parallel β-sheet with all pairs of strands anti-parallel Architecture refers to the arrangement and orientation of SSEs, but not to the connectivity.

Examples of protein topology Topology refers to the manner in which the SSEs are connected. Two β-sheets (all parallel) with different topologies.

Secondary structures are connected to form motifs. G.M. Salem et al. J. Mol. Biol. (1999)

Supersecondary structure: Greek key motifs G.M. Salem et al. J. Mol. Biol. (1999)

Some supersecondary structure motifs are associated with specific function: DNA binding motifs. Helix-turn-helix motif: recognizes specific palindromic DNA sequence Zn-finger motif: Zn binds to two Cys and two His; binds in tandems along major groove

P-loop motif. Sequence pattern: G/AxxxxGK(x)S/T Function: mononucleotide binding

Calcium-binding motif. Calcium-binding sequence pattern: DxD/NxDxxxE/DxxE Function: binding of Ca(2+); calmodulin: Ca-dependent signaling pathways A.Lewit-Bentley & S. Rety, 2000

Protein domains can be defined based on: Geometry: group of residues with the high contact density, number of contacts within domains is higher than the number of contacts between domains. - chain continuous domains - chain discontinous domains Kinetics: domain as an independently folding unit. Physics: domain as a rigid body linked to other domains by flexible linkers. Genetics: minimal fragment of gene that is capable of performing a specific function.

Domains as recurrent units of proteins. The same or similar domains are found in different proteins. Each domain has a well determined compact structure and performs a specific function. Proteins evolve through the duplication and domain shuffling. Protein domain classification based on comparing their recurrent sequence, structure and functional features – Conserved Domain Database

Conserved Domain Database (CDD). Protein domain classification based on comparing their recurrent sequence, structure and functional features – Conserved Domain Database CDD represents a collection of multiple sequence alignments corresponding to different protein domains

CDD icludes a set of multiple sequence alignments. Accurate alignments since structure-structure alignments are reconciled with sequence alignments. Block-based alignments. Annotated alignments. Annotated functionally important sites.

PSSMs for each CDD are calculated using observed residue frequencies and relationships between different residue types IDVVVVC LDLV--I A LDLVFVI ADIIFLI R W(D,3) = log( Q(D,3) / P(D) ) N P(D) – background probability D Q(D,3) – estimated probability C for residue “D” to be found in column 3..

How to annotate domains in a protein using CDD? To annotate domains in a protein: - to find domain boundaries - to assign function(structure) for each domain For each query sequence perform CD-search. CD-search: query sequence is compared with sequence profiles derived from CDD multiple sequence alignments.

Classwork Retrieve 1WQ1 from MMDB, look at structural domains and domains annotated by CDD. How different are they? Pretend you do not know the structure of 1WQ1, perform the CD-search, annotate domain boundaries.

Protein folds. Fold definition: two folds are similar if they have a similar arrangement of SSEs (architecture) and connectivity (topology). Sometimes a few SSEs may be missing. Fold classification: structural similarity between folds is searched using structure-structure comparison algorithms. There is a limited number of folds ~1000 – 3000.

Superfolds are the most populated protein folds. C.Orengo et al, 1994 There are about 10 types of folds, the superfolds, to which about 30% of the other folds are similar. Superfolds are characterized by a wide range of sequence diversity and spanning a range of non-similar functions.

Why do some folds are more populated than others? Thermodynamic stability? Fast folding? By chance, through the duplication processes? Perform essential functions? Symmetrical folds, emerged through the gene duplication? High supersecondary structure content, higher fraction of local interactions?

Distinguishing structural similarity due to common origin versus convergent evolution. Divergent evolution, homologsConvergent evolution, analogs

TIM barrels Classified into 21 families in the CATH database. Mostly enzymes, but participate in a diverse collection of different biochemical reactions. There are intriguing common features across the families, e.g. the active site is always located at the C-terminal end of the barrel. Catalytic and metal-binding residues aligned in structure-structure alignments Nagano, C. Orengo and J. Thornton, 2002

Functional diversity of TIM-barrels.

TIM barrel evolutionary relationships Sequence analyses with advanced programs such as PSI-BLAST have identified further relationships among the families. Further interesting similarities observed from careful comparison of structures, e.g. a phosphate binding site commonly formed by loops 7, 8 and a small helix. In summary, there is evidence for evolutionary relationships between 17 of the 21 families.

SCOP (Structural Classification of Proteins) Levels of the SCOP hierarchy: –Family: clear evolutionary relationship –Superfamily: probable common evolutionary origin –Fold: major structural similarity –Class: secondary structure content

CATH (Class, Architecture, Topology, Homologous superfamily)

Classwork Using SCOP and CATH classify four protein structures (1b5t, 1n8i, 1tph and 1hti). How different are the classifications produced by SCOP and CATH? Can these proteins be considered homologous?