Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.

Slides:



Advertisements
Similar presentations
Global Sequence Alignment by Dynamic Programming.
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
1 ALIGNMENT OF NUCLEOTIDE & AMINO-ACID SEQUENCES.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
Sequence Similarity Searching Class 4 March 2010.
1-month Practical Course Genome Analysis (Integrative Bioinformatics & Genomics) Lecture 3: Pair-wise alignment Centre for Integrative Bioinformatics VU.
Sequencing and Sequence Alignment
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Bioinformatics and Phylogenetic Analysis
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Algorithms Dr. Nancy Warter-Perez June 19, May 20, 2003 Developing Pairwise Sequence Alignment Algorithms2 Outline Programming workshop 2 solutions.
Developing Sequence Alignment Algorithms in C++ Dr. Nancy Warter-Perez May 21, 2002.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Incorporating Bioinformatics in an Algorithms Course Lawrence D’Antonio Ramapo College of New Jersey.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
LCS and Extensions to Global and Local Alignment Dr. Nancy Warter-Perez June 26, 2003.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Sequence comparison: Local alignment
1 Introduction to Bioinformatics 2 Introduction to Bioinformatics. LECTURE 3: SEQUENCE ALIGNMENT * Chapter 3: All in the family.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Pairwise & Multiple sequence alignments
Protein Sequence Alignment and Database Searching.
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Introduction to Sequence Alignment. Why Align Sequences? Find homology within the same species Find clues to gene function Practical issues in experiments.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Bioinformatics Overview
INTRODUCTION TO BIOINFORMATICS
Multiple sequence alignment (msa)
Sequence comparison: Local alignment
Biology 162 Computational Genetics Todd Vision Fall Aug 2004
Intro to Alignment Algorithms: Global and Local
Pairwise Sequence Alignment
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Pairwise Alignment Global & local alignment
Sequence alignment BI420 – Introduction to Bioinformatics
Basic Local Alignment Search Tool (BLAST)
Presentation transcript:

Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March 2004

Introduction to Bioinformatics Bioinformatics in Göttingen: Dep. of Bioinformatics (UKG), Edgar Wingender Dep. of Bioinformatics (IMG), BM Inst. Num. and Applied Mathematics, Stephan Waack Dep. of Genetics (Hans Fritz, IMG), Rainer Merkl

Introduction to Bioinformatics Definition: Bioinformatics = development and application of software tools for Molecular Biology

Bioinformatics: Topics: (a) Sequence Analysis (Gene finding …) (b) Structure Analysis (RNA, Protein) (c) Gene Expression Analysis (d) Metabolic Pathways, Virtual Cell

Bioinformatics: Areas of work: (a) Application of software tools for data analysis in (Molecular) Biology (b) Computing infrastructure, database development, support (c) Development of algorithms and software tools

Information flow in the cell

Idea: Sequence -> Structure -> Function

Information flow in the cell Lots of data available at the sequence level Fewer data at the structure and function level

Topics of lecture: Data bases SwissProt, GenBank Pair-wise sequence comparison Data base searching Multiple sequence alignment Gene prediction

Protein data bases Sanger and Tuppy: protein-sequencing methods (1951) Margaret Dayhoff: Atlas of Protein Sequence and Structure (1972); later: Protein Identification Resource (PIR) as international collaboration (a) Organize proteins into families; (b) Amino acid substitution frequencies Amos Bairoch: SwissProt (1986)

Exponential growth of data bases

DNA data bases Maxam and Gilbert; Sanger: DNA sequencing methods (1977) GenBank DNA data base (1979), now run by NCBI. Collaboration with EMBL (1982), DDBJ (1984) Translated DNA sequences stored in protein data bases (PIR, trEMBL)

Most important tool for sequence analysis: Sequence comparison

The dot plot Y Q E W T Y I V A R E A Q Y E C I V M R E Q Y

The dot plot Y Q E W T Y I V A R E A Q Y E C I V M R E Q Y

The dot plot Y Q E W T Y I V A R E A Q Y E C I X V X M R X E X X X Q X X Y X X

The dot plot Y Q E W T Y I V A R E A Q Y E C I X V X M R X E X X X Q X X Y X X

The dot plot Y Q E W T Y I V A R E A Q Y E C I X V X M R X E X X X Q X X Y X X

The dot plot Y Q E W T Y I V A R E A Q Y E C I X V X M R X E X X X Q X X Y X X

The dot plot Y Q E W T Y Q E V R E Y Q E I C I X V X M R Y X X X Q X X X E X X X X

The dot plot Y Q E W T Y Q E V R E Y Q E I C I X V X M R Y X X X Q X X X E X X X X

The dot plot Advantages: 1. Various types of similarity detectable (repeats, inversions) 2. Useful for large-scale analysis

The dot plot

Pair-wise sequence alignment Evolutionary or structurally related sequences: alignment possible Sequence homologies represented by inserting gaps

Pair-wise sequence alignment T Y I V A R E A Q Y E C I X V X M R X E X X Q X Y X X

Pair-wise sequence alignment T Y I V A R E A Q Y E C I X V X M R X E X X Q X Y X X

Pair-wise sequence alignment T Y I V A R E A Q Y E C I X V X M R X E X X Q X Y X X

Pair-wise sequence alignment T Y I V A R E A Q Y E C I X V X M R X E X X Q X Y X X

Pair-wise sequence alignment T Y I V A R E A Q Y E C I V M R E Q Y

Pair-wise sequence alignment T Y I V A R E A Q Y E - C I V M R E - Q Y –

Pair-wise sequence alignment T Y I V A R E A Q Y E - C I V M R E - Q Y – Global alignment: sequences aligned over the entire length

Pair-wise sequence alignment T Y I V A R E A Q Y E - C I V M R E - Q Y – Basic task: Find best alignment of two sequences

Pair-wise sequence alignment T Y I V A R E A Q Y E - C I V M R E - Q Y – Basic task: Find best alignment of two sequences = alignment that reflects structural and evolutionary relations

Pair-wise sequence alignment T Y I V A R E A Q Y E - C I V M R E - Q Y – Questions: 1. What is a good alignment? 2. How to find the best alignment?

Pair-wise sequence alignment T Y I V A R E A Q Y E - C I V M R E - Q Y – Problem: Astronomical number of possible alignments

Pair-wise sequence alignment T Y I V A R E A Q Y E C I - V M R E - Q Y – Problem: Astronomical number of possible alignments

Pair-wise sequence alignment T Y I V A R E A Q Y E - C I V M R E - Q Y – Problem: Astronomical number of possible alignments Stupid computer has to find out: which alignment is best ??

Pair-wise sequence alignment T Y I V A R E A Q Y E - C I V M R E - Q Y – First (simplified) rules: 1. Minimize number of mismatches 2. Maximize number of matches

Pair-wise sequence alignment T Y I V A R E A Q Y E C I - V M R E - Q Y – First (simplified) rules: 1. Minimize number of mismatches 2. Maximize number of matches

Pair-wise sequence alignment T Y I V A R E A Q Y E - C I V M R E - Q Y – First (simplified) rules: 1. Minimize number of mismatches 2. Maximize number of matches

Pair-wise sequence alignment T Y I V A R E A Q Y E - C I V M R E - Q Y – First (simplified) rules: 1. Minimize number of mismatches 2. Maximize number of matches

Pair-wise sequence alignment T Y I V A R E A Q Y E C I - V M R E - Q Y – Second (simplified) rule: Minimize number of gaps

Pair-wise sequence alignment T Y I V - A R E A Q Y E C I - V M - R E - Q Y – Second (simplified) rule: Minimize number of gaps

Pair-wise sequence alignment For protein sequences: Different degrees of similarity among amino acids. Counting matches/mismatches oversimplistic

Pair-wise sequence alignment T Y I V T L V

Pair-wise sequence alignment T Y I V T L - V

Pair-wise sequence alignment T Y I V T - L V

Pair-wise sequence alignment T Y I V T - L V Use similarity scores for amino acids

Pair-wise sequence alignment T Y I V T - L V Use similarity scores for amino acids: Define score s(a,b) for amino acids a and b

Pair-wise sequence alignment T Y I V T - L V Given a similarity score for pairs of amino acids Define score of alignment as sum of similarity values s(a,b) of aligned residues minus gap penalty g for each residue aligned with a gap

Pair-wise sequence alignment T Y I V T - L V Example: Score = s(T,T) + s(I,L) + s (V,V) - g

Pair-wise sequence alignment T Y I V T - L V Dynamic-programming algorithm finds alignment with best score. (Needleman and Wunsch, 1970)

Pair-wise sequence alignment T Y I V A R E A Q Y E - C I V M R E - Q Y – Alignment corresponds to path through comparison matrix

Pair-wise sequence alignment T Y I V A R E A Q Y E C I X V X M R X E X X Q X Y X X

Pair-wise sequence alignment T Y I V A R E A Q Y E X X C X I X V X M X R X E X X Q X Y X X

Pair-wise sequence alignment T Y I V A R E A Q Y E - C I V M R E - Q Y – Alignment corresponds to path through comparison matrix

Pair-wise sequence alignment T W L V - R E A Q I - C I V M R E - H Y

Pair-wise sequence alignment Score of alignment: Sum of similarity values of aligned residues minus gap penatly T W L V - R E A Q I - C I V M R E - H Y

Pair-wise sequence alignment Example: S = - g + s(W,C) + s(L,L) + s(V,V) - g + s(R,R) … T W L V - R E A Q I - C I V M R E - H Y

Pair-wise sequence alignment T W L V R E A Q Y I X X C X Alignment corresponds I X to path through V X comparison matrix M X R X E X X H X Y X X T W L V - R E A Q I - C I V M R E - H Y

Pair-wise sequence alignment i T W L V R E A Q Y I X X Dynamic programming: C X Calculate scores S(i,j) I X of optimal alignment of V X prefixes up to positions M X i and j. j R X E H Y T W L V - R - C I V M R

Pair-wise sequence alignment i T W L V R E A Q Y I X X C X S(i,j) can be calculated from I X possible predecessors V X S(i-1,j-1), S(i,j-1), S(i-1,j). M X j R X E H Y T W L V - R - C I V M R

Pair-wise sequence alignment i T W L V R E A Q Y I X X C X Score of optimal path that I X comes from top left = V X M X S(i-1,j-1) + s(R,R) j R X E H Y T W L V - R - C I V M R

Pair-wise sequence alignment i T W L V R E A Q Y I X X C X Score of optimal path that I X comes from above = V X j-1 M X S(i,j-1) – g j R X E H Y T W L V R - - C I V M R

Pair-wise sequence alignment i-1 i T W L V R E A Q Y I X X C X Score of optimal path that I X comes from left = V X M X S(i-1,j) – g j R X X E H Y T W L - - V R - C I V M R -

Pair-wise sequence alignment i-1 i T W L V R E A Q Y I X X C X Score of optimal path = I X V X Maximum of these three M X values j R X X E H Y T W L - - V R - C I V M R -

Pair-wise sequence alignment Recursion formula: S(i,j) = max { S(i-1,j-i)+s(a i,b j ), S(i-1,j) – g, S(i,j-i) – g }

Pair-wise sequence alignment T W L V R C I V M R E H Y

Pair-wise sequence alignment T W L V R x x x C x x x I x x V x x M x x R x x E x x H x x Y x x Fill matrix from top left to bottom right:

Pair-wise sequence alignment T W L V R x x x C x x x I x x x V x x M x x R x x E x x H x x Y x x Fill matrix from top left to bottom right:

Pair-wise sequence alignment T W L V R x x x x x x C x x x x x x I x x x x x x V x x x x x x M x x x x x x R x x x x x x E x x x x x x H x x x x x x Y x x x x x x Fill matrix from top left to bottom right:

Pair-wise sequence alignment T W L V R x x x x x x C x x x x x x I x x x x x x V x x x x x x M x x x x x x R x x x x x x E x x x x x x H x x x x x x Y x x x x x x Find optimal alignment by trace-back procedure

Pair-wise sequence alignment T W L V R x x x x x x C x I x V x M x R x E x H x Y x Initial matrix entries?

Pair-wise sequence alignment i T W L V R X X C X Entries S(i,j) scores I X of optimal alignment of j V X prefixes up to positions M i and j. R E H Y T W L V - C I V

Pair-wise sequence alignment i T W L V R j X X X X X C Entries S(i,0) scores I of optimal alignment of V prefix up to positions M i and empty prefix. R E Score = - i* g H Y T W L V

Pair-wise sequence alignment T W L V R C I V M R E H Y Initial matrix entries: Example, g = 2

Pair-wise sequence alignment T W L V R C -2 I -4 V -6 M -8 R -10 E -12 H -14 Y -16 Initial matrix entries: Example, g = 2

Pair-wise global alignment T W L V R E A Q Y I X X C X I X V X M X R X E X X F X Y X X T W L V - R E A Q I - C I V M R E - F Y

Pair-wise global alignment Complexity: l 1 and l 2 length of sequences: Computing time and memory proportional to l 1 * l 2 Time and space complexity = O(l 1 * l 2 )

Pair-wise local alignment Sequences often share only local sequence similarity (conserved genes or domains) Important for database searching

Pair-wise local alignment T W L V R E A Q Y I X X C X I X V X M X R X E X X H X Y X X T W L V - R E A Q I - C I V M R E - F Y

Pair-wise local alignment T W L V R E A Q Y I X X C X I X V X M X R X E X X F X Y X X T W L V - R E A Q I - C I V M R E - F Y

Pair-wise local alignment Problem: Find pair of segments with maximal Alignment score (not necessarily part of optimal global alignment!)

Pair-wise local alignment T W L V R E A Q Y I X X C X I X V X M X R X E X X F X Y X X T W L V - R E A Q I - C I V M R E - F Y

Pair-wise sequence alignment Recursion formula for global alignment: S(i,j) = max { S(i-1,j-i)+s(a i,b j ), S(i-1,j) – g, S(i,j-i) – g }

Pair-wise sequence alignment Recursion formula for local alignment: S(i,j) = max { 0, S(i-1,j-i)+s(a i,b j ), S(i-1,j) – g, S(i,j-i) – g }

Pair-wise sequence alignment T W L V R C 0 I 0 V 0 M 0 R 0 E 0 H 0 Y 0 Initial matrix entries = 0

Pair-wise sequence alignment T W L V R C 0 0 I 0 V 0 M 0 R 0 E 0 H 0 Y 0 s(C,T) = -2

Pair-wise sequence alignment Recursion formula for local alignment: S(i,j) = max { 0, S(i-1,j-i)+s(a i,b j ), S(i-1,j) – g, S(i,j-i) – g } Store position with maximal value S(i,j) in matrix

Pair-wise local alignment T W L V R E A Q Y I X X C X I X V X M X R X E X X F X Y X X T W L V - R E A Q I - C I V M R E - F Y

Pair-wise local alignment Algorithm by Smith and Waterman (1983) Implementation: e.g. BestFit in GCG package