Download presentation
Presentation is loading. Please wait.
1
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March 2004
2
Introduction to Bioinformatics Bioinformatics in Göttingen: Dep. of Bioinformatics (UKG), Edgar Wingender Dep. of Bioinformatics (IMG), BM Inst. Num. and Applied Mathematics, Stephan Waack Dep. of Genetics (Hans Fritz, IMG), Rainer Merkl
3
Introduction to Bioinformatics Definition: Bioinformatics = development and application of software tools for Molecular Biology
4
Bioinformatics: Topics: (a) Sequence Analysis (Gene finding …) (b) Structure Analysis (RNA, Protein) (c) Gene Expression Analysis (d) Metabolic Pathways, Virtual Cell
5
Bioinformatics: Areas of work: (a) Application of software tools for data analysis in (Molecular) Biology (b) Computing infrastructure, database development, support (c) Development of algorithms and software tools
6
Information flow in the cell
7
Idea: Sequence -> Structure -> Function
8
Information flow in the cell Lots of data available at the sequence level Fewer data at the structure and function level
9
Topics of lecture: Data bases SwissProt, GenBank Pair-wise sequence comparison Data base searching Multiple sequence alignment Gene prediction
10
Protein data bases Sanger and Tuppy: protein-sequencing methods (1951) Margaret Dayhoff: Atlas of Protein Sequence and Structure (1972); later: Protein Identification Resource (PIR) as international collaboration (a) Organize proteins into families; (b) Amino acid substitution frequencies Amos Bairoch: SwissProt (1986)
11
Exponential growth of data bases
17
DNA data bases Maxam and Gilbert; Sanger: DNA sequencing methods (1977) GenBank DNA data base (1979), now run by NCBI. Collaboration with EMBL (1982), DDBJ (1984) Translated DNA sequences stored in protein data bases (PIR, trEMBL)
23
Most important tool for sequence analysis: Sequence comparison
24
The dot plot Y Q E W T Y I V A R E A Q Y E C I V M R E Q Y
25
The dot plot Y Q E W T Y I V A R E A Q Y E C I V M R E Q Y
26
The dot plot Y Q E W T Y I V A R E A Q Y E C I X V X M R X E X X X Q X X Y X X
27
The dot plot Y Q E W T Y I V A R E A Q Y E C I X V X M R X E X X X Q X X Y X X
28
The dot plot Y Q E W T Y I V A R E A Q Y E C I X V X M R X E X X X Q X X Y X X
29
The dot plot Y Q E W T Y I V A R E A Q Y E C I X V X M R X E X X X Q X X Y X X
30
The dot plot Y Q E W T Y Q E V R E Y Q E I C I X V X M R Y X X X Q X X X E X X X X
31
The dot plot Y Q E W T Y Q E V R E Y Q E I C I X V X M R Y X X X Q X X X E X X X X
32
The dot plot Advantages: 1. Various types of similarity detectable (repeats, inversions) 2. Useful for large-scale analysis
33
The dot plot
34
Pair-wise sequence alignment Evolutionary or structurally related sequences: alignment possible Sequence homologies represented by inserting gaps
35
Pair-wise sequence alignment T Y I V A R E A Q Y E C I X V X M R X E X X Q X Y X X
36
Pair-wise sequence alignment T Y I V A R E A Q Y E C I X V X M R X E X X Q X Y X X
37
Pair-wise sequence alignment T Y I V A R E A Q Y E C I X V X M R X E X X Q X Y X X
38
Pair-wise sequence alignment T Y I V A R E A Q Y E C I X V X M R X E X X Q X Y X X
39
Pair-wise sequence alignment T Y I V A R E A Q Y E C I V M R E Q Y
40
Pair-wise sequence alignment T Y I V A R E A Q Y E - C I V M R E - Q Y –
41
Pair-wise sequence alignment T Y I V A R E A Q Y E - C I V M R E - Q Y – Global alignment: sequences aligned over the entire length
42
Pair-wise sequence alignment T Y I V A R E A Q Y E - C I V M R E - Q Y – Basic task: Find best alignment of two sequences
43
Pair-wise sequence alignment T Y I V A R E A Q Y E - C I V M R E - Q Y – Basic task: Find best alignment of two sequences = alignment that reflects structural and evolutionary relations
44
Pair-wise sequence alignment T Y I V A R E A Q Y E - C I V M R E - Q Y – Questions: 1. What is a good alignment? 2. How to find the best alignment?
45
Pair-wise sequence alignment T Y I V A R E A Q Y E - C I V M R E - Q Y – Problem: Astronomical number of possible alignments
46
Pair-wise sequence alignment T Y I V A R E A Q Y E C I - V M R E - Q Y – Problem: Astronomical number of possible alignments
47
Pair-wise sequence alignment T Y I V A R E A Q Y E - C I V M R E - Q Y – Problem: Astronomical number of possible alignments Stupid computer has to find out: which alignment is best ??
48
Pair-wise sequence alignment T Y I V A R E A Q Y E - C I V M R E - Q Y – First (simplified) rules: 1. Minimize number of mismatches 2. Maximize number of matches
49
Pair-wise sequence alignment T Y I V A R E A Q Y E C I - V M R E - Q Y – First (simplified) rules: 1. Minimize number of mismatches 2. Maximize number of matches
50
Pair-wise sequence alignment T Y I V A R E A Q Y E - C I V M R E - Q Y – First (simplified) rules: 1. Minimize number of mismatches 2. Maximize number of matches
51
Pair-wise sequence alignment T Y I V A R E A Q Y E - C I V M R E - Q Y – First (simplified) rules: 1. Minimize number of mismatches 2. Maximize number of matches
52
Pair-wise sequence alignment T Y I V A R E A Q Y E C I - V M R E - Q Y – Second (simplified) rule: Minimize number of gaps
53
Pair-wise sequence alignment T Y I V - A R E A Q Y E C I - V M - R E - Q Y – Second (simplified) rule: Minimize number of gaps
54
Pair-wise sequence alignment For protein sequences: Different degrees of similarity among amino acids. Counting matches/mismatches oversimplistic
55
Pair-wise sequence alignment T Y I V T L V
56
Pair-wise sequence alignment T Y I V T L - V
57
Pair-wise sequence alignment T Y I V T - L V
58
Pair-wise sequence alignment T Y I V T - L V Use similarity scores for amino acids
60
Pair-wise sequence alignment T Y I V T - L V Use similarity scores for amino acids: Define score s(a,b) for amino acids a and b
62
Pair-wise sequence alignment T Y I V T - L V Given a similarity score for pairs of amino acids Define score of alignment as sum of similarity values s(a,b) of aligned residues minus gap penalty g for each residue aligned with a gap
63
Pair-wise sequence alignment T Y I V T - L V Example: Score = s(T,T) + s(I,L) + s (V,V) - g
64
Pair-wise sequence alignment T Y I V T - L V Dynamic-programming algorithm finds alignment with best score. (Needleman and Wunsch, 1970)
65
Pair-wise sequence alignment T Y I V A R E A Q Y E - C I V M R E - Q Y – Alignment corresponds to path through comparison matrix
66
Pair-wise sequence alignment T Y I V A R E A Q Y E C I X V X M R X E X X Q X Y X X
67
Pair-wise sequence alignment T Y I V A R E A Q Y E X X C X I X V X M X R X E X X Q X Y X X
68
Pair-wise sequence alignment T Y I V A R E A Q Y E - C I V M R E - Q Y – Alignment corresponds to path through comparison matrix
69
Pair-wise sequence alignment T W L V - R E A Q I - C I V M R E - H Y
70
Pair-wise sequence alignment Score of alignment: Sum of similarity values of aligned residues minus gap penatly T W L V - R E A Q I - C I V M R E - H Y
71
Pair-wise sequence alignment Example: S = - g + s(W,C) + s(L,L) + s(V,V) - g + s(R,R) … T W L V - R E A Q I - C I V M R E - H Y
72
Pair-wise sequence alignment T W L V R E A Q Y I X X C X Alignment corresponds I X to path through V X comparison matrix M X R X E X X H X Y X X T W L V - R E A Q I - C I V M R E - H Y
73
Pair-wise sequence alignment i T W L V R E A Q Y I X X Dynamic programming: C X Calculate scores S(i,j) I X of optimal alignment of V X prefixes up to positions M X i and j. j R X E H Y T W L V - R - C I V M R
74
Pair-wise sequence alignment i T W L V R E A Q Y I X X C X S(i,j) can be calculated from I X possible predecessors V X S(i-1,j-1), S(i,j-1), S(i-1,j). M X j R X E H Y T W L V - R - C I V M R
75
Pair-wise sequence alignment i T W L V R E A Q Y I X X C X Score of optimal path that I X comes from top left = V X M X S(i-1,j-1) + s(R,R) j R X E H Y T W L V - R - C I V M R
76
Pair-wise sequence alignment i T W L V R E A Q Y I X X C X Score of optimal path that I X comes from above = V X j-1 M X S(i,j-1) – g j R X E H Y T W L V R - - C I V M R
77
Pair-wise sequence alignment i-1 i T W L V R E A Q Y I X X C X Score of optimal path that I X comes from left = V X M X S(i-1,j) – g j R X X E H Y T W L - - V R - C I V M R -
78
Pair-wise sequence alignment i-1 i T W L V R E A Q Y I X X C X Score of optimal path = I X V X Maximum of these three M X values j R X X E H Y T W L - - V R - C I V M R -
79
Pair-wise sequence alignment Recursion formula: S(i,j) = max { S(i-1,j-i)+s(a i,b j ), S(i-1,j) – g, S(i,j-i) – g }
80
Pair-wise sequence alignment T W L V R C I V M R E H Y
81
Pair-wise sequence alignment T W L V R x x x C x x x I x x V x x M x x R x x E x x H x x Y x x Fill matrix from top left to bottom right:
82
Pair-wise sequence alignment T W L V R x x x C x x x I x x x V x x M x x R x x E x x H x x Y x x Fill matrix from top left to bottom right:
83
Pair-wise sequence alignment T W L V R x x x x x x C x x x x x x I x x x x x x V x x x x x x M x x x x x x R x x x x x x E x x x x x x H x x x x x x Y x x x x x x Fill matrix from top left to bottom right:
84
Pair-wise sequence alignment T W L V R x x x x x x C x x x x x x I x x x x x x V x x x x x x M x x x x x x R x x x x x x E x x x x x x H x x x x x x Y x x x x x x Find optimal alignment by trace-back procedure
85
Pair-wise sequence alignment T W L V R x x x x x x C x I x V x M x R x E x H x Y x Initial matrix entries?
86
Pair-wise sequence alignment i T W L V R X X C X Entries S(i,j) scores I X of optimal alignment of j V X prefixes up to positions M i and j. R E H Y T W L V - C I V
87
Pair-wise sequence alignment i T W L V R j X X X X X C Entries S(i,0) scores I of optimal alignment of V prefix up to positions M i and empty prefix. R E Score = - i* g H Y T W L V - - - -
88
Pair-wise sequence alignment T W L V R C I V M R E H Y Initial matrix entries: Example, g = 2
89
Pair-wise sequence alignment T W L V R 0 -2 -4 -6 -8 -10 C -2 I -4 V -6 M -8 R -10 E -12 H -14 Y -16 Initial matrix entries: Example, g = 2
90
Pair-wise global alignment T W L V R E A Q Y I X X C X I X V X M X R X E X X F X Y X X T W L V - R E A Q I - C I V M R E - F Y
91
Pair-wise global alignment Complexity: l 1 and l 2 length of sequences: Computing time and memory proportional to l 1 * l 2 Time and space complexity = O(l 1 * l 2 )
92
Pair-wise local alignment Sequences often share only local sequence similarity (conserved genes or domains) Important for database searching
93
Pair-wise local alignment T W L V R E A Q Y I X X C X I X V X M X R X E X X H X Y X X T W L V - R E A Q I - C I V M R E - F Y
94
Pair-wise local alignment T W L V R E A Q Y I X X C X I X V X M X R X E X X F X Y X X T W L V - R E A Q I - C I V M R E - F Y
95
Pair-wise local alignment Problem: Find pair of segments with maximal Alignment score (not necessarily part of optimal global alignment!)
96
Pair-wise local alignment T W L V R E A Q Y I X X C X I X V X M X R X E X X F X Y X X T W L V - R E A Q I - C I V M R E - F Y
97
Pair-wise sequence alignment Recursion formula for global alignment: S(i,j) = max { S(i-1,j-i)+s(a i,b j ), S(i-1,j) – g, S(i,j-i) – g }
98
Pair-wise sequence alignment Recursion formula for local alignment: S(i,j) = max { 0, S(i-1,j-i)+s(a i,b j ), S(i-1,j) – g, S(i,j-i) – g }
99
Pair-wise sequence alignment T W L V R 0 0 0 0 0 0 C 0 I 0 V 0 M 0 R 0 E 0 H 0 Y 0 Initial matrix entries = 0
100
Pair-wise sequence alignment T W L V R 0 0 0 0 0 0 C 0 0 I 0 V 0 M 0 R 0 E 0 H 0 Y 0 s(C,T) = -2
101
Pair-wise sequence alignment Recursion formula for local alignment: S(i,j) = max { 0, S(i-1,j-i)+s(a i,b j ), S(i-1,j) – g, S(i,j-i) – g } Store position with maximal value S(i,j) in matrix
102
Pair-wise local alignment T W L V R E A Q Y I X X C X I X V X M X R X E X X F X Y X X T W L V - R E A Q I - C I V M R E - F Y
103
Pair-wise local alignment Algorithm by Smith and Waterman (1983) Implementation: e.g. BestFit in GCG package
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.