CZ5225 Methods in Computational Biology Lecture 2-3: Protein Families and Family Prediction Methods Prof. Chen Yu Zong Tel: Room 07-24, level 7, SOC1, NUS August
2 Protein Evolution: SARS coronavirus as an example
3 SARS Coronavirus A novel coronavirus Identified as the cause of severe respiratory syndrome (SARS )
4 SARS Infection How SARS coronavirus enters a cell and reproduce
5 Protein Evolution Generation of different species
6 Protein Families Sequence alignment-based families. –Based on Principle of Sequence-structure-function-relationship. –Derived by multiple sequence alignment –Database: PFAM (Nucleic Acids Res. 30: )PFAM Structure-based families. –Derived by visual inspection and comparison of structures –Database: SCOP (J. Mol. Biol. 247, )SCOP Functional Families. –Databases: G-protein coupled receptors: GPCRDB (Nucleic Acids Res. 29: ), ORDB (Nucleic Acids Res. 30: )GPCRDBORDB Nuclear receptors: NucleaRDB (Nucleic Acids Res. 29: )NucleaRDB Enzymes: BRENDA (Nucleic Acids Res. 30, 47-49)BRENDA Transporters: TC-DB (Microbiol Mol Biol Rev. 64: )TC-DB Ligand-gated ion channels: LGICdb (Nucleic Acids Res. 29: )LGICdb Therapeutic targets: TTD (Nucleic Acids Res. 30, )TTD Drug side-effect targets: DART (Drug Safety 26: )DART
7 Protein Families Sequence families =\= Structural families =\= Functional families Sequence similar, structure different Sequence different, structure similar Sequence similar, function different (distantly related proteins) Sequence different, function similar Homework: find examples
8 Protein Family Prediction Methods Sequence alignment-based families: Multiple sequence alignment (HMM): HMMER ;Multiple sequence alignment (HMM) HMMER JMB 235, ; JMB 301, Structure-based families: Visual inspection and comparison of structures Functional Families. Statistical learning methods: –Neural network: ProtFun (Bioinformatics, 19: ) ProtFun –Support vector machines: SVMProt (Nucleic Acids Res., 31: ) SVMProt
9 Sequence Comparison as a Mathematical Problem: Example: Sequence a: ATTCTTGC Sequence b: ATCCTATTCTAGC Best Alignment: ATTCTTGC ATCCTATTCTAGC /|\ gap Bad Alignment: AT TCTT GC ATCCTATTCTAGC /|\ /|\ gap gap Construction of many alignments => which is the best?
10 How to rate an alignment? Match: +8 (w(x, y) = 8, if x = y) Mismatch: -5 (w(x, y) = -5, if x ≠ y) Each gap symbol: -3 (w(-,x)=w(x,-)=-3) C T T A A C T C G G A T C A - - T = +12 Alignment score
11 Alignment Graph Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACTCTTAACT C---TTAACT CGGATCA--T
12 An optimal alignment -- the alignment of maximum score Let A=a 1 a 2 …a m and B=b 1 b 2 …b n. S i,j : the score of an optimal alignment between a 1 a 2 …a i and b 1 b 2 …b j With proper initializations, S i,j can be computed as follows.
13 Computing S i,j i j w(a i,-) w(-,b j ) w(a i,b j ) S m,n
14Initializations C G G A T C A T CTTAACTCTTAACT
15 S 3,5 = ? ? C G G A T C A T CTTAACTCTTAACT
16 S 3,5 = ? C G G A T C A T CTTAACTCTTAACT optimal score
17 C T T A A C – T C G G A T C A T C G G A T C A T CTTAACTCTTAACT 8 – 5 – = 14
18 Global Alignment vs. Local Alignment global alignment: local alignment:
19 An optimal local alignment S i,j : the score of an optimal local alignment ending at a i and b j With proper initializations, S i,j can be computed as follows.
20 local alignment ? C G G A T C A T CTTAACTCTTAACT Match: 8 Mismatch: -5 Gap symbol: -3
C G G A T C A T CTTAACTCTTAACT The best score A – C - T A T C A T = 18 local alignment
22 Multiple sequence alignment (MSA) The multiple sequence alignment problem is to simultaneously align more than two sequences. Seq1: GCTC Seq2: AC Seq3: GATC GC-TC A---C G-ATC
23 How to score an MSA? Sum-of-Pairs (SP-score) GC-TC A---C G-ATC GC-TC A---C GC-TC G-ATC A---C G-ATC Score= + +
24 Functional Classification by SVM A protein is classified as either belong (+) or not belong (-) to a functional family By screening against all families, the function of this protein can be identified (example: SVMProt)SVMProt What is SVM? Support vector machines, a machine learning method, learning by examples, statistical learning, classify objects into one of the two classes. Advantage of SVM: Diversity of class members (no racial discrimination). Use of sequence-derived physico-chemical features as basis for classification. Suitable for functional family classifications.
25 SVM References C. Burges, "A tutorial on support vector machines for pattern recognition", Data Mining and Knowledge Discovery, Kluwer Academic Publishers,1998 (on-line). R. Duda, P. Hart, and D. Stork, Pattern Classification, John-Wiley, 2nd edition, 2001 (section 5.11, hard-copy). S. Gong et al. Dynamic Vision: From Images to Face Recognition, Imperial College Pres, 2001 (sections 3.6.2, 3.7.2, hard copy). Online lecture notes
26 Introduction to Machine Learning Goal: To “improve” (gaining knowledge, enhancing computing capability) Tasks: Forming concepts by data generalization. Compiling knowledge into compact form Finding useful explanations for valid concepts. Clustering data into classes. Reference: Machine Learning in Molecular Biology Sequence Analysis Machine Learning in Molecular Biology Sequence Analysis. Internet links:
27 Introduction to Machine Learning Category: Inductive learning. Forming concepts from data without a lot of knowledge from domain (learning from examples). Analytic learning. Use of existing knowledge to derive new useful concepts (explanation based learning). Connectionist learning. Use of artificial neural networks in searching for or representing of concepts. Genetic algorithms. To search for the most effective concept by means of Darwin’s “survival of the fittest” approach.
28 Machine Learning Methods Inductive learning: Concept learning and example-based learning Concept learning:
29 Machine Learning Methods Analytic learning:
30 Machine Learning Methods Neural network:
31 Machine Learning Methods Genetic algorithms: Strength Pattern Classification
32
33 SVM
34 SVM
35 SVM
36 SVM
37 SVM
38 SVM
39 SVM
40 SVM
41 SVM
42 SVM
43 SVM
44 SVM for Classification of Proteins How to represent a protein? Each sequence represented by specific feature vector assembled from encoded representations of tabulated residue properties: –amino acid composition –Hydrophobicity –normalized Van der Waals volume –polarity, –Polarizability –Charge –surface tension –secondary structure –solvent accessibility Three descriptors, composition (C), transition (T), and distribution (D), are used to describe global composition of each of these properties. Nucleic Acids Res., 31:
45 SVM for Classification of Proteins Descriptors for amino acid composition of protein: C=(53.33, 46.67) T=(51.72) D=(3.33, 16.67, 40.0, 66.67, 96.67, 6.67, 26.67, 60.0, 76.67, 100.0) Nucleic Acids Res., 31:
46 Assignment 1 CZ5225 Methods in Computational Biology Assignment 1 Project 1: Protein family classification by SVM –Construction of training and testing datasets –Generating feature vectors –SVM classification and analysis. –Write a report and include a softcopy of your datasets Project 2: Develop a program of pair-wise sequence alignment using a simple scoring scheme. –Write a code in any programming language –Test it on a few examples (such as estrogen receptor and Progesterone receptor) –Can you extend your program to multiple alignment? –Write a report and include a softcopy of your program