Presentation is loading. Please wait.

Presentation is loading. Please wait.

CZ5225 Methods in Computational Biology Lecture 2-3: Protein Families and Family Prediction Methods Prof. Chen Yu Zong Tel: 6874-6877

Similar presentations


Presentation on theme: "CZ5225 Methods in Computational Biology Lecture 2-3: Protein Families and Family Prediction Methods Prof. Chen Yu Zong Tel: 6874-6877"— Presentation transcript:

1 CZ5225 Methods in Computational Biology Lecture 2-3: Protein Families and Family Prediction Methods Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, NUS August 2004 csccyz@nus.edu.sg http://xin.cz3.nus.edu.sgcsccyz@nus.edu.sg http://xin.cz3.nus.edu.sg

2 2 Protein Evolution: SARS coronavirus as an example

3 3 SARS Coronavirus A novel coronavirus Identified as the cause of severe respiratory syndrome (SARS )

4 4 SARS Infection How SARS coronavirus enters a cell and reproduce

5 5 Protein Evolution Generation of different species

6 6 Protein Families Sequence alignment-based families. –Based on Principle of Sequence-structure-function-relationship. –Derived by multiple sequence alignment –Database: PFAM (Nucleic Acids Res. 30:276-280)PFAM Structure-based families. –Derived by visual inspection and comparison of structures –Database: SCOP (J. Mol. Biol. 247, 536-540)SCOP Functional Families. –Databases: G-protein coupled receptors: GPCRDB (Nucleic Acids Res. 29: 346- 349), ORDB (Nucleic Acids Res. 30:354-360)GPCRDBORDB Nuclear receptors: NucleaRDB (Nucleic Acids Res. 29: 346-349)NucleaRDB Enzymes: BRENDA (Nucleic Acids Res. 30, 47-49)BRENDA Transporters: TC-DB (Microbiol Mol Biol Rev. 64:354-411)TC-DB Ligand-gated ion channels: LGICdb (Nucleic Acids Res. 29: 294- 295)LGICdb Therapeutic targets: TTD (Nucleic Acids Res. 30, 412-415)TTD Drug side-effect targets: DART (Drug Safety 26: 685-690)DART

7 7 Protein Families Sequence families =\= Structural families =\= Functional families Sequence similar, structure different Sequence different, structure similar Sequence similar, function different (distantly related proteins) Sequence different, function similar Homework: find examples

8 8 Protein Family Prediction Methods Sequence alignment-based families: Multiple sequence alignment (HMM): HMMER ;Multiple sequence alignment (HMM) HMMER JMB 235, 1501-153; JMB 301, 173-190 Structure-based families: Visual inspection and comparison of structures Functional Families. Statistical learning methods: –Neural network: ProtFun (Bioinformatics, 19:635-642) ProtFun –Support vector machines: SVMProt (Nucleic Acids Res., 31: 3692-3697) SVMProt

9 9 Sequence Comparison as a Mathematical Problem: Example: Sequence a: ATTCTTGC Sequence b: ATCCTATTCTAGC Best Alignment: ATTCTTGC ATCCTATTCTAGC /|\ gap Bad Alignment: AT TCTT GC ATCCTATTCTAGC /|\ /|\ gap gap Construction of many alignments => which is the best?

10 10 How to rate an alignment? Match: +8 (w(x, y) = 8, if x = y) Mismatch: -5 (w(x, y) = -5, if x ≠ y) Each gap symbol: -3 (w(-,x)=w(x,-)=-3) C - - - T T A A C T C G G A T C A - - T +8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12 Alignment score

11 11 Alignment Graph Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACTCTTAACT C---TTAACT CGGATCA--T

12 12 An optimal alignment -- the alignment of maximum score Let A=a 1 a 2 …a m and B=b 1 b 2 …b n. S i,j : the score of an optimal alignment between a 1 a 2 …a i and b 1 b 2 …b j With proper initializations, S i,j can be computed as follows.

13 13 Computing S i,j i j w(a i,-) w(-,b j ) w(a i,b j ) S m,n

14 14Initializations 0-3-6-9-12-15-18-21-24 -3 -6 -9 -12 -15 -18 -21 C G G A T C A T CTTAACTCTTAACT

15 15 S 3,5 = ? 0-3-6-9-12-15-18-21-24 -3852-4-7-10-13 -6530-3741-2 -920-2-5? -12 -15 -18 -21 C G G A T C A T CTTAACTCTTAACT

16 16 S 3,5 = ? 0-3-6-9-12-15-18-21-24 -3852-4-7-10-13 -6530-3741-2 -920-2-55-49 -12-3-563076 -15-4-6-831-285 -18-7-9-110-2963 -21-10-12-14-386414 C G G A T C A T CTTAACTCTTAACT optimal score

17 17 C T T A A C – T C G G A T C A T 0-3-6-9-12-15-18-21-24 -3852-4-7-10-13 -6530-3741-2 -920-2-55-49 -12-3-563076 -15-4-6-831-285 -18-7-9-110-2963 -21-10-12-14-386414 C G G A T C A T CTTAACTCTTAACT 8 – 5 –5 +8 -5 +8 -3 +8 = 14

18 18 Global Alignment vs. Local Alignment global alignment: local alignment:

19 19 An optimal local alignment S i,j : the score of an optimal local alignment ending at a i and b j With proper initializations, S i,j can be computed as follows.

20 20 local alignment 000000000 085200852 0530085313 0200085211 0000853? 0 0 0 C G G A T C A T CTTAACTCTTAACT Match: 8 Mismatch: -5 Gap symbol: -3

21 21 000000000 085200852 0530085313 0200085211 00008531310 0000852118 08525313107 053021310818 C G G A T C A T CTTAACTCTTAACT The best score A – C - T A T C A T 8-3+8-3+8 = 18 local alignment

22 22 Multiple sequence alignment (MSA) The multiple sequence alignment problem is to simultaneously align more than two sequences. Seq1: GCTC Seq2: AC Seq3: GATC GC-TC A---C G-ATC

23 23 How to score an MSA? Sum-of-Pairs (SP-score) GC-TC A---C G-ATC GC-TC A---C GC-TC G-ATC A---C G-ATC Score= + +

24 24 Functional Classification by SVM A protein is classified as either belong (+) or not belong (-) to a functional family By screening against all families, the function of this protein can be identified (example: SVMProt)SVMProt What is SVM? Support vector machines, a machine learning method, learning by examples, statistical learning, classify objects into one of the two classes. Advantage of SVM: Diversity of class members (no racial discrimination). Use of sequence-derived physico-chemical features as basis for classification. Suitable for functional family classifications.

25 25 SVM References C. Burges, "A tutorial on support vector machines for pattern recognition", Data Mining and Knowledge Discovery, Kluwer Academic Publishers,1998 (on-line). R. Duda, P. Hart, and D. Stork, Pattern Classification, John-Wiley, 2nd edition, 2001 (section 5.11, hard-copy). S. Gong et al. Dynamic Vision: From Images to Face Recognition, Imperial College Pres, 2001 (sections 3.6.2, 3.7.2, hard copy). Online lecture notes

26 26 Introduction to Machine Learning Goal: To “improve” (gaining knowledge, enhancing computing capability) Tasks: Forming concepts by data generalization. Compiling knowledge into compact form Finding useful explanations for valid concepts. Clustering data into classes. Reference: Machine Learning in Molecular Biology Sequence Analysis Machine Learning in Molecular Biology Sequence Analysis. Internet links: http://www.ai.univie.ac.at/oefai/ml/ml-resources.html

27 27 Introduction to Machine Learning Category: Inductive learning. Forming concepts from data without a lot of knowledge from domain (learning from examples). Analytic learning. Use of existing knowledge to derive new useful concepts (explanation based learning). Connectionist learning. Use of artificial neural networks in searching for or representing of concepts. Genetic algorithms. To search for the most effective concept by means of Darwin’s “survival of the fittest” approach.

28 28 Machine Learning Methods Inductive learning: Concept learning and example-based learning Concept learning:

29 29 Machine Learning Methods Analytic learning:

30 30 Machine Learning Methods Neural network:

31 31 Machine Learning Methods Genetic algorithms: Strength Pattern Classification

32 32

33 33 SVM

34 34 SVM

35 35 SVM

36 36 SVM

37 37 SVM

38 38 SVM

39 39 SVM

40 40 SVM

41 41 SVM

42 42 SVM

43 43 SVM

44 44 SVM for Classification of Proteins How to represent a protein? Each sequence represented by specific feature vector assembled from encoded representations of tabulated residue properties: –amino acid composition –Hydrophobicity –normalized Van der Waals volume –polarity, –Polarizability –Charge –surface tension –secondary structure –solvent accessibility Three descriptors, composition (C), transition (T), and distribution (D), are used to describe global composition of each of these properties. Nucleic Acids Res., 31: 3692-3697

45 45 SVM for Classification of Proteins Descriptors for amino acid composition of protein: C=(53.33, 46.67) T=(51.72) D=(3.33, 16.67, 40.0, 66.67, 96.67, 6.67, 26.67, 60.0, 76.67, 100.0) Nucleic Acids Res., 31: 3692-3697

46 46 Assignment 1 CZ5225 Methods in Computational Biology Assignment 1 Project 1: Protein family classification by SVM –Construction of training and testing datasets –Generating feature vectors –SVM classification and analysis. –Write a report and include a softcopy of your datasets Project 2: Develop a program of pair-wise sequence alignment using a simple scoring scheme. –Write a code in any programming language –Test it on a few examples (such as estrogen receptor and Progesterone receptor) –Can you extend your program to multiple alignment? –Write a report and include a softcopy of your program


Download ppt "CZ5225 Methods in Computational Biology Lecture 2-3: Protein Families and Family Prediction Methods Prof. Chen Yu Zong Tel: 6874-6877"

Similar presentations


Ads by Google