CZ5225 Methods in Computational Biology Lecture 2-3: Protein Families and Family Prediction Methods Prof. Chen Yu Zong Tel: 6874-6877

Slides:



Advertisements
Similar presentations
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Advertisements

Machine learning continued Image source:
MitoInteractome : Mitochondrial Protein Interactome Database Rohit Reja Korean Bioinformation Center, Daejeon, Korea.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Profiles for Sequences
Discriminative and generative methods for bags of features
LSM3241: Bioinformatics and Biocomputing Lecture 2: Bioinformatics of viral genome Prof. Chen Yu Zong Tel:
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Structural bioinformatics
Protein-DNA interactions: amino acid conservation and the effects of mutations on binding specificity Nicholas M. Luscombe and Janet M. Thornton JMB (2002)
Using Bioinformatics to Make the Bio- Math Connection The Confessions of a Biology Teacher.
Machine Vision and Dig. Image Analysis 1 Prof. Heikki Kälviäinen C50A6100 Lectures 12: Object Recognition Professor Heikki Kälviäinen Machine Vision and.
. Protein Structure Prediction [Based on Structural Bioinformatics, section VII]
BL5203: Molecular Recognition & Interaction Lecture 5: Drug Design Methods Ligand-Protein Docking (Part I) Prof. Chen Yu Zong Tel:
Sequence Alignment III CIS 667 February 10, 2004.
Project list 1.Peptide MHC binding predictions using position specific scoring matrices including pseudo counts and sequences weighting clustering (Hobohm)
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Presented by Liu Qi An introduction to Bioinformatics Algorithms Qi Liu
Artificial Neural Networks for Secondary Structure Prediction CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (slides by J. Burg)
Lecture 7: Computer aided drug design: Statistical approach. Lecture 7: Computer aided drug design: Statistical approach. Chen Yu Zong Department of Computational.
Cédric Notredame (30/08/2015) Chemoinformatics And Bioinformatics Cédric Notredame Molecular Biology Bioinformatics Chemoinformatics Chemistry.
1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.
SUPERVISED NEURAL NETWORKS FOR PROTEIN SEQUENCE ANALYSIS Lecture 11 Dr Lee Nung Kion Faculty of Cognitive Sciences and Human Development UNIMAS,
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
LSM3241: Bioinformatics and Biocomputing Lecture 3: Machine learning method for protein function prediction Prof. Chen Yu Zong Tel:
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
CZ5225 Methods in Computational Biology Lecture 4-5: Protein Structure and Structural Modeling Prof. Chen Yu Zong Tel:
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?
CZ5225: Modeling and Simulation in Biology Lecture 9: Next Generation Sequencing Prof. Chen Yu Zong Tel:
Sequence Based Analysis Tutorial NIH Proteomics Workshop Lai-Su Yeh, Ph.D. Protein Information Resource at Georgetown University Medical Center.
CZ5225 Methods in Computational Biology Lecture 9: Biological pathways and pathway simulation Prof. Chen Yu Zong Tel:
Introduction to Bioinformatics Biostatistics & Medical Informatics 576 Computer Sciences 576 Fall 2008 Colin Dewey Dept. of Biostatistics & Medical Informatics.
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
Multiple Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan WWW:
Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
What is a Project Purpose –Use a method introduced in the course to describe some biological problem How –Construct a data set describing the problem –Define.
Protein secondary structure Prediction Why 2 nd Structure prediction? The problem Seq: RPLQGLVLDTQLYGFPGAFDDWERFMRE Pred:CCCCCHHHHHCCCCEEEECCHHHHHHCC.
Overview of Bioinformatics 1 Module Denis Manley..
LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.
CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel:
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
CZ5226: Advanced Bioinformatics Lecture 7: Statistical Learning Methods Prof. Chen Yu Zong Tel:
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Bioinformatics Research Overview Li Liao Develop new algorithms and (statistical) learning methods > Capable of incorporating domain knowledge > Effective,
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
CZ3253: Computer Aided Drug design Lecture 7: Drug Design Methods II: SVM Prof. Chen Yu Zong Tel:
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Bioinformatics Overview
SMA5422: Special Topics in Biotechnology
CZ3253: Computer Aided Drug design Introduction about the module Prof
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Prediction of RNA Binding Protein Using Machine Learning Technique
חיזוי ואפיון אתרי קישור של חלבון לדנ"א מתוך הרצף
Sequence Based Analysis Tutorial
SMA5422: Special Topics in Biotechnology
Sequence Based Analysis Tutorial
Presentation transcript:

CZ5225 Methods in Computational Biology Lecture 2-3: Protein Families and Family Prediction Methods Prof. Chen Yu Zong Tel: Room 07-24, level 7, SOC1, NUS August

2 Protein Evolution: SARS coronavirus as an example

3 SARS Coronavirus A novel coronavirus Identified as the cause of severe respiratory syndrome (SARS )

4 SARS Infection How SARS coronavirus enters a cell and reproduce

5 Protein Evolution Generation of different species

6 Protein Families Sequence alignment-based families. –Based on Principle of Sequence-structure-function-relationship. –Derived by multiple sequence alignment –Database: PFAM (Nucleic Acids Res. 30: )PFAM Structure-based families. –Derived by visual inspection and comparison of structures –Database: SCOP (J. Mol. Biol. 247, )SCOP Functional Families. –Databases: G-protein coupled receptors: GPCRDB (Nucleic Acids Res. 29: ), ORDB (Nucleic Acids Res. 30: )GPCRDBORDB Nuclear receptors: NucleaRDB (Nucleic Acids Res. 29: )NucleaRDB Enzymes: BRENDA (Nucleic Acids Res. 30, 47-49)BRENDA Transporters: TC-DB (Microbiol Mol Biol Rev. 64: )TC-DB Ligand-gated ion channels: LGICdb (Nucleic Acids Res. 29: )LGICdb Therapeutic targets: TTD (Nucleic Acids Res. 30, )TTD Drug side-effect targets: DART (Drug Safety 26: )DART

7 Protein Families Sequence families =\= Structural families =\= Functional families Sequence similar, structure different Sequence different, structure similar Sequence similar, function different (distantly related proteins) Sequence different, function similar Homework: find examples

8 Protein Family Prediction Methods Sequence alignment-based families: Multiple sequence alignment (HMM): HMMER ;Multiple sequence alignment (HMM) HMMER JMB 235, ; JMB 301, Structure-based families: Visual inspection and comparison of structures Functional Families. Statistical learning methods: –Neural network: ProtFun (Bioinformatics, 19: ) ProtFun –Support vector machines: SVMProt (Nucleic Acids Res., 31: ) SVMProt

9 Sequence Comparison as a Mathematical Problem: Example: Sequence a: ATTCTTGC Sequence b: ATCCTATTCTAGC Best Alignment: ATTCTTGC ATCCTATTCTAGC /|\ gap Bad Alignment: AT TCTT GC ATCCTATTCTAGC /|\ /|\ gap gap Construction of many alignments => which is the best?

10 How to rate an alignment? Match: +8 (w(x, y) = 8, if x = y) Mismatch: -5 (w(x, y) = -5, if x ≠ y) Each gap symbol: -3 (w(-,x)=w(x,-)=-3) C T T A A C T C G G A T C A - - T = +12 Alignment score

11 Alignment Graph Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACTCTTAACT C---TTAACT CGGATCA--T

12 An optimal alignment -- the alignment of maximum score Let A=a 1 a 2 …a m and B=b 1 b 2 …b n. S i,j : the score of an optimal alignment between a 1 a 2 …a i and b 1 b 2 …b j With proper initializations, S i,j can be computed as follows.

13 Computing S i,j i j w(a i,-) w(-,b j ) w(a i,b j ) S m,n

14Initializations C G G A T C A T CTTAACTCTTAACT

15 S 3,5 = ? ? C G G A T C A T CTTAACTCTTAACT

16 S 3,5 = ? C G G A T C A T CTTAACTCTTAACT optimal score

17 C T T A A C – T C G G A T C A T C G G A T C A T CTTAACTCTTAACT 8 – 5 – = 14

18 Global Alignment vs. Local Alignment global alignment: local alignment:

19 An optimal local alignment S i,j : the score of an optimal local alignment ending at a i and b j With proper initializations, S i,j can be computed as follows.

20 local alignment ? C G G A T C A T CTTAACTCTTAACT Match: 8 Mismatch: -5 Gap symbol: -3

C G G A T C A T CTTAACTCTTAACT The best score A – C - T A T C A T = 18 local alignment

22 Multiple sequence alignment (MSA) The multiple sequence alignment problem is to simultaneously align more than two sequences. Seq1: GCTC Seq2: AC Seq3: GATC GC-TC A---C G-ATC

23 How to score an MSA? Sum-of-Pairs (SP-score) GC-TC A---C G-ATC GC-TC A---C GC-TC G-ATC A---C G-ATC Score= + +

24 Functional Classification by SVM A protein is classified as either belong (+) or not belong (-) to a functional family By screening against all families, the function of this protein can be identified (example: SVMProt)SVMProt What is SVM? Support vector machines, a machine learning method, learning by examples, statistical learning, classify objects into one of the two classes. Advantage of SVM: Diversity of class members (no racial discrimination). Use of sequence-derived physico-chemical features as basis for classification. Suitable for functional family classifications.

25 SVM References C. Burges, "A tutorial on support vector machines for pattern recognition", Data Mining and Knowledge Discovery, Kluwer Academic Publishers,1998 (on-line). R. Duda, P. Hart, and D. Stork, Pattern Classification, John-Wiley, 2nd edition, 2001 (section 5.11, hard-copy). S. Gong et al. Dynamic Vision: From Images to Face Recognition, Imperial College Pres, 2001 (sections 3.6.2, 3.7.2, hard copy). Online lecture notes

26 Introduction to Machine Learning Goal: To “improve” (gaining knowledge, enhancing computing capability) Tasks: Forming concepts by data generalization. Compiling knowledge into compact form Finding useful explanations for valid concepts. Clustering data into classes. Reference: Machine Learning in Molecular Biology Sequence Analysis Machine Learning in Molecular Biology Sequence Analysis. Internet links:

27 Introduction to Machine Learning Category: Inductive learning. Forming concepts from data without a lot of knowledge from domain (learning from examples). Analytic learning. Use of existing knowledge to derive new useful concepts (explanation based learning). Connectionist learning. Use of artificial neural networks in searching for or representing of concepts. Genetic algorithms. To search for the most effective concept by means of Darwin’s “survival of the fittest” approach.

28 Machine Learning Methods Inductive learning: Concept learning and example-based learning Concept learning:

29 Machine Learning Methods Analytic learning:

30 Machine Learning Methods Neural network:

31 Machine Learning Methods Genetic algorithms: Strength Pattern Classification

32

33 SVM

34 SVM

35 SVM

36 SVM

37 SVM

38 SVM

39 SVM

40 SVM

41 SVM

42 SVM

43 SVM

44 SVM for Classification of Proteins How to represent a protein? Each sequence represented by specific feature vector assembled from encoded representations of tabulated residue properties: –amino acid composition –Hydrophobicity –normalized Van der Waals volume –polarity, –Polarizability –Charge –surface tension –secondary structure –solvent accessibility Three descriptors, composition (C), transition (T), and distribution (D), are used to describe global composition of each of these properties. Nucleic Acids Res., 31:

45 SVM for Classification of Proteins Descriptors for amino acid composition of protein: C=(53.33, 46.67) T=(51.72) D=(3.33, 16.67, 40.0, 66.67, 96.67, 6.67, 26.67, 60.0, 76.67, 100.0) Nucleic Acids Res., 31:

46 Assignment 1 CZ5225 Methods in Computational Biology Assignment 1 Project 1: Protein family classification by SVM –Construction of training and testing datasets –Generating feature vectors –SVM classification and analysis. –Write a report and include a softcopy of your datasets Project 2: Develop a program of pair-wise sequence alignment using a simple scoring scheme. –Write a code in any programming language –Test it on a few examples (such as estrogen receptor and Progesterone receptor) –Can you extend your program to multiple alignment? –Write a report and include a softcopy of your program