Prediction of Subcellular Localization of Proteins ~ Past, Present, and Future ~ Human Genome Center, Inst. Med. Sci., University of Tokyo Kenta Nakai.

Slides:



Advertisements
Similar presentations
Genomes and Proteomes genome: complete set of genetic information in organism gene sequence contains recipe for making proteins (genotype) proteome: complete.
Advertisements

The Robert Gordon University School of Engineering Dr. Mohamed Amish
(SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab
Where in the cell is your protein most likely found?
MitoInteractome : Mitochondrial Protein Interactome Database Rohit Reja Korean Bioinformation Center, Daejeon, Korea.
Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Beyond Null Hypothesis Testing Supplementary Statistical Techniques.
“Folding quality control in the export of proteins by the bacterial twin-arginine translocation pathway” DeLisa MP, Tullman D, Georgiou G. Proc Natl Acad.
Prediction of protein localization and membrane protein topology Gunnar von Heijne Department of Biochemistry and Biophysics Stockholm Bioinformatics Center.
Tools to analyze protein characteristics Protein sequence -Family member -Multiple alignments Identification of conserved regions Evolutionary relationship.
CIS 430 ( Expert System ) Supervised By : Mr. Ashraf Yaseen Student name : Ziad N. Al-A’abed Student # : EXPERT SYSTEM.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Artificial Neural Networks Thomas Nordahl Petersen & Morten Nielsen.
The Protein Data Bank (PDB)
M.W. Mak and S.Y. Kung, ICASSP’09 1 Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites M.W. Mak The Hong Kong Polytechnic University.
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall Chapter 16 Knowledge Application Systems: Systems that Utilize Knowledge.
Artificial Intelligence (AI) Addition to the lecture 11.
1 Backward-Chaining Rule-Based Systems Elnaz Nouri December 2007.
Protein Tertiary Structure Prediction
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Shankar Subramaniam University of California at San Diego Data to Biology.
Introduction The GPM project (The Global Proteome Machine Organization) Salvador Martínez de Bartolomé Bioinformatics support –
Artificial Intelligence CS105. Team Meeting Time (10 minutes) Find yourself a team Find your team leader Talk about topics and responsibilities.
CS 790 – Bioinformatics Introduction and overview.
Protein Functional Annotation Dr G.P.S. Raghava. Annotation Methods Annotation by homology (BLAST) requires a large, well annotated database of protein.
BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar †, Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli.
Day 2: Protein Sequence Analysis 1.Physico-chemical properties. 2.Cellular localization. 3.Signal peptides. 4.Transmembrane domains. 5.Post-translational.
1 Introduction(1/2)  Eukaryotic cells can synthesize up to 10,000 different kinds of proteins  The correct transport of a protein to its final destination.
Finish up array applications Move on to proteomics Protein microarrays.
Microbial Biotechnology Philadelphia University
Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.
Frontiers in the Convergence of Bioscience and Information Technologies 2007 Seyed Koosha Golmohammadi, Lukasz Kurgan, Brendan Crowley, and Marek Reformat.
Condor: BLAST Monday, July 19 th, 3:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Localization prediction of transmembrane proteins Stefan Maetschke, Mikael Bodén and Marcus Gallagher The University of Queensland.
THE PUZZLING PROPERTIES OF THE PERMEASE (PPP) Kim Finer, Jennifer Galovich, Ruth Gyure, Dave Westenberg March 4, 2006.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
BioInformatics Database of Primer Results In order to help predict the way proteins will act in an organism, biologists cross-examine sequences of amino.
Condor: BLAST Rob Quick Open Science Grid Indiana University.
Proteomics Session 1 Introduction. Some basic concepts in biology and biochemistry.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Central dogma: the story of life RNA DNA Protein.
An overview of Bioinformatics. Cell and Central Dogma.
Bioinformatics and Computational Biology
Clinical Decision Support 1 Historical Perspectives.
Protein Properties Function, structure Residue features Targeting Post-trans modifications BIO520 BioinformaticsJim Lund Reading: Chapter , 11.7,
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
NOTES - CH 15 (and 14.3): DNA Technology (“Biotech”)
Integration of Bioinformatics into Inquiry Based Learning by Kathleen Gabric.
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS) LECTURE 13 ANALYSIS OF THE TRANSCRIPTOME.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
INTERPRETING GENETIC MUTATIONAL DATA FOR CLINICAL ONCOLOGY Ben Ho Park, M.D., Ph.D. Associate Professor of Oncology Johns Hopkins University May 2014.
1 Computational Approaches(1/7)  Computational methods can be divided into four categories: prediction methods based on  (i) The overall protein amino.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Artificial Intelligence: Applications
` Comparison of Gene Ontology Term Annotations Between E.coli K12 Databases REDDYSAILAJA MARPURI WESTERN KENTUCKY UNIVERSITY.
Prediction of protein features. Beyond protein structure
Bio/Chem-informatics
University of California at San Diego
A Very Basic Gibbs Sampler for Motif Detection
Spring 2003 Dr. Susan Bridges
SynechoNET: integrated protein interaction database of Synechocystis
Evaluating classifiers for disease gene discovery
University of California at San Diego
Ab initio gene prediction
Introduction to Bioinformatics II
Geneomics and Database Mining and Genetic Mapping
Protein Functional Annotation
Protein Functional Annotation
General structure of RIFINs and STEVORs
Presentation transcript:

Prediction of Subcellular Localization of Proteins ~ Past, Present, and Future ~ Human Genome Center, Inst. Med. Sci., University of Tokyo Kenta Nakai Swiss-Prot 20 Years

20 Years Ago.. I became a graduate student in Prof. Minoru Kanehisa’s lab I wanted to write a program that interprets the information encoded in DNA sequences But biology is full of exceptions

Diagnosis System of Bacterial Infections (MYCIN 1974) Enter Information about the patient. (Name, Age, Sex, and Race) Are there any positive cultures obtained from SALLY? … Has SALLY recently had symptoms of persistent headache or other abnormal neurologic symptoms (dizziness, lethargy, etc.)? … Enter Information about the patient. (Name, Age, Sex, and Race) Are there any positive cultures obtained from SALLY? … Has SALLY recently had symptoms of persistent headache or other abnormal neurologic symptoms (dizziness, lethargy, etc.)? … INFECTION-1 is MENINGITIS + MYCOBACTERIUM-TB [from clinical evidence only] + … [REC-1] My preferred therapy recommendation is as follows: 1) ETHAMBUTAL Dose: ( mg-tablets) q24h PO for 60 days [calculated on basis of 25 mg/kg] then 770 mg ( mg-tablets) q24h PO.. INFECTION-1 is MENINGITIS + MYCOBACTERIUM-TB [from clinical evidence only] + … [REC-1] My preferred therapy recommendation is as follows: 1) ETHAMBUTAL Dose: ( mg-tablets) q24h PO for 60 days [calculated on basis of 25 mg/kg] then 770 mg ( mg-tablets) q24h PO..

Knowledge Base for Automatic Reasoning Knowledge is represented as a collection of “if-then” rules, which are chained to make the system solve a realistic problem Rule 123 If: the gram stain of the organism is negative and: the aerobicity of the organism is anaerobic and: the morphology of the organism is rod then: the genus of the organism is bacteroides with a certainty factor of 0.6 Rule 123 If: the gram stain of the organism is negative and: the aerobicity of the organism is anaerobic and: the morphology of the organism is rod then: the genus of the organism is bacteroides with a certainty factor of 0.6 Working Memory Name: Sally Age: 42 years Sex: Female Race: … Working Memory Name: Sally Age: 42 years Sex: Female Race: …

Expert Systems Knowledge Base Inference Engine

Sample Problem

Prediction of Subcellular Localization

Typical Sorting Signals Signal FunctionExample Import into nucleus-P-P-K-K-K-R-K-V- Export from nucleus-L-A-L-K-L-A-G-L-D-I- Import into mitochondria<-MLSLRQSIRFFKPATRTLCSSRYLL- Import into plastid <-MVAMAMASLQSSMSSLSLSSNS FLGQPLSPITLSPFLQG- Import into peroxisomes-S-K-L-> Import into ER <-MMSFVSLLLVGILFWAT EAEQLTKCEVFN- Return to ER-K-D-E-L->

Amino Acid Composition Another good clue for prediction Suited for machine learning Outer membrane proteins and periplasmic proteins of Gram- negative bacteria

PSORT (I) Nakai & Kanehisa, 1991, 1992 Expert system using about 100 “If-then” rules ERMPMLSMERLLSLOTERMPMMT NCPXERMPMGGCP OMITMX GY motif KK signal peptide (Specific Signals) KDEL GPI Topology MTS NLS SKL TMS Topology Apolar Topology TMS in Mature Part signal cleavage site IM

Papers and the web server Nakai & Kanehisa, Proteins 1991 –cited 295 times Nakai & Kanehisa, Genomics 1992 –cited 961 times –34 in 2006 Web server since 1993

Limitations of PSORT Relatively low accuracy possibly because of the complexity of the sorting mechanisms It is difficult to optimize the certainty parameters assigned for each rule It is tedious to update the knowledge base with the growth of the training data

PSORT II Nakai & Horton, 1997, 1999 (cited 638 times) Machine learning kNN (k-nearest neighbor) method Q k = 3

iPSORT: Bannai et al Rule 1 A protein has an SP if the sum of hydropathy index values within [6,25] exceeds 18.3 Rule 2 A protein has either an mTP or a cTP if it contains less than 3 D/Es within [1,30] and if it contains a motif similar to , where 2=(I,R),3=(D,E,H,K,N),1=otherwise Rule 3 A protein has an mTP if it satisfies Rule 2, if the sum of isoelectric point values within [1,15] exceeds 93, and if it contains a motif similar to , where 2=(K,R),3=(I,P),1=otherwise

PSORTb and PSORT.ORG Gardy et al. 2003, 2004 –Contribution from a Canadian group (Brinkman lab) Update for bacterial proteins

WoLF-PSORT Horton et al Latest PSORT update for eukaryotic proteins WoLF: Women only Love Fools!?

Current Dilemma More data are necessary to improve the training process The practical value of prediction methods becomes less with the growth of experimental data Moreover, the more we investigate, the more the number of exceptions grows

It’s a General Problem Gene Finding Prediction of Protein Structure … Knowing the answer of a problem before we become to know how to solve it Similarity search against the data of typical model organisms will become enough in many cases

New Generation Predictors Should be useful to engineer proteins for their targeting sites Should complement errors of proteome analyses (i.e., isoforms with differential localization) Comprehensively example-based rather than statistical feature-based (such as amino acid composition)

Biology is like Linguistics Both are naturally born and full of exceptions There may not exist “general principles”

Future of Sequence Analysis It will become “DNA linguistics” Large dictionaries (databases) will contain both general cases and exceptions Such databases may be a sort of knowledge base that can be used to simulate the subcellular processes

Past, Present, and Future Past –Expert system-based predictions Present –Machine learning-based predictions Future –Combination of both? –Revival of knowledge bases to simulate cellular processes?

Acknowledgments Minoru Kanehisa Paul Horton Hideo Bannai, Satoru Miyano Jennifer Gardy, Fiona Brinkman And all the other people who contributed to the PSORT project!