Machine Learning Challenges in Location Proteomics Robert F. Murphy Departments of Biological Sciences and Biomedical Engineering & Center for Automated.

Slides:

Advertisements

Similar presentations

Visual Vocabulary Construction for Mining Biomedical Images Arnab Bhattacharya, Vebjorn Ljosa, Jia-Yu Pan Presented by Li An, CIS, TU.

Advertisements

Outlines Background & motivation Algorithms overview

Microarray Data Analysis Day 2

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.

Applications of Visualization and Data Clustering to 3D Gene Expression Data Oliver Rübel 1,2,3,7, Gunther H. Weber 3,7, Min-Yu Huang 1,7, E. Wes Bethel.

Image Interpretation Methods for Protein Location in Cells Meel Velliste Murphy Lab Dept. of Biomedical Engineering Carnegie Mellon University Copyright.

Real-Time Human Pose Recognition in Parts from Single Depth Images Presented by: Mohammad A. Gowayyed.

Computational Biology, Part 23 Segmentation and Feature Calculation for Automated Interpretation of Subcellular Patterns Robert F. Murphy Copyright 

Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.

Funding Networks Abdullah Sevincer University of Nevada, Reno Department of Computer Science & Engineering.

Training a Neural Network to Recognize Phage Major Capsid Proteins Author: Michael Arnoult, San Diego State University Mentors: Victor Seguritan, Anca.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Region Segmentation. Find sets of pixels, such that All pixels in region i satisfy some constraint of similarity.

Robust Real-time Object Detection by Paul Viola and Michael Jones ICCV 2001 Workshop on Statistical and Computation Theories of Vision Presentation by.

Computational Biology, Part 28 Automated Interpretation of Subcellular Patterns in Microscope Images III Robert F. Murphy Copyright  1996, 1999,

An Investigation into the Relationship between Semantic and Content Based Similarity Using LIDC Grace Dasovich Robert Kim Midterm Presentation August 21.

Feature Screening Concept: A greedy feature selection method. Rank features and discard those whose ranking criterions are below the threshold. Problem:

Classification of Protein Localization Patterns in 3-D Meel Velliste Carnegie Mellon University.

Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:

Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.

Smart Traveller with Visual Translator for OCR and Face Recognition LYU0203 FYP.

Training a Neural Network to Recognize Phage Major Capsid Proteins Author: Michael Arnoult, San Diego State University Mentors: Victor Seguritan, Anca.

Protein and Function Databases

Making Protein Localization Features More Robust Meel Velliste Carnegie Mellon University.

Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.

BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.

Automatic methods for functional annotation of sequences Petri Törönen.

Image Pattern Recognition The identification of animal species through the classification of hair patterns using image pattern recognition: A case study.

Integration of PSLID and SLIF with “Virtual Cell” Robert F. Murphy, Les Loew & Ion Moraru Ray and Stephanie Lane Professor of Computational Biology Molecular.

Data mining and machine learning A brief introduction.

Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.

Analyzing and Interpreting Quantitative Data

Sequence analysis: Macromolecular motif recognition Sylvia Nagl.

Texture. Texture is an innate property of all surfaces (clouds, trees, bricks, hair etc…). It refers to visual patterns of homogeneity and does not result.

Computational Biology, Part 24 Biological Imaging IV Robert F. Murphy Copyright  All rights reserved.

Finish up array applications Move on to proteomics Protein microarrays.

A Graph-based Friend Recommendation System Using Genetic Algorithm

Agent-based methods for translational cancer multilevel modelling Sylvia Nagl PhD Cancer Systems Science & Biomedical Informatics UCL Cancer Institute.

Supervised Learning of Edges and Object Boundaries Piotr Dollár Zhuowen Tu Serge Belongie.

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

Jan 18, 2008 Ju Han, Hang Chang, Mary Helen Barcellos-Hoff, and Bahram Parvin Lawrence Berkeley National Laboratory Multivariate.

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.

Protein and RNA Families

Structural proteomics

Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.

Face Detection Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL

Computer Graphics and Image Processing (CIS-601).

Motif discovery and Protein Databases Tutorial 5.

Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.

Extracting binary signals from microarray time-course data Debashis Sahoo 1, David L. Dill 2, Rob Tibshirani 3 and Sylvia K. Plevritis 4 1 Department of.

Ivica Dimitrovski 1, Dragi Kocev 2, Suzana Loskovska 1, Sašo Džeroski 2 1 Faculty of Electrical Engineering and Information Technologies, Department of.

Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.

Analyzing Expression Data: Clustering and Stats Chapter 16.

Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.

BME 353 – BIOMEDICAL MEASUREMENTS AND INSTRUMENTATION MEASUREMENT PRINCIPLES.

Slides from Dr. Shahera Hossain

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Image registration aligns the common features of two images. The open-source Insight Toolkit (ITK, funded by the National Library of Medicine) provides.

Ubiquitination Sites Prediction Dah Mee Ko Advisor: Dr.Predrag Radivojac School of Informatics Indiana University May 22, 2009.

Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.

CSCI 631 – Foundations of Computer Vision March 15, 2016 Ashwini Imran Image Stitching.

Methods of multivariate analysis Ing. Jozef Palkovič, PhD.

Clustering Manpreet S. Katari.

Introduction Machine Learning 14/02/2017.

Genomic Data Manipulation

A perspective on proteomics in cell biology

Dimension reduction : PCA and Clustering

Class Prediction Based on Gene Expression Data Issues in the Design and Analysis of Microarray Experiments Michael D. Radmacher, Ph.D. Biometric Research.

Grouping/Segmentation

Presentation transcript:

Machine Learning Challenges in Location Proteomics Robert F. Murphy Departments of Biological Sciences and Biomedical Engineering & Center for Automated Learning and Discovery Carnegie Mellon University

Protein characteristics relevant to systems approach ä sequence ä structure ä expression level ä activity ä partners ä location

Subcellular locations from major protein databases ä Giantin ä Entrez: /note="a new 376kD Golgi complex outher membrane protein" ä SwissProt: INTEGRAL MEMBRANE PROTEIN. GOLGI MEMBRANE. ä GPP130 ä Entrez: /note="GPP130; type II Golgi membrane protein” ä SwissProt: nothing

More questions than answers ä We learned that Giantin and GPP130 are both Golgi proteins, but do we know: ä What part (i.e., cis, medial, trans) of the Golgi complex they each are found in? ä If they have the same subcellular distribution? ä If they also are found in other compartments?

Vocabulary is part of the problem ä Different investigators may use different terms to refer to the same pattern or the same term to refer to different patterns ä Efforts to create restricted vocabularies (e.g., Gene Ontology consortium) for location have been made

SWALL entries for giantin and gpp130 ID GIAN_HUMAN STANDARD; PRT; 3259 AA. AC Q14789; Q14398; GN GOLGB1. DR GO; GO: ; C:Golgi membrane; TAS. DR GO; GO: ; C:Golgi stack; TAS. DR GO; GO: ; C:integral to membrane; TAS. DR GO; GO: ; P:Golgi organization and biogenesis; TAS. ID O00461 PRELIMINARY; PRT; 696 AA. AC O00461; GN GPP130. DR GO; GO: ; C:endocytotic transport vesicle; TAS. DR GO; GO: ; C:Golgi cis-face; TAS. DR GO; GO: ; C:Golgi lumen; TAS. DR GO; GO: ; C:integral to membrane; TAS.

Words are not enough ä Still don’t know how similar the locations patterns of these proteins are ä Restricted vocabularies do not provide the necessary complexity and specificity

Needed: Systematic Approach Need new methods for accurately and objectively determining the subcellular location pattern of all proteinsNeed new methods for accurately and objectively determining the subcellular location pattern of all proteins Distinct from drug screening by low- resolution microscopyDistinct from drug screening by low- resolution microscopy Need to advance past “cartoon” view of subcellular locationNeed to advance past “cartoon” view of subcellular location Need systematic, quantitative approach to protein locationNeed systematic, quantitative approach to protein location

First Decision Point ä Classification by direct (pixel-by-pixel) comparison of individual images to known patterns is not useful, since ä different cells have different shapes, sizes, orientations ä organelles within cells are not found in fixed locations Therefore, use feature-based methods rather than (pixel) model-based methods

Input Images ä Created 2D image database for HeLa cells ä Ten classes covering all major subcellular structures: Golgi, ER, mitochondria, lysosomes, endosomes, nuclei, nucleoli, microfilaments, microtubules ä Included classes that are similar to each other

Example 2D Images of HeLa

Features: SLF ä Developed sets of Subcellular Location Features (SLF) containing features of different types ä Motivated in part by descriptions used by biologists (e.g., punctate, perinuclear) ä First type of features derived from morphological image processing - finding objects by automated thresholding

ä Number of fluorescent objects per cell ä Variance of the object sizes ä Ratio of the largest object to the smallest ä Average distance of objects to the ‘center of fluorescence’ ä Average “roundness” of objects Features: Morphological

Features: Haralick texture ä Give information on correlations in intensity between adjacent pixels to answer questions like ä is the pattern more like a checkerboard or alternating stripes? ä is the pattern highly organized (ordered) or more scattered (disordered)?

Example: Difference detected by texture feature “entropy”

Features: Zernike moment ä Measure degree to which pattern matches a particular Zernike polynomial ä Give information on basic nature of pattern (e.g., circle, donut) and sizes (frequencies) present in pattern

Examples of Zernike Polynomials Z(2,0)Z(4,4)Z(10,6)

Subcellular Location Features: 2D ä Morphological features ä Haralick texture features ä Zernike moment features ä Geometric features ä Edge features

2D Classification Results Overall accuracy = 92% (95% for major patterns) TrueClass Output of the Classifier DNAERGiaGppLamMitNucActTfRTub DNA ER Gia Gpp Lam Mit Nuc Act TfR Tub

Human Classification Results Overall accuracy = 83% (92% for major patterns)

Computer vs. Human

Extending to 3D: Labeling approach ä Total protein labeled with Cy5 reactive dye ä DNA labeled with PI ä Specific Proteins labeled with primary Ab + Alexa488 conjugated secondary Ab

3D Image Set GiantinNuclearERLysosomalgpp130 ActinMitoch.NucleolarTubulinEndosomal

New features to measure “z” asymmetry ä 2D features treated x and y equivalently ä For 3D images, while it makes sense to treat x and y equivalently (cells don’t have a “left” and “right”, z should be treated differently (“top” and “bottom” are not the same) ä We designed features to separate distance measures into x-y component and z component

Overall accuracy = 97% Classification Results for 3D images

How to do even better ä Biologists interpreting images of protein localization typically view many cells before reaching a conclusion ä Can simulate this by classifying sets of cells from the same microscope slide

Set size 9, Overall accuracy = 99.7% Classification of Sets of 3D Images Tub Endo Actin Nucle Mito Lyso Gpp Gia ER DNA TubEndoActinNuclMitoLysoGppGiaERDNA True Class Predicted Class

First Conclusion ä Description of subcellular locations for systems biology should be implemented using a data-driven approach rather than a knowledge-capture approach, but…

Subcellular Location Image Finder ä (Have automated system for finding images in on-line journal articles that match a particular pattern - enables connection between new images and previously published results) Figure Caption Panels Scope Annotated Scopes Annotated Panels ImagePtr Panel labels Label Matching Caption understanding Panel splitting Label finding Panel classification, Micrograph analysis Entity extraction proteins, cells, drugs, experimental conditions, … image type, image scale, subcellular pattern analysis… [Murphy et al, 2001] [Murphy et al, 2001] [Cohen et al, 2003] ] alignment between caption entities and panels

Image Similarity ä Classification power of features implies that they capture essential characteristics of protein patterns ä Can be used to measure similarity between patterns

Clustering by Image Similarity ä Ability to measure similarity of protein patterns allows us for the first time to create a systematic, objective, framework for describing subcellular locations ä Ideal for database references ä One way is by creating a Subcellular Location Tree ä Illustration: Build hierarchical dendrogram

Subcellular Location Tree for 10 classes in HeLa cells

Do this for all proteins: Location Proteomics ä Can use CD-tagging (developed by Dr. Jonathan Jarvik) to randomly tag many proteins: Infect population of cells with a retrovirus carrying a DNA sequence that will produce a “tag” in a random gene in each cell ä Isolate separate clones, each of which produces express one tagged protein ä Use RT-PCR to identify tagged gene in each clone ä Collect images of many cells for each clone using fluorescence microscopy

Example images of CD-tagged clones (A)Glut1 gene (type 1 glucose transporter) (B)Tmpo gene (thymopoietin  (C)tuba1 gene (  -tubulin) (D)Cald gene (caldesmon 1) (E)Ncl gene (nucleolin) (F)Rps11 gene (ribosomal protein S11) (G)Hmga1 gene (high mobility group AT-hook 1) (H)Col1a2 gene (procollagen type I  2) (I)Atp5a1 gene (ATP synthase isoform 1)

Proof of principle ä Cluster 46 clones expressing different tagged proteins based on their subcellular location patterns

Feature selection ä Use Stepwise Discriminant Analysis to rank features based on their ability to distinguish proteins ä Use increasing numbers of features to train neural network classifiers and evaluate classification accuracy over all 46 clones ä Best performance obtained with 10 features

Tree building ä Therefore use these 10 features with z-scored Euclidean distance function to build SLT ä Find optimal number of clusters using k-means clustering and AIC ä Find consensus hierarchical trees by randomly dividing the images for each protein in half and keeping branches conserved between both halves (repeat for 50 random divisions)

Consensus Subcellular Location Tree

Examples from major clusters

Significance ä Proteins clustered by location analogous to proteins clustered by sequence (e.g., PFAM) ä Can subdivide clusters by observing response to drugs, oncogenes, etc. ä These represent protein location states ä Base knowledge required for modeling ä Can be used to filter protein interactions

From patterns to causes ä Machine learning approaches have been previously used to find localization motifs in protein sequences, but the set of locations used was limited to major organelles ä High-resolution subcellular location trees can be used to discover (recursively) new motifs that determine location of each group ä Can include post-translational modifications

More Conclusions ä Organized data collection approach is required to capture high-resolution information on the subcellular location of all proteins ä Prohibitive combinatorial complexity make colocalization approach infeasible, so major effort should focus on one protein at a time

Center for Bioimage Informatics ä $2.75 M CMU funding from NSF ITR ä Joint with UCSB and collaborators at Berkeley and MIT ä R. Murphy (CALD/Biomed.Eng./Biol.Sci.) ä Jelena Kovacevic (Biomedical Engineering) ä Tom Mitchell (CALD) ä Christos Faloutsos (CALD)

Acknowledgments ä Former students ä Michael Boland, Mia Markey, William Dirks, Gregory Porreca, Edward Roques, Meel Velliste ä Current grad students ä Kai Huang, Xiang Chen, Ting Zhao, Yanhua Hu, Elvira Garcia Osuna, Zhenzhen Kou, Juchang Hua ä Funding ä NSF, NIH, Rockefeller Bros. Fund, PA. Tobacco Settlement Fund ä Collaborators/Consultants ä Simon Watkins, David Cassasent, Tom Mitchell, Christos Faloutsos, Jon Jarvik, Peter Berget