Machine Learning Challenges in Location Proteomics Robert F. Murphy Departments of Biological Sciences and Biomedical Engineering & Center for Automated.

Machine Learning Challenges in Location Proteomics Robert F. Murphy Departments of Biological Sciences and Biomedical Engineering & Center for Automated Learning and Discovery Carnegie Mellon University

Protein characteristics relevant to systems approach ä sequence ä structure ä expression level ä activity ä partners ä location

Subcellular locations from major protein databases ä Giantin ä Entrez: /note="a new 376kD Golgi complex outher membrane protein" ä SwissProt: INTEGRAL MEMBRANE PROTEIN. GOLGI MEMBRANE. ä GPP130 ä Entrez: /note="GPP130; type II Golgi membrane protein” ä SwissProt: nothing

More questions than answers ä We learned that Giantin and GPP130 are both Golgi proteins, but do we know: ä What part (i.e., cis, medial, trans) of the Golgi complex they each are found in? ä If they have the same subcellular distribution? ä If they also are found in other compartments?

Vocabulary is part of the problem ä Different investigators may use different terms to refer to the same pattern or the same term to refer to different patterns ä Efforts to create restricted vocabularies (e.g., Gene Ontology consortium) for location have been made

SWALL entries for giantin and gpp130 ID GIAN_HUMAN STANDARD; PRT; 3259 AA. AC Q14789; Q14398; GN GOLGB1. DR GO; GO:0000139; C:Golgi membrane; TAS. DR GO; GO:0005795; C:Golgi stack; TAS. DR GO; GO:0016021; C:integral to membrane; TAS. DR GO; GO:0007030; P:Golgi organization and biogenesis; TAS. ID O00461 PRELIMINARY; PRT; 696 AA. AC O00461; GN GPP130. DR GO; GO:0005810; C:endocytotic transport vesicle; TAS. DR GO; GO:0005801; C:Golgi cis-face; TAS. DR GO; GO:0005796; C:Golgi lumen; TAS. DR GO; GO:0016021; C:integral to membrane; TAS.

Words are not enough ä Still don’t know how similar the locations patterns of these proteins are ä Restricted vocabularies do not provide the necessary complexity and specificity

Needed: Systematic Approach Need new methods for accurately and objectively determining the subcellular location pattern of all proteinsNeed new methods for accurately and objectively determining the subcellular location pattern of all proteins Distinct from drug screening by low- resolution microscopyDistinct from drug screening by low- resolution microscopy Need to advance past “cartoon” view of subcellular locationNeed to advance past “cartoon” view of subcellular location Need systematic, quantitative approach to protein locationNeed systematic, quantitative approach to protein location

First Decision Point ä Classification by direct (pixel-by-pixel) comparison of individual images to known patterns is not useful, since ä different cells have different shapes, sizes, orientations ä organelles within cells are not found in fixed locations Therefore, use feature-based methods rather than (pixel) model-based methods

Input Images ä Created 2D image database for HeLa cells ä Ten classes covering all major subcellular structures: Golgi, ER, mitochondria, lysosomes, endosomes, nuclei, nucleoli, microfilaments, microtubules ä Included classes that are similar to each other

Example 2D Images of HeLa

Features: SLF ä Developed sets of Subcellular Location Features (SLF) containing features of different types ä Motivated in part by descriptions used by biologists (e.g., punctate, perinuclear) ä First type of features derived from morphological image processing - finding objects by automated thresholding

ä Number of fluorescent objects per cell ä Variance of the object sizes ä Ratio of the largest object to the smallest ä Average distance of objects to the ‘center of fluorescence’ ä Average “roundness” of objects Features: Morphological

Features: Haralick texture ä Give information on correlations in intensity between adjacent pixels to answer questions like ä is the pattern more like a checkerboard or alternating stripes? ä is the pattern highly organized (ordered) or more scattered (disordered)?

Example: Difference detected by texture feature “entropy”

Features: Zernike moment ä Measure degree to which pattern matches a particular Zernike polynomial ä Give information on basic nature of pattern (e.g., circle, donut) and sizes (frequencies) present in pattern

Examples of Zernike Polynomials Z(2,0)Z(4,4)Z(10,6)

Subcellular Location Features: 2D ä Morphological features ä Haralick texture features ä Zernike moment features ä Geometric features ä Edge features

2D Classification Results Overall accuracy = 92% (95% for major patterns) TrueClass Output of the Classifier DNAERGiaGppLamMitNucActTfRTub DNA 99 100000000 ER0 97 00020001 Gia00 91 7000020 Gpp0014 82 002010 Lam0010 88 100100 Mit03000 92 0033 Nuc000000 99 010 Act0000000 100 00 TfR010012201 81 2 Tub120001001 95

Human Classification Results Overall accuracy = 83% (92% for major patterns)

Computer vs. Human

Extending to 3D: Labeling approach ä Total protein labeled with Cy5 reactive dye ä DNA labeled with PI ä Specific Proteins labeled with primary Ab + Alexa488 conjugated secondary Ab

3D Image Set GiantinNuclearERLysosomalgpp130 ActinMitoch.NucleolarTubulinEndosomal

New features to measure “z” asymmetry ä 2D features treated x and y equivalently ä For 3D images, while it makes sense to treat x and y equivalently (cells don’t have a “left” and “right”, z should be treated differently (“top” and “bottom” are not the same) ä We designed features to separate distance measures into x-y component and z component

Overall accuracy = 97% Classification Results for 3D images

How to do even better ä Biologists interpreting images of protein localization typically view many cells before reaching a conclusion ä Can simulate this by classifying sets of cells from the same microscope slide

Set size 9, Overall accuracy = 99.7% Classification of Sets of 3D Images 99000000000Tub 010000000000Endo 001000000000Actin 000100000000Nucle 000010000000Mito 000001000000Lyso 00000099000Gpp 000000010000Gia 00000000990ER 000000000100DNA TubEndoActinNuclMitoLysoGppGiaERDNA True Class Predicted Class

First Conclusion ä Description of subcellular locations for systems biology should be implemented using a data-driven approach rather than a knowledge-capture approach, but…

Subcellular Location Image Finder ä (Have automated system for finding images in on-line journal articles that match a particular pattern - enables connection between new images and previously published results) Figure Caption Panels Scope Annotated Scopes Annotated Panels ImagePtr Panel labels Label Matching Caption understanding Panel splitting Label finding Panel classification, Micrograph analysis Entity extraction proteins, cells, drugs, experimental conditions, … image type, image scale, subcellular pattern analysis… [Murphy et al, 2001] [Murphy et al, 2001] [Cohen et al, 2003] ] alignment between caption entities and panels

Image Similarity ä Classification power of features implies that they capture essential characteristics of protein patterns ä Can be used to measure similarity between patterns

Clustering by Image Similarity ä Ability to measure similarity of protein patterns allows us for the first time to create a systematic, objective, framework for describing subcellular locations ä Ideal for database references ä One way is by creating a Subcellular Location Tree ä Illustration: Build hierarchical dendrogram

Subcellular Location Tree for 10 classes in HeLa cells

Do this for all proteins: Location Proteomics ä Can use CD-tagging (developed by Dr. Jonathan Jarvik) to randomly tag many proteins: Infect population of cells with a retrovirus carrying a DNA sequence that will produce a “tag” in a random gene in each cell ä Isolate separate clones, each of which produces express one tagged protein ä Use RT-PCR to identify tagged gene in each clone ä Collect images of many cells for each clone using fluorescence microscopy

Example images of CD-tagged clones (A)Glut1 gene (type 1 glucose transporter) (B)Tmpo gene (thymopoietin  (C)tuba1 gene (  -tubulin) (D)Cald gene (caldesmon 1) (E)Ncl gene (nucleolin) (F)Rps11 gene (ribosomal protein S11) (G)Hmga1 gene (high mobility group AT-hook 1) (H)Col1a2 gene (procollagen type I  2) (I)Atp5a1 gene (ATP synthase isoform 1)

Proof of principle ä Cluster 46 clones expressing different tagged proteins based on their subcellular location patterns

Feature selection ä Use Stepwise Discriminant Analysis to rank features based on their ability to distinguish proteins ä Use increasing numbers of features to train neural network classifiers and evaluate classification accuracy over all 46 clones ä Best performance obtained with 10 features

Tree building ä Therefore use these 10 features with z-scored Euclidean distance function to build SLT ä Find optimal number of clusters using k-means clustering and AIC ä Find consensus hierarchical trees by randomly dividing the images for each protein in half and keeping branches conserved between both halves (repeat for 50 random divisions)

Consensus Subcellular Location Tree

Examples from major clusters

Significance ä Proteins clustered by location analogous to proteins clustered by sequence (e.g., PFAM) ä Can subdivide clusters by observing response to drugs, oncogenes, etc. ä These represent protein location states ä Base knowledge required for modeling ä Can be used to filter protein interactions

From patterns to causes ä Machine learning approaches have been previously used to find localization motifs in protein sequences, but the set of locations used was limited to major organelles ä High-resolution subcellular location trees can be used to discover (recursively) new motifs that determine location of each group ä Can include post-translational modifications

More Conclusions ä Organized data collection approach is required to capture high-resolution information on the subcellular location of all proteins ä Prohibitive combinatorial complexity make colocalization approach infeasible, so major effort should focus on one protein at a time

Center for Bioimage Informatics ä $2.75 M CMU funding from NSF ITR ä Joint with UCSB and collaborators at Berkeley and MIT ä R. Murphy (CALD/Biomed.Eng./Biol.Sci.) ä Jelena Kovacevic (Biomedical Engineering) ä Tom Mitchell (CALD) ä Christos Faloutsos (CALD)

Acknowledgments ä Former students ä Michael Boland, Mia Markey, William Dirks, Gregory Porreca, Edward Roques, Meel Velliste ä Current grad students ä Kai Huang, Xiang Chen, Ting Zhao, Yanhua Hu, Elvira Garcia Osuna, Zhenzhen Kou, Juchang Hua ä Funding ä NSF, NIH, Rockefeller Bros. Fund, PA. Tobacco Settlement Fund ä Collaborators/Consultants ä Simon Watkins, David Cassasent, Tom Mitchell, Christos Faloutsos, Jon Jarvik, Peter Berget

Machine Learning Challenges in Location Proteomics Robert F. Murphy Departments of Biological Sciences and Biomedical Engineering & Center for Automated.

Similar presentations

Presentation on theme: "Machine Learning Challenges in Location Proteomics Robert F. Murphy Departments of Biological Sciences and Biomedical Engineering & Center for Automated."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning Challenges in Location Proteomics Robert F. Murphy Departments of Biological Sciences and Biomedical Engineering & Center for Automated.

Similar presentations

Presentation on theme: "Machine Learning Challenges in Location Proteomics Robert F. Murphy Departments of Biological Sciences and Biomedical Engineering & Center for Automated."— Presentation transcript:

Similar presentations

About project

Feedback