Computational Biology, Part 24 Biological Imaging IV Robert F. Murphy Copyright  2001. All rights reserved.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Tests of Hypotheses Based on a Single Sample
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Image Analysis Phases Image pre-processing –Noise suppression, linear and non-linear filters, deconvolution, etc. Image segmentation –Detection of objects.
Image Interpretation Methods for Protein Location in Cells Meel Velliste Murphy Lab Dept. of Biomedical Engineering Carnegie Mellon University Copyright.
1 1 Slide © 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Computational Biology, Part 23 Segmentation and Feature Calculation for Automated Interpretation of Subcellular Patterns Robert F. Murphy Copyright 
Chapter 10 Quality Control McGraw-Hill/Irwin
1 1 Slide © 2003 South-Western/Thomson Learning TM Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Reduced Support Vector Machine
Texture Classification Using QMF Bank-Based Sub-band Decomposition A. Kundu J.L. Chen Carole BakhosEvan Kastner Dave AbramsTommy Keane Rochester Institute.
Computational Biology, Part 28 Automated Interpretation of Subcellular Patterns in Microscope Images III Robert F. Murphy Copyright  1996, 1999,
Evaluating Hypotheses
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Classification of Protein Localization Patterns in 3-D Meel Velliste Carnegie Mellon University.
Experimental Evaluation
Inferences About Process Quality
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Part I: Classification and Bayesian Learning
Quantitative Genetics
1 1 Slide © 2003 South-Western/Thomson Learning TM Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
November 25, 2014Computer Vision Lecture 20: Object Recognition IV 1 Creating Data Representations The problem with some data representations is that the.
F-Test ( ANOVA ) & Two-Way ANOVA
Chapter 3 - Part B Descriptive Statistics: Numerical Methods
Integration of PSLID and SLIF with “Virtual Cell” Robert F. Murphy, Les Loew & Ion Moraru Ray and Stephanie Lane Professor of Computational Biology Molecular.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-1 Review and Preview.
1 1 Slide © 2009 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS St. Edward’s University.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Created by Tom Wegleitner, Centreville, Virginia Section 3-1 Review and.
Discriminant Function Analysis Basics Psy524 Andrew Ainsworth.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
1 1 Slide Slides Prepared by JOHN S. LOUCKS St. Edward’s University © 2002 South-Western/Thomson Learning.
Chapter 21 Basic Statistics.
Basic Concepts of Correlation. Definition A correlation exists between two variables when the values of one are somehow associated with the values of.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Fundamentals of Hypothesis Testing: One-Sample Tests Statistics.
Chapter 3, Part B Descriptive Statistics: Numerical Measures n Measures of Distribution Shape, Relative Location, and Detecting Outliers n Exploratory.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Data Mining and Decision Support
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. SAMPLING AND SAMPLING VARIATION Sample Knowledge of students No. of red blood cells in a person Length of.
Learning Objectives After this section, you should be able to: The Practice of Statistics, 5 th Edition1 DESCRIBE the shape, center, and spread of the.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Copyright © 2013 Pearson Education, Inc. Publishing as Prentice Hall Statistics for Business and Economics 8 th Edition Chapter 9 Hypothesis Testing: Single.
Appendix I A Refresher on some Statistical Terms and Tests.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Slide 1 Copyright © 2004 Pearson Education, Inc.  Descriptive Statistics summarize or describe the important characteristics of a known set of population.
Descriptive Statistics ( )
Two-Sample Hypothesis Testing
CHAPTER 10 Comparing Two Populations or Groups
Comparing Three or More Means
REMOTE SENSING Multispectral Image Classification
Chapter 11 Analysis of Variance
Chapter 10 Correlation and Regression
Typical Image Selection
EXTENDING GENE ANNOTATION WITH GENE EXPRESSION
CHAPTER 10 Comparing Two Populations or Groups
Counting Statistics and Error Prediction
Chapter 1: Exploring Data
Chapter 1: Exploring Data
CHAPTER 10 Comparing Two Populations or Groups
Parametric Methods Berlin Chen, 2005 References:
CHAPTER 10 Comparing Two Populations or Groups
Learning From Observed Data
CHAPTER 10 Comparing Two Populations or Groups
CHAPTER 10 Comparing Two Populations or Groups
Presentation transcript:

Computational Biology, Part 24 Biological Imaging IV Robert F. Murphy Copyright  All rights reserved.

Proteomics The set of proteins expressed in a given cell type or tissue is called its proteome The set of proteins expressed in a given cell type or tissue is called its proteome Not all transcripts are actually made into protein, and the steady-state level of protein expression is controlled by many factors other than transcript amount Not all transcripts are actually made into protein, and the steady-state level of protein expression is controlled by many factors other than transcript amount Protein differences between cell types responsible for different roles of those cells Protein differences between cell types responsible for different roles of those cells

Proteomics Things to learn about proteins Things to learn about proteins  sequence  location  structure  activity  partners

Proteomics Things to learn about proteins Things to learn about proteins  sequence  location  structure  activity  partners

Proteomics Things to learn about proteins Things to learn about proteins  sequence  location - gives insight into function  structure  activity  partners

Proteomics Things to learn about proteins Things to learn about proteins  sequence  location - gives insight into function  structure  activity  partners Almost nothing is known about most proteins! Almost nothing is known about most proteins!

One Approach to Proteomics - CD-tagging Infect cells with a retrovirus carrying a DNA sequence that will produce a “tag” in a random protein Infect cells with a retrovirus carrying a DNA sequence that will produce a “tag” in a random protein Examine many cells, each of which expected to express one tagged protein, to determine the subcellular location of that protein Examine many cells, each of which expected to express one tagged protein, to determine the subcellular location of that protein Use fluorescence microscopy Use fluorescence microscopy

Principles of CD-Tagging (CD = Central Dogma) Exon 1Intron 1 Exon 2 Genomic DNA + CD-cassette Exon 1 Tag Exon 2 Tagged DNA CD cassette Tag Tagged mRNA Tagged Protein Tag (Epitope) Tag

 Use a CD-cassette containing the hemagglutinin (HA) epitope  Insert the cassette into introns of the nucleolin gene  Obtain clonal lines expressing the tagged protein  Image the distribution of nucleolin using immunofluorescence microscopy CD-Tagging: Proof of concept

Results: CD-Tagging Tagged Nucleolin in HeLa Cells

Improved epitope tagImproved epitope tag  The HA epitope works only in one reading frame  Designed an epitope that is the same in all three reading frames - the universal epitope Endogenously fluorescent tags Endogenously fluorescent tags  Can use fluorescent proteins (e.g., GFP, YFP) as the inserted tag!  Don’t need fixation and antibodies CD-Tagging: Extensions

CD-tagging project Large project funded by National Cancer Institute to identify locations for all expressed genes Large project funded by National Cancer Institute to identify locations for all expressed genes  Jonathan Jarvik  Peter Berget  Robert Murphy My group responsible for automated analysis of subcellular location patterns My group responsible for automated analysis of subcellular location patterns

The Problem Different investigators may use different terms to refer to the same pattern or the same term to refer to different patterns Different investigators may use different terms to refer to the same pattern or the same term to refer to different patterns Current determinations do not lend themselves to incorporation into databases (at best, databases may describe a protein in comment fields as being a “cytoskeletal protein” or an “endosomal protein” even though these are known to be imprecise) Current determinations do not lend themselves to incorporation into databases (at best, databases may describe a protein in comment fields as being a “cytoskeletal protein” or an “endosomal protein” even though these are known to be imprecise)

Cartoonists view of Subcellular Locations Cells Alive! “rollover” cell with information on each organelle Cells Alive! “rollover” cell with information on each organelle

The Starting Point A systematic, quantitative approach to protein localization (whether from a pattern analysis or a bioinformatics perspective) has not been presented previously A systematic, quantitative approach to protein localization (whether from a pattern analysis or a bioinformatics perspective) has not been presented previously

This is a Golgi protein ! The Goal

More problems Direct (point-by-point) comparison of individual images is not possible, since Direct (point-by-point) comparison of individual images is not possible, since  different cells have different shapes, sizes, orientations  organelles within cells are not found in fixed locations

The Approach 1. Create sets of images showing the localization of many different proteins (each set defines one class of pattern) 2. Reduce each image to a set of numerical values (“features”) that are insensitive to position and rotation of the cell 3. Use statistical classification methods to “learn” how to distinguish each class using the features

Input Created image database for HeLa cells Created image database for HeLa cells Ten classes covering all major subcellular structures: Golgi, ER, mitochondria, lysosomes, endosomes, nuclei, nucleoli, microfilaments, microtubules Ten classes covering all major subcellular structures: Golgi, ER, mitochondria, lysosomes, endosomes, nuclei, nucleoli, microfilaments, microtubules Includes classes that are similar to each other Includes classes that are similar to each other

Example Images Patterns that might be easily confused Patterns that might be easily confused Endoplasmic Reticulum (ER)Mitochondria

Example Images Patterns that might be easily confused Patterns that might be easily confused Lysosomes (LAMP2)Endosomes (TfR)

Example Images Patterns that might be easily confused Patterns that might be easily confused F-actinTubulin

Example Images Classes expected to be indistinguishable Classes expected to be indistinguishable Golgi (Giantin)Golgi (gpp130)

Features Zernike moment features (based on the Zernike polynomials) - give information on basic nature of pattern (e.g., circle, donut) and sizes (frequencies) present in pattern Zernike moment features (based on the Zernike polynomials) - give information on basic nature of pattern (e.g., circle, donut) and sizes (frequencies) present in pattern Haralick texture features - give information on correlations in intensity between adjacent pixels Haralick texture features - give information on correlations in intensity between adjacent pixels

Examples of Zernike Polynomials

Zernike Moments Reconstruction Original Order 12 Order 20 Order 45

Features Developed additional features (SLF, for Subcellular Location Features) Developed additional features (SLF, for Subcellular Location Features) Motivated by descriptions of patterns used by biologists (e.g., punctate, perinuclear) Motivated by descriptions of patterns used by biologists (e.g., punctate, perinuclear) Combined with Zernike and Haralick features to give 84 features used to describe each image Combined with Zernike and Haralick features to give 84 features used to describe each image

Example Features from SLF1 Number of fluorescent objects per cell Number of fluorescent objects per cell Variance of the object sizes Variance of the object sizes Ratio of the largest object to the smallest Ratio of the largest object to the smallest Average distance of objects to the ‘center of fluorescence’ Average distance of objects to the ‘center of fluorescence’ Fraction of convex hull occupied by fluorescence Fraction of convex hull occupied by fluorescence

1. Acquisition of Images 2. Image Processing 3. Feature Extraction 4. Classifier Design and Training 5. Classification feature1 feature2... featureN Image Image ImageM This is a Golgi Protein The Approach

Backpropagation Neural Network Input 1 Input 2 Input n Output 1 Output 2 Output m Internal ‘Neurons’

Classification accuracy for single images Average Correct Classification Rate: 81%

How does it work? Scatter plot for TfR and LAMP2

Feature Subsets The large number of features used may make training of the network harder due to the large number of weights needing to be adjusted The large number of features used may make training of the network harder due to the large number of weights needing to be adjusted Therefore stepwise discriminant analysis was used to select a subset of the features that optimizes a criterion for distinguishing classes Therefore stepwise discriminant analysis was used to select a subset of the features that optimizes a criterion for distinguishing classes

Results: “Best” Features Average Correct Classification Rate: 83%

How to do even better Biologists interpreting images of protein localization typically view many cells before reaching a conclusion Biologists interpreting images of protein localization typically view many cells before reaching a conclusion Can simulate this by classifying sets of cells from the same microscope slide Can simulate this by classifying sets of cells from the same microscope slide Also applicable for colonies of CD-tagged cells Also applicable for colonies of CD-tagged cells

Classification accuracy for sets of ten images Average Correct Classification Rate = 98% (99% for those sets not considered “unknown”) Predicted Class

Conclusion so far Have demonstrated feasibility of using automated classification to assign a subcellular location “class” to an image Have demonstrated feasibility of using automated classification to assign a subcellular location “class” to an image Gearing up to do this for thousands of proteins Gearing up to do this for thousands of proteins

This is a Golgi protein ! SLIC (Subcellular Location Image Classifier)

Extending to 3D Have begun extending this approach to 3D images collected by confocal microscopy Have begun extending this approach to 3D images collected by confocal microscopy Also beginning to collect 3D images by new method using “grating imager” (with F. Lanni) Also beginning to collect 3D images by new method using “grating imager” (with F. Lanni)

3D labeling approach All Proteins labeled with Cy5 conjugated reactive dye All Proteins labeled with Cy5 conjugated reactive dye DNA labeled with PI DNA labeled with PI Specific Proteins labeled with primary Ab + secondary Alexa488 conjugated Ab Specific Proteins labeled with primary Ab + secondary Alexa488 conjugated Ab

Features for 3D Images Use a subset of the 2D SLF features:   Number of Objects   Euler Number   Average Object Size   Standard Deviation of Object sizes   Ratio of the Largest to the Smallest Object Size   Average Distance of Objects from COF   Standard Deviation of Object Distances from COF   Ratio of the Largest to Smallest Object Distance

DNA Features Use the parallel DNA image to calculate   The average object distance from the COF of the DNA image   The variance of object distances from the DNA COF   The ratio of the largest to the smallest object to DNA COF distance   The distance between the protein COF and the DNA COF   The ratio of the volume occupied by protein to that occupied by DNA   The fraction of the protein fluorescence that co-localizes with DNA

3D Classification Results with 14 features Overall accuracy = 96%

2D Results — Same 14 Features Overall accuracy = 82%

Next: Experiment Interpretation Growing use of digital microscopy anticipated to give rise to a need for a variety of computational approaches that can automate extraction of information from images or testing of hypotheses using image sets Growing use of digital microscopy anticipated to give rise to a need for a variety of computational approaches that can automate extraction of information from images or testing of hypotheses using image sets Key is design and validation of feature sets Key is design and validation of feature sets

Goal: Typical Image Selection To develop automated methods for selecting a representative image from a set of images obtained by fluorescence microscopy To develop automated methods for selecting a representative image from a set of images obtained by fluorescence microscopy

The third image is the most typical of the set!! TypIC - Typical Image Chooser Image Set

Motivation Authors/Speakers must choose images for publication/presentation that represent an entire set Authors/Speakers must choose images for publication/presentation that represent an entire set Currently choice is subjective and may change over time Currently choice is subjective and may change over time Currently choice cannot be verified by others Currently choice cannot be verified by others

Approach Use sets of images collected for the classification project to evaluate various approaches to choosing a typical image Use sets of images collected for the classification project to evaluate various approaches to choosing a typical image

Sample Images

Approach Calculate numerical features that contain information about each image (just like when classifying images) Calculate numerical features that contain information about each image (just like when classifying images) Calculate the similarity of each image to the other images (using the numerical features) Calculate the similarity of each image to the other images (using the numerical features) Choose the image that is representative (typical) by choosing the image that is most similar to the others Choose the image that is representative (typical) by choosing the image that is most similar to the others

Image Similarity Why do we need to be able to measure image similarity? Why do we need to be able to measure image similarity?  To find images similar to a particular image, either on the web, in a database or on a microscope  To pick a representative image from a set  To test hypotheses regarding images (are two images or groups of images the same or different)

What is typical? What do we mean by a typical (or representative) point in multidimensional space? What do we mean by a typical (or representative) point in multidimensional space? In one dimension, we think of the median point In one dimension, we think of the median point What we need then is a multidimensional median What we need then is a multidimensional median  Problem: No unique definition

Possible approaches to multidimensional median Convex peeling Convex peeling Closest point to combination of unidimensional medians Closest point to combination of unidimensional medians Closest point to mean Closest point to mean >>> In all cases, beware of outliers! >>> In all cases, beware of outliers!

Results For Golgi (giantin) Images Most Typical Least Typical

Goal: Image Set Comparison A common paradigm in molecular cell biology is to compare the distribution of a protein with and without the addition of a potential perturbing agent (e.g., drug, overexpressed protein) A common paradigm in molecular cell biology is to compare the distribution of a protein with and without the addition of a potential perturbing agent (e.g., drug, overexpressed protein) Such experiments usually assayed by visual examination Such experiments usually assayed by visual examination We have explored automating such comparisons We have explored automating such comparisons

These sets are statistically different! SImEC - Statistical Imaging Experiment Comparator SImEC - Statistical Imaging Experiment Comparator Image Set 1 Image Set 2

Inputs to Method 1) Two sets of images taken under identical conditions except for condition being tested (e.g., with & without drug)  Should have roughly equal number of images in each set  Total number of images between both sets should exceed the number of features

Inputs to Method 2) A specification of the feature set to be used  default is 65 features, 49 Zernike moments and 16 SLF features 3) A confidence level  default is 95%

Method Calculate feature matrix for each set of images Calculate feature matrix for each set of images Compare feature matrices using a multivariate hypothesis test called the Hotelling T 2 -test Compare feature matrices using a multivariate hypothesis test called the Hotelling T 2 -test

Hotelling T 2 test Let and be the number of images in the two sets Let n 1 and n 2 be the number of images in the two sets Let be the number of features Let p be the number of features Calculate mean vector for each set, and Calculate mean vector for each set, I 1 and I 2 Calculate covariance matrices for each set, and Calculate covariance matrices for each set, cov 1 and cov 2

Hotelling T 2 test Calculate merged covariance matrix Calculate merged covariance matrix

Hotelling T 2 test Calculate Mahalanobis distance between the mean vectors using combined covariance matrix Calculate Mahalanobis distance between the mean vectors using combined covariance matrix  measures how far apart the two sets are

Hotelling T 2 test Calculate Hotelling T 2 Calculate Hotelling T 2 and associated F statistic and associated F statistic

Hotelling T 2 test This F statistic has n and n-p degrees of freedom This F statistic has n and n-p degrees of freedom Tests H 0 : I 1 =I 2 Tests H 0 : I 1 =I 2 Accept H 0 if F is less than the critical value for the two degrees of freedom Accept H 0 if F is less than the critical value for the two degrees of freedom

Summary of Method Collect 2 sets of images Collect 2 sets of images Extract features Extract features Perform Hotelling T 2 test Perform Hotelling T 2 test If F value falls below critical value for desired confidence level (e.g., 95%) then the two distributions are considered to be the same If F value falls below critical value for desired confidence level (e.g., 95%) then the two distributions are considered to be the same

F values for comparison of all pairs of classes using 65 features Critical values are approximately 1.4 for all comparisons (depends on number of images)

Comparison of two sets drawn randomly from the same class TfRPhal Average F Critical F (0.95) Number of failing sets out of Expected result obtained: 95% of randomly drawn sets are considered to be the same

SImEc Have system for comparing image sets Have system for comparing image sets  can detect subtle differences  but still concludes that two sets of images of the same protein are the same

Conclusions New frontier of automated cell biology just opening New frontier of automated cell biology just opening  Classification of subcellular patterns  Selection of representative images  Comparison of image sets Will be combined with informatics tools to produce self-justifying, self-populating knowledge bases for proteins Will be combined with informatics tools to produce self-justifying, self-populating knowledge bases for proteins