Computational Biology, Part 24 Biological Imaging IV Robert F. Murphy Copyright All rights reserved.
Proteomics The set of proteins expressed in a given cell type or tissue is called its proteome The set of proteins expressed in a given cell type or tissue is called its proteome Not all transcripts are actually made into protein, and the steady-state level of protein expression is controlled by many factors other than transcript amount Not all transcripts are actually made into protein, and the steady-state level of protein expression is controlled by many factors other than transcript amount Protein differences between cell types responsible for different roles of those cells Protein differences between cell types responsible for different roles of those cells
Proteomics Things to learn about proteins Things to learn about proteins sequence location structure activity partners
Proteomics Things to learn about proteins Things to learn about proteins sequence location structure activity partners
Proteomics Things to learn about proteins Things to learn about proteins sequence location - gives insight into function structure activity partners
Proteomics Things to learn about proteins Things to learn about proteins sequence location - gives insight into function structure activity partners Almost nothing is known about most proteins! Almost nothing is known about most proteins!
One Approach to Proteomics - CD-tagging Infect cells with a retrovirus carrying a DNA sequence that will produce a “tag” in a random protein Infect cells with a retrovirus carrying a DNA sequence that will produce a “tag” in a random protein Examine many cells, each of which expected to express one tagged protein, to determine the subcellular location of that protein Examine many cells, each of which expected to express one tagged protein, to determine the subcellular location of that protein Use fluorescence microscopy Use fluorescence microscopy
Principles of CD-Tagging (CD = Central Dogma) Exon 1Intron 1 Exon 2 Genomic DNA + CD-cassette Exon 1 Tag Exon 2 Tagged DNA CD cassette Tag Tagged mRNA Tagged Protein Tag (Epitope) Tag
Use a CD-cassette containing the hemagglutinin (HA) epitope Insert the cassette into introns of the nucleolin gene Obtain clonal lines expressing the tagged protein Image the distribution of nucleolin using immunofluorescence microscopy CD-Tagging: Proof of concept
Results: CD-Tagging Tagged Nucleolin in HeLa Cells
Improved epitope tagImproved epitope tag The HA epitope works only in one reading frame Designed an epitope that is the same in all three reading frames - the universal epitope Endogenously fluorescent tags Endogenously fluorescent tags Can use fluorescent proteins (e.g., GFP, YFP) as the inserted tag! Don’t need fixation and antibodies CD-Tagging: Extensions
CD-tagging project Large project funded by National Cancer Institute to identify locations for all expressed genes Large project funded by National Cancer Institute to identify locations for all expressed genes Jonathan Jarvik Peter Berget Robert Murphy My group responsible for automated analysis of subcellular location patterns My group responsible for automated analysis of subcellular location patterns
The Problem Different investigators may use different terms to refer to the same pattern or the same term to refer to different patterns Different investigators may use different terms to refer to the same pattern or the same term to refer to different patterns Current determinations do not lend themselves to incorporation into databases (at best, databases may describe a protein in comment fields as being a “cytoskeletal protein” or an “endosomal protein” even though these are known to be imprecise) Current determinations do not lend themselves to incorporation into databases (at best, databases may describe a protein in comment fields as being a “cytoskeletal protein” or an “endosomal protein” even though these are known to be imprecise)
Cartoonists view of Subcellular Locations Cells Alive! “rollover” cell with information on each organelle Cells Alive! “rollover” cell with information on each organelle
The Starting Point A systematic, quantitative approach to protein localization (whether from a pattern analysis or a bioinformatics perspective) has not been presented previously A systematic, quantitative approach to protein localization (whether from a pattern analysis or a bioinformatics perspective) has not been presented previously
This is a Golgi protein ! The Goal
More problems Direct (point-by-point) comparison of individual images is not possible, since Direct (point-by-point) comparison of individual images is not possible, since different cells have different shapes, sizes, orientations organelles within cells are not found in fixed locations
The Approach 1. Create sets of images showing the localization of many different proteins (each set defines one class of pattern) 2. Reduce each image to a set of numerical values (“features”) that are insensitive to position and rotation of the cell 3. Use statistical classification methods to “learn” how to distinguish each class using the features
Input Created image database for HeLa cells Created image database for HeLa cells Ten classes covering all major subcellular structures: Golgi, ER, mitochondria, lysosomes, endosomes, nuclei, nucleoli, microfilaments, microtubules Ten classes covering all major subcellular structures: Golgi, ER, mitochondria, lysosomes, endosomes, nuclei, nucleoli, microfilaments, microtubules Includes classes that are similar to each other Includes classes that are similar to each other
Example Images Patterns that might be easily confused Patterns that might be easily confused Endoplasmic Reticulum (ER)Mitochondria
Example Images Patterns that might be easily confused Patterns that might be easily confused Lysosomes (LAMP2)Endosomes (TfR)
Example Images Patterns that might be easily confused Patterns that might be easily confused F-actinTubulin
Example Images Classes expected to be indistinguishable Classes expected to be indistinguishable Golgi (Giantin)Golgi (gpp130)
Features Zernike moment features (based on the Zernike polynomials) - give information on basic nature of pattern (e.g., circle, donut) and sizes (frequencies) present in pattern Zernike moment features (based on the Zernike polynomials) - give information on basic nature of pattern (e.g., circle, donut) and sizes (frequencies) present in pattern Haralick texture features - give information on correlations in intensity between adjacent pixels Haralick texture features - give information on correlations in intensity between adjacent pixels
Examples of Zernike Polynomials
Zernike Moments Reconstruction Original Order 12 Order 20 Order 45
Features Developed additional features (SLF, for Subcellular Location Features) Developed additional features (SLF, for Subcellular Location Features) Motivated by descriptions of patterns used by biologists (e.g., punctate, perinuclear) Motivated by descriptions of patterns used by biologists (e.g., punctate, perinuclear) Combined with Zernike and Haralick features to give 84 features used to describe each image Combined with Zernike and Haralick features to give 84 features used to describe each image
Example Features from SLF1 Number of fluorescent objects per cell Number of fluorescent objects per cell Variance of the object sizes Variance of the object sizes Ratio of the largest object to the smallest Ratio of the largest object to the smallest Average distance of objects to the ‘center of fluorescence’ Average distance of objects to the ‘center of fluorescence’ Fraction of convex hull occupied by fluorescence Fraction of convex hull occupied by fluorescence
1. Acquisition of Images 2. Image Processing 3. Feature Extraction 4. Classifier Design and Training 5. Classification feature1 feature2... featureN Image Image ImageM This is a Golgi Protein The Approach
Backpropagation Neural Network Input 1 Input 2 Input n Output 1 Output 2 Output m Internal ‘Neurons’
Classification accuracy for single images Average Correct Classification Rate: 81%
How does it work? Scatter plot for TfR and LAMP2
Feature Subsets The large number of features used may make training of the network harder due to the large number of weights needing to be adjusted The large number of features used may make training of the network harder due to the large number of weights needing to be adjusted Therefore stepwise discriminant analysis was used to select a subset of the features that optimizes a criterion for distinguishing classes Therefore stepwise discriminant analysis was used to select a subset of the features that optimizes a criterion for distinguishing classes
Results: “Best” Features Average Correct Classification Rate: 83%
How to do even better Biologists interpreting images of protein localization typically view many cells before reaching a conclusion Biologists interpreting images of protein localization typically view many cells before reaching a conclusion Can simulate this by classifying sets of cells from the same microscope slide Can simulate this by classifying sets of cells from the same microscope slide Also applicable for colonies of CD-tagged cells Also applicable for colonies of CD-tagged cells
Classification accuracy for sets of ten images Average Correct Classification Rate = 98% (99% for those sets not considered “unknown”) Predicted Class
Conclusion so far Have demonstrated feasibility of using automated classification to assign a subcellular location “class” to an image Have demonstrated feasibility of using automated classification to assign a subcellular location “class” to an image Gearing up to do this for thousands of proteins Gearing up to do this for thousands of proteins
This is a Golgi protein ! SLIC (Subcellular Location Image Classifier)
Extending to 3D Have begun extending this approach to 3D images collected by confocal microscopy Have begun extending this approach to 3D images collected by confocal microscopy Also beginning to collect 3D images by new method using “grating imager” (with F. Lanni) Also beginning to collect 3D images by new method using “grating imager” (with F. Lanni)
3D labeling approach All Proteins labeled with Cy5 conjugated reactive dye All Proteins labeled with Cy5 conjugated reactive dye DNA labeled with PI DNA labeled with PI Specific Proteins labeled with primary Ab + secondary Alexa488 conjugated Ab Specific Proteins labeled with primary Ab + secondary Alexa488 conjugated Ab
Features for 3D Images Use a subset of the 2D SLF features: Number of Objects Euler Number Average Object Size Standard Deviation of Object sizes Ratio of the Largest to the Smallest Object Size Average Distance of Objects from COF Standard Deviation of Object Distances from COF Ratio of the Largest to Smallest Object Distance
DNA Features Use the parallel DNA image to calculate The average object distance from the COF of the DNA image The variance of object distances from the DNA COF The ratio of the largest to the smallest object to DNA COF distance The distance between the protein COF and the DNA COF The ratio of the volume occupied by protein to that occupied by DNA The fraction of the protein fluorescence that co-localizes with DNA
3D Classification Results with 14 features Overall accuracy = 96%
2D Results — Same 14 Features Overall accuracy = 82%
Next: Experiment Interpretation Growing use of digital microscopy anticipated to give rise to a need for a variety of computational approaches that can automate extraction of information from images or testing of hypotheses using image sets Growing use of digital microscopy anticipated to give rise to a need for a variety of computational approaches that can automate extraction of information from images or testing of hypotheses using image sets Key is design and validation of feature sets Key is design and validation of feature sets
Goal: Typical Image Selection To develop automated methods for selecting a representative image from a set of images obtained by fluorescence microscopy To develop automated methods for selecting a representative image from a set of images obtained by fluorescence microscopy
The third image is the most typical of the set!! TypIC - Typical Image Chooser Image Set
Motivation Authors/Speakers must choose images for publication/presentation that represent an entire set Authors/Speakers must choose images for publication/presentation that represent an entire set Currently choice is subjective and may change over time Currently choice is subjective and may change over time Currently choice cannot be verified by others Currently choice cannot be verified by others
Approach Use sets of images collected for the classification project to evaluate various approaches to choosing a typical image Use sets of images collected for the classification project to evaluate various approaches to choosing a typical image
Sample Images
Approach Calculate numerical features that contain information about each image (just like when classifying images) Calculate numerical features that contain information about each image (just like when classifying images) Calculate the similarity of each image to the other images (using the numerical features) Calculate the similarity of each image to the other images (using the numerical features) Choose the image that is representative (typical) by choosing the image that is most similar to the others Choose the image that is representative (typical) by choosing the image that is most similar to the others
Image Similarity Why do we need to be able to measure image similarity? Why do we need to be able to measure image similarity? To find images similar to a particular image, either on the web, in a database or on a microscope To pick a representative image from a set To test hypotheses regarding images (are two images or groups of images the same or different)
What is typical? What do we mean by a typical (or representative) point in multidimensional space? What do we mean by a typical (or representative) point in multidimensional space? In one dimension, we think of the median point In one dimension, we think of the median point What we need then is a multidimensional median What we need then is a multidimensional median Problem: No unique definition
Possible approaches to multidimensional median Convex peeling Convex peeling Closest point to combination of unidimensional medians Closest point to combination of unidimensional medians Closest point to mean Closest point to mean >>> In all cases, beware of outliers! >>> In all cases, beware of outliers!
Results For Golgi (giantin) Images Most Typical Least Typical
Goal: Image Set Comparison A common paradigm in molecular cell biology is to compare the distribution of a protein with and without the addition of a potential perturbing agent (e.g., drug, overexpressed protein) A common paradigm in molecular cell biology is to compare the distribution of a protein with and without the addition of a potential perturbing agent (e.g., drug, overexpressed protein) Such experiments usually assayed by visual examination Such experiments usually assayed by visual examination We have explored automating such comparisons We have explored automating such comparisons
These sets are statistically different! SImEC - Statistical Imaging Experiment Comparator SImEC - Statistical Imaging Experiment Comparator Image Set 1 Image Set 2
Inputs to Method 1) Two sets of images taken under identical conditions except for condition being tested (e.g., with & without drug) Should have roughly equal number of images in each set Total number of images between both sets should exceed the number of features
Inputs to Method 2) A specification of the feature set to be used default is 65 features, 49 Zernike moments and 16 SLF features 3) A confidence level default is 95%
Method Calculate feature matrix for each set of images Calculate feature matrix for each set of images Compare feature matrices using a multivariate hypothesis test called the Hotelling T 2 -test Compare feature matrices using a multivariate hypothesis test called the Hotelling T 2 -test
Hotelling T 2 test Let and be the number of images in the two sets Let n 1 and n 2 be the number of images in the two sets Let be the number of features Let p be the number of features Calculate mean vector for each set, and Calculate mean vector for each set, I 1 and I 2 Calculate covariance matrices for each set, and Calculate covariance matrices for each set, cov 1 and cov 2
Hotelling T 2 test Calculate merged covariance matrix Calculate merged covariance matrix
Hotelling T 2 test Calculate Mahalanobis distance between the mean vectors using combined covariance matrix Calculate Mahalanobis distance between the mean vectors using combined covariance matrix measures how far apart the two sets are
Hotelling T 2 test Calculate Hotelling T 2 Calculate Hotelling T 2 and associated F statistic and associated F statistic
Hotelling T 2 test This F statistic has n and n-p degrees of freedom This F statistic has n and n-p degrees of freedom Tests H 0 : I 1 =I 2 Tests H 0 : I 1 =I 2 Accept H 0 if F is less than the critical value for the two degrees of freedom Accept H 0 if F is less than the critical value for the two degrees of freedom
Summary of Method Collect 2 sets of images Collect 2 sets of images Extract features Extract features Perform Hotelling T 2 test Perform Hotelling T 2 test If F value falls below critical value for desired confidence level (e.g., 95%) then the two distributions are considered to be the same If F value falls below critical value for desired confidence level (e.g., 95%) then the two distributions are considered to be the same
F values for comparison of all pairs of classes using 65 features Critical values are approximately 1.4 for all comparisons (depends on number of images)
Comparison of two sets drawn randomly from the same class TfRPhal Average F Critical F (0.95) Number of failing sets out of Expected result obtained: 95% of randomly drawn sets are considered to be the same
SImEc Have system for comparing image sets Have system for comparing image sets can detect subtle differences but still concludes that two sets of images of the same protein are the same
Conclusions New frontier of automated cell biology just opening New frontier of automated cell biology just opening Classification of subcellular patterns Selection of representative images Comparison of image sets Will be combined with informatics tools to produce self-justifying, self-populating knowledge bases for proteins Will be combined with informatics tools to produce self-justifying, self-populating knowledge bases for proteins