Applications of Visualization and Data Clustering to 3D Gene Expression Data Oliver Rübel 1,2,3,7, Gunther H. Weber 3,7, Min-Yu Huang 1,7, E. Wes Bethel 3, Mark D. Biggin 4,7, Charless C. Fowlkes 5,7, Cris L. Luengo Hendriks 6,7, Soile V. E. Keränen 4,7, Michael B. Eisen 4,7, David W. Knowles 6,7, Jitendra Malik 5,7, Hans Hagen 2, and Bernd Hamann 1,2,3,7 1. Institute for Data Analysis and Visualization, University of California, Davis, One Shields Avenue, Davis CA 95616, USA 2. International Research Training Group “Visualization of Large and Unstructured Data Sets,” University of Kaiserslautern, Germany 3. Computational Research Division, Lawrence Berkeley National Laboratory, One Cyclotron Road, Berkeley CA 94620, USA 4. Genomics Division, Lawrence Berkeley National Laboratory, One Cyclotron Road, Berkeley CA 94620, USA 5. Computer Science Division,University of California, Berkeley, CA, USA 6. Life Sciences Division, Lawrence Berkeley National Laboratory, One Cyclotron Road, Berkeley CA 94620, USA 7. Berkeley Drosophila Transcription Network Project, Lawrence Berkeley National Laboratory, One Cyclotron Road, Berkeley CA 94620, USA, / / Biological Background Animals comprise dynamic 3D arrays of cells that express gene products in intricate spatial and temporal patterns. These patterns of gene expression determine the shape and form of the animal. Biologists have typically analyzed gene expression and morphology by visual inspection of 2D microscopic images. A rigorous understanding of developmental processes requires methods that can quantitatively analyze these phenomenally complex arrays at the level of cellular resolution. Single Pattern Analysis Genes are frequently expressed in complex patterns consisting of quantitative differences in expression between cells of an embryo. Clustering can be used effectively to discretize the expression pattern of a gene. Discretization of expression patterns can be very useful, e.g., to create logical models of gene networks. Here the pattern of eve (a) is classified into 2, 3, and 6 levels (b-d). Based on the results shown in (d), seven clusters, each selecting one stripe of the eve pattern, are created using cluster post-processing techniques. Characteristics of the seven stripes are revealed in the scatter-plot of three eve regulators gt, hb and, Kr. Temporal Variation Analysis Gene expression patterns are not static but are highly dynamic. Understanding the temporal profile of a gene expression pattern is therefore essential if we are to understand complex relationships between genes. To assist in the analysis of the spatio-temporal expression pattern of genes we use PointCloudXplore to cluster cells into groups based on the similarity of their temporal expression profiles. The example here shows the classification of the spatio-temporal pattern of giant (gt) expression. Cluster statistics, such as average temporal expression profiles of clusters, reveal the complex changes of gene patterns and allow quantitation of their temporal variation. Multiple Pattern Analysis To dissect the complex regulatory interactions between genes, the expression patterns of multiple potential regulatory transcription factors can be used as input to cluster analysis. Cells are classified into clusters that have similar combinations of expression for the input set of regulators. Each cluster describes one potential sub-pattern that a regulatory network composed of these factors could give rise to. The results of such a clustering can also be compared to the expression patterns of suspected target genes to assess possible regulatory relationships. Here, the pattern of the genes giant (gt), hunchback (hb), and Krüppel (Kr) have been used as input to the clustering. Clustering results are compared to stripe two pf the eve expression pattern, suggesting that the anterior and posterior border of the stripe as well as the ventral dip in eve expression can be modeled using gt, hb, and Kr expression levels. 3D Gene Expression Data The BDTNP has developed a suite of methods to quantitate the expression of genes in 3D at cellular resolution from whole Drosophila embryos. Drosophila embryos are first imaged using two- photon fluorescence microscopy. The resulting 3D image stacks are segmented in order to extract information about the expression of genes on a per cell basis. Currently datasets with information up to about 100 genes at up to six different time steps are available. PointCloudXplore: A Framework for Visualization and Clustering of 3D Gene Expression Data In our software called PointCloudXplore we have linked dedicated physical and information visualization views of the data via the concept of brushing (cell selection). A user can select and highlight cells of interest in any view. All brushes (cell selectors) are then stored in a central cell selector management system allowing one to highlight all selections in any view. Data clustering provides means for automatic detection and definition of data features by automatically classifying cells into groups of similar behavior, the clusters. Clusters, each defining a selection of cells, can be managed and visualized in the same way as user-defined cell selections. Visualization is used for validation and improvement of clustering results while clustering is used to analyze the data as well as to improve the visualization. For improvement of clustering results we have developed dedicated cluster post-processing techniques, such as splitting, merging and filtering of clusters based on spatial cell positions. e) Clustering-based False Coloring Using hierarchical clustering one can define a linear order of the cells. This linear order can be used as basis for false coloring of the data. By defining ranges in this linear cell order one can also easily define data features based on cell similarity. Data Clustering Cell Selector Management Data Selection Physical Views Abstract Views Cell Selector Statistics Post-Processing Visualization Data Clustering PointCloudXplore Clusters giant (gt) Krüppel (Kr) hunchback (hb) tailless (tll)