Ranjit Ganta, Raj Acharya, Shruthi Prabhakara Department of Computer Science and Engineering, Penn State University DATA WAREHOUSE FOR BIO-GEO HEALTH CARE INFORMATICS Health-Care records: ‘Bio-geo’ Informatics Patient identification information Geographical Information. Clinical Information: Organ/Cellular level: Tumor, pathology. Molecular level: DNA sequence, Microarray. Laboratory data: Blood tests, diagnosis, prognosis. INTRODUCTION Integration of Health-care records: Privacy Violation Distributed integration of health care records. Integration within Health-care records: Information Fusion: Combine multiple disparate sources of information such that the whole is more than the sum of it’s parts. For the patient demographic data set this helps answer questions such as: Which age/race profile(s) if any, define a typical profile of a prostate cancer patient? Are middle-aged Caucasian males more prone to prostate cancer than Caucasians of other age groups? Is there a close association between age and race groups? CORRESPONDENCE ANALYSIS Sample Result: Example Data : Dhanasekharan et al. "Delineation of prognostic biomarkers in prostate cancer", Letters to Nature, Vol 412, August 2001, pages Supplementary data (Fig 1C, pg 823,Commercial Pool) Gene expression (microarray data) in four clinical states of prostate-derived tissues CLINICAL STATES Benign states BPH : Benign Prostatic Hyperlasia NAP : Normal Adjacent Prostate Malignant states PCA : Localized prostate cancer MET : Metastatic sample Sample Result: KL-CLUSTERING Genes To Co-regulated genes Down-regulated {g1} Up-regulated {g2, g4}; {g3}; {g5} No change {g6} Clusters Input Profiles g1 g2 g3 g4 g5 g6 The Kullback-Leibler (KL) divergence measures the relative dissimilarity of the shapes of two gene profiles. 1-D SOM algorithm + KL Minimize D(Gene || SOM weight for each node) at each iteration step. [Bioinformatics, Vol. 19, No. 4, 2003, ] Common Motifs Motif: short segments of DNA that act as a binding site for a specific transcription factor Typically 6-25bp in length Statistically different in composite compared to the background Often repeated within a sequence Motif 1Motif 2…Motif k Gene Gene … … Gene n30 0 Frequency of occurrence COMBINED CLUSTERING Clustering using more than one data source aims at identifying clusters of genes with similar properties among all data. Goal of combined clustering is to answer the following: 1.If genes have similar expression profile patterns, do they also share common motifs? 2.If genes have a set of motifs in common, do they also exhibit similar expression profile patterns? 3.Which genes share BOTH - that is, they have similar expression profile patterns AND share a set of common motifs? Alpha Factor Experiments Cluster on Motif vectorsCluster on Gene expression Combined clustering All genes in the cluster share the Transcription Factor MCBa CONCLUSION Figure: Information Fusion Based Attack Prototype for Bio-geo Data Warehouse Gene Expression Clinical and Pathology Public Data (Literature etc) Patient Information Global Statistics Information Fusion based Clustering Cancer Research Grid Cancer Analysis Applications Result Visualization Geographical Information We have demonstrated the significance of information fusion based tools for bio-geo health care informatics. As a data warehouse for various data sets involved in bio-geo health care informatics studies. To provide and demonstrate a set of information fusion tools for disease research.