Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ptree * -based Approach to Mining Gene Expression Data Fei Pan 1, Xin Hu 2, William Perrizo 1 1. Dept. Computer Science, 2. Dept. Pharmaceutical Science,

Similar presentations


Presentation on theme: "Ptree * -based Approach to Mining Gene Expression Data Fei Pan 1, Xin Hu 2, William Perrizo 1 1. Dept. Computer Science, 2. Dept. Pharmaceutical Science,"— Presentation transcript:

1 Ptree * -based Approach to Mining Gene Expression Data Fei Pan 1, Xin Hu 2, William Perrizo 1 1. Dept. Computer Science, 2. Dept. Pharmaceutical Science, North Dakota State University, Fargo, ND 58105 INTRODUCTION Gene expression data provides valuable information to analyze human diseases. The process of drug discovery involves the screening of compounds against a drug target to identify those compounds that interact with the target to produce the desired effect. Cluster analysis is currently the most widely used technique to analyze gene expression profiles. However, this technique is limited as to the amount of useful information that it can provide. On the other hand, to mine large microarray data sets and to analyze proteomic data sets is becoming a huge challenge. New computational tools and interpretation methods in the analysis of results. To address the problem, we introduce a new data structure, called Peano Count Tree (Ptree) and a new dissimilarity metric, called HOBbit metric. We apply a new hierarchical clustering algorithm to analyze the NCI 60 cell lines data set. The data set comprises 1376 gene expression profiles and the growth inhibition of 1400 chemical compounds on the same cell lines, providing an excellent test case case in the context of linking genomic data mining with high-throughput drug design. METHODS The hierarchical Ptree clustering algorithm was developed based on Peano Count Trees (Ptree) and the HOBbit metric. We defined the Peano Count Tree and HOBbit metric and then use HPC algorithm to evaluate the cell- cell correlations on the basis of NCI gene expression profiles. ACKNOWLEDGEMENTS We thank Pratap Kotala for his support and assistance. A special thanks to GSA Grant ACT# K96130308 for financial support. RESULTS The clustering results in Figure 3 show that various cancer cell lines tend to cluster according to the tissue of origin whereas, the gene-drug interaction suggest that drug metabolism rather than the mechanism of drug action is an important feature of the drug activity-gene expression correlation. Figure 3. (left) Cluster results of gene expression profiles. (right) cluster results represented using mask Ptree We compared our approach to the widely used average linkage method on the whole gene expression dataset. Results showed that our method is much faster than the average linkage method with a better tightness around the centroid. CONCLUSIONS We introduce a new data structure as well as a dissimilarity metric and apply the Ptree based hierarchical clustering to the analysis of large gene expression data sets. Empirical analysis of the NCI-60 cancer cell lines data set show that the new approach present a better tightness around the centroid compared to the average linkage method, thus providing high accuracy. Also the new clustering approach makes it possible to handle large data set and achieved speed improvements. Peano Count Tree (Ptree) HOBbit dissimilarity measurement Hierarchical Ptree Clustering Algorithm (HPC) HPC features one database scan and high accuracy by using Ptree techniques and an optimal approximation LP algorithm. Step1: build accumulative bit quadrants and Ptrees while reading gene expression data. Step2: find the 2-median of the whole Ptree as well as its sub-trees using the LP primal dual optimization algorithm recursively. Step3: assign gene expression data to its nearest median and build the mask P-tree hierarchically.  Let a i and b i be the i th bits of integer A and integer B respectively, and let m (m  1) be the number of bits in binary representations of the values, the HOBbit similarity between two integers A and B is defined as: HOB( A, B ) =  Let m (m  1) be the number of bits in binary representations of the integer values or mantissa of floating point, the HOBbit dissimilarity between two data points A and B is defined as: d v ( A, B ) = m – HOB( A, B ).  Peano order, also called Z-ordering, is a recursive raster ordering. An accumulative quadrant is the run-length quadrant-wise compression of a bit sequence in Peano order. A quadrant of a bit sequence is pure if it is entirely 1 or entirely 0.  Peano Count Trees (Ptrees) are defined as the recursive 1-bit counts of the accumulative quadrants in peano order.  A mask P-tree, also called template P-tree, is the representation of clustering groups. It contains 1 bit at the same index position of the data objects if this data objects belong to this groups, otherwise contains 0 bit. Figure 2. (left) Accumulative bit quadrant of gene expression data, (right) its corresponding Peano Count Tree Figure 1. (left) Scan of cDNA microarray containing whole yeast genome. (right) microarray spotting device at The Institute for Genome Research 56 depth=0 level=3 ____________/ / \ \___________ / _____/ \ _ \ ___12 __ 16 16 __ 12_ depth=1 level=2 / / | \ / | \ \ 4 2 4 2 2 4 2 4 depth=2 level=1 //|\ //|\ //|\ //|\ 1100 1100 0011 0011 depth=3 level=0 56 depth=0 level=3 ____________/ / \ \___________ / _____/ \ _ \ ___12 __ 16 16 __ 12_ depth=1 level=2 / / | \ / | \ \ 4 2 4 2 2 4 2 4 depth=2 level=1 //|\ //|\ //|\ //|\ 1100 1100 0011 0011 depth=3 level=0 * Patents are pending on the P-tree technology. This work is partially supported by GSA Grant ACT#: K96130308, NSF Grant OSR-9553368.


Download ppt "Ptree * -based Approach to Mining Gene Expression Data Fei Pan 1, Xin Hu 2, William Perrizo 1 1. Dept. Computer Science, 2. Dept. Pharmaceutical Science,"

Similar presentations


Ads by Google