Ptree * -based Approach to Mining Gene Expression Data Fei Pan 1, Xin Hu 2, William Perrizo 1 1. Dept. Computer Science, 2. Dept. Pharmaceutical Science,

Slides:



Advertisements
Similar presentations
BioInformatics (3).
Advertisements

Basic Gene Expression Data Analysis--Clustering
The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Clustering Categorical Data The Case of Quran Verses
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Array Based Cancer Diagnostics: Gene Expression Profiling of DNA Microarray Data Abdoulaye Samb DPS 2005 Proceedings Student Research May 06, 2005.
Introduction to Bioinformatics - Tutorial no. 12
Fuzzy K means.
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
Why microarrays in a bioinformatics class? Design of chips Quantitation of signals Integration of the data Extraction of groups of genes with linked expression.
Proteomics Understanding Proteins in the Postgenomic Era.
Artificial Neural Network Applications on Remotely Sensed Imagery Kaushik Das, Qin Ding, William Perrizo North Dakota State University
Gene expression & Clustering (Chapter 10)
Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.
Gene Regulatory Network Inference. Progress in Disease Treatment  Personalized medicine is becoming more prevalent for several kinds of cancer treatment.
Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.
Bit Sequential (bSQ) Data Model and Peano Count Trees (P-trees) Department of Computer Science North Dakota State University, USA (the bSQ and P-tree technology.
Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo.
Finish up array applications Move on to proteomics Protein microarrays.
Bioinformatics Brad Windle Ph# Web Site:
Genomes and Genomics.
Function first: a powerful approach to post-genomic drug discovery Stephen F. Betz, Susan M. Baxter and Jacquelyn S. Fetrow GeneFormatics Presented by.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
Hierarchical Clustering of Gene Expression Data Author : Feng Luo, Kun Tang Latifur Khan Graduate : Chien-Ming Hsiao.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU.
Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.
TEMPLATE DESIGN © Predicate-Tree based Pretty Good Protection of Data William Perrizo, Arjun G. Roy Department of Computer.
Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo.
Bioinformatics MEDC601 Lecture by Brad Windle Ph# Office: Massey Cancer Center, Goodwin Labs Room 319 Web site for lecture:
Prepared by: Mahmoud Rafeek Al-Farra
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.
By: Amira Djebbari and John Quackenbush BMC Systems Biology 2008, 2: 57 Presented by: Garron Wright April 20, 2009 CSCE 582.
LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department.
Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.
ABSTRACT First genomic scale data about gene expression have recently started to become available in addition to complete genome sequence data and annotations.
Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.
Efficient Quantitative Frequent Pattern Mining Using Predicate Trees Baoying Wang, Fei Pan, Yue Cui William Perrizo North Dakota State University.
Unsupervised Learning
Item-Based P-Tree Collaborative Filtering applied to the Netflix Data
Decision Tree Classification of Spatial Data Streams Using Peano Count Trees Qiang Ding Qin Ding * William Perrizo Department of Computer Science.
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Decision Tree Induction for High-Dimensional Data Using P-Trees
Efficient Ranking of Keyword Queries Using P-trees
Efficient Ranking of Keyword Queries Using P-trees
 The human genome contains approximately genes.  At any given moment, each of our cells has some combination of these genes turned on & others.
North Dakota State University Fargo, ND USA
Yue (Jenny) Cui and William Perrizo North Dakota State University
Classification and Prediction
Vertical K Median Clustering
Vertical K Median Clustering
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
North Dakota State University Fargo, ND USA
Vertical K Median Clustering
Integrative Analysis of multiple large-scale molecular biological data
North Dakota State University Fargo, ND USA
INTRODUCTION TO Machine Learning 2nd Edition
The P-tree Structure and its Algebra Qin Ding Maleq Khan Amalendu Roy
Unsupervised Learning
Presentation transcript:

Ptree * -based Approach to Mining Gene Expression Data Fei Pan 1, Xin Hu 2, William Perrizo 1 1. Dept. Computer Science, 2. Dept. Pharmaceutical Science, North Dakota State University, Fargo, ND INTRODUCTION Gene expression data provides valuable information to analyze human diseases. The process of drug discovery involves the screening of compounds against a drug target to identify those compounds that interact with the target to produce the desired effect. Cluster analysis is currently the most widely used technique to analyze gene expression profiles. However, this technique is limited as to the amount of useful information that it can provide. On the other hand, to mine large microarray data sets and to analyze proteomic data sets is becoming a huge challenge. New computational tools and interpretation methods in the analysis of results. To address the problem, we introduce a new data structure, called Peano Count Tree (Ptree) and a new dissimilarity metric, called HOBbit metric. We apply a new hierarchical clustering algorithm to analyze the NCI 60 cell lines data set. The data set comprises 1376 gene expression profiles and the growth inhibition of 1400 chemical compounds on the same cell lines, providing an excellent test case case in the context of linking genomic data mining with high-throughput drug design. METHODS The hierarchical Ptree clustering algorithm was developed based on Peano Count Trees (Ptree) and the HOBbit metric. We defined the Peano Count Tree and HOBbit metric and then use HPC algorithm to evaluate the cell- cell correlations on the basis of NCI gene expression profiles. ACKNOWLEDGEMENTS We thank Pratap Kotala for his support and assistance. A special thanks to GSA Grant ACT# K for financial support. RESULTS The clustering results in Figure 3 show that various cancer cell lines tend to cluster according to the tissue of origin whereas, the gene-drug interaction suggest that drug metabolism rather than the mechanism of drug action is an important feature of the drug activity-gene expression correlation. Figure 3. (left) Cluster results of gene expression profiles. (right) cluster results represented using mask Ptree We compared our approach to the widely used average linkage method on the whole gene expression dataset. Results showed that our method is much faster than the average linkage method with a better tightness around the centroid. CONCLUSIONS We introduce a new data structure as well as a dissimilarity metric and apply the Ptree based hierarchical clustering to the analysis of large gene expression data sets. Empirical analysis of the NCI-60 cancer cell lines data set show that the new approach present a better tightness around the centroid compared to the average linkage method, thus providing high accuracy. Also the new clustering approach makes it possible to handle large data set and achieved speed improvements. Peano Count Tree (Ptree) HOBbit dissimilarity measurement Hierarchical Ptree Clustering Algorithm (HPC) HPC features one database scan and high accuracy by using Ptree techniques and an optimal approximation LP algorithm. Step1: build accumulative bit quadrants and Ptrees while reading gene expression data. Step2: find the 2-median of the whole Ptree as well as its sub-trees using the LP primal dual optimization algorithm recursively. Step3: assign gene expression data to its nearest median and build the mask P-tree hierarchically.  Let a i and b i be the i th bits of integer A and integer B respectively, and let m (m  1) be the number of bits in binary representations of the values, the HOBbit similarity between two integers A and B is defined as: HOB( A, B ) =  Let m (m  1) be the number of bits in binary representations of the integer values or mantissa of floating point, the HOBbit dissimilarity between two data points A and B is defined as: d v ( A, B ) = m – HOB( A, B ).  Peano order, also called Z-ordering, is a recursive raster ordering. An accumulative quadrant is the run-length quadrant-wise compression of a bit sequence in Peano order. A quadrant of a bit sequence is pure if it is entirely 1 or entirely 0.  Peano Count Trees (Ptrees) are defined as the recursive 1-bit counts of the accumulative quadrants in peano order.  A mask P-tree, also called template P-tree, is the representation of clustering groups. It contains 1 bit at the same index position of the data objects if this data objects belong to this groups, otherwise contains 0 bit. Figure 2. (left) Accumulative bit quadrant of gene expression data, (right) its corresponding Peano Count Tree Figure 1. (left) Scan of cDNA microarray containing whole yeast genome. (right) microarray spotting device at The Institute for Genome Research 56 depth=0 level=3 ____________/ / \ \___________ / _____/ \ _ \ ___12 __ __ 12_ depth=1 level=2 / / | \ / | \ \ depth=2 level=1 //|\ //|\ //|\ //|\ depth=3 level=0 56 depth=0 level=3 ____________/ / \ \___________ / _____/ \ _ \ ___12 __ __ 12_ depth=1 level=2 / / | \ / | \ \ depth=2 level=1 //|\ //|\ //|\ //|\ depth=3 level=0 * Patents are pending on the P-tree technology. This work is partially supported by GSA Grant ACT#: K , NSF Grant OSR