Peano Count Trees and Association Rule Mining for Gene Expression Profiling using DNA Microarray Data Dr. William Perrizo, Willy Valdivia, Dr. Edward Deckard,

Slides:



Advertisements
Similar presentations
RAID (Redundant Arrays of Independent Disks). Disk organization technique that manages a large number of disks, providing a view of a single disk of High.
Advertisements

Mining Association Rules from Microarray Gene Expression Data.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
With PGP-D, to get pTree info, you need: the ordering (the mapping of bit position to table row) the predicate (e.g., table column id and bit slice or.
SWE 423: Multimedia Systems
Classifier Decision Tree A decision tree classifies data by predicting the label for each record. The first element of the tree is the root node, representing.
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
Array Based Cancer Diagnostics: Gene Expression Profiling of DNA Microarray Data Abdoulaye Samb DPS 2005 Proceedings Student Research May 06, 2005.
First Bytes - LabVIEW. Today’s Session Introduction to LabVIEW Colors and computers Lab to create a color picker Lab to manipulate an image Visual ProgrammingImage.
ESRM 250 & CFR 520: Introduction to GIS © Phil Hurvitz, KEEP THIS TEXT BOX this slide includes some ESRI fonts. when you save this presentation,
Data Mining Techniques
Artificial Neural Network Applications on Remotely Sensed Imagery Kaushik Das, Qin Ding, William Perrizo North Dakota State University
Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.
Attribute Data in GIS Data in GIS are stored as features AND tabular info Tabular information can be associated with features OR Tabular data may NOT be.
Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets Taufik Abidin, Amal Perera, Masum Serazi, William.
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
MULTI-LAYERED SOFTWARE SYSTEM FRAMEWORK FOR DISTRIBUTED DATA MINING
Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.
Bit Sequential (bSQ) Data Model and Peano Count Trees (P-trees) Department of Computer Science North Dakota State University, USA (the bSQ and P-tree technology.
Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo.
Data : The Small Forwarding Table(SFT), In general, The small forwarding table is the compressed version of a trie. Since SFT organizes.
Chapter 3 Digital Representation of Geographic Data.
Data Warehousing.
Ptree * -based Approach to Mining Gene Expression Data Fei Pan 1, Xin Hu 2, William Perrizo 1 1. Dept. Computer Science, 2. Dept. Pharmaceutical Science,
Gene expression analysis
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Association Rule Mining (ARM)  We will look for common models for ARM/Classification/Clustering, e.g., R(K 1..K k,A 1..A n ) where K s are structure &
Association Rule Mining on Remotely Sensed Imagery Using Peano-trees (P-trees) Qin Ding, Qiang Ding, and William Perrizo Computer Science Department North.
Data Mining and Data Warehousing Many-to-Many Relationships Applications William Perrizo Dept of Computer Science North Dakota State Univ.
Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.
TEMPLATE DESIGN © Predicate-Tree based Pretty Good Protection of Data William Perrizo, Arjun G. Roy Department of Computer.
Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Clustering using Wavelets and Meta-Ptrees Anne Denton, Fang Zhang.
Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.
P-Tree Implementation Anne Denton. So far: Logical Definition C.f. Dr. Perrizo’s slides Logical definition Defines node information Representation of.
Accelerating Multilevel Secure Database Queries using P-Tree Technology Imad Rahal and Dr. William Perrizo Computer Science Department North Dakota State.
Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
What is GIS? “A powerful set of tools for collecting, storing, retrieving, transforming and displaying spatial data”
Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.
Unsupervised Classification
P Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  1.
Multimedia Data Mining using P-trees* William Perrizo,William Jockheck, Amal Perera, Dongmei Ren, Weihua Wu, Yi Zhang Computer Science Department North.
JPEG Compression What is JPEG? Motivation
By Arijit Chatterjee Dr
Data Mining Motivation: “Necessity is the Mother of Invention”
Decision Tree Classification of Spatial Data Streams Using Peano Count Trees Qiang Ding Qin Ding * William Perrizo Department of Computer Science.
Indexing Structures for Files and Physical Database Design
MATLAB Distributed, and Other Toolboxes
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Digital Data Format and Storage
Decision Tree Induction for High-Dimensional Data Using P-Trees
Efficient Ranking of Keyword Queries Using P-trees
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Firmer Mathematical Foundation: HistoTrees
Classification and Prediction
3. Vertical Data LECTURE 2 Section 3.
Storage Structure and Efficient File Access
A Spatial Data and Sensor Network Application:
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
The P-tree Structure and its Algebra Qin Ding Maleq Khan Amalendu Roy
d1 F rrn m1 m2 (m21 m22) F rrn m1 m2 (m21 m22) d2 d3 D1 rrn1 a11 a12
Presentation transcript:

Peano Count Trees and Association Rule Mining for Gene Expression Profiling using DNA Microarray Data Dr. William Perrizo, Willy Valdivia, Dr. Edward Deckard, Francis Larson; North Dakota State University {william.perrizo, willy.valdivia, edward.deckard, Patents pending on bSQ and Ptree technology

The Problem There is a lot of data available today (e.g., gene expression data), but too little information. Data Mining attempts to reduce raw data to information for decision support. Decisions (often 1 bit – Y/N, T/F, Do/Don’t_do ) Data mining Classification (supervised learning) Clustering (unsupervised learning) Association Rule Mining (ARM) Statistics Machine Learning Data Structuring Signal Processing raw data (gigs, teras, petas, exas…) 0/1

A Solution? Currently the predominant method employed in bioinformatics is clustering (a little classification) on isolated microarray datasets. Needed:? A data mining software suite able to: transform copies of pertinent data from a variety of databases into a data mining-ready form in real-time (our solution based on P-trees?) “transform copies” rather than “standardize” since standardization rarely works! There will always be an MS (and I don’t mean Martha Stewart) to frustrate/destroy the standardization effort. facilitate Association Rule Mining, Clustering, Classification in an uniform way (so data mining results from other areas can be used) Bioinformatics: a Walmart or a Kmart?!? Walmart took DM seriously (early, comprehensive approach borrowing useful techniques from a variety of application areas) Kmart? Too little, too late.

Using data mining techniques developed for other application areas in bioinformatics? TIFF image Yield Map Remotely Sensed Images (RSI) can be viewed as collections of pixels. Each pixel has a value for each feature attribute For example, the RSI dataset above has 1320 rows and 1320 columns of pixels (1,742,400 pixels) and 4 feature attributes (Red,Green,Blue,Yield). The (R,G,B) feature bands are in the TIFF image and the Y feature is color coded in the Yield Map. Microarray or DNA chip data is not much different (multiple attributes corresponding to treatments or conditions). Much data mining (ARM) has been done on RSI data. Can it be useful in bioinformatics?

Regulation Pathway Discovery is not very different from Market Basket Research (ala Walmart)  The results of clustering microarray data may indicate that genes (1 – 9) are involved in a regulation pathway.  High confident rule mining on that cluster can discover the relationships among those genes (e.g., the expression of one gene, Gene2, might be discovered to be regulated by 1,3,5,6,8,9 and Gene4 and Gene7 may not be directly regulating Gene2 and can therefore be excluded. Gene1 Gene2, Gene3 Gene4, Gene 5, Gene6 Gene7, Gene8 Gene9 Clustering ARM Gene2 Gene1Gene3 Gene8Gene6 Gene9 Gene5 Gene4Gene7

ARM for Microarray Data A gene regulatory pathway component can be represented as an association rule, {G 1..G n }  G m where {G 1 …G n } is the antecedent & G m is the consequent. Microarray data is most often represented as a relation G(Gid, T 1 … T n ) where Gid is the gene identifier; T 1... T n are the treatments (or conditions) and the data values represent gene expression levels. Call this the " Gene Table”. Currently, data-mining techniques concentrate on the Gene table - specifically, on finding clusters of genes that exhibit similar expression patterns under selected treatments (clustering the gene table). …. G4G4 G3G3 G2G2 G1G1 T4T4 T3T3 T2T2 T1T1 Trmt-ID Gene-ID. Gene expression values

ARM for Microarray Data (Contd.) An alternate data format exits (called the “Treatment Table”.) T(Tid, G 1, G 2, …., G n ) where Tid is the treatment identifier and G 1 …G n are the gene identifiers. Treatment table provides a convenient form for ARM of gene expression levels. Goal is to mine for rules among genes by associating treatment table columns. …. T4T4 T3T3 T2T2 T1T1 G4G4 G3G3 G2G2 G1G1 GeneID TrtmtID. Gene expression values The form of the Treatment Table with binary values (coding only whether an expression level exceeds or does not_exceed a threshold) is identical to Market Basket Data, for which a wealth of Rule Mining techniques have been developed in the last 8 years.

Treatment Table ……. …T4T4 … …T3T3 … …T2T2 … …T1T1 G4G4 G3G3 G2G2 G1G1 Gene Table is usually given as a standard (MS excel) spreadsheet of gene expression levels coming from microarray experiements. It is a 2-D data cube which can be rotated (to the Treatment Table), rolledup, sliced, diced, drilled down, association rule mined etc. Gene Table ……….…G4G4 …… …G3G3 …… …G2G2 …… …G1G1 T4T4 T3T3 T2T2 T1T1

What are Peano Trees? First what are the Spatial Data Formats BAND ( ) ( ) ( ) ( ) BAND ( ) ( ) ( ) ( ) Band SeQuential (2 files) (BSQ) Band 1: Band 2:

Spatial Data Formats (Cont.) BAND ( ) ( ) ( ) ( ) BAND ( ) ( ) ( ) ( ) BSQ format (2 files) Band 1: Band 2: Band InterLeaved by Line (BIL)

Spatial Data Formats (Cont.) BAND ( ) ( ) ( ) ( ) BAND ( ) ( ) ( ) ( ) BSQ format (2 files) Band 1: Band 2: BIL format (1 file) Band Interleaved by Pixel (1 file) (BIP)

Spatial Data Formats (Cont.) BAND ( ) ( ) ( ) ( ) BAND ( ) ( ) ( ) ( ) BSQ format (2 files) Band 1: Band 2: BIL format (1 file) BIP format (1 file) bit SeQuential (bSQ) format (16 files) (related to bit planes in graphics) B 11 B 12 B 13 B 14 B 15 B 16 B 17 B 18 B 21 B 22 B 23 B 24 B 25 B 26 B 27 B  Reasons of using bSQ format –Different bits contribute to the value differently. –bSQ format facilitates representation of precision hierarchy (1 bit, 2 bit, … n-bit precision). –bSQ format facilitates the creation of an efficient P-tree data structure and P-tree algebra.

BSQ and bSQ formats –BSQ and bSQ are “tabular” formats BSQ consist of a separate table for each band (e.g., Gene or Treatment) bSQ consist of a separate table for each bit of each band –One can view it this way: Data set is initially 1 relation or table, R(K 1,..,K k, A 1, A 2,…, A n ), K 1,..,K k are structure attributes and each A i is a feature attribute. –Structure attributes of an RSI are X and Y coordinates (could put the same structure on the Gene Table, but I want to focus on the Treatment table). –Structure attributes of the Treatment Table might be a collection of Treatment dimensions, based on MIAME standard (Minimum info about microarray exp): »Experimental design »Array design »Samples »Hybridisations »Measurements »Normalization Control

A Universal Format?  E.g., One large universal table with 5 dimensions based on MIAME standard? –E = Experimental design – Hybridisation Procedures –A = Array design –S = Samples –M = Measurements –N = Normalization Control for data mining across all treatments and genes?

Gene-Rep Tid (E,A,S,M,N) G1G1 G2G2 …GnGn E,A,S,M,N 1 …. E,A,S,M,N 2 ….... E,A,S,M,N m …. Gene expression values "GREASMN" (5-D Universal Gene Expression Cube) Cardinatlity is high, but compression will be substantial (next slide).

GREASMN datacube rolled up onto (E,S) … zeros S (Organism..) E (Lab…) Yeast S 1 S 2. S n E 1 E 2... E n The non-zero blocks may occur off the diagonal. The Point: Massive but very sparse dataset!

Peano Count Tree (P-tree)  P-tree represents spatial bSQ data bit-by-bit in a recursive quadrant-by-quadrant arrangement.  P-tree is a lossless, compressed, data-mining- ready representation of the data. –partially run-length compressed using the structure attributes. – “count pre-computed”.

An example of Peano Count tree  Peano or Z-ordering  Pure (Pure-1/Pure-0) quadrant  Root Count  Level  Fan-out  QID (Quadrant ID) Given a bSQ file, B ij, (shown in spatial positions below) we create its basic PC-tree, P ij as follows

An example of PC-tree  Peano or Z-ordering  Pure (Pure-1/Pure-0) quadrant  Root Count  Level  Fan-out  QID (Quadrant ID) ( 7, 1 ) ( 111, 001 )  Level-0  Level-3  Level-2  Level

Alternative forms for Ptrees (all lossless) P1: 0 ______/ / \ \______ / / \ \ / / \ \ / / \ \ //|\ //|\ //|\ P0: 0 ______/ / \ \______ / / \ \ / / \ \ / / \ \ //|\ //|\ //|\ PNZ (=P0’) 1 ________ / / \ \___ / ____ / \ \ / / \ \ / / \ \ / / \ \ //|\ //|\ //|\ means quadrant is pure-1, 0 otherwise (pure0 if no sub-tree ptrs, otherwise mixed) 1 means quadrant is pure-0, 0 otherwise 1 means quadrant is Not pure-Zero, 0 otherwise (Note: PM = PNZ XOR P1 ) P1V (as a table): qidvector [ ] 1001 [01] 0010 [10] 1101 [01.00]1110 [01.11]0010 [10.10]1101 P0V: qidvector [ ] 0000 [01] 0100 [10] 0000 [01.00]0001 [01.11]1101 [10.10]0010 PNZV: qidvector [ ] 1111 [01] 1011 [10] 1111 [01.00]1110 [01.11]0010 [10.10]1101 Vector forms (A table entry for each mixed inode containing its qid and its children bit-vector ; Eliminate need for subtree pointers) Since there is no qid=[01.01] in the table we know it’s pure0, not mixed

Basic, Value and Tuple Ptrees Value Ptrees (i.e., P 1, 001 = P 11 ’ AND P 12 ’ AND P 13 ) Tuple Ptrees (i.e., P 001, 010, 111 = P 1, 001 AND P 2, 010 AND P 3, 111 ) AND Basic Ptrees (i.e., P 11, P 12, …, P 18, P 21, …, P 28, …, P 71, …, P 78 )

Distributed P trees? qidNZP1 [ ] [01] [10] [01.00]1110 [01.11]0010 [10.10]1101 qidNZP1 [ ] [10] [10.11]0111 qidNZP1 [ ] [01] [10] [01.11]0110 [10.00]1000 P 11 P 12 P 13 Assume a 5-computer cluster; NodeC, Node 00, Node 01, Node 10, Node 11. Send to Node ij if qid ends in ij: Bp qidNZP [01.00] [10.00]1000 Bp qidNZP1 C 11[ ] [ ] [ ] A data mining request involves a series of multicast invocations and at most one unicast reply for each receiving node. A distributed Genomic data mining federation of Beowulf clusters? Each node computes only a tiny portion of the necessary count information then sends to the requesting node? Bp qidNZP [01] [01] Bp qidNZP [10] [10.10] [10] [10] Bp qidNZP [01.11] [10.11] [01.11]0110

… …5 55depth=0 level=3 ____________/ /\\___________ / _____/\___\ 16 ____8__ _15__ 16depth=1 level=2 / / |\/ |\\ depth=2 level=1 //|\ \ \ depth=3 level=0 bSQ format: Bit files of intervalized, normalized, Red/green ratios for each Microarray. Ptree format: One P-tree for each bit position of each bSQ file (e.g., the high-order bit) Hierarchical Clustering AgglomerativeDivisive Non-Hierarchical Clustering K-clusteringPCASOM Supervised Learning or Classification SVMDecision Trees KNN Non-ARM Ptree-based Microarray data mining methods

Temporal Gene Exp. Analysis Spatial Gene Exp. Analysis Genotypic Gene Exp. Analysis Data Repository bSQ Ptrees Development Of Data Mining Tools User JAVA Graphical Interface SQL, XML Other Microarray Data Repositories Stanford EMBL SGDB A plan

Data Mining in Genomics: Conclusion Data Mining in application areas, with huge raw data stores such as Market Basket Research, Remotely Sensed Imagery, and Genomics (Proteomics?, Transcriptomics, Metabolomics?), are remarkably similar in terms of data and data mining needs. There should be more collaboration across applications. In the application areas data cube rotation can open data mining possibilities. We suggest a universal data structure (GREASMN Table and P-trees) striped across a wide federation of computer nodes, using P-tree technology to facilitate data mining eliminate barriers introduced by scale limitations, incompatible data formats, etc.