Peano Count Trees and Association Rule Mining for Gene Expression Profiling using DNA Microarray Data Dr. William Perrizo, Willy Valdivia, Dr. Edward Deckard, Francis Larson; North Dakota State University {william.perrizo, willy.valdivia, edward.deckard, Patents pending on bSQ and Ptree technology
The Problem There is a lot of data available today (e.g., gene expression data), but too little information. Data Mining attempts to reduce raw data to information for decision support. Decisions (often 1 bit – Y/N, T/F, Do/Don’t_do ) Data mining Classification (supervised learning) Clustering (unsupervised learning) Association Rule Mining (ARM) Statistics Machine Learning Data Structuring Signal Processing raw data (gigs, teras, petas, exas…) 0/1
A Solution? Currently the predominant method employed in bioinformatics is clustering (a little classification) on isolated microarray datasets. Needed:? A data mining software suite able to: transform copies of pertinent data from a variety of databases into a data mining-ready form in real-time (our solution based on P-trees?) “transform copies” rather than “standardize” since standardization rarely works! There will always be an MS (and I don’t mean Martha Stewart) to frustrate/destroy the standardization effort. facilitate Association Rule Mining, Clustering, Classification in an uniform way (so data mining results from other areas can be used) Bioinformatics: a Walmart or a Kmart?!? Walmart took DM seriously (early, comprehensive approach borrowing useful techniques from a variety of application areas) Kmart? Too little, too late.
Using data mining techniques developed for other application areas in bioinformatics? TIFF image Yield Map Remotely Sensed Images (RSI) can be viewed as collections of pixels. Each pixel has a value for each feature attribute For example, the RSI dataset above has 1320 rows and 1320 columns of pixels (1,742,400 pixels) and 4 feature attributes (Red,Green,Blue,Yield). The (R,G,B) feature bands are in the TIFF image and the Y feature is color coded in the Yield Map. Microarray or DNA chip data is not much different (multiple attributes corresponding to treatments or conditions). Much data mining (ARM) has been done on RSI data. Can it be useful in bioinformatics?
Regulation Pathway Discovery is not very different from Market Basket Research (ala Walmart) The results of clustering microarray data may indicate that genes (1 – 9) are involved in a regulation pathway. High confident rule mining on that cluster can discover the relationships among those genes (e.g., the expression of one gene, Gene2, might be discovered to be regulated by 1,3,5,6,8,9 and Gene4 and Gene7 may not be directly regulating Gene2 and can therefore be excluded. Gene1 Gene2, Gene3 Gene4, Gene 5, Gene6 Gene7, Gene8 Gene9 Clustering ARM Gene2 Gene1Gene3 Gene8Gene6 Gene9 Gene5 Gene4Gene7
ARM for Microarray Data A gene regulatory pathway component can be represented as an association rule, {G 1..G n } G m where {G 1 …G n } is the antecedent & G m is the consequent. Microarray data is most often represented as a relation G(Gid, T 1 … T n ) where Gid is the gene identifier; T 1... T n are the treatments (or conditions) and the data values represent gene expression levels. Call this the " Gene Table”. Currently, data-mining techniques concentrate on the Gene table - specifically, on finding clusters of genes that exhibit similar expression patterns under selected treatments (clustering the gene table). …. G4G4 G3G3 G2G2 G1G1 T4T4 T3T3 T2T2 T1T1 Trmt-ID Gene-ID. Gene expression values
ARM for Microarray Data (Contd.) An alternate data format exits (called the “Treatment Table”.) T(Tid, G 1, G 2, …., G n ) where Tid is the treatment identifier and G 1 …G n are the gene identifiers. Treatment table provides a convenient form for ARM of gene expression levels. Goal is to mine for rules among genes by associating treatment table columns. …. T4T4 T3T3 T2T2 T1T1 G4G4 G3G3 G2G2 G1G1 GeneID TrtmtID. Gene expression values The form of the Treatment Table with binary values (coding only whether an expression level exceeds or does not_exceed a threshold) is identical to Market Basket Data, for which a wealth of Rule Mining techniques have been developed in the last 8 years.
Treatment Table ……. …T4T4 … …T3T3 … …T2T2 … …T1T1 G4G4 G3G3 G2G2 G1G1 Gene Table is usually given as a standard (MS excel) spreadsheet of gene expression levels coming from microarray experiements. It is a 2-D data cube which can be rotated (to the Treatment Table), rolledup, sliced, diced, drilled down, association rule mined etc. Gene Table ……….…G4G4 …… …G3G3 …… …G2G2 …… …G1G1 T4T4 T3T3 T2T2 T1T1
What are Peano Trees? First what are the Spatial Data Formats BAND ( ) ( ) ( ) ( ) BAND ( ) ( ) ( ) ( ) Band SeQuential (2 files) (BSQ) Band 1: Band 2:
Spatial Data Formats (Cont.) BAND ( ) ( ) ( ) ( ) BAND ( ) ( ) ( ) ( ) BSQ format (2 files) Band 1: Band 2: Band InterLeaved by Line (BIL)
Spatial Data Formats (Cont.) BAND ( ) ( ) ( ) ( ) BAND ( ) ( ) ( ) ( ) BSQ format (2 files) Band 1: Band 2: BIL format (1 file) Band Interleaved by Pixel (1 file) (BIP)
Spatial Data Formats (Cont.) BAND ( ) ( ) ( ) ( ) BAND ( ) ( ) ( ) ( ) BSQ format (2 files) Band 1: Band 2: BIL format (1 file) BIP format (1 file) bit SeQuential (bSQ) format (16 files) (related to bit planes in graphics) B 11 B 12 B 13 B 14 B 15 B 16 B 17 B 18 B 21 B 22 B 23 B 24 B 25 B 26 B 27 B Reasons of using bSQ format –Different bits contribute to the value differently. –bSQ format facilitates representation of precision hierarchy (1 bit, 2 bit, … n-bit precision). –bSQ format facilitates the creation of an efficient P-tree data structure and P-tree algebra.
BSQ and bSQ formats –BSQ and bSQ are “tabular” formats BSQ consist of a separate table for each band (e.g., Gene or Treatment) bSQ consist of a separate table for each bit of each band –One can view it this way: Data set is initially 1 relation or table, R(K 1,..,K k, A 1, A 2,…, A n ), K 1,..,K k are structure attributes and each A i is a feature attribute. –Structure attributes of an RSI are X and Y coordinates (could put the same structure on the Gene Table, but I want to focus on the Treatment table). –Structure attributes of the Treatment Table might be a collection of Treatment dimensions, based on MIAME standard (Minimum info about microarray exp): »Experimental design »Array design »Samples »Hybridisations »Measurements »Normalization Control
A Universal Format? E.g., One large universal table with 5 dimensions based on MIAME standard? –E = Experimental design – Hybridisation Procedures –A = Array design –S = Samples –M = Measurements –N = Normalization Control for data mining across all treatments and genes?
Gene-Rep Tid (E,A,S,M,N) G1G1 G2G2 …GnGn E,A,S,M,N 1 …. E,A,S,M,N 2 ….... E,A,S,M,N m …. Gene expression values "GREASMN" (5-D Universal Gene Expression Cube) Cardinatlity is high, but compression will be substantial (next slide).
GREASMN datacube rolled up onto (E,S) … zeros S (Organism..) E (Lab…) Yeast S 1 S 2. S n E 1 E 2... E n The non-zero blocks may occur off the diagonal. The Point: Massive but very sparse dataset!
Peano Count Tree (P-tree) P-tree represents spatial bSQ data bit-by-bit in a recursive quadrant-by-quadrant arrangement. P-tree is a lossless, compressed, data-mining- ready representation of the data. –partially run-length compressed using the structure attributes. – “count pre-computed”.
An example of Peano Count tree Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant Root Count Level Fan-out QID (Quadrant ID) Given a bSQ file, B ij, (shown in spatial positions below) we create its basic PC-tree, P ij as follows
An example of PC-tree Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant Root Count Level Fan-out QID (Quadrant ID) ( 7, 1 ) ( 111, 001 ) Level-0 Level-3 Level-2 Level
Alternative forms for Ptrees (all lossless) P1: 0 ______/ / \ \______ / / \ \ / / \ \ / / \ \ //|\ //|\ //|\ P0: 0 ______/ / \ \______ / / \ \ / / \ \ / / \ \ //|\ //|\ //|\ PNZ (=P0’) 1 ________ / / \ \___ / ____ / \ \ / / \ \ / / \ \ / / \ \ //|\ //|\ //|\ means quadrant is pure-1, 0 otherwise (pure0 if no sub-tree ptrs, otherwise mixed) 1 means quadrant is pure-0, 0 otherwise 1 means quadrant is Not pure-Zero, 0 otherwise (Note: PM = PNZ XOR P1 ) P1V (as a table): qidvector [ ] 1001 [01] 0010 [10] 1101 [01.00]1110 [01.11]0010 [10.10]1101 P0V: qidvector [ ] 0000 [01] 0100 [10] 0000 [01.00]0001 [01.11]1101 [10.10]0010 PNZV: qidvector [ ] 1111 [01] 1011 [10] 1111 [01.00]1110 [01.11]0010 [10.10]1101 Vector forms (A table entry for each mixed inode containing its qid and its children bit-vector ; Eliminate need for subtree pointers) Since there is no qid=[01.01] in the table we know it’s pure0, not mixed
Basic, Value and Tuple Ptrees Value Ptrees (i.e., P 1, 001 = P 11 ’ AND P 12 ’ AND P 13 ) Tuple Ptrees (i.e., P 001, 010, 111 = P 1, 001 AND P 2, 010 AND P 3, 111 ) AND Basic Ptrees (i.e., P 11, P 12, …, P 18, P 21, …, P 28, …, P 71, …, P 78 )
Distributed P trees? qidNZP1 [ ] [01] [10] [01.00]1110 [01.11]0010 [10.10]1101 qidNZP1 [ ] [10] [10.11]0111 qidNZP1 [ ] [01] [10] [01.11]0110 [10.00]1000 P 11 P 12 P 13 Assume a 5-computer cluster; NodeC, Node 00, Node 01, Node 10, Node 11. Send to Node ij if qid ends in ij: Bp qidNZP [01.00] [10.00]1000 Bp qidNZP1 C 11[ ] [ ] [ ] A data mining request involves a series of multicast invocations and at most one unicast reply for each receiving node. A distributed Genomic data mining federation of Beowulf clusters? Each node computes only a tiny portion of the necessary count information then sends to the requesting node? Bp qidNZP [01] [01] Bp qidNZP [10] [10.10] [10] [10] Bp qidNZP [01.11] [10.11] [01.11]0110
… …5 55depth=0 level=3 ____________/ /\\___________ / _____/\___\ 16 ____8__ _15__ 16depth=1 level=2 / / |\/ |\\ depth=2 level=1 //|\ \ \ depth=3 level=0 bSQ format: Bit files of intervalized, normalized, Red/green ratios for each Microarray. Ptree format: One P-tree for each bit position of each bSQ file (e.g., the high-order bit) Hierarchical Clustering AgglomerativeDivisive Non-Hierarchical Clustering K-clusteringPCASOM Supervised Learning or Classification SVMDecision Trees KNN Non-ARM Ptree-based Microarray data mining methods
Temporal Gene Exp. Analysis Spatial Gene Exp. Analysis Genotypic Gene Exp. Analysis Data Repository bSQ Ptrees Development Of Data Mining Tools User JAVA Graphical Interface SQL, XML Other Microarray Data Repositories Stanford EMBL SGDB A plan
Data Mining in Genomics: Conclusion Data Mining in application areas, with huge raw data stores such as Market Basket Research, Remotely Sensed Imagery, and Genomics (Proteomics?, Transcriptomics, Metabolomics?), are remarkably similar in terms of data and data mining needs. There should be more collaboration across applications. In the application areas data cube rotation can open data mining possibilities. We suggest a universal data structure (GREASMN Table and P-trees) striped across a wide federation of computer nodes, using P-tree technology to facilitate data mining eliminate barriers introduced by scale limitations, incompatible data formats, etc.