DataSURG (Database Systems Users and Research Group)

Slides:

Advertisements

Similar presentations

OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.

Advertisements

Data Mining – Intro.

Three Challenges in Data Mining Anne Denton Department of Computer Science NDSU.

Data Mining Techniques

Artificial Neural Network Applications on Remotely Sensed Imagery Kaushik Das, Qin Ding, William Perrizo North Dakota State University

Data Mining on Streams  We should use runlists for stream data mining (unless there is some spatial structure to the data, of course, then we need to.

Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.

Data Mining and Data Warehousing, many-to-many Relationships, applications William Perrizo Dept of Computer Science North Dakota State Univ.

3. Vertical Data First, a brief description of Data Warehouses (DWs) versus Database Management Systems (DBMSs)  C.J. Date recommended, circa 1980, 

MULTI-LAYERED SOFTWARE SYSTEM FRAMEWORK FOR DISTRIBUTED DATA MINING

OnLine Analytical Processing (OLAP)

Bit Sequential (bSQ) Data Model and Peano Count Trees (P-trees) Department of Computer Science North Dakota State University, USA (the bSQ and P-tree technology.

Section 1 # 1 CS The Age of Infinite Storage.

Section 1 # 1 CS The Age of Infinite Storage.

Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.

6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.

Association Rule Mining on Remotely Sensed Imagery Using Peano-trees (P-trees) Qin Ding, Qiang Ding, and William Perrizo Computer Science Department North.

Data Mining and Data Warehousing Many-to-Many Relationships Applications William Perrizo Dept of Computer Science North Dakota State Univ.

Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.

Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo.

Vertical Data In Data Processing, you run up against two curses immediately. Curse of cardinality: solutions don’t scale well with respect to record volume.

Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees.

Data Mining and Data Warehousing, many-to-many Relationships, applications DataSURG (Database Systems Users and Research Group) North Dakota State University.

Data Mining and Data Warehousing of Many-to-Many Relationships and some Applications William Perrizo Dept of Computer Science North Dakota State Univ.

Data Mining and Data Warehousing, many-to-many Relationships, applications William Perrizo DataSURG (Database Systems Users and Research Group) North Dakota.

What is GIS? “A powerful set of tools for collecting, storing, retrieving, transforming and displaying spatial data”

Content  Hierarchical Triangle Mesh (HTM)  Perrizo Triangle Mesh Tree (PTM-tree)  SDSS.

Efficient Quantitative Frequent Pattern Mining Using Predicate Trees Baoying Wang, Fei Pan, Yue Cui William Perrizo North Dakota State University.

Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.

P Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  1.

Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.

The Dawning of the Age of Infinite Storage

Data Mining – Intro.

Item-Based P-Tree Collaborative Filtering applied to the Netflix Data

Data Mining Motivation: “Necessity is the Mother of Invention”

Decision Tree Classification of Spatial Data Streams Using Peano Count Trees Qiang Ding Qin Ding * William Perrizo Department of Computer Science.

Record Storage, File Organization, and Indexes

Astronomy Application: (National Virtual Observatory data)

Data warehouse and OLAP

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

Decision Tree Induction for High-Dimensional Data Using P-Trees

Efficient Ranking of Keyword Queries Using P-trees

The Age of Infinite Storage or the age of data mining

Parallel Density-based Hybrid Clustering

Physical Database Design for Relational Databases Step 3 – Step 8

Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.

Mean Shift Segmentation

Data Mining on a Data Warehouse vs

William Perrizo Dept of Computer Science North Dakota State Univ.

North Dakota State University Fargo, ND USA

Yue (Jenny) Cui and William Perrizo North Dakota State University

CS The Age of Infinite Storage

3. Vertical Data LECTURE 2 Section 3.

MANAGING DATA RESOURCES

File Systems and Databases

Physical Database Design

Vertical K Median Clustering

Data Warehousing and Data Mining

Data Warehouse Overview September 28, 2012 presented by Terry Bilskie

3. Vertical Data LECTURE 2 Section 3.

A Spatial Data and Sensor Network Application:

North Dakota State University Fargo, ND USA

Vertical K Median Clustering

Visualization of Content Information in Networks using GlyphNet

William Perrizo Dept of Computer Science North Dakota State Univ.

North Dakota State University Fargo, ND USA

Data Warehousing Concepts

The P-tree Structure and its Algebra Qin Ding Maleq Khan Amalendu Roy

This presentation was developed by Dr. Steven C

Presentation transcript:

Data Mining and Data Warehousing, many-to-many Relationships, applications DataSURG (Database Systems Users and Research Group) North Dakota State University Fargo, North Dakota USA dataSURG.ite@ndsu.nodak.edu

Data Mining on a Data Warehouse vs Data Mining on a Data Warehouse vs. Transaction Processing on a Data Base Workload on Repository Question C.J. Date, circa 1980 Transactions on a DBMS vs. file processing programs on file systems. “Use a DBMS instead of file systems! Unify data resources, centralize control, promote standards and consistency, eliminate redundancy, increase data value and usage, yadda, yadda” Circa 1990 “Buy a separate DW for DM” (separate from your DBMS for TP)” 2 separate, quite redundant, non-sharing, inconsistent.. systems! What happened? Great marketing success! (sold more hardware and software) Great Concurrency Control R&D failure! We failed to integrate transactions and queries (OLTP and OLAP, i.e., updates and reads) in one system with acceptable performance! The marketing was so successful, nobody noticed the failure!

OUTLINE I still hold out hope that DW and DB will eventually be unified again. I believe eventually the industry will demand it. Already, there’s work to update DWs! For now let’s just focus on DATA. I Consider Data Mining (DM) to be on the unstructured side of querying. And on that side, you run up against two curses immediately. Curse of non-scalability (solutions don’t scale with data volume.) Curse of dimensionality (solutions don’t scale with data dimension I will talk about techniques we use to address these curses. Horizontal processing of vertically structured data (instead of the ubiquitous vertical processing of horizontal data (the record orientation). Parallelize the DM engine. Parallelize the software DM engine on clusters of computers. Parallelize the greyware DM engine on clusters of people (i.e., browser-enable all software for visual data mining)

The DataSURG DM Architecture YOUR DATA data mining algorithms (yours/ ours) PREDICATE Count of objects satisfying PREDICATE Internet DCI (Data Capture Interface) DMI (Data Mining Interface) The Ptree Repository (lossless, compressed, vertically-structured replicas)

Data mining finds information in data. Why do we need Data Mining? Data volume expands by Parkinson’s Law Data volume expands to fill available data storage Disk-storage expands by Moore’s law Capacity  2 t / 9 months Available storage doubles every 9 months!

We’re awash with data! Network data: hi-speed, DWDM, All-opt (mgmt, flow classif’n,QoS,security) (10 terabytes by 2003 ~ 1013 B). US EROS Data Center (EDC) archives Earth Observing System (EOS) Remotely Sensed Imagery (RSI), satellite and aerial photo data (10 petabytes by 2005 ~ 1016 B). National Virtual Observatory (aggregated astronomical data) (10 exabytes by 2010 ~ 1019 B). Sensor data from sensors (including Micro & Nano -sensor networks) (10 zettabytes by 2015 ~ 1022 B). WWW (and other text collections) (10 yottabytes by 2020 ~ 1025 B). Genomic/Proteomic/Metabolomic data (microarrays, genechips, genome sequences) (10 gazillabytes by 2030 ~ 1028 B?). Stock Market prediction data (prices + all the above? especially astronomy data?) (10 supragazillabytes by 2040 ~ 1031 B?). Useful information must be teased out of these large volumes of data through data mining. I had to make up this Name! Projected data sizes are overrunning our ability to name those sizes!

More’s Law: More’s Less The more volume, the less information. (AKA: Shannon’s Canon) A simple illustration: Which phone book has more info? (both involve the same 4 data granules) BOOK-1 BOOK-2 Name Number Name Number Smith 234-9816 Smith 234-9816 Jones 231-7237 Smith 231-7237 Jones 234-9816 Jones 231-7237 Data mining reduces volume and raises the information level.

Precision Agriculture Data Mining Dataset consists of an aerial photograph (TIFF image taken during the growing season) and a synchronized yield map (crop yield taken at harvest). Altogether there are 4 feature attributes (B,G,R,Y) and ~100,000 pixels. TIFF image Yield Map A producer wants to know the relationship between the color intensities and yield? One hypothsize, the Association Rule, hi_green and low_red  hi_yield, is intuitive and could be made and verified without data mining (simple querying). Data mining has found a stronger rule, hi_NIR and low_red  very_hi_yield. So many producers use VIR instead of RBG cameras to get the better information.

Another Precision Agriculture Data Mining Example: Grasshopper Infestation Prediction (again involving RSI data) Grasshopper caused significant economic loss each year. Early infestation prediction is key to damage control. Pixel classification on remotely sensed imagery holds significant promise to achieve early detection. Pixel classification (signaturing) has many applications pest detection, forest fire detection, wet-lands monitoring … (for signaturing we developed the SMILEY software/greyware system) http:midas.cs.ndsu.nodak.edu/~smiley

Sensor Network Data Mining Micro and Nano scale sensor blocks are being developed for sensing Biological agents Chemical agents Motion detection coatings deterioration RF-tagging of inventory Structural materials fatigue There will be trillions++ of individual sensors creating mountains of data. The data must be mined for it’s information.

Sensor Network Application: CubE for Active Situation Replication (CEASR) Situation space (with nano-sensors ) Operational Capability: Using Alien Technology’s Fluidic Self-assembly (FSA) technology, clear layers (with embedded nano-LED elements) laminated together produce a visualization cube (a nano-LED at each pixel corresponding to a nano-sensors at each pixel in the situation space. Nano-sensors turn on CEASR display Nano-LEDs when a threshold level (chemical, vibrational, biological, thermal…) is exceeded.. Proposed Technical Approach: Sense chemical, vibrational, biological, thermal in real-time. The problems to be solved include: Communication between sensor field(s) and CEASR. Nano-sensors position registration. Fluidic Self Assembly (FSA) of Cube. FSA is an Alien Technology patented process capable of producing clear flexible substrates with embedded nano-LED display units. Each energized nano-sensor must transmit a ping together with its location. These locations are then translated to 3-dimensional coordinates at the display. The corresponding voxel on the display lights up. A more sophisticated CEASR device would sense and transmit the intensity levels, lighting up the display voxel with the same intensity.

Anthropology Application Digital Archive Network for Anthropology (DANA) (data mine arthropological artifacts (shape, color, discovery location,…)

Astronomy Application: The celestial sphere RA dec

Data Mining? Querying is asking specific questions and expecting specific answers. Data Mining is going into the MOUNTAIN of DATA, and returning with information gems. But also, some fool’s gold? Relevance and interestingness analysis, serves to assay those information and knowledge gems.

Data Mining Process Pattern Evaluation and Assay Data mining: the core of the knowledge discovery process. Data Mining OLAP Classification Clustering ARM Task-relevant Data Data Warehouse: cleaned, integrated, read-only, periodic, historical raw database Selection Feature extraction, tuple selection Data Cleaning/Integration: missing data, outliers, noise, errors Mountain of Raw Data

Data Mining versus Querying There is a whole spectrum of techniques to get information from data: On the Query end, much work is yet to be done (D. DeWitt, ACM SIGMOD Record’02). On the Data Mining end, the surface has barely been scratched. But even those scratches had a great impact – One of the early scatchers became the biggest corporation in the world last year. A Non-scratcher filed for bankruptcy SQL SELECT FROM WHERE Complex queries (nested, EXISTS..) FUZZY query, Search engines, BLAST searches OLAP (rollup, drilldown, slice/dice.. Machine Learning Data Mining Standard querying Searching and Aggregating Supervised Learning – classification regression Unsupervised Learning - clustering Association Rule Mining Data Prospecting Fractals, … Walmart vs. KMart

Our Approach Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees (Ptrees in either case)1 processed horizontally (DBMSs process horizontal data vertically) Ptrees are data-mining-ready, compressed data structures, which attempt to address the curses of scalability and curse of dimensionality. A compressed, OLAP-ready data warehouse structure, the Pcube1 which facilitate OLAP and querying, using Ptrees. 1 Technology is patent pending by North Dakota State University

--> R[A1] R[A2] R[A3] R[A4] A table, R(A1..An), is a horizontal structure (set of horizontal records) Ptrees vertical partition; compress each vertical bit slice into a basic Ptree; horizontally process these Ptrees using one multi-operand logical AND operation. processed vertically (vertical scans) 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 R( A1 A2 A3 A4) 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 --> R[A1] R[A2] R[A3] R[A4] 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 Horizontal structure Processed vertically (scans) 1 1-Dimensional Ptrees are built by recording the truth of the predicate “pure 1” recursively on halves, until there is purity, P11: 1. Whole file is not pure1 0 2. 1st half is not pure1  0 3. 2nd half is not pure1  0 0 0 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 1 10 1 0 01 0 0 0 1 01 10 1 0 0 1 0 1 0 0 10 01 7. 2nd half of 1st of 2nd not 0 0 0 0 1 10 4. 1st half of 2nd half not  0 0 0 6. 1st half of 1st of 2nd is  1 0 0 0 1 1 5. 2nd half of 2nd half is  1 0 0 0 1 Eg, to count, 111 000 001 100s, use “pure111000001100”: 0 23-level P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0 22-level =2 01 21-level

Horizontal Processing of Vertical Structures History In the 1980’s vertical data structures were proposed for record-based workloads Decomposition Storage Model (DSM, Copeland et al) Attribute Transposed File (ATF) Band Sequential (BSQ) in RSI) Bit Transposed File (BTF, Wang et al) These initiatives didn’t last. Why not?

Horizontal Processing of Vertical Structures for Record-based Workloads For record-based workloads (where the result is a set of records), changing the horizontal record structure and then having to reconstruct it, may have introduced too much post processing? For data mining workloads, the result is often a bit (Yes/No, True/False) or another unstructured result, there is no reconstructive post processing? 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 R( A1 A2 A3 A4) 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 1

--> R[A1] R[A2] R[A3] R[A4] Run Lists: Generalized Ptrees using standard run length compression of vertical bit files (alternatively, using Lempl Zipf?, Golomb?, other?) Run Lists: record type and start-offset of pure runs. E.g., RL11: 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 R( A1 A2 A3 A4) 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 --> R[A1] R[A2] R[A3] R[A4] 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 1 R11 1st run is Pure0  0:000 truth:start 2. 2nd run is Pure1  1:100 3. 3rd run is Pure0  0:101 4. 4th run is Pure1  1:110 RL11 0:000 1:100 0:101 1:110 (to complement, flip purity bits) RL11 RL12 RL13 RL21 RL22 RL23 RL31 RL32 RL33 RL41 RL42 RL43 0:000 1:100 0:101 1:110 1:000 0:100 1:101 1:001 0:010 0:110 1:011 1:010 0:001 Eg, to count, 111 000 001 100s, use “pure111000001100”: RL11^RL12^RL13^RL’21^RL’22^RL’23^RL’31^RL’32^RL33^RL41^RL’42^RL’43

RunList-trees? (RLtrees) To facilitate subsetting (isolating a subset) and processing, a Ptree stucture can be constructed on top of the RunList using the “pure1” predicate: 1. Whole file is not pure1 0 2. 1st half is not pure1  0 3. 2nd half is not pure1  0 0 0 7. 2nd half of 1st of 2nd not 0 0 0 0 1 10 5. 2nd half of 2nd half is  1 0 0 0 1 4. 1st half of 2nd half not  0 0 0 6. 1st half of 1st of 2nd is  1 0 0 0 1 1 1 R11 RL11 0:000 1:100 0:101 1:110

RunList-trees continued Alternatively, a separate NotPure0 index trees could be build where the predicate is NotPure0 (also note, the tree could be terminated at a given level). First, AND the NP0 index trees. Only the 1-branches or the resulty need to be ANDed through list scans. The more operands, the fewer 1-branches. 1 1. Whole file is true  1 2. 1st half is false  0 1 3. 2nd half is true  1 1 0 1 7. 2nd of 1st of 2nd false  0 1 0 1 1 1 10 5. 2nd half of 2nd half true  1 1 0 1 1 1 4. 1st half of 2nd half true  1 1 0 1 6. 1st half of 1st of 2nd true 1 1 0 1 1 1 1 R11 RL11 0:000 1:100 0:101 1:110

Ptrees Vertical, compressed, lossless structures that facilitates fast horizontal AND-processing The jury is still out on the best parallelization approach, vertical (by relation) or horizontal (by tree node) or some combination. Horizontal parallelization is pretty, but network multicast overhead is huge Use active networking? Clusters of Playstations?... The most useful form of a Ptree is the Pure1-tree or P1tree A 1-bit at a node iff corresponding half is pure1. There are many other useful predicates, e.g., NonPure0-trees But we will focus on P1trees. All Ptrees shown so far were 1-dimensional (recursively halving bit files), but they can be 2-D (recursively quartering) (e.g., used for 2-D images) 3-D (recursively eight-ing), …

A 2-Dimensional P1tree Node is 1 iff that quadrant is purely 1-bits, e.g., A bit-file (from, e.g., a 2-D image) 1111110011111000111111001111111011110000111100001111000001110000 Run-length compress the corresponding raster ordered matrix using Peano order. 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1

Alternatively, a Count tree Alternatively, a Count tree? Counts are the ultimate goal, but P1trees are more compressed and produce the needed counts quite quickly. 001 55 level-3 (pure=43) 16 8 15 16 level-2 3 4 1 4 level-1 1 level-0 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 2 3 2 3 2 . 2 . 3 111 QID (Quadrant ID): e.g., 2.2.3 Pure-1/Pure-0 quadrants Root Count Tree levels: 3, 2, 1, 0 (purity counts of 43 42 41 40 resp.) Fan-out = 2dim = 4 ( 7, 1 ) ( 111, 001 ) 10.10.11

Logical Operations on Ptrees (are used to get counts of any pattern) Ptree 1 Ptree 2 AND result OR result AND operation is faster than the bit-by-bit AND since, there are shortcuts (any pure0 operand node means result node is pure0.) e.g., only load quadrant 2 to AND Ptree1, Ptree2, etc. The more operands there are in the AND, the greater the benefit due to this shortcut (more pure0 nodes).

A Natural 3-D Application: CubE for Active Situation Replication (CEASR) Situation space (with nano-sensors ) Operational Capability: Using Alien Technology’s Fluidic Self-assembly (FSA) technology, clear layers (with embedded nano-LED elements) laminated together produce a visualization cube (a nano-LED at each pixel corresponding to a nano-sensors at each pixel in the situation space. Nano-sensors turn on CEASR display Nano-LEDs when a threshold level (chemical, vibrational, biological, thermal…) is exceeded.. Proposed Technical Approach: Sense chemical, vibrational, biological, thermal in real-time. The problems to be solved include: Communication between sensor field(s) and CEASR. Nano-sensors position registration. Fluidic Self Assembly (FSA) of Cube. FSA is an Alien Technology patented process capable of producing clear flexible substrates with embedded nano-LED display units. Each energized nano-sensor must transmit a ping together with its location. These locations are then translated to 3-dimensional coordinates at the display. The corresponding voxel on the display lights up. A more sophisticated CEASR device would sense and transmit the intensity levels, lighting up the display voxel with the same intensity.

Another Natural 3-D Application Digital Archive Network for Anthropology (DANA) (data mine arthropological artifacts (shape, color,…)

3-Dimensional Ptrees (e.g., for the CEASR sensor network or the Digital Archive Network for Anthropology)

Ptree dimension The dimension of the Ptree structure is a user chosen parameter It can be chosen to fit the data dimension Most datasets  1-D Ptrees (recursive halving) 2-D Images  2-D Ptrees (recursive quartering) 3-D Solids  3-D Ptrees (recursive eighth-ing) Or dimension can be chosen based on other considerations optimize compression increase processing speed (next slide)

Raster Sorting: Attributes 1st Bit position 2nd Peano Sorting: Bit position 1st Attributes 2nd Generalized Raster and Peano Sorting: generalizes to any table with numeric attributes (not just images).

Generalize Peano Sorting 20 40 60 80 100 120 adult spam mushroom function crop Time in Seconds Unsorted Generalized Raster Generalized Peano KNN speed improvement (UCI MLR data sets)

National Virtual Observatory data What Ptree dimension and what ordering should be used for astronomical data? Where all bodies are assumed to be on the surface of a sphere, the celestial sphere (shares equatorial plane with earth and has no specified radius) Peano Triangle Mesh Tree (PTM-tree) Peano Celestial Coordinate tree (PCCtree) Uses (RA, dec) coordinates of the celestial sphere RA=Recession Angle (longitudinal angle) dec=declination (latitude angle)

Peano Triangular Mesh Tree (PTM-tree) Similar to the Hierarchical Triangular Mesh (HTM) used in the Sloan Digital Sky Survey project. In both: Sphere is divided into triangles Triangle sides are always great circle segments. PTM differs from HTM in the way in which they are ordered?

The difference between HTM and PTM-trees is in the ordering. 1,2 1,3 1,0 1,1 1 1,3,3 1,3,2 1,3,0 1,3,1 1,2 1,1 1,0 1,3 1 1,1,2 1,1,0 1,1,1 1.1.3 Ordering of PTM-tree Ordering of HTM Why use a different ordering?

PTM Triangulation of the Celestial Sphere Traverse southern hemisphere in the revere direction (just the identical pattern pushed down instead of pulled up, arriving at the Southern neighbor of the start point. RA dec This “Peano ordering” produces a sphere-surface filling curve with good continuity characteristics.

PTM triangulation – Next Level LRLR LRLR LRLR LRLR

PTM-triangulation - Next Level LRLR RLRL LRLR RLRL LRLR RLRL LRLR RLRL

Peano Celestial Coordinate Trees (PCCtrees) Unlike PTM-trees which initially partition the sphere into the 8 faces of an octahedron, in the PCCtree scheme: the sphere is tranformed into a cylinder, then into a rectangle, then standard Peano ordering is used on the Celestial Coordinates. Celestial Coordinates RA is from 0 to 360o dec is -90o to 90o.

PRAdec-scheme: Sphere  Cylinder  Plane North Plane South Plane 90o 0o -90o 0o 360o Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z PRAdec-scheme: Sphere  Cylinder  Plane Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z

Graph data (many-to-many self relations) “Everything should be made as simple as possible, but not simpler” Albert Einstein

Representating graphs Examples: Genomics Protein-protein interactions (ACM KDD-Cup ’02) Focuses is on node structure WWW Focuses on link structure Publications citations ACM KDD_Cup ’03 Focus is on both Scientific American 05/03

Gene-Org Dim Table chromosome,length Gene Dimension Table 1 PolyA-Tail .9 .1 StopCodonDensity apop mito meio Function Ribo Nucl Myta SubCell-Location Genomics Organism Dimension Table 3000 1 Mus musculus mouse 12.1 Saccharomyces cerevisiae yeast 185 Drosophila melanogaster fly Homo sapiens human Genome Size (million bp) Vert Species Organism Gene-Org Dim Table chromosome,length g0 g1 g2 g3 o1 o2 o3 o0 17, 78 12, 60 Mi, 40 1, 48 10, 75 0 0 7, 40 0 14, 65 0 0 16, 76 0 9, 45 Pl, 43 1 0 1 1 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 0 1 0 1 0 0 1 0 1 1 e0 e1 e2 e3 Experiment Dimension Table (MIAME) 1 a s 4 2 c h b 3 N M H S AD ED STZ CTY STR UNV PI LAB e0 e1 e2 e3 Gene-Experiment-Organism Cube (1 iff that gene from that organism expresses at a threshold level in that experiment.) Many-to-many-to-many relationship

Protien-Protien Interactions (PPI) (2-hop interactions) SubCell-Location Myta Ribo Nucl Ribo Function apop meio mito apop StopCodonDensity .1 .1 .1 .9 Gene Dimension Table PolyA-Tail 1 1 g4 1 g3 g2 g1 G E N Poly-A SCD Mi to Me i o apop Nuc l Ribo Myta 2 3 4 Gene Dimension Table (Binary) g0 g1 g2 g3 g0 g1 g2 g3 g0 g1 g2 g3 1 1 1

Association of Computing Machinery KDD-Cup-02 NDSU Team

Greyware PPI graph mining tool Visualize feature information using a glyph for each gene (PPI graph node) PPI Edge iff the 2 genes code for interacting proteins This visual data mining tool was effective in KDD-CUP ’02) Glyph for g1 g4 2 8 9 .9 1 g3 5 4 .1 g2 6 g1 o G E N Info-qty SCD Mi to Me i apop Nuc l Ribo Myta length essential Dis-center Gene Dimension Table (non-binary) stopcodondensity

Thanks so much! Don’t forget to submit your best work to CAINE, Nov 11-13, 2003, Las Vegas NV by July 1. Submit to Program Chair, kendall.nygard@ndsu.nodak.edu or Conference Chair, william.perrizo@ndsu.nodak.edu http:/www.cs.ndsu.nodak.edu/~krile/caine03 or http:www.isca-hq.org For those interested in DM in genomics and bioinformatics, Virtual Conference in Genomics and Bioinformatics VGAB-III, Sep 16-18, 2003 http:www.ndsu.edu/~virtual-genomics Submit papers to Program Chair, willy.valdivia@ndsu.nodak.edu or to the Conference Chair, william.perrizo@ndsu.nodak.edu VGAB-III will be available over Access Grid and Real Player to anywhere, for free (no registration fee)