DataSURG (Database Systems Users and Research Group)

Data Mining and Data Warehousing, many-to-many Relationships, applications
DataSURG (Database Systems Users and Research Group) North Dakota State University Fargo, North Dakota USA

Data Mining on a Data Warehouse vs
Data Mining on a Data Warehouse vs. Transaction Processing on a Data Base Workload on Repository Question C.J. Date, circa 1980 Transactions on a DBMS vs. file processing programs on file systems. “Use a DBMS instead of file systems! Unify data resources, centralize control, promote standards and consistency, eliminate redundancy, increase data value and usage, yadda, yadda” Circa 1990 “Buy a separate DW for DM” (separate from your DBMS for TP)” 2 separate, quite redundant, non-sharing, inconsistent.. systems! What happened? Great marketing success! (sold more hardware and software) Great Concurrency Control R&D failure! We failed to integrate transactions and queries (OLTP and OLAP, i.e., updates and reads) in one system with acceptable performance! The marketing was so successful, nobody noticed the failure!

OUTLINE I still hold out hope that DW and DB will eventually be unified again. I believe eventually the industry will demand it. Already, there’s work to update DWs! For now let’s just focus on DATA. I Consider Data Mining (DM) to be on the unstructured side of querying. And on that side, you run up against two curses immediately. Curse of non-scalability (solutions don’t scale with data volume.) Curse of dimensionality (solutions don’t scale with data dimension I will talk about techniques we use to address these curses. Horizontal processing of vertically structured data (instead of the ubiquitous vertical processing of horizontal data (the record orientation). Parallelize the DM engine. Parallelize the software DM engine on clusters of computers. Parallelize the greyware DM engine on clusters of people (i.e., browser-enable all software for visual data mining)

The DataSURG DM Architecture
YOUR DATA data mining algorithms (yours/ ours) PREDICATE Count of objects satisfying PREDICATE Internet DCI (Data Capture Interface) DMI (Data Mining Interface) The Ptree Repository (lossless, compressed, vertically-structured replicas)

Data mining finds information in data. Why do we need Data Mining?
Data volume expands by Parkinson’s Law Data volume expands to fill available data storage Disk-storage expands by Moore’s law Capacity  2 t / 9 months Available storage doubles every 9 months!

We’re awash with data! Network data: hi-speed, DWDM, All-opt (mgmt, flow classif’n,QoS,security) (10 terabytes by 2003 ~ 1013 B). US EROS Data Center (EDC) archives Earth Observing System (EOS) Remotely Sensed Imagery (RSI), satellite and aerial photo data (10 petabytes by 2005 ~ 1016 B). National Virtual Observatory (aggregated astronomical data) (10 exabytes by ~ 1019 B). Sensor data from sensors (including Micro & Nano -sensor networks) (10 zettabytes by ~ 1022 B). WWW (and other text collections) (10 yottabytes by ~ 1025 B). Genomic/Proteomic/Metabolomic data (microarrays, genechips, genome sequences) (10 gazillabytes by ~ 1028 B?). Stock Market prediction data (prices + all the above? especially astronomy data?) (10 supragazillabytes by 2040 ~ 1031 B?). Useful information must be teased out of these large volumes of data through data mining. I had to make up this Name! Projected data sizes are overrunning our ability to name those sizes!

More’s Law: More’s Less
The more volume, the less information. (AKA: Shannon’s Canon) A simple illustration: Which phone book has more info? (both involve the same 4 data granules) BOOK-1 BOOK-2 Name Number Name Number Smith Smith Jones Smith Jones Jones Data mining reduces volume and raises the information level.

Precision Agriculture Data Mining
Dataset consists of an aerial photograph (TIFF image taken during the growing season) and a synchronized yield map (crop yield taken at harvest). Altogether there are 4 feature attributes (B,G,R,Y) and ~100,000 pixels. TIFF image Yield Map A producer wants to know the relationship between the color intensities and yield? One hypothsize, the Association Rule, hi_green and low_red  hi_yield, is intuitive and could be made and verified without data mining (simple querying). Data mining has found a stronger rule, hi_NIR and low_red  very_hi_yield So many producers use VIR instead of RBG cameras to get the better information.

Another Precision Agriculture Data Mining Example: Grasshopper Infestation Prediction (again involving RSI data) Grasshopper caused significant economic loss each year. Early infestation prediction is key to damage control. Pixel classification on remotely sensed imagery holds significant promise to achieve early detection. Pixel classification (signaturing) has many applications pest detection, forest fire detection, wet-lands monitoring … (for signaturing we developed the SMILEY software/greyware system)

Sensor Network Data Mining
Micro and Nano scale sensor blocks are being developed for sensing Biological agents Chemical agents Motion detection coatings deterioration RF-tagging of inventory Structural materials fatigue There will be trillions++ of individual sensors creating mountains of data. The data must be mined for it’s information.

Sensor Network Application:
CubE for Active Situation Replication (CEASR) Situation space (with nano-sensors ) Operational Capability: Using Alien Technology’s Fluidic Self-assembly (FSA) technology, clear layers (with embedded nano-LED elements) laminated together produce a visualization cube (a nano-LED at each pixel corresponding to a nano-sensors at each pixel in the situation space. Nano-sensors turn on CEASR display Nano-LEDs when a threshold level (chemical, vibrational, biological, thermal…) is exceeded.. Proposed Technical Approach: Sense chemical, vibrational, biological, thermal in real-time. The problems to be solved include: Communication between sensor field(s) and CEASR. Nano-sensors position registration. Fluidic Self Assembly (FSA) of Cube. FSA is an Alien Technology patented process capable of producing clear flexible substrates with embedded nano-LED display units. Each energized nano-sensor must transmit a ping together with its location. These locations are then translated to 3-dimensional coordinates at the display. The corresponding voxel on the display lights up. A more sophisticated CEASR device would sense and transmit the intensity levels, lighting up the display voxel with the same intensity.

Anthropology Application Digital Archive Network for Anthropology (DANA) (data mine arthropological artifacts (shape, color, discovery location,…)

Astronomy Application: The celestial sphere
RA dec

Data Mining? Querying is asking specific questions and expecting specific answers. Data Mining is going into the MOUNTAIN of DATA, and returning with information gems. But also, some fool’s gold? Relevance and interestingness analysis, serves to assay those information and knowledge gems.

Data Mining Process Pattern Evaluation
and Assay Data mining: the core of the knowledge discovery process. Data Mining OLAP Classification Clustering ARM Task-relevant Data Data Warehouse: cleaned, integrated, read-only, periodic, historical raw database Selection Feature extraction, tuple selection Data Cleaning/Integration: missing data, outliers, noise, errors Mountain of Raw Data

Data Mining versus Querying
There is a whole spectrum of techniques to get information from data: On the Query end, much work is yet to be done (D. DeWitt, ACM SIGMOD Record’02). On the Data Mining end, the surface has barely been scratched. But even those scratches had a great impact – One of the early scatchers became the biggest corporation in the world last year. A Non-scratcher filed for bankruptcy SQL SELECT FROM WHERE Complex queries (nested, EXISTS..) FUZZY query, Search engines, BLAST searches OLAP (rollup, drilldown, slice/dice.. Machine Learning Data Mining Standard querying Searching and Aggregating Supervised Learning – classification regression Unsupervised Learning - clustering Association Rule Mining Data Prospecting Fractals, … Walmart vs. KMart

Our Approach Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees (Ptrees in either case)1 processed horizontally (DBMSs process horizontal data vertically) Ptrees are data-mining-ready, compressed data structures, which attempt to address the curses of scalability and curse of dimensionality. A compressed, OLAP-ready data warehouse structure, the Pcube1 which facilitate OLAP and querying, using Ptrees. 1 Technology is patent pending by North Dakota State University

--> R[A1] R[A2] R[A3] R[A4]
A table, R(A1..An), is a horizontal structure (set of horizontal records) Ptrees vertical partition; compress each vertical bit slice into a basic Ptree; horizontally process these Ptrees using one multi-operand logical AND operation. processed vertically (vertical scans) R( A1 A2 A3 A4) --> R[A1] R[A2] R[A3] R[A4] R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 Horizontal structure Processed vertically (scans) 1 1-Dimensional Ptrees are built by recording the truth of the predicate “pure 1” recursively on halves, until there is purity, P11: 1. Whole file is not pure1 0 2. 1st half is not pure1  0 3. 2nd half is not pure1  0 0 0 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 1 10 1 0 01 1 0 0 1 7. 2nd half of 1st of 2nd not 0 0 0 0 1 10 4. 1st half of 2nd half not  0 0 0 6. 1st half of 1st of 2nd is  1 0 0 0 1 1 5. 2nd half of 2nd half is  1 0 0 0 1 Eg, to count, s, use “pure ”: level P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = level =2 level

Horizontal Processing of Vertical Structures History
In the 1980’s vertical data structures were proposed for record-based workloads Decomposition Storage Model (DSM, Copeland et al) Attribute Transposed File (ATF) Band Sequential (BSQ) in RSI) Bit Transposed File (BTF, Wang et al) These initiatives didn’t last. Why not?

Horizontal Processing of Vertical Structures for Record-based Workloads
For record-based workloads (where the result is a set of records), changing the horizontal record structure and then having to reconstruct it, may have introduced too much post processing? For data mining workloads, the result is often a bit (Yes/No, True/False) or another unstructured result, there is no reconstructive post processing? R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 R( A1 A2 A3 A4) R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 1

--> R[A1] R[A2] R[A3] R[A4]
Run Lists: Generalized Ptrees using standard run length compression of vertical bit files (alternatively, using Lempl Zipf?, Golomb?, other?) Run Lists: record type and start-offset of pure runs E.g., RL11: R( A1 A2 A3 A4) --> R[A1] R[A2] R[A3] R[A4] R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 1 R11 1st run is Pure  0:000 truth:start nd run is Pure1  1:100 rd run is Pure  0:101 th run is Pure  1:110 RL :000 1:100 0:101 1: (to complement, flip purity bits) RL11 RL12 RL13 RL21 RL22 RL23 RL31 RL32 RL33 RL41 RL42 RL43 0:000 1:100 0:101 1:110 1:000 0:100 1:101 1:001 0:010 0:110 1:011 1:010 0:001 Eg, to count, s, use “pure ”: RL11^RL12^RL13^RL’21^RL’22^RL’23^RL’31^RL’32^RL33^RL41^RL’42^RL’43

RunList-trees? (RLtrees)
To facilitate subsetting (isolating a subset) and processing, a Ptree stucture can be constructed on top of the RunList using the “pure1” predicate: 1. Whole file is not pure1 0 2. 1st half is not pure1  0 3. 2nd half is not pure1  0 0 0 7. 2nd half of 1st of 2nd not 0 0 0 0 1 10 5. 2nd half of 2nd half is  1 0 0 0 1 4. 1st half of 2nd half not  0 0 0 6. 1st half of 1st of 2nd is  1 0 0 0 1 1 1 R11 RL :000 1:100 0:101 1:110

RunList-trees continued
Alternatively, a separate NotPure0 index trees could be build where the predicate is NotPure0 (also note, the tree could be terminated at a given level). First, AND the NP0 index trees. Only the 1-branches or the resulty need to be ANDed through list scans. The more operands, the fewer 1-branches. 1 1. Whole file is true  1 2. 1st half is false  0 1 3. 2nd half is true  1 1 0 1 7. 2nd of 1st of 2nd false  0 1 0 1 1 1 10 5. 2nd half of 2nd half true  1 1 0 1 1 1 4. 1st half of 2nd half true  1 1 0 1 6. 1st half of 1st of 2nd true 1 1 0 1 1 1 1 R11 RL :000 1:100 0:101 1:110

Ptrees Vertical, compressed, lossless structures that facilitates fast horizontal AND-processing The jury is still out on the best parallelization approach, vertical (by relation) or horizontal (by tree node) or some combination. Horizontal parallelization is pretty, but network multicast overhead is huge Use active networking? Clusters of Playstations?... The most useful form of a Ptree is the Pure1-tree or P1tree A 1-bit at a node iff corresponding half is pure1. There are many other useful predicates, e.g., NonPure0-trees But we will focus on P1trees. All Ptrees shown so far were 1-dimensional (recursively halving bit files), but they can be 2-D (recursively quartering) (e.g., used for 2-D images) 3-D (recursively eight-ing), …

A 2-Dimensional P1tree Node is 1 iff that quadrant is purely 1-bits, e.g., A bit-file (from, e.g., a 2-D image) Run-length compress the corresponding raster ordered matrix using Peano order. 1 1 1 1 1 1 1 1 1 1 1

Alternatively, a Count tree
Alternatively, a Count tree? Counts are the ultimate goal, but P1trees are more compressed and produce the needed counts quite quickly. 001 level-3 (pure=43) 16 8 15 level-2 3 4 1 level-1 level-0 1 2 3 2 3 111 QID (Quadrant ID): e.g., Pure-1/Pure-0 quadrants Root Count Tree levels: 3, 2, 1, 0 (purity counts of resp.) Fan-out = 2dim = 4 ( 7, 1 ) ( 111, 001 )

Logical Operations on Ptrees (are used to get counts of any pattern)
Ptree Ptree AND result OR result AND operation is faster than the bit-by-bit AND since, there are shortcuts (any pure0 operand node means result node is pure0.) e.g., only load quadrant 2 to AND Ptree1, Ptree2, etc. The more operands there are in the AND, the greater the benefit due to this shortcut (more pure0 nodes).

A Natural 3-D Application:
CubE for Active Situation Replication (CEASR) Situation space (with nano-sensors ) Operational Capability: Using Alien Technology’s Fluidic Self-assembly (FSA) technology, clear layers (with embedded nano-LED elements) laminated together produce a visualization cube (a nano-LED at each pixel corresponding to a nano-sensors at each pixel in the situation space. Nano-sensors turn on CEASR display Nano-LEDs when a threshold level (chemical, vibrational, biological, thermal…) is exceeded.. Proposed Technical Approach: Sense chemical, vibrational, biological, thermal in real-time. The problems to be solved include: Communication between sensor field(s) and CEASR. Nano-sensors position registration. Fluidic Self Assembly (FSA) of Cube. FSA is an Alien Technology patented process capable of producing clear flexible substrates with embedded nano-LED display units. Each energized nano-sensor must transmit a ping together with its location. These locations are then translated to 3-dimensional coordinates at the display. The corresponding voxel on the display lights up. A more sophisticated CEASR device would sense and transmit the intensity levels, lighting up the display voxel with the same intensity.

Another Natural 3-D Application Digital Archive Network for Anthropology (DANA) (data mine arthropological artifacts (shape, color,…)

3-Dimensional Ptrees (e.g., for the CEASR sensor network or
the Digital Archive Network for Anthropology)

Ptree dimension The dimension of the Ptree structure is a user chosen parameter It can be chosen to fit the data dimension Most datasets  1-D Ptrees (recursive halving) 2-D Images  2-D Ptrees (recursive quartering) 3-D Solids  3-D Ptrees (recursive eighth-ing) Or dimension can be chosen based on other considerations optimize compression increase processing speed (next slide)

Raster Sorting: Attributes 1st Bit position 2nd
Peano Sorting: Bit position 1st Attributes 2nd Generalized Raster and Peano Sorting: generalizes to any table with numeric attributes (not just images).

Generalize Peano Sorting
20 40 60 80 100 120 adult spam mushroom function crop Time in Seconds Unsorted Generalized Raster Generalized Peano KNN speed improvement (UCI MLR data sets)

National Virtual Observatory data
What Ptree dimension and what ordering should be used for astronomical data? Where all bodies are assumed to be on the surface of a sphere, the celestial sphere (shares equatorial plane with earth and has no specified radius) Peano Triangle Mesh Tree (PTM-tree) Peano Celestial Coordinate tree (PCCtree) Uses (RA, dec) coordinates of the celestial sphere RA=Recession Angle (longitudinal angle) dec=declination (latitude angle)

Peano Triangular Mesh Tree (PTM-tree)
Similar to the Hierarchical Triangular Mesh (HTM) used in the Sloan Digital Sky Survey project In both: Sphere is divided into triangles Triangle sides are always great circle segments. PTM differs from HTM in the way in which they are ordered?

The difference between HTM and PTM-trees is in the ordering.
1,2 1,3 1,0 1,1 1 1,3,3 1,3,2 1,3,0 1,3,1 1,2 1,1 1,0 1,3 1 1,1,2 1,1,0 1,1,1 1.1.3 Ordering of PTM-tree Ordering of HTM Why use a different ordering?

PTM Triangulation of the Celestial Sphere
Traverse southern hemisphere in the revere direction (just the identical pattern pushed down instead of pulled up, arriving at the Southern neighbor of the start point. RA dec This “Peano ordering” produces a sphere-surface filling curve with good continuity characteristics.

PTM triangulation – Next Level
LRLR LRLR LRLR LRLR

PTM-triangulation - Next Level
LRLR RLRL LRLR RLRL LRLR RLRL LRLR RLRL

Peano Celestial Coordinate Trees (PCCtrees)
Unlike PTM-trees which initially partition the sphere into the 8 faces of an octahedron, in the PCCtree scheme: the sphere is tranformed into a cylinder, then into a rectangle, then standard Peano ordering is used on the Celestial Coordinates. Celestial Coordinates RA is from 0 to 360o dec is -90o to 90o.

PRAdec-scheme: Sphere  Cylinder  Plane
North Plane South Plane 90o 0o -90o 0o o Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z PRAdec-scheme: Sphere  Cylinder  Plane Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z

Graph data (many-to-many self relations)
“Everything should be made as simple as possible, but not simpler” Albert Einstein

Representating graphs
Examples: Genomics Protein-protein interactions (ACM KDD-Cup ’02) Focuses is on node structure WWW Focuses on link structure Publications citations ACM KDD_Cup ’03 Focus is on both Scientific American 05/03

Gene-Org Dim Table chromosome,length
Gene Dimension Table 1 PolyA-Tail .9 .1 StopCodonDensity apop mito meio Function Ribo Nucl Myta SubCell-Location Genomics Organism Dimension Table 3000 1 Mus musculus mouse 12.1 Saccharomyces cerevisiae yeast 185 Drosophila melanogaster fly Homo sapiens human Genome Size (million bp) Vert Species Organism Gene-Org Dim Table chromosome,length g0 g1 g2 g3 o1 o2 o3 o0 17, , Mi, , 48 10, , 40 , 16, , Pl, 43 1 e0 e1 e2 e3 Experiment Dimension Table (MIAME) 1 a s 4 2 c h b 3 N M H S AD ED STZ CTY STR UNV PI LAB e0 e1 e2 e3 Gene-Experiment-Organism Cube (1 iff that gene from that organism expresses at a threshold level in that experiment.) Many-to-many-to-many relationship

Protien-Protien Interactions (PPI) (2-hop interactions)
SubCell-Location Myta Ribo Nucl Ribo Function apop meio mito apop StopCodonDensity .1 .1 .1 .9 Gene Dimension Table PolyA-Tail 1 1 g4 1 g3 g2 g1 G E N Poly-A SCD Mi to Me i o apop Nuc l Ribo Myta 2 3 4 Gene Dimension Table (Binary) g0 g1 g2 g3 g0 g1 g2 g3 g0 g1 g2 g3 1 1 1

Association of Computing Machinery KDD-Cup-02
NDSU Team

Greyware PPI graph mining tool
Visualize feature information using a glyph for each gene (PPI graph node) PPI Edge iff the 2 genes code for interacting proteins This visual data mining tool was effective in KDD-CUP ’02) Glyph for g1 g4 2 8 9 .9 1 g3 5 4 .1 g2 6 g1 o G E N Info-qty SCD Mi to Me i apop Nuc l Ribo Myta length essential Dis-center Gene Dimension Table (non-binary) stopcodondensity

Thanks so much! Don’t forget to submit your best work to CAINE, Nov 11-13, 2003, Las Vegas NV by July 1. Submit to Program Chair, or Conference Chair, or For those interested in DM in genomics and bioinformatics, Virtual Conference in Genomics and Bioinformatics VGAB-III, Sep 16-18, Submit papers to Program Chair, or to the Conference Chair, VGAB-III will be available over Access Grid and Real Player to anywhere, for free (no registration fee)

DataSURG (Database Systems Users and Research Group)

Similar presentations

Presentation on theme: "DataSURG (Database Systems Users and Research Group)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DataSURG (Database Systems Users and Research Group)

Similar presentations

Presentation on theme: "DataSURG (Database Systems Users and Research Group)"— Presentation transcript:

Similar presentations

About project

Feedback