Data Mining and Data Warehousing, many-to-many Relationships, applications William Perrizo DataSURG (Database Systems Users and Research Group) North Dakota.

Data Mining and Data Warehousing, many-to-many Relationships, applications William Perrizo DataSURG (Database Systems Users and Research Group) North Dakota State University Fargo, North Dakota USA william.perrizo@ndsu.nodak.edu

DM on a DW vs. TP on a DB Workload on Repository Question  C.J. Date, circa 1980  Transactions on a DBMS vs.  file processing programs on file systems.  “Use a DBMS instead of file systems! Unifies data resources, centralizes control, promotes standards and consistency, eliminates redundancy, increases data value (wider data usage)”  Circa 1990  “Buy a separate DW for DM” (separate from your DBMS for TP)”  2 separate, quite redundant, non-sharing, inconsistent.. systems!  What happened?  Great marketing success! (sold more hardware and software)  Great Concurrency Control R&D failure! We failed to integrate transactions and queries (OLTP and OLAP, i.e., updates and reads) in one system with acceptable performance!  The marketing was so successful, nobody noticed the failure!

OUTLINE I still hold out hope that DW and DB will eventually be unified again. I believe eventually the industry will demand it. Already, there is work to update DW s ! For now let’s just focus on DATA and DM.  I Consider DM to be on the unstructured side of querying. There you run up against two curses immediately. Curse of non-scalability (solutions don’t scale with data volume.) Curse of dimensionality (solutions don’t scale with data dimension  I will talk about techniques we have used to address the curses. Process vertically structured data horizontally (instead of the ubiquitous vertical processing of horizontal data (the record orientation). Parallelize the DM engine.  Parallelize the software DM engine on clusters of computers.  Parallelize the greyware DM engine on clusters of people (i.e., browser-enable all software for visual data mining)

Data mining finds information in data. Why do we need Data Mining?  Data volume expands by Parkinson’s Law Data volume expands to fill available data storage  Disk-storage expands by Moore’s law Capacity  2 t / 9 months Available storage doubles every 9 months!

We’re awash with data! Network data: hi-speed, DWDM, All-opt (mgmt, flow classif’n,QoS,security) (10 terabytes by 2003 ~ 10 13 B). US EROS Data Center (EDC) archives Earth Observing System (EOS) Remotely Sensed Imagery (RSI), satellite and aerial photo data (10 petabytes by 2005 ~ 10 16 B). National Virtual Observatory (aggregated astronomical data) (10 exabytes by 2010 ~ 10 19 B). Sensor data from sensor networks (Micro & Nano -sensor networks) (10 zettabytes by 2015 ~ 10 22 B). WWW will continue to grow (and other text collections) (10 yottabytes by 2020 ~ 10 25 B). Micro-arrays, gene-chips, genome sequence data (10 gazillabytes by 2030 ~ 10 28 B?). Stock Market prediction data (prices + all the above? especially astronomy data?) (10 supragazillabytes by 2040 ~ 10 31 B?). Useful info must be teased out of these large volumes of data thru data mining. I had to make up this Name? Are projected data sizes overrunning our ability to names for those sizes!

More’s Law: More’s Less The more volume, the less information. (AKA: Shannon’s Canon) A simple illustration: Which phone book has more info? (both involve the same 4 data granules) BOOK-1BOOK-2 NameNumberNameNumber Smith234-9816Smith234-9816 Jones231-7237Smith231-7237 Jones234-9816 Jones231-7237 Data mining reduces volume and raises the information level.

Precision Agriculture Data Mining TIFF image Yield Map Dataset consists of an aerial photograph (TIFF image taken during the growing season) and a synchronized yield map (crop yield taken at harvest). Altogether there are 4 feature attributes (B,G,R,Y) and ~100,000 pixels. A producer wants to know the relationship between the color intensities and yield? One hypothsize, the Association Rule, hi_green and low_red  hi_yield, is intuitive and could be made and verified without data mining (simple querying). Data mining has found a stronger rule, hi_NIR and low_red  very_hi_yield. So many producers use VIR instead of RBG cameras to get the better information.

Another Precision Agriculture Data Mining Example: Grasshopper Infestation Prediction (again involving RSI data) Grasshopper caused significant economic loss each year. Early infestation prediction is key to damage control. Pixel classification on remotely sensed imagery holds significant promise to achieve early detection. Pixel classification (signaturing) has many applications pest detection, forest fire detection, wet-lands monitoring … (for signaturing we developed the SMILEY software/greyware system) http:midas.cs.ndsu.nodak.edu/~smiley

Sensor Network Data Mining  Micro and Nano scale sensor blocks are being developed for sensing  Biological agents  Chemical agents  Motion detection  coatings deterioration  RF-tagging of inventory  Structural materials fatigue There will be trillions ++ of individual sensors creating mountains of data. The data must be mined for it’s information.

CEASR slide 

Data Mining? Querying is asking specific questions and expecting specific answers. Data Mining goes into the MOUNTAIN of DATA, and hopefully returns information gems. But also, some fool’s gold? Relevance and interestingness analysis, serves to assay those gems. (help pick out valuable gems).

Data Mining Process Data mining: the core of the knowledge discovery process. Data Cleaning/Integration: missing data, outliers, noise, errors Mountain of Raw Data Data Warehouse: cleaned, integrated, read-only, periodic, historical raw database Task-relevant Data Selection Data Mining Pattern Evaluation and Assay OLAP Classification Clustering ARM Feature extraction, tuple selection

Data Mining versus Querying There is a whole spectrum of techniques to get information from data : On the Query end, much work is yet to be done (D. DeWitt, ACM SIGMOD Record’02). On the Data Mining end, the surface has barely been scratched. But even those scratches had a great impact – One of the early scatchers became the biggest corporation in the world last year. A Non-scratcher filed for bankruptcy SQL SELECT FROM WHERE Complex queries (nested, EXISTS..) FUZZY query, Search engines, BLAST searches OLAP (rollup, drilldown, slice/dice.. Machine LearningData Mining Standard querying Searching and Aggregating Supervised Learning – classification regression Unsupervised Learning - clustering Association Rule Mining Data Prospecting Fractals, … Walmart vs. KMart

Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, Peano-trees (Ptrees) 1 processed horizontally (DBMSs process horizontal data vertically) Ptrees are data-mining-ready, compressed data structures, which attempt to address the curses of scalability and curse of dimensionality.  And a compressed, OLAP-ready data warehouse structure, the Peano Data Cube (PDcube) 1 PDcube facilitates OLAP operations and query processing, using the Ptree data structure. 1 Technology is patent pending by North Dakota State University

6. 1 st half of 1 st of 2 nd is  1 0 0 1 1 4. 1 st half of 2 nd half not  0 0 2. 1 st half is not pure1  0 0 0 1. Whole file is not pure1  0 Vertical structure processed horizontally (ANDs) Horizontal structure Processed vertically (scans) P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P 43 0 0 0 0 1 10 0 1 0 0 1 0 0 0 0 0 0 1 01 10 0 1 0 0 1 0 0 0 0 1 0 01 0 1 0 0 0 10 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 10 01 5. 2 nd half of 2 nd half is  1 0 0 1 0000101100001011 processed horizontally (using a multi- operand logical AND s ). Ptrees are vertical structures (compressed vertical bit files) 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 R( A 1 A 2 A 3 A 4 ) DBMS relation (table) is a horizontal structure (set of horizontal records) processed vertically (vertical scans) (1-Dim) Ptrees are built by recording the truth of the predicate “pure 1” recursively on halves, until there is purity, P 11 : 3. 2 nd half is not pure1  0 0 7. 2 nd half of 1 st of 2 nd not  0 0 0 1 10 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 101 100 111 000 001 100 --> R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 1 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43

Ptrees Run-length-compressed, lossless, vertical, structures representing the data in a way that facilitates fast horizontal AND-processing Jury is still out on which parallelization approach is best, vertical (by relation) or horizontal (by tree node) or some combination.  Horizontal is pretty, but network multicast overload eats us alive (maybe active networking? Clusters of Playstations?...)  The most useful form of a Ptree is the predicate-Ptree e.g., when we use the “pure 1” predicate, as in the previous example, they are called Pure1-tree s or P1tree s (1-bit at a node iff corresponding half is pure1.) There are many other useful predicates, e.g., NonPure0-trees or NP0trees (1 iff half is not pure0) etc., but we will focus on P1trees. The Ptree on previous slide were 1-dimensional (recursively halving bit files), but they can be 2-D (recursively quartering), 3-D,…  Ptrees for 2-D spatial data are usually 2-dimensional (recursively quartering, in Peano order)

A 2-Dimensional P1tree 0 1000 00101101 11100010110 1 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 10 0 1 111 11 10 000 10 0 11 0 0 1 0 Node is 1 iff that quadrant is purely 1-bits, e.g., A bit-file (from, e.g., a 2-D image) 1111110011111000111111001111111011110000111100001111000001110000 Run-length compress the corresponding raster ordered matrix using Peano order.

Peano coordinates (QID) Raster coordinate order sorts by dimension 1 st and bit-position 2 nd Peano coordinate order sorts by bit-position 1 st and dimension 2 nd. 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 0123 111 ( 7, 1 ) ( 111, 001 ) 10.10.11 2 3 2. 2. 3 001 55 1681516 30414434 11100010110 1 2-D Count Ptrees (original form) Counts are the ultimate goal, but we use predicate Ptrees (e.g., P1trees) because they are more compressed, facilitate faster ANDing and produce the needed counts quite quickly

CEASR and a 3-D diagram  With animation arrows? If so, go back first Then left, Then down.  Point out: Add additional bits to increase intensity resolution (granularity)? Add DANA slides.

Logical Operations on P-trees (are used to get counts of any pattern) The Ptree AND operation is faster than the bit-by-bit AND operation since, there are shortcuts. e.g., only load quadrant 2 to AND Ptree1, Ptree2, etc. The more operands there are in the AND, the greater the benefit due to this shortcut (more pure0 nodes). Ptree 1 Ptree 2 AND result OR result

Ptree dimension  The dimension of the Ptree structure is a user chosen parameter It can be chosen to fit the data dimension Most datasets  1-D Ptrees (recursive halving) 2-D Images  2-D Ptrees (recursive quartering) 3-D Solids  3-D Ptrees (recursive eighth-ing) Or dimension can be chosen based on other considerations optimize compression increase processing speed (next slide)

Raster Sorting: Attributes 1 st Bit position 2 nd Peano Sorting: Bit position 1 st Attributes 2 nd Generalized Raster and Peano Sorting: generalizes to any table with numeric attributes (not just images).

Generalize Peano Sorting 0 20 40 60 80 100 120 adult spam mushroom function crop Time in Seconds Unsorted Generalized Raster Generalized Peano KNN speed improvement (UCI MLR data sets)

National Virtual Observatory data  What Ptree dimension and what ordering should be used for astronomical data? Peano Triangle Mesh Tree (PTM-tree) Peano Celestial Coordinate tree (PCCtree)  Uses (RA, dec) coordinates of the celestial sphere RA=Recession Angle dec=declination

Peano Triangular Mesh Tree (PTM-tree)  Similar to the Hierarchical Triangular Mesh (HTM) scheme (Sloan Digital Sky Survey project) Sphere is divided into triangles Triangle sides are always great circle segments. PTM differs from HTM in the way in which they are ordered?

PTM-Ordering the Triangular Mesh 1,2 1,3 1,0 1,1 1 1,3,3 1,3,21,3,0 1,3,1

The Half Sphere up to 3 Levels Traverse southern hemisphere in the revere direction (just the identical pattern pushed down instead of pulled up, arriving at the Southern neighbor of the start point. RA dec

PTM-tree up to 3 Levels LRLR LRLR

PTM-tree up to 4 Levels LRLR RLRL LRLR RLRL LRLR RLRL LRLR RLRL LRLR RLRL LRLR RLRL

Peano Celestial Coordinate Trees (PCCtrees)  Unlike PTM-tree which partitions the sphere into the 8 faces of an octahedron, in the PCCtree scheme transforms the sphere into a cylinder, then a rectangle, then use regular Peano Coords.  Celestial coordinates RA is from 0 to 360 o dec is -90 o to 90 o.

PRAdec-scheme: Sphere  Cylinder  Plane Z North Plane South Plane 90 o 0 o -90 o 0 o 360 o Z PRAdecPRAdec

Graph data (many-to-many self relations) “Everything should be made as simple as possible, but not simpler” Albert Einstein

Representating graphs Examples:  Genomics Protein-protein interactions (ACM KDD-Cup ’02)  Focuses is on node structure  WWW Focuses on link structure  Publications citations ACM KDD_Cup ’03 Focus is on both Scientific American 05/03

e0e0 e1e1 e2e2 e3e3 1 0 1 1 0 1 1 1 1 1 0 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 0 0 1 0 1 0 0 1 0 1 1 o1o1 o2o2 o3o3 o0o0 Gene-Experiment-Organism Cube (1 iff that gene from that organism expresses at a threshold level in that experiment.) Many-to-many-to-many relationship Organism Dimension Table 30001Mus musculus mouse 12.10 Saccharomyces cerevisiae yeast 1850Drosophila melanogaster fly 30001Homo sapienshuman Genome Size (million bp) Vert Species Organism Gene Dimension Table 0011PolyA-Tail.9.1 StopCodonDensity apopmitomeioapop Function Ribo Nucl RiboMyta SubCell-Location Experiment Dimension Table (MIAME) 1asa42 1aca42 0hsb22 1hca23 NMHSADAD EDED STZSTZ CTYCTY STRSTR UNVUNV PIPI LABLAB g0g0 g1g1 g2g2 g3g3 e0e0 e1e1 e2e2 e3e3 17, 78 12, 60 Mi, 40 1, 48 10, 75 0 0 7, 40 0 14, 65 0 0 16, 76 0 9, 45 Pl, 43 Gene-Org Dim Table chromosome,length Genomics

0100 0101 0110 1001 100 011 011 01 01 0 g0g0 g1g1 g2g2 g3g3 g0g0 g1g1 g2g2 g3g3 Protien-Protien Interactions (PPI) (2-hop interactions) Gene Dimension Table 0011PolyA-Tail.9.1 StopCodonDensity apo p mit o mei o apo p Function Rib o NuclRib o Myt a SubCell- Location g4g4 01001001010 g3g3 01000100100 g2g2 11000010010 g1g1 11000101001 GENEGENE P ol y- A SCD1SCD1 MitoMito MeioMeio apopapop NuclNucl RiboRibo MytaMyta SCD2SCD2 SCD3SCD3 SCD4SCD4 Gene Dimension Table (Binary) g0g0 g1g1 g2g2 g3g3

KDD-Cup ’02 results NDSU Team

Greyware PPI graph mining tool  Visualize feature information using a glyph for each gene (PPI graph node)  PPI Edge iff the 2 genes code for interacting proteins This visual data mining tool was effective in KDD-CUP ’02) g4g4 2809.9 001010 g3g3 5004.1 100100 g2g2 1506 010010 g1g1 4114 o 101001 GENEGENE In fo- qt y SCDSCD MitoMito MeioMeio apopapop NuclNucl RiboRibo MytaMyta le n g t h e ss e nt ia l Di s- ce nt er Gene Dimension Table (non-binary) stopcodondensity Glyp h for g 1

Thanks so much!  Don’t forget to submit your best work to CAINE, Nov 11- 13, 2003, Las Vegas NV by July 1. Submit to Program Chair, kendall.nygard@ndsu.nodak.edu or Conference Chair, william.perrizo@ndsu.nodak.edu kendall.nygard@ndsu.nodak.edu william.perrizo@ndsu.nodak.edu http:/www.cs.ndsu.nodak.edu/~krile/caine03 or http:www.isca-hq.org For those interested in DM in genomics and bioinformatics, Virtual Conference in Genomics and Bioinformatics VGAB-III, Sep 16-18, 2003 http:www.ndsu.edu/~virtual-genomics Submit papers to Program Chair, willy.valdivia@ndsu.nodak.eduwilly.valdivia@ndsu.nodak.edu or to the Conference Chair, william.perrizo@ndsu.nodak.eduwilliam.perrizo@ndsu.nodak.edu VGAB-III will be available over Access Grid and Real Player to anywhere, for free (no registration fee)

Data Mining and Data Warehousing, many-to-many Relationships, applications William Perrizo DataSURG (Database Systems Users and Research Group) North Dakota.

Similar presentations

Presentation on theme: "Data Mining and Data Warehousing, many-to-many Relationships, applications William Perrizo DataSURG (Database Systems Users and Research Group) North Dakota."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining and Data Warehousing, many-to-many Relationships, applications William Perrizo DataSURG (Database Systems Users and Research Group) North Dakota.

Similar presentations

Presentation on theme: "Data Mining and Data Warehousing, many-to-many Relationships, applications William Perrizo DataSURG (Database Systems Users and Research Group) North Dakota."— Presentation transcript:

Similar presentations

About project

Feedback