Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining and Data Warehousing, many-to-many Relationships, applications DataSURG (Database Systems Users and Research Group) North Dakota State University.

Similar presentations


Presentation on theme: "Data Mining and Data Warehousing, many-to-many Relationships, applications DataSURG (Database Systems Users and Research Group) North Dakota State University."— Presentation transcript:

1 Data Mining and Data Warehousing, many-to-many Relationships, applications DataSURG (Database Systems Users and Research Group) North Dakota State University Fargo, North Dakota USA dataSURG.ite@ndsu.nodak.edu

2 DM on a DW vs. TP on a DB Workload on Repository Question  C.J. Date, circa 1980  Transactions on a DBMS vs.  file processing programs on file systems.  “Use a DBMS instead of file systems! Unifies data resources, centralizes control, promotes standards and consistency, eliminates redundancy, increases data value (wider data usage)”  Circa 1990  “Buy a separate DW for DM” (separate from your DBMS for TP)”  2 separate, quite redundant, non-sharing, inconsistent.. systems!  What happened?  Great marketing success! (sold more hardware and software)  Great Concurrency Control R&D failure! We failed to integrate transactions and queries (OLTP and OLAP, i.e., updates and reads) in one system with acceptable performance!  The marketing was so successful, nobody noticed the failure!

3 OUTLINE I still hold out hope that DW and DB will eventually be unified again. I believe eventually the industry will demand it. Already, there is work to update DW s ! For now let’s just focus on DATA and DM.  I Consider DM to be on the unstructured side of querying. There you run up against two curses immediately. Curse of non-scalability (solutions don’t scale with data volume.) Curse of dimensionality (solutions don’t scale with data dimension  I will talk about techniques we have used to address the curses. Process vertically structured data horizontally (instead of the ubiquitous vertical processing of horizontal data (the record orientation). Parallelize the DM engine.  Parallelize the software DM engine on clusters of computers.  Parallelize the greyware DM engine on clusters of people (i.e., browser-enable all software for visual data mining)

4 Data mining finds information in data. Why do we need Data Mining?  Data volume expands by Parkinson’s Law Data volume expands to fill available data storage  Disk-storage expands by Moore’s law Capacity  2 t / 9 months Available storage doubles every 9 months!

5 We’re awash with data! Network data: hi-speed, DWDM, All-opt (mgmt, flow classif’n,QoS,security) (10 terabytes by 2003 ~ 10 13 B). US EROS Data Center (EDC) archives Earth Observing System (EOS) Remotely Sensed Imagery (RSI), satellite and aerial photo data (10 petabytes by 2005 ~ 10 16 B). National Virtual Observatory (aggregated astronomical data) (10 exabytes by 2010 ~ 10 19 B). Sensor data from sensor networks (Micro & Nano -sensor networks) (10 zettabytes by 2015 ~ 10 22 B). WWW will continue to grow (and other text collections) (10 yottabytes by 2020 ~ 10 25 B). Micro-arrays, gene-chips, genome sequence data (10 gazillabytes by 2030 ~ 10 28 B?). Stock Market prediction data (prices + all the above? especially astronomy data?) (10 supragazillabytes by 2040 ~ 10 31 B?). Useful info must be teased out of these large volumes of data thru data mining. I had to make up this Name? Projected data sizes are overrunning our ability to names for them!

6 More’s Law: More’s Less The more volume, the less information. (AKA: Shannon’s Canon) A simple illustration: Which phone book has more info? (both involve the same 4 data granules) BOOK-1BOOK-2 NameNumberNameNumber Smith234-9816Smith234-9816 Jones231-7237Smith231-7237 Jones234-9816 Jones231-7237 Data mining reduces volume and raises the information level.

7 Precision Agriculture Data Mining TIFF image Yield Map Dataset consists of an aerial photograph (TIFF image taken during the growing season) and a synchronized yield map (crop yield taken at harvest). Altogether there are 4 feature attributes (B,G,R,Y) and ~100,000 pixels. A producer wants to know the relationship between the color intensities and yield? One hypothsize, the Association Rule, hi_green and low_red  hi_yield, is intuitive and could be made and verified without data mining (simple querying). Data mining has found a stronger rule, hi_NIR and low_red  very_hi_yield. So many producers use VIR instead of RBG cameras to get the better information.

8 Another Precision Agriculture Data Mining Example: Grasshopper Infestation Prediction (again involving RSI data) Grasshopper caused significant economic loss each year. Early infestation prediction is key to damage control. Pixel classification on remotely sensed imagery holds significant promise to achieve early detection. Pixel classification (signaturing) has many applications pest detection, forest fire detection, wet-lands monitoring … (for signaturing we developed the SMILEY software/greyware system) http:midas.cs.ndsu.nodak.edu/~smiley

9 Sensor Network Data Mining  Micro and Nano scale sensor blocks are being developed for sensing  Biological agents  Chemical agents  Motion detection  coatings deterioration  RF-tagging of inventory  Structural materials fatigue There will be trillions ++ of individual sensors creating mountains of data. The data must be mined for it’s information.

10 Operational Capability: Sensor Network Application: CubE for Active Situation Replication (CEASR) Proposed Technical Approach : Ability to sense chemical, vibrational, biological, thermal in real- time (over hills, etc.) The problems to be solved include: 1.Communication between sensor field(s) and CEASR. 2.Nano-sensors must be position registered. 3.Data fusion architecture must be developed. 4.Fluidic Self Assembly (FSA) of Cube for Active Situation Replication (CEASR). FSA is an Alien Technology patented process capable of producing clear flexible substrates with embedded nano- LED display units. Each energized nano-sensor must transmit a ping and its location (QID). These QIDs are then translated to 3- dimensional coordinates at the display device. The correcponding voxel on the display lights up. A more sophisicated CEASR device would sense and transmit the intensity levels. The display device would light up with the corresponding intensity. Using Alien Technology’s Fluidic Self-assembly (FSA) technology, clear sheets with embedded nano-LED elements will be laminated together to produce a clear visualization cube with a nano-LED at each pixel corresponding to nano-sensors at each pixel of the actual space. The nano-sensors will turn on their corresponding CEASR display Nano-LED when a threshold is sensed (chemical, vibrational, biological, thermal…). These sensed patterns will be transmitted, fused and displayed in real-time. Situation space (with nano-sensors ) Realtime replica of sensed pattern (on a wristwatch?

11 Anthropological Application: Digital Archive Network for Anthropology (DANA) (data mine arthropological artifacts (shape, color, discovery location,…)

12 Astronomy Application: The celestial sphere RA dec

13 Data Mining? Querying is asking specific questions and expecting specific answers. Data Mining goes into the MOUNTAIN of DATA, and hopefully returns information gems. But also, some fool’s gold? Relevance and interestingness analysis, serves to assay those gems. (help pick out valuable gems).

14 Data Mining Process Data mining: the core of the knowledge discovery process. Data Cleaning/Integration: missing data, outliers, noise, errors Mountain of Raw Data Data Warehouse: cleaned, integrated, read-only, periodic, historical raw database Task-relevant Data Selection Data Mining Pattern Evaluation and Assay OLAP Classification Clustering ARM Feature extraction, tuple selection

15 Data Mining versus Querying There is a whole spectrum of techniques to get information from data : On the Query end, much work is yet to be done (D. DeWitt, ACM SIGMOD Record’02). On the Data Mining end, the surface has barely been scratched. But even those scratches had a great impact – One of the early scatchers became the biggest corporation in the world last year. A Non-scratcher filed for bankruptcy SQL SELECT FROM WHERE Complex queries (nested, EXISTS..) FUZZY query, Search engines, BLAST searches OLAP (rollup, drilldown, slice/dice.. Machine LearningData Mining Standard querying Searching and Aggregating Supervised Learning – classification regression Unsupervised Learning - clustering Association Rule Mining Data Prospecting Fractals, … Walmart vs. KMart

16 Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees (Ptrees in either case) 1 processed horizontally (DBMSs process horizontal data vertically) Ptrees are data-mining-ready, compressed data structures, which attempt to address the curses of scalability and curse of dimensionality.  And a compressed, OLAP-ready data warehouse structure, the Pcube 1 Pcubes facilitate OLAP operations and query processing, using the Ptree data structure. 1 Technology is patent pending by North Dakota State University

17 6. 1 st half of 1 st of 2 nd is  1 0 0 1 1 4. 1 st half of 2 nd half not  0 0 2. 1 st half is not pure1  0 0 0 1. Whole file is not pure1  0 Horizontal structure Processed vertically (scans) P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P 43 0 0 0 0 1 10 0 1 0 0 1 0 0 0 0 0 0 1 01 10 0 1 0 0 1 0 0 0 0 1 0 01 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 10 01 5. 2 nd half of 2 nd half is  1 0 0 1 0000101100001011 horizontally process these Ptrees using one multi-operand logical AND operation. Ptrees vertical partition ; compress each vertical bit slice into a basic Ptree; 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 R( A 1 A 2 A 3 A 4 ) A table, R(A 1..A n ), is a horizontal structure (set of horizontal records) processed vertically (vertical scans) 1-Dimensional Ptrees are built by recording the truth of the predicate “pure 1” recursively on halves, until there is purity, P 11 : 3. 2 nd half is not pure1  0 0 7. 2 nd half of 1 st of 2 nd not  0 0 0 1 10 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 --> R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 Eg, to count, 111 000 001 100 s, use “pure111000001100”: 0 2 3 -level P 11 ^P 12 ^P 13 ^P’ 21 ^P’ 22 ^P’ 23 ^P’ 31 ^P’ 32 ^P 33 ^P 41 ^P’ 42 ^P’ 43 = 0 0 2 2 -level =2 01 2 1 -level

18 Horizontal Processing of Vertical Structures History  In the 1980’s it was proposed for DBMSs and record-based workloads Decomposition Storage Model (DSM, Copeland et al)  Attribute Transposed File (ATF) Band Sequential (BSQ) in RSI) Bit Transposed File (BTF, Wang et al)  These initiatives didn’t last. Why?

19 Horizontal Processing of Vertical Structures for Record-based Workloads  For record-based workloads (where the result is a set of records), changing the horizontal record structure and then having to reconstruct it, may introduce too much post processing?  For data mining workloads, the result is often a bit (Yes/No, True/False) or another unstructured result, there is no reconstructive post processing? 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 R( A 1 A 2 A 3 A 4 ) 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 1

20 1.1 st run is Pure0  0:000 truth:start 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 R( A 1 A 2 A 3 A 4 ) Run Lists : Generalized Ptrees using standard run length compression of vertical bit files (alternatively, using Lempl Zipf?, Golomb?, other?) Run Lists: record type and start-offset of pure runs. E.g., RL 11 : 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 --> R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 2. 2 nd run is Pure1  1:100 3. 3 rd run is Pure0  0:101 4. 4 th run is Pure1  1:110 RL 11 0:000 1:100 0:101 1:110 (to complement, flip purity bits) Eg, to count, 111 000 001 100 s, use “pure111000001100”: RL 11 ^RL 12 ^RL 13 ^RL’ 21 ^RL’ 22 ^RL’ 23 ^RL’ 31 ^RL’ 32 ^RL 33 ^RL 41 ^RL’ 42 ^RL’ 43 RL 11 RL 12 RL 13 RL 21 RL 22 RL 23 RL 31 RL 32 RL 33 RL 41 RL 42 RL 43 0:000 1:100 0:101 1:110 1:000 0:100 1:101 0:000 1:001 0:010 1:100 0:101 1:110 1:000 0:100 1:000 0:110 1:000 0:010 1:011 0:100 1:000 0:100 1:000 0:010 0:000 1:010 0:000 1:010 0:000 1:010 0:101 1:000 0:001 1:010 0:100 1:101 0:110 0000101100001011 R 11

21 RunList-trees? (RLtrees)  To facilitate subsetting (isolating a subset) and processing, a Ptree stucture can be constructed on top of the RunList using the “pure1” predicate: 0000101100001011 R 11 RL 11 0:000 1:100 0:101 1:110 6. 1 st half of 1 st of 2 nd is  1 0 0 1 1 4. 1 st half of 2 nd half not  0 0 2. 1 st half is not pure1  0 0 0 1. Whole file is not pure1  0 5. 2 nd half of 2 nd half is  1 0 0 1 3. 2 nd half is not pure1  0 0 7. 2 nd half of 1 st of 2 nd not  0 0 0 1 10

22 RunList-trees continued  Alternatively, a separate NotPure0 index trees could be build where the predicate is NotPure0 (also note, the tree could be terminated at a given level). First, AND the NP0 index trees. Only the 1-branches or the resulty need to be ANDed through list scans. The more operands, the fewer 1-branches. 0000101100001011 R 11 RL 11 0:000 1:100 0:101 1:110 6. 1 st half of 1 st of 2 nd true  1 1 0 1 1 1 1 4. 1 st half of 2 nd half true  1 1 0 1 1 2. 1 st half is false  0 1 0 1 1. Whole file is true  1 5. 2 nd half of 2 nd half true  1 1 0 1 1 1 3. 2 nd half is true  1 1 0 1 7. 2 nd of 1 st of 2 nd false  0 1 0 1 1 1 10

23 Ptrees Vertical, compressed, lossless structures that facilitates fast horizontal AND-processing Jury is still out on the best parallelization approach, vertical (by relation) or horizontal (by tree node) or some combination.  Horizontal parallelization is pretty, but network multicast overload is huge Use active networking? Clusters of Playstations?...  The most useful form of a Ptree is the Pure1-tree or P1tree 1-bit at a node iff corresponding half is pure1. There are many other useful predicates, e.g., NonPure0-trees (previously shown). We will focus on P1trees.  All Ptrees shown so far were 1-dimensional (recursively halving bit files), but they can be 2-D (recursively quartering) (e.g., used for 2-D images) 3-D (recursively eight-ing), …

24 A 2-Dimensional P1tree 0 1000 00101101 11100010110 1 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 10 0 1 111 11 10 000 10 0 11 0 0 1 0 Node is 1 iff that quadrant is purely 1-bits, e.g., A bit-file (from, e.g., a 2-D image) 1111110011111000111111001111111011110000111100001111000001110000 Run-length compress the corresponding raster ordered matrix using Peano order.

25 Alternatively, a Count tree? Counts are the ultimate goal, but P1trees are more compressed and produce the needed counts quite quickly.  QID (Quadrant ID): e.g., 2.2.3  Pure-1/Pure-0 quadrants  Root Count  Tree levels: 3, 2, 1, 0 (purity counts of 4 3 4 2 4 1 4 0 resp.)  Fan-out = 2 dim = 4 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 0123 111 ( 7, 1 ) ( 111, 001 ) 10.10.11 2 3 2. 2. 3 001 55 level-3 (pure=4 3 ) 1681516 level-2 30414434 level-1 11100010110 1 level-0

26 Logical Operations on Ptrees (are used to get counts of any pattern) AND operation is faster than the bit-by-bit AND since, there are shortcuts (any pure0 operand node means result node is pure0.) e.g., only load quadrant 2 to AND Ptree1, Ptree2, etc. The more operands there are in the AND, the greater the benefit due to this shortcut (more pure0 nodes). Ptree 1 Ptree 2 AND result OR result

27 Operational Capability: Sensor Network Application: CubE for Active Situation Replication (CEASR) Proposed Technical Approach : Ability to sense chemical, vibrational, biological, thermal in real- time (over hills, etc.) The problems to be solved include: 1.Communication between sensor field(s) and CEASR. 2.Nano-sensors must be position registered. 3.Data fusion architecture must be developed. 4.Fluidic Self Assembly (FSA) of Cube for Active Situation Replication (CEASR). FSA is an Alien Technology patented process capable of producing clear flexible substrates with embedded nano- LED display units. Each energized nano-sensor must transmit a ping and its location (QID). These QIDs are then translated to 3- dimensional coordinates at the display device. The correcponding voxel on the display lights up. A more sophisicated CEASR device would sense and transmit the intensity levels. The display device would light up with the corresponding intensity. Using Alien Technology’s Fluidic Self-assembly (FSA) technology, clear sheets with embedded nano-LED elements will be laminated together to produce a clear visualization cube with a nano-LED at each pixel corresponding to nano-sensors at each pixel of the actual space. The nano-sensors will turn on their corresponding CEASR display Nano-LED when a threshold is sensed (chemical, vibrational, biological, thermal…). These sensed patterns will be transmitted, fused and displayed in real-time. Situation space (with nano-sensors ) Realtime replica of sensed pattern (on a wristwatch?

28 3-D Application: Digital Archive Network for Anthropology (DANA) (data mine arthropological artifacts (shape, color,…)

29 3-Dimensional Ptrees (e.g., for the CEASR sensor network or the Digital Archive Network for Anthropology)

30 Ptree dimension  The dimension of the Ptree structure is a user chosen parameter It can be chosen to fit the data dimension Most datasets  1-D Ptrees (recursive halving) 2-D Images  2-D Ptrees (recursive quartering) 3-D Solids  3-D Ptrees (recursive eighth-ing) Or dimension can be chosen based on other considerations optimize compression increase processing speed (next slide)

31 Raster Sorting: Attributes 1 st Bit position 2 nd Peano Sorting: Bit position 1 st Attributes 2 nd Generalized Raster and Peano Sorting: generalizes to any table with numeric attributes (not just images).

32 Generalize Peano Sorting 0 20 40 60 80 100 120 adult spam mushroom function crop Time in Seconds Unsorted Generalized Raster Generalized Peano KNN speed improvement (UCI MLR data sets)

33 National Virtual Observatory data  What Ptree dimension and what ordering should be used for astronomical data?  Where all bodies are assumed to be on the surface of a sphere, the celestial sphere (shares equatorial plane with earth and has no specified radius) Peano Triangle Mesh Tree (PTM-tree) Peano Celestial Coordinate tree (PCCtree)  Uses (RA, dec) coordinates of the celestial sphere  RA=Recession Angle (longitudinal angle)  dec=declination (latitude angle)

34 Peano Triangular Mesh Tree (PTM-tree)  Similar to the Hierarchical Triangular Mesh (HTM) used in the Sloan Digital Sky Survey project. In both:  Sphere is divided into triangles  Triangle sides are always great circle segments.  PTM differs from HTM in the way in which they are ordered?

35 The difference between HTM and PTM-trees is in the ordering. 1, 2 1, 3 1, 0 1,1 1 1,3,3 1,3,21,3,0 1,3,1 1,2 1,1 1,0 1,3 1 1,1,2 1,1,01,1,1 1.1.3 Ordering of HTM Ordering of PTM-tree Why use a different ordering?

36 PTM-Ordering the Triangular Mesh 1,2 1,3 1,0 1,1 1 1,3,3 1,3,21,3,0 1,3,1

37 PTM Triangulation of the Celestial Sphere Traverse southern hemisphere in the revere direction (just the identical pattern pushed down instead of pulled up, arriving at the Southern neighbor of the start point. RA dec This “Peano ordering” produces a sphere-surface filling curve with good continuity characteristics.

38 PTM triangulation – Next Level LRLR LRLR

39 PTM-triangulation - Next Level LRLR RLRL LRLR RLRL LRLR RLRL LRLR RLRL LRLR RLRL LRLR RLRL

40 Peano Celestial Coordinate Trees (PCCtrees)  Unlike PTM-trees which initially partition the sphere into the 8 faces of an octahedron, in the PCCtree scheme:  the sphere is tranformed into a cylinder,  then into a rectangle,  then standard Peano ordering is used on the Celestial Coordinates.  Celestial Coordinates RA is from 0 to 360 o dec is -90 o to 90 o.

41 PRAdec-scheme: Sphere  Cylinder  Plane Z North Plane South Plane 90 o 0 o -90 o 0 o 360 o Z PRAdecPRAdec

42 Graph data (many-to-many self relations) “Everything should be made as simple as possible, but not simpler” Albert Einstein

43 Representating graphs Examples:  Genomics Protein-protein interactions (ACM KDD-Cup ’02)  Focuses is on node structure  WWW Focuses on link structure  Publications citations ACM KDD_Cup ’03 Focus is on both Scientific American 05/03

44 e0e0 e1e1 e2e2 e3e3 1 0 1 1 0 1 1 1 1 1 0 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 0 0 1 0 1 0 0 1 0 1 1 o1o1 o2o2 o3o3 o0o0 Gene-Experiment-Organism Cube (1 iff that gene from that organism expresses at a threshold level in that experiment.) Many-to-many-to-many relationship Organism Dimension Table 30001Mus musculus mouse 12.10 Saccharomyces cerevisiae yeast 1850Drosophila melanogaster fly 30001Homo sapienshuman Genome Size (million bp) Vert Species Organism Gene Dimension Table 0011PolyA-Tail.9.1 StopCodonDensity apopmitomeioapop Function Ribo Nucl RiboMyta SubCell-Location Experiment Dimension Table (MIAME) 1asa42 1aca42 0hsb22 1hca23 NMHSADAD EDED STZSTZ CTYCTY STRSTR UNVUNV PIPI LABLAB g0g0 g1g1 g2g2 g3g3 e0e0 e1e1 e2e2 e3e3 17, 78 12, 60 Mi, 40 1, 48 10, 75 0 0 7, 40 0 14, 65 0 0 16, 76 0 9, 45 Pl, 43 Gene-Org Dim Table chromosome,length Genomics

45 0100 0101 0110 1001 100 011 011 01 01 0 g0g0 g1g1 g2g2 g3g3 g0g0 g1g1 g2g2 g3g3 Protien-Protien Interactions (PPI) (2-hop interactions) Gene Dimension Table 0011PolyA-Tail.9.1 StopCodonDensity apo p mit o mei o apo p Function Rib o NuclRib o Myt a SubCell- Location g4g4 01001001010 g3g3 01000100100 g2g2 11000010010 g1g1 11000101001 GENEGENE P ol y- A SCD1SCD1 MitoMito MeioMeio apopapop NuclNucl RiboRibo MytaMyta SCD2SCD2 SCD3SCD3 SCD4SCD4 Gene Dimension Table (Binary) g0g0 g1g1 g2g2 g3g3

46 KDD-Cup ’02 results NDSU Team

47 Greyware PPI graph mining tool  Visualize feature information using a glyph for each gene (PPI graph node)  PPI Edge iff the 2 genes code for interacting proteins This visual data mining tool was effective in KDD-CUP ’02) g4g4 2809.9 001010 g3g3 5004.1 100100 g2g2 1506 010010 g1g1 4114 o 101001 GENEGENE In fo- qt y SCDSCD MitoMito MeioMeio apopapop NuclNucl RiboRibo MytaMyta le n g t h e ss e nt ia l Di s- ce nt er Gene Dimension Table (non-binary) stopcodondensity Glyp h for g 1

48 Thanks so much!  Don’t forget to submit your best work to CAINE, Nov 11- 13, 2003, Las Vegas NV by July 1. Submit to Program Chair, kendall.nygard@ndsu.nodak.edu or Conference Chair, william.perrizo@ndsu.nodak.edu kendall.nygard@ndsu.nodak.edu william.perrizo@ndsu.nodak.edu http:/www.cs.ndsu.nodak.edu/~krile/caine03 or http:www.isca-hq.org For those interested in DM in genomics and bioinformatics, Virtual Conference in Genomics and Bioinformatics VGAB-III, Sep 16-18, 2003 http:www.ndsu.edu/~virtual-genomics Submit papers to Program Chair, willy.valdivia@ndsu.nodak.eduwilly.valdivia@ndsu.nodak.edu or to the Conference Chair, william.perrizo@ndsu.nodak.eduwilliam.perrizo@ndsu.nodak.edu VGAB-III will be available over Access Grid and Real Player to anywhere, for free (no registration fee)


Download ppt "Data Mining and Data Warehousing, many-to-many Relationships, applications DataSURG (Database Systems Users and Research Group) North Dakota State University."

Similar presentations


Ads by Google