Presentation is loading. Please wait.

Presentation is loading. Please wait.

OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center Data Mining and Access Pattern Discovery  Subprojects:  Dimension reduction and sampling.

Similar presentations


Presentation on theme: "OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center Data Mining and Access Pattern Discovery  Subprojects:  Dimension reduction and sampling."— Presentation transcript:

1 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center Data Mining and Access Pattern Discovery  Subprojects:  Dimension reduction and sampling (Chandrika, Imola)  Access pattern Discovery (Ghaleb)  “Run and Render” Capability in ASPECT (George, Joel, Nagiza)  Common applications: climate and astrophysics  Common goals:  Explore data for knowledge discovery  Knowledge is used in different ways:  Explain volcano and El Niño effects on changes in the earth’s surface temperature  Minimize disk access times  Reduce the amount of data stored  Quantify correlations between the neutrino flux and stellar core convection, between convection and spatial dimensionality, convection and rotation  Common tools that we use: cluster analysis, dimension reduction  Feed each other: dimension reduction cluster analysis, ASPECT access pattern

2 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center Nagiza Samatova, George Ostrouchov, Faisal AbduKhzam, Joel Reed,Tom Potok & Randy Burris Computer Science and Mathematics Division http://www.csm.ornl.gov/ SciDAC SDM ISIC All-Hands Meeting March 26-27, 2002 Gatlinburg, TN ASPECT: Adaptable Simulation Product Exploration and Control Toolkit

3 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center Team & Collaborators  AbduKhzam, Faisal – distributed & streamline data mining research  Ostrouchov, George – Application coordination, sampling & data reduction, data analysis  Reed, Joel – ASPECT’s GUI Interface, Agents  Samatova, Nagiza – Management, streamline & distributed data mining algorithms in ASPECT, application tie-ins  Summer students - Java-R back-end interface development  Burris, Randy – Establishing prototyping environment in Probe  Drake, John – A lot of ideas have been inspired from  Geist, Al – Distributed and streamline data analysis research  Mezzacappa, Tony – TSI Application Driver  Million, Dan – Establishing software environments in Probe  Potok, Tom – ORMAC Agent Framework Collaborators: Team:

4 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center Analysis & Visualization of Simulation Product – State of the Art  Post-processing data analysis tools (like PCMDI):  Scientists must wait for the simulation completion  Can use lots of CPU cycles on long-running simulations  Can use up to 50% more storage and require unnecessary data transfer for data-intensive simulations  Simulation monitoring tools:  Need simulation code instrumentation (e.g., call to vis. libraries)  Interference with simulation run: snapshot of data => can pause simulation  Computationally intensive data analysis task becomes part of simulation  Synchronous view of data and simulation run  More control over simulation

5 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center Improvements through — ASPECT Data stream  not simulation  monitoring tool PROBE FFT ICA Filters D4D4 RACHET Desktop Filters RACHET ICA D4D4 GUI Interface Plug-in modules ASPECT Disks Tapes Simulation Data ASPECT’s advantages: No simulation code instrumentation Single data — multiple views of data No interference w/ simulation ASPECT’s drawbacks: (e.g. unlike CUMULVS/ORNL) No computational steering No collaborative visualization No high performance visualization

6 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center “Run and Render” Simulation Cycle in SciDAC: Our vision SP3: TSI Simulation Computational Environment Disks Tapes PROBE for Storage & Analysis of Simulation Data: High-Dimensional Distributed Dynamic Massive Data Management Application Scientist ASPECT Data Analysis Visualization: Scalable Adaptable Interactive Collaborative Part of SciDAC Missing

7 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center Approaching the Goal through a Collaborative Set of Activities Interact with Application Scientists T. Mezzacappa, R. Toedte, D. Erickson, J. Drake Build a Workflow Environment (Probe) Application Data Analysis Research CS & Math Research driven by Applications ASPECT Design & Implementation Publications, Meetings & Presentations Learn Application Domain ( problem, software) Data Preparation & Processing

8 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center Building a Workflow Environment

9 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center 80% => 20% Paradigm in Probe’s Research & Application driven Environment  Very limited resources  General purpose software only  Lack of interface with HPSS  Homogenous platform (e.g., Linux only) From frustrationsTo smooth operation  Hardware Infrastructure: RS6000 S80, 6 processors 2 GB memory,1 TB IDE FibreChannel RAID 360 GB Sun RAID  Software Infrastructure: Compilers (Fortran, C, Java) Data Analysis (R, Java-R, Ggobi) Visualization (ncview, GrADS) Data Formats (netCDF, HDF) Data Storage & Transfer (HPSS, hsi, pftp, GridFTP, MPI-IO)

10 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center ASPECT Design and Implementation

11 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center Menu of Modules Categories: Data Acquisition Data Filtering Data Analysis Visualization Create Instance Link Modules FFT NetCDF Reader Visualization ModuleFilter Module ASPECT Front-End Infrastructure Functionality: Instantiate Modules Link Modules Control Valid Links Synchronously Control Add Modules by XML Data Acquisition NetCDF Reader datamonitor.NetCDFReader Data Filtering Invert Filter datamonitor.Inverter Data Acquisition NetCDF Reader datamonitor.NetCDFReader Data Filtering Invert Filter datamonitor.Inverter XML Config File

12 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center ASPECT Implementation  Front-end interface:  Java  Back-end data analysis:  R (GNU S-Plus) (and C): provides richness of data analysis capabilities  Omegahat’s Java-R interface (http://omegahat.org)http://omegahat.org  Networking layer:  ORNL’s ORMAC Agent Architecture based on RMI  Other: Servlets, HORB (http://horb.a02.aist.go.jp/horb/), CORBAhttp://horb.a02.aist.go.jp/horb/  File Readers:  NetCDF  ASCI  HDF5 (later)

13 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center Agents for Distributed Data Processing

14 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center Agents and Parallel Computing Astrophysics Example  Massive datasets  Team of agents divide up the task  Each agent contributes solution for his portion the dataset  Agent-derived partial solutions are merged to create total solution  Solution appropriately formatted for resource

15 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center Team of Agents Divide Up Data 1)Resource Aware Agent Receives Request Varying Resources

16 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center Team of Agents Divide Up Data 1)Resource Aware Agent Receives Request 2) Announces Request to Agent Team Varying Resources

17 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center Team of Agents Divide Up Data 1)Resource Aware Agent Receives Request 2) Announces Request to Agent Team 3) Team Responds Varying Resources

18 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center Team of Agents Divide Up Data 1)Resource Aware Agent Receives Request 2) Announces Request to Agent Team 3) Team Responds 4) Resource Aware Agent - Assembles and formats for resource - Hands back solution Varying Resources

19 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center Distributed and Streamline Data Analysis Research

20 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center Tera&Petabytes Existing methods do not scale in terms of time and storage Challenge: Develop effective & efficient methods for mining scientific data sets Distributed Existing methods work on single centralized dataset. Data transfer is prohibitive High-dimensional Existing methods do not scale up with the number of dimensions Dynamic Existing methods work w/ static data. Changes lead to complete re-computation Complexity of Scientific Data Sets Drives Algorithmic Breakthroughs Supernova Explosion: 1-D simulation: 2GB 2-D simulation: 1TB 3-D simulation: 50TB

21 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center Need to break the Algorithmic Complexity Bottleneck 3 yrs.0.1 sec.10 -2 sec. 10GB 3 hrs10 -3 sec.10 -4 sec. 100MB 1 sec.10 -5 sec.10 -6 sec. 1MB 10 -4 sec.10 -8 sec. 10KB 10 -8 sec.10 -10 sec. 100B n2n2 nlog(n)n Algorithm Complexity Data size, n Algorithmic Complexity: Calculate meansO(n) Calculate FFT O(n log(n)) Calculate SVDO(r c) Clustering algorithmsO(n 2 ) For illustration chart assumes 10 -12 sec. calculation time per data point

22 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center Perform cluster analysis in a distributed fashion with reasonable data transfer overheads Strategy  Compute local analyses using distributed agents  Merge minimum info into a global analysis via peer- to-peer agents’ collaboration & negotiation Key idea Benefits  NO need to centralize data  Linear scalability with data size and with data dimensionality RACHET: High Performance Framework for Distributed Cluster Analysis

23 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center  Data distribution is driven by a science application  Software code is sent to the data  One time communication  No assumptions on hardware architecture  Provide an approximate solution  Data distribution is driven by algorithm performance  Data is partitioned by a software code  Excessive data transfers  Hardware architecture- centric  Aim for the “exact” computation Paradigm Shift in Data Analysis Distributed ApproachParallel Approach (RACHET approach)

24 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center Distributed Cluster Analysis Local Dendrogram Local Dendrogram Local Dendrogram Global Dendrogram RACHET merges local dendrograms to determine global cluster structure of the data Intelligent agents RACHET |S|<<N O(N) N data size S number of sites k number of dimensions

25 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center Distributed & Streamline Data Reduction: Merging Information Rather Than Raw Data  Global Principal Components  transmit information, not data  Dynamic Principal Components  no need to keep all data Benefits:  Little loss of information  Much lower transmission costs Method: Merge few local PCs and local means Performance of Distributed PCA vs. Monolithic PCA # of Data Sets Ratio

26 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center t=t 0 t=t 1 t=t 2 new Incremental update via fusion Stream of simulation data Features:  Linear time for each chunk  One time communication for distributed version  ~5% deviation from monolithic version DFastMap: Fast Dimension Reduction for Distributed and Streamline Data

27 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center Application Data Reduction and Potentials for Scientific Discovery

28 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center Adaptive PCA-based Data Compression in Supernova Explosion Simulation PCA vs. sub-sampling compression Loss function: Mean Square Error (MSE) Sub-sampling: 1 point out of 9 (black) PCA approximation: k PCs out of 400 (red) Compression Features:  Adaptive  Rate: 200 to 20 times  PCA-based  3 times better than subsampling Original PCA Restored Time step = 0; MSE = 0.004 Compression rate = 200 Number of PCs = 3 of 400

29 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center Data Compression & Discovery of the Unusual by Fitting Local Models  Strategy  Segment series  Model the usual to find the unusual  Key ideas  Fit simple local models to segments  Use parameters for global analysis and monitoring  Resulting system  Detects specific events (targeted or unusual)  Provides a global description of one or several data series  Provides data reduction to parameters of local model

30 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center From Local Models to Annotated Time Series Segment series (100 obs) Fit simple local model ( c 0, c 1, c 2, ||e|| , ||e|| 2 ) Select extreme (10%) Cluster extreme (4) Map back to series

31 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center EOF 1 EOF 2 EOF 3 EOF 4 Decomposition and Monitoring of a GCM Run 135 year CCM3 run at T42 resolution Average Monthly Temperature CO2 increase to 3x EOF 1EOF 2EOF 3EOF 4 Periodic + Trend 11-13 mo bandpass 15 yr lowpass Anomaly 13 mo-15 yr bandpass 11 mo highpass + ++++... EOF N + Circulation through 12 months Winter warming more severe than summer warming

32 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center Publications & Presentations  F. AbuKhzam, N. F. Samatova, and G. Ostrouchov (2002). “FastMap for Distributed Data: Fast Dimension Reduction,” in preparation.  Y. Qu, G. Ostrouchov, N.F. Samatova, A. Geist (2002). “Principal Component Analysis for Dimension Reduction in Massive Distributed Data Sets”, in Proc. The Second SIAM International Conference on Data Mining, April 2002.  N.F. Samatova, G. Ostrouchov, A. Geist, A. Melechko. “RACHET: An Efficient Cover- Based Merging of Clustering Hierarchies from Distributed Datasets”, Special Issue on Parallel and Distributed Data Mining, International Journal of Distributed and Parallel Databases: An International Journal, 2002, Volume 11, No. 2, March 2002.  N. Samatova, A. Geist, G. Ostrouchov, “RACHET: Petascale Distributed Data Analysis Suite”, in Proc. SPEEDUP Workshop on Distributed Supercomputing Data Intensive Computing, March 4-6, 2002, Badehotel Bristol, Leukerbad, Valais, Switzerland. Presentations: Publications:  N. Samatova, A. Geist, G. Ostrouchov, “RACHET: Petascale Distributed Data Analysis Suite”, SPEEDUP Workshop on Distributed Supercomputing Data Intensive Computing, March 4-6, 2002, Badehotel Bristol, Leukerbad, Valais, Switzerland  A. Shoshani, R. Burris, T. Potok, N. Samatova, “SDM-ISIC”, TSI All-Hands Meeting, February, 2002.

33 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center Thank You!


Download ppt "OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM center Data Mining and Access Pattern Discovery  Subprojects:  Dimension reduction and sampling."

Similar presentations


Ads by Google