Ohio State University Department of Computer Science and Engineering 1 Tools and Techniques for the Data Grid Gagan Agrawal.

Slides:



Advertisements
Similar presentations
Supporting High-Level Abstractions through XML Technologies Xiaogang Li Gagan Agrawal The Ohio State University.
Advertisements

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing.
A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University.
Overview of the Database Development Process
Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.
Efficient Evaluation of XQuery over Streaming Data Xiaogang Li Gagan Agrawal The Ohio State University.
Web-Enabled Decision Support Systems
A Metadata Catalog Service for Data Intensive Applications Presented by Chin-Yi Tsai.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
Database System Concepts and Architecture
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.
Ohio State University Department of Computer Science and Engineering 1 Supporting SQL-3 Aggregations on Grid-based Data Repositories Li Weng, Gagan Agrawal,
Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal June 1,
Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.
1 A Grid-Based Middleware for Processing Distributed Data Streams Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.
Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.
High-level Interfaces and Abstractions for Data-Driven Applications in a Grid Environment Gagan Agrawal Department of Computer Science and Engineering.
Ohio State University Department of Computer Science and Engineering 1 Tools and Techniques for the Data Grid Gagan Agrawal.
Compiler Supported High-level Abstractions for Sparse Disk-resident Datasets Renato Ferreira Gagan Agrawal Joel Saltz Ohio State University.
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
Ohio State University Department of Computer Science and Engineering Data-Centric Transformations on Non- Integer Iteration Spaces Swarup Kumar Sahoo Gagan.
1 Supporting Dynamic Migration in Tightly Coupled Grid Applications Liang Chen Qian Zhu Gagan Agrawal Computer Science & Engineering The Ohio State University.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.
Compiler (and Runtime) Support for CyberInfrastructure Gagan Agrawal (joint work with Wei Du, Xiaogang Li, Ruoming Jin, Li Weng)
Supporting Load Balancing for Distributed Data-Intensive Applications Leonid Glimcher, Vignesh Ravi, and Gagan Agrawal Department of ComputerScience and.
Computer Science and Engineering FREERIDE-G: A Grid-Based Middleware for Scalable Processing of Remote Data Leonid Glimcher Gagan Agrawal.
Distributed Data Analysis & Dissemination System (D-DADS ) Special Interest Group on Data Integration June 2000.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Gagan Agrawal Department of Computer and Information Sciences Ohio.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Packet Size optimization for Supporting Coarse-Grained Pipelined Parallelism Wei Du Gagan Agrawal Ohio State University.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Research Overview Gagan Agrawal Associate Professor.
Ohio State University Department of Computer Science and Engineering 1 Tools and Techniques for the Data Grid Gagan Agrawal The Ohio State University.
Using XQuery for Flat-File Scientific Datasets Xiaogang Li Gagan Agrawal The Ohio State University.
1 A Grid-Based Middleware’s Support for Processing Distributed Data Streams Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering.
System Support for High Performance Scientific Data Mining Gagan Agrawal Ruoming Jin Raghu Machiraju S. Parthasarathy Department of Computer and Information.
1 Supporting a Volume Rendering Application on a Grid-Middleware For Streaming Data Liang Chen Gagan Agrawal Computer Science & Engineering Ohio State.
Servicing Seismic and Oil Reservoir Simulation Data through Grid Data Services Sivaramakrishnan Narayanan, Tahsin Kurc, Umit Catalyurek and Joel Saltz.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Efficient Evaluation of XQuery over Streaming Data
Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering
Supporting Fault-Tolerance in Streaming Grid Applications
GATES: A Grid-Based Middleware for Processing Distributed Data Streams
Grid Based Data Integration with Automatic Wrapper Generation
Resource Allocation in a Middleware for Streaming Data
Resource Allocation for Distributed Streaming Applications
Learning Layouts of Biological Datasets Semi-Automatically
The Ohio State University
Xuan Zhang Kaushik Sinha Ruoming Jin Gagan Agrawal
Using Data Mining Techniques to Learn Layouts of Flat-File Biological Datasets Kaushik Sinha Xuan Zhang Ruoming Jin Gagan Agrawal 5 May 2019.
Supporting High-Performance Data Processing on Flat-Files
Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.
New (Applications of) Compiler Techniques for Data Grids
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
LCPC02 Wei Du Renato Ferreira Gagan Agrawal
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Ohio State University Department of Computer Science and Engineering 1 Tools and Techniques for the Data Grid Gagan Agrawal

Ohio State University Department of Computer Science and Engineering 2 Grids and Data Grids Grid Computing –Large scale problem solving using resources over the internet –Distributed computing, but across multiple administrative domains Data Grid –Grid with focus on sharing and processing large scale datasets

Ohio State University Department of Computer Science and Engineering 3 Scientific Data Analysis on Grid-based Data Repositories Scientific data repositories –Large volume »Gigabyte, Terabyte, Petabyte –Distributed datasets »Generated/collected by scientific simulations or instruments –Data could be streaming in nature Scientific data analysis Data Specification Data Organization Data Extraction Data Movement Data Analysis Data Visualization

Ohio State University Department of Computer Science and Engineering 4 Opportunities Scientific simulations and data collection instruments generating large scale data Grid standards enabling sharing of data Rapidly increasing wide-area bandwidths

Ohio State University Department of Computer Science and Engineering 5 Existing Efforts Data grids recognized as important component of grid/distributed computing Major topics –Efficient/Secure Data Movement –Replica Selection –Metadata catalogs / Metadata services –Setting up workflows

Ohio State University Department of Computer Science and Engineering 6 Open Issues Accessing / Retrieving / Processing data from scientific repositories –Need to deal with low-level formats Integrating tools and services having/requiring data with different formats Support for processing streaming data in a distributed environment Efficient distributed data-intensive applications Developing scalable data analysis applications

Ohio State University Department of Computer Science and Engineering 7 Ongoing Projects Automatic Data Virtualization On the fly information integration in a distributed environment Middleware for Processing Streaming Data Supporting Coarse-grained pipelined parallelism Compiling XQuery on Scientific and Streaming Data Middleware and Algorithms for Scalable Data Mining

Ohio State University Department of Computer Science and Engineering 8 Outline Automatic Data Virtualization –Relational/SQL –XML/XQuery based Information Integration Middleware for Streaming Data Cluster and Grid-based data mining middleware

Ohio State University Department of Computer Science and Engineering 9 Automatic Data Virtualization: Motivation Emergence of grid-based data repositories –Can enable sharing of data in an unprecedented way Access mechanisms for remote repositories –Complex low-level formats make accessing and processing of data difficult Main desired functionality –Ability to select, down-load, and process a subset of data

Ohio State University Department of Computer Science and Engineering 10 Data Virtualization An abstract view of data dataset Data Service Data Virtualization By Global Grid Forum’s DAIS working group: A Data Virtualization describes an abstract view of data. A Data Service implements the mechanism to access and process data through the Data Virtualization

Ohio State University Department of Computer Science and Engineering 11 Our Approach: Automatic Data Virtualization Automatically create data services –A new application of compiler technology A meta-data descriptor describes the layout of data on a repository An abstract view is exposed to the users Two implementations: –Relational /SQL-based –XML/XQuery based

Ohio State University Department of Computer Science and Engineering 12 Relational/SQL Implementation Analysis and Code Generation Extract Service Aggregation Service Meta-data Descriptor User Defined Aggregate Query frontend Select Query Input

Ohio State University Department of Computer Science and Engineering 13 Design a Meta-data Description Language Requirements –Specify the relationship of a dataset to the virtual dataset schema –Describe the dataset physical layout within a file –Describe the dataset distribution on nodes of one or more clusters –Specify the subsetting index attributes –Easy to use for data repository administrators and also convenient for our code generation

Ohio State University Department of Computer Science and Engineering 14 An Example Oil Reservoir Management –The dataset comprises several simulation on the same grid –For each realization, each grid point, a number of attributes are stored. –The dataset is stored on a 4 node cluster. Component I: Dataset Schema Description [IPARS]// { * Dataset schema name *} REL = short int// {* Data type definition *} TIME = int X = float Y = float Z = float SOIL = float SGAS = float Component II: Dataset Storage Description [IparsData] //{* Dataset name *} //{* Dataset schema for IparsData *} DatasetDescription = IPARS DIR[0] = osu0/ipars DIR[1] = osu1/ipars DIR[2] = osu2/ipars DIR[3] = osu3/ipars

Ohio State University Department of Computer Science and Engineering 15 Evaluate the Scalability of Our Tool Scale the number of nodes hosting the Oil reservoir management dataset Extract a subset of interest at the size of 1.3GB The execution times scale almost linearly. The performance difference varies between 5%~34%, with an average difference of 16%.

Ohio State University Department of Computer Science and Engineering 16 Comparison with an existing database (PostgreSQL) 6GB data for Satellite data processing. The total storage required after loading the data in PostgreSQL is 18GB. Create Index for both spatial coordinates and S1 in PostgreSQL. No special performance tuning applied for the experiment. No.Description 1SELECT * FROM TITAN; 2SELECT * FROM TITAN WHERE X>=0 AND X =0 AND Y =0 AND Z<=100; 3SELECT * FROM TITAN WHERE DISTANCE(X,Y,Z) < 1000; 4SELECT * FROM TITAN WHERE S1 < 0.01; 5SELECT * FROM TITAN WHERE S1 < 0.5;

Ohio State University Department of Computer Science and Engineering 17 Outline Automatic Data Virtualization –Relational/SQL –XML/XQuery based Information Integration Middleware for Streaming Data Coarse-grained pipelined parallelism

Ohio State University Department of Computer Science and Engineering 18 XML/XQuery Implementation TEXT … NetCDF RMDB HDF5 XML XQuer y ???

Ohio State University Department of Computer Science and Engineering 19 Programming/Query Language High-level declarative languages ease application development –Popularity of Matlab for scientific computations New challenges in compiling them for efficient execution XQuery is a high-level language for processing XML datasets –Derived from database, declarative, and functional languages ! –XPath (a subset of XQuery) embedded in an imperative language is another option

Ohio State University Department of Computer Science and Engineering 20 Approach / Contributions Use of XML Schemas to provide high-level abstractions on complex datasets Using XQuery with these Schemas to specify processing Issues in Translation –High-level to low-level code –Data-centric transformations for locality in low-level codes –Issues specific to XQuery »Recognizing recursive reductions »Type inferencing and translation

Ohio State University Department of Computer Science and Engineering 21 External Schema XQuery Sources Compiler XML Mapping Service System Architecture logical XML schemaphysical XML schema C++/C

Ohio State University Department of Computer Science and Engineering 22 Outline Automatic Data Virtualization –Relational/SQL –XML/XQuery based Information Integration Middleware for Streaming Data Cluster and Grid-based data mining middleware

Ohio State University Department of Computer Science and Engineering 23 Overall Goal Tools for data integration driven by: –Data explosion »Data size & number of data sources –New analysis tools –Autonomous resources »Heterogeneous data representation & various interfaces –Frequent Updates –Common Situations: » Flat-file datasets » Ad-hoc sharing of data

Ohio State University Department of Computer Science and Engineering 24 Current Approaches Manually written wrappers –Problems »O(N 2 ) wrappers needed, O(N) for a single updates Mediator-based integration systems –Problems »Need a common intermediate format »Unnecessary data transformation Integration using web/grid services »Needs all tools to be web-services (all data in XML?)

Ohio State University Department of Computer Science and Engineering 25 Our Approach Automatically generate wrappers –Stand-alone programs –For integrated DBs, (grid) workflow systems Transform data in files of arbitrary formats –No domain- or format-specific heuristics –Layout information provided by users Help biologists write layout descriptors using data mining techniques Particularly attractive for – flat-file datasets – ad hoc data sharing – data grid environments

Ohio State University Department of Computer Science and Engineering 26 Our Approach: Advantages Advantages: –No DB or query support required –One descriptor per resource needed –No unnecessary transformation –New resources can be integrated on-the-fly

Ohio State University Department of Computer Science and Engineering 27 Our Approach: Challenges Description language –Format and logical view of data in flat files –Easy to interpret and write Wrapper generation and Execution –Correspondence between data items –Separating wrapper analysis and execution Interactive tools for writing layout descriptors –What data mining techniques to use ?

Ohio State University Department of Computer Science and Engineering 28 Wrapper Generation System Overview Layout DescriptorSchema Descriptors ParserMapping Generator Data Entry RepresentationSchema Mapping DataReaderDataWriter Synchronizer Source Dataset Target Dataset Application Analyzer WRAPINFO

Ohio State University Department of Computer Science and Engineering 29 Layout Description Language Goal –To describe data in arbitrary flat file format –Easy to interpret and write Components: 1.Schema description 2.Layout description Example: FASTA

Ohio State University Department of Computer Science and Engineering 30 Layout Description Language Component I: Schema Description [FASTA]//Schema Name ID = string//Data type definitions DESCRIPTION = string SEQ = string … >seq1 comment1 \n ASTPGHTIIYEAVCLHNDRTTIP \n >seq2 comment2 \n ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \n NMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \n KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n >seq3 …

Ohio State University Department of Computer Science and Engineering 31 Layout Description Language Key observations on data layout –Strings of variable length –Delimiters widely used –Data fields divided into variables –Repetitive structures Key tokens –“constant string” –LINESIZE –[optional] – –… … >seq1 comment1 \n ASTPGHTIIYEAVCLHNDRTTIP \n >seq2 comment2 \n ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \n NMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \n KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n >seq3 …

Ohio State University Department of Computer Science and Engineering 32 Layout Description Language Component II: Layout Description … LOOP ENTRY 1:EOF:1 { “>” ID “ ” DESCRIPTION “\n” | EOF } … >seq1 comment1 \n ASTPGHTIIYEAVCLHNDRTTIP \n >seq2 comment2 \n ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \n NMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \n KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n >seq3 …

Ohio State University Department of Computer Science and Engineering 33 Outline Automatic Data Virtualization –Relational/SQL –XML/XQuery based Information Integration Middleware for Streaming Data Coarse-grained pipelined parallelism

Ohio State University Department of Computer Science and Engineering 34 Streaming Data Model Continuous data arrival and processing Emerging model for data processing –Sources that produce data continuously: sensors, long running simulations –WAN bandwidths growing faster than disk bandwidths Active topic in many computer science communities –Databases –Data Mining –Networking ….

Ohio State University Department of Computer Science and Engineering 35 Summary/Limitations of Current Work Focus on – centralized processing of stream from a single source (databases, data mining) – communication only (networking) Many applications involve – distributed processing of streams – streams from multiple sources

Ohio State University Department of Computer Science and Engineering 36 Motivating Application Switch Network X Network Fault Management System

Ohio State University Department of Computer Science and Engineering 37 Motivating Application (2) Computer Vision Based Surveillance

Ohio State University Department of Computer Science and Engineering 38 Features of Distributed Streaming Processing Applications Data sources could be distributed –Over a WAN Continuous data arrival Enormous volume –Probably can ’ t communicate it all to one site Results from analysis may be desired at multiple sites Real-time constraints –A real-time, high-throughput, distributed processing problem

Ohio State University Department of Computer Science and Engineering 39 Need for a Grid-Based Stream Processing Middleware Application developers interested in data stream processing –Will like to have abstracted »Grid standards and interfaces »Adaptation function –Will like to focus on algorithms only GATES is a middleware for –Grid-based –Self-adapting Data Stream Processing

Ohio State University Department of Computer Science and Engineering 40 Adaptation for Real-time Processing Analysis on streaming data is approximate Accuracy and execution rate trade-off can be captured by certain parameters (Adaptation parameters) –Sampling Rate –Size of summary structure Application developers can expose these parameters and a range of values

Ohio State University Department of Computer Science and Engineering 41 Public class Sampling-Stage implements StreamProcessing{ … void init(){ … } … void work(buffer in, buffer out){ … while(true) { Image img = get-from-buffer-in-GATES(in); Image img-sample = Sampling(img, sampling-ratio); put-to-buffer-in-GATES(img-sample, out); } … } API for Adaptation sampling-ratio = GATES.getSuggestedParameter(); GATES.Information-About-Adjustment-Parameter(min, max, 1)

Ohio State University Department of Computer Science and Engineering 42 Outline Automatic Data Virtualization –Relational/SQL –XML/XQuery based Information Integration Middleware for Streaming Data Cluster and Grid-based data mining middleware

Ohio State University Department of Computer Science and Engineering 43 Scalable Mining Problem Our understanding of what algorithms and parameters will give desired insights is often limited The time required for creating scalable implementations of different algorithms and running them with different parameters on large datasets slows down the data mining process

Ohio State University Department of Computer Science and Engineering 44 Mining in a Grid Environment Mining in a Grid Environment  A data mining application in a grid environment - - Needs to exploit different forms of available parallelism - Needs to deal with different data layouts and formats - Needs to adapt to resource availability

Ohio State University Department of Computer Science and Engineering 45 FREERIDE Overview Framework for Rapid Implementation of datamining engines Demonstrated for a variety of standard mining algorithm Targeted distributed memory parallelism, shared memory parallelism, and combination Can be used as basis for scalable grid-based data mining implementations Published in SDM 01, SDM 02, SDM 03, Sigmetrics 02, Europar 02, IPDPS 03, IEEE TKDE (to appear)

Ohio State University Department of Computer Science and Engineering 46 FREERIDE-G Data processing may not be feasible where the data resides Need to identify resources for data processing Need to abstract data retrieval, movement and parallel processing

Ohio State University Department of Computer Science and Engineering 47 Group Members Ph.D students –Liang Chen –Leo Glimcher –Kaushik Sinha –Li Weng –Xuan Zhang –Qian Zhu Recently Graduated –Ruoming Jin (Kent State) –Wei Du (Yahoo) –Xiaogang Li (Wi 06, AskJeeves)

Ohio State University Department of Computer Science and Engineering 48 Getting Involved Talk to me Most recent papers are available online Sign in for my 888