Download presentation
Presentation is loading. Please wait.
Published byBuddy Holland Modified over 9 years ago
1
Ohio State University Department of Computer Science and Engineering 1 Tools and Techniques for the Data Grid Gagan Agrawal The Ohio State University
2
Ohio State University Department of Computer Science and Engineering 2 Overall Motivation Computation has long become an integral part of any scientific discipline –Parallels theory and experiments Last 2 (or more) decades have seen Computational-X emerge –Major emphasis on computational modeling –Involved CS support for high-end computing In last 5-10 years, X-Informatics is emerging –Data-driven science and engineering applications –Needs CS support for high-end and distributed computing
3
Ohio State University Department of Computer Science and Engineering 3 Context: Grid Computing Wide area collaborations and pooling of resources Natural synergy with data-intensive applications –Wide-area sharing of data –Using distributed resources for data analysis –Stage multiple tasks: data generation, processing, visualization
4
Ohio State University Department of Computer Science and Engineering 4 Scientific Data Analysis on (Grid-based) Data Repositories Scientific data repositories –Large volume »Gigabyte, Terabyte, Petabyte –Distributed datasets »Generated/collected by scientific simulations or instruments –Data could be streaming in nature Scientific data analysis Data Specification Data Organization Data Extraction Data Movement Data Analysis Data Visualization
5
Ohio State University Department of Computer Science and Engineering 5 Coastal Forecasting and Change Detection (Lake Erie)
6
Ohio State University Department of Computer Science and Engineering 6 Opportunities Scientific simulations and data collection instruments generating large scale data Rapidly increasing wide-area bandwidths Grid standards enabling sharing of data Service/grid model of computing –Plug and play application modules / data sources
7
Ohio State University Department of Computer Science and Engineering 7 Existing Efforts Data grids recognized as important component of grid/distributed computing Major topics –Efficient/Secure Data Movement –Replica Selection –Metadata catalogs / Metadata services –Setting up workflows
8
Ohio State University Department of Computer Science and Engineering 8 Open Issues Accessing / Retrieving / Processing data from scientific repositories –Need to deal with low-level formats Integrating tools and services having/requiring data with different formats Support for processing streaming data in a distributed environment Developing scalable data analysis applications
9
Ohio State University Department of Computer Science and Engineering 9 Ongoing Projects Automatic Data Virtualization On the fly data integration in a distributed environment Middleware for Processing Streaming Data Supporting Coarse-grained pipelined parallelism Compiling XQuery on Scientific and Streaming Data Middleware for Scalable Data Processing Data Mining Algorithms and Systems –Ask Ruoming !
10
Ohio State University Department of Computer Science and Engineering 10 Outline Automatic Data Virtualization –Relational/SQL –XML/XQuery based Data Integration Middleware for Streaming Data Cluster and Grid-based data mining middleware
11
Ohio State University Department of Computer Science and Engineering 11 An Example Application Scenario
12
Ohio State University Department of Computer Science and Engineering 12 Automatic Data Virtualization: Motivation Emergence of grid-based data repositories –Can enable sharing of data in an unprecedented way Access mechanisms for remote repositories –Complex low-level formats make accessing and processing of data difficult Main desired functionality –Ability to select, down-load, and process a subset of data
13
Ohio State University Department of Computer Science and Engineering 13 Data Virtualization An abstract view of data dataset Data Service Data Virtualization By Global Grid Forum’s DAIS working group: A Data Virtualization describes an abstract view of data. A Data Service implements the mechanism to access and process data through the Data Virtualization
14
Ohio State University Department of Computer Science and Engineering 14 Our Approach: Automatic Data Virtualization Automatically create data services –A new application of compiler technology A metadata descriptor describes the layout of data on a repository An abstract view is exposed to the users Two implementations: –Relational /SQL-based –XML/XQuery based
15
Ohio State University Department of Computer Science and Engineering 15 System Overview SELECT FROM WHERE …. AND Filter( );
16
Ohio State University Department of Computer Science and Engineering 16 Design a Meta-data Description Language Requirements –Specify the relationship of a dataset to the virtual dataset schema –Describe the dataset physical layout within a file –Describe the dataset distribution on nodes of one or more clusters –Specify the subsetting index attributes –Easy to use for data repository administrators and also convenient for our code generation
17
Ohio State University Department of Computer Science and Engineering 17 Design Overview Dataset Schema Description Component Dataset Storage Description Component Dataset Layout Description Component
18
Ohio State University Department of Computer Science and Engineering 18 An Example Oil Reservoir Management –The dataset comprises several simulation on the same grid –For each realization, each grid point, a number of attributes are stored. –The dataset is stored on a 4 node cluster. Component I: Dataset Schema Description [IPARS]// { * Dataset schema name *} REL = short int// {* Data type definition *} TIME = int X = float Y = float Z = float SOIL = float SGAS = float Component II: Dataset Storage Description [IparsData] //{* Dataset name *} //{* Dataset schema for IparsData *} DatasetDescription = IPARS DIR[0] = osu0/ipars DIR[1] = osu1/ipars DIR[2] = osu2/ipars DIR[3] = osu3/ipars
19
Ohio State University Department of Computer Science and Engineering 19 Data Layout Description Component Dataset Root dataset 1 dataset 2 dataset 3 Data1 Data2 Data3Data4 Data5 Data6 DATASET “ROOT” { DATATYPE { … } DATAINDEX { … } DATA { DATASET dataset1 DATASET dataset2 DATASET dataset3 } DATASET “dataset1” { DATATYPE { … } DATASPACE { … } DATA { data1 data2 data3 } } DATASET “dataset2” { DATATYPE { … } DATASPACE { … } DATA { data4 } } DATASET “dataset3” { …. }
20
Ohio State University Department of Computer Science and Engineering 20 An Example Oil Reservoir Management –Use LOOP keyword for capturing the repetitive structure within a file. –The grid has 4 partitions (0~3). –“IparsData” comprises “ipars1” and “ipars2”. “ipars1” describes the data files with the spatial coordinates’ stored; “ipars2” specifies the data files with other attributes stored. Component III: Dataset Layout Description DATASET “IparsData” { //{* Name for Dataset *} DATATYPE { IPARS } //{* Schema for Dataset *} DATAINDEX { REL TIME } DATA { DATASET ipars1 DATASET ipars2 } DATASET “ipars1” { DATASPACE { LOOP GRID ($DIRID*100+1):(($DIRID+1)*100):1 { X Y Z } DATA { $DIR[$DIRID]/COORDS $DIRID = 0:3:1 } } // {* end of DATASET “ipars1” *} DATASET “ipars2” { DATASPACE { LOOP TIME 1:500:1 { LOOP GRID ( $DIRID*100+1):(( $DIRID+1)*100):1 { SOIL SGAS } DATA { $DIR[ $DIRID]/DATA$REL $REL = 0:3:1 $DIRID = 0:3:1 } } //{* end of DATASET “ipars2” *} }
21
Ohio State University Department of Computer Science and Engineering 21 Automatic Virtualization Using Meta-data Aligned file chunks {num_rows, {File 1,Offset 1,Num_Bytes 1 }, {File 2,Offset 2,Num_Bytes 2 }, ……, {File m,Offset m,Num_Bytes m } } Our tool parses the meta-data descriptor and generates function codes. At run time, the query would provide parameters to invoke the generated functions to create Aligned File Chunks. Dataset Root dataset 1 dataset 2 dataset 3 Data1 Data2 Data3Data4 Data5 Data6
22
Ohio State University Department of Computer Science and Engineering 22 Compiler Analysis Compiler Analysis Meta-data descriptor Create AFC Process AFC Index & Extraction function code Data _Extract { Find _File _Groups() Process _File _Groups() } Find _File _Groups { Let S be the set of files that match against the query Classify files in S by the set of attributes they have Let S 1, …,S m be the m sets T = Ø foreach {s 1, …,s m } s i ∈ S i { {* cartesian product between S 1, …,S m *} If the values of implicit attributes are not inconsistent { T = T ∪ {s 1, …,s m } } Output T } Process _File _Groups { foreach {s 1, …,s m } ∈ T Find _Aligned _File _Chunks() Supply implicit attributes for each file chunk foreach Aligned File Chunk { Check against index Compute offset and length Output the aligned file chunk }
23
Ohio State University Department of Computer Science and Engineering 23 Outline Automatic Data Virtualization –Relational/SQL –XML/XQuery based Information Integration Middleware for Streaming Data Coarse-grained pipelined parallelism
24
Ohio State University Department of Computer Science and Engineering 24 XML/XQuery Implementation TEXT … NetCDF RMDB HDF5 XML XQuer y ???
25
Ohio State University Department of Computer Science and Engineering 25 Programming/Query Language High-level declarative languages ease application development –Popularity of Matlab for scientific computations New challenges in compiling them for efficient execution XQuery is a high-level language for processing XML datasets –Derived from database, declarative, and functional languages ! –XPath (a subset of XQuery) embedded in an imperative language is another option
26
Ohio State University Department of Computer Science and Engineering 26 Approach / Contributions Use of XML Schemas to provide high-level abstractions on complex datasets Using XQuery with these Schemas to specify processing Issues in Translation –High-level to low-level code –Data-centric transformations for locality in low-level codes –Issues specific to XQuery »Recognizing recursive reductions »Type inferencing and translation
27
Ohio State University Department of Computer Science and Engineering 27 External Schema XQuery Sources Compiler XML Mapping Service System Architecture logical XML schemaphysical XML schema C++/C
28
Ohio State University Department of Computer Science and Engineering 28 Outline Automatic Data Virtualization –Relational/SQL –XML/XQuery based Information Integration Middleware for Streaming Data Cluster and Grid-based data mining middleware
29
Ohio State University Department of Computer Science and Engineering 29 Data Integration: Overall Goal Tools for data integration driven by: –Data explosion »Data size & number of data sources –New analysis tools –Autonomous resources »Heterogeneous data representation & various interfaces –Frequent Updates –Common Situations: » Flat-file datasets » Ad-hoc sharing of data
30
Ohio State University Department of Computer Science and Engineering 30 Current Approaches Manually written wrappers –Problems »O(N 2 ) wrappers needed, O(N) for a single updates Mediator-based integration systems –Problems »Need a common intermediate format »Unnecessary data transformation Integration using web/grid services »Needs all tools to be web-services (all data in XML?)
31
Ohio State University Department of Computer Science and Engineering 31 Our Approach Automatically generate wrappers –Stand-alone programs –For integrated DBs, (grid) workflow systems Transform data in files of arbitrary formats –No domain- or format-specific heuristics –Layout information provided by users Help biologists write layout descriptors using data mining techniques Particularly attractive for – flat-file datasets – ad hoc data sharing – data grid environments
32
Ohio State University Department of Computer Science and Engineering 32 Our Approach: Advantages Advantages: –No DB or query support required –One descriptor per resource needed –No unnecessary transformation –New resources can be integrated on-the-fly
33
Ohio State University Department of Computer Science and Engineering 33 Our Approach: Challenges Description language –Format and logical view of data in flat files –Easy to interpret and write Wrapper generation and Execution –Correspondence between data items –Separating wrapper analysis and execution Interactive tools for writing layout descriptors –What data mining techniques to use ?
34
Ohio State University Department of Computer Science and Engineering 34 Wrapper Generation System Overview Layout DescriptorSchema Descriptors ParserMapping Generator Data Entry RepresentationSchema Mapping DataReaderDataWriter Synchronizer Source Dataset Target Dataset Application Analyzer WRAPINFO
35
Ohio State University Department of Computer Science and Engineering 35 Outline Automatic Data Virtualization –Relational/SQL –XML/XQuery based Information Integration Middleware for Streaming Data Coarse-grained pipelined parallelism
36
Ohio State University Department of Computer Science and Engineering 36 Streaming Data Model Continuous data arrival and processing Emerging model for data processing –Sources that produce data continuously: sensors, long running simulations –WAN bandwidths growing faster than disk bandwidths Active topic in many computer science communities –Databases –Data Mining –Networking ….
37
Ohio State University Department of Computer Science and Engineering 37 Summary/Limitations of Current Work Focus on – centralized processing of stream from a single source (databases, data mining) – communication only (networking) Many applications involve – distributed processing of streams – streams from multiple sources
38
Ohio State University Department of Computer Science and Engineering 38 Motivating Application Switch Network X Network Fault Management System
39
Ohio State University Department of Computer Science and Engineering 39 Motivating Application (2) Computer Vision Based Surveillance
40
Ohio State University Department of Computer Science and Engineering 40 Features of Distributed Streaming Processing Applications Data sources could be distributed –Over a WAN Continuous data arrival Enormous volume –Probably can ’ t communicate it all to one site Results from analysis may be desired at multiple sites Real-time constraints –A real-time, high-throughput, distributed processing problem
41
Ohio State University Department of Computer Science and Engineering 41 Need for a Grid-Based Stream Processing Middleware Application developers interested in data stream processing –Will like to have abstracted »Grid standards and interfaces »Adaptation function –Will like to focus on algorithms only GATES is a middleware for –Grid-based –Self-adapting Data Stream Processing
42
Ohio State University Department of Computer Science and Engineering 42 Adaptation for Real-time Processing Analysis on streaming data is approximate Accuracy and execution rate trade-off can be captured by certain parameters (Adaptation parameters) –Sampling Rate –Size of summary structure Application developers can expose these parameters and a range of values
43
Ohio State University Department of Computer Science and Engineering 43 Public class Sampling-Stage implements StreamProcessing{ … void init(){ … } … void work(buffer in, buffer out){ … while(true) { Image img = get-from-buffer-in-GATES(in); Image img-sample = Sampling(img, sampling-ratio); put-to-buffer-in-GATES(img-sample, out); } … } API for Adaptation sampling-ratio = GATES.getSuggestedParameter(); GATES.Information-About-Adjustment-Parameter(min, max, 1)
44
Ohio State University Department of Computer Science and Engineering 44 Outline Automatic Data Virtualization –Relational/SQL –XML/XQuery based Information Integration Middleware for Streaming Data Cluster and Grid-based data mining middleware
45
Ohio State University Department of Computer Science and Engineering 45 Scalable Mining Problem Our understanding of what algorithms and parameters will give desired insights is often limited The time required for creating scalable implementations of different algorithms and running them with different parameters on large datasets slows down the data mining process
46
Ohio State University Department of Computer Science and Engineering 46 Mining in a Grid Environment Mining in a Grid Environment A data mining application in a grid environment - - Needs to exploit different forms of available parallelism - Needs to deal with different data layouts and formats - Needs to adapt to resource availability
47
Ohio State University Department of Computer Science and Engineering 47 FREERIDE Overview Framework for Rapid Implementation of datamining engines Demonstrated for a variety of standard mining algorithm Targeted distributed memory parallelism, shared memory parallelism, and combination Can be used as basis for scalable grid-based data mining implementations Published in SDM 01, SDM 02, SDM 03, Sigmetrics 02, Europar 02, IPDPS 03, IEEE TKDE (to appear)
48
Ohio State University Department of Computer Science and Engineering 48 FREERIDE-G Data processing may not be feasible where the data resides Need to identify resources for data processing Need to abstract data retrieval, movement and parallel processing
49
Ohio State University Department of Computer Science and Engineering 49 Conclusion Distributed data-driven science: –We have a long way to go The holy grail will be –The system finds all relevant data for you –The system finds all relevant analysis tools for you –The system best uses all possible resources to give you the fastest response –Does all of this transparent to you ! We will never get there, but the journey is interesting ….
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.