Download presentation
Presentation is loading. Please wait.
Published byMarian Crawford Modified over 9 years ago
2
Ohio State University Department of Computer Science and Engineering 1 Tools and Techniques for the Data Grid Gagan Agrawal
3
Ohio State University Department of Computer Science and Engineering 2 Scientific Data Analysis on Grid-based Data Repositories Scientific data repositories –Large volume »Gigabyte, Terabyte, Petabyte –Distributed datasets »Generated/collected by scientific simulations or instruments –Data could be streaming in nature Scientific data analysis Data Specification Data Organization Data Extraction Data Movement Data Analysis Data Visualization
4
Ohio State University Department of Computer Science and Engineering 3 Opportunities Scientific simulations and data collection instruments generating large scale data Grid standards enabling sharing of data Rapidly increasing wide-area bandwidths
5
Ohio State University Department of Computer Science and Engineering 4 Motivating Scientific Applications Magnetic Resonance Imaging Oil Reservoir Management Data-driven applications from science, Engineering, biomedicine: Oil Reservoir ManagementWater Contamination Studies Cancer Studies using MRITelepathology with Digitized Slides Satellite Data ProcessingVirtual Microscope …
6
Ohio State University Department of Computer Science and Engineering 5 Existing Efforts Data grids recognized as important component of grid/distributed computing Major topics –Efficient/Secure Data Movement –Replica Selection –Metadata catalogs / Metadata services –Setting up workflows
7
Ohio State University Department of Computer Science and Engineering 6 Open Issues Accessing / Retrieving / Processing data from scientific repositories –Need to deal with low-level formats Integrating tools and services having/requiring data with different formats Support for processing streaming data in a distributed environment Efficient distributed data-intensive applications Developing scalable data analysis applications
8
Ohio State University Department of Computer Science and Engineering 7 Ongoing Projects Automatic Data Virtualization On the fly information integration in a distributed environment Middleware for Processing Streaming Data Supporting Coarse-grained pipelined parallelism Compiling XQuery on Scientific and Streaming Data Middleware and Algorithms for Scalable Data Mining
9
Ohio State University Department of Computer Science and Engineering 8 Outline Automatic Data Virtualization –Relational/SQL –XML/XQuery based Information Integration Middleware for Streaming Data Coarse-grained pipelined parallelism
10
Ohio State University Department of Computer Science and Engineering 9 Automatic Data Virtualization: Motivation Emergence of grid-based data repositories –Can enable sharing of data in an unprecedented way Access mechanisms for remote repositories –Complex low-level formats make accessing and processing of data difficult Main desired functionality –Ability to select, down-load, and process a subset of data
11
Ohio State University Department of Computer Science and Engineering 10 Current Approaches Databases –Relational model using SQL –Properties of transactions: Atomicity, Isolation, Durability, Consistency –Good! But is it too heavyweight for read-mostly scientific data ? Manual implementation based on low-level datasets –Need detailed understanding of low-level formats HDF5, NetCDF, etc –No single established standard BinX, BFD, DFDL –Machine readable descriptions, but application is dependent on a specific layout
12
Ohio State University Department of Computer Science and Engineering 11 Data Virtualization An abstract view of data dataset Data Service Data Virtualization By Global Grid Forum’s DAIS working group: A Data Virtualization describes an abstract view of data. A Data Service implements the mechanism to access and process data through the Data Virtualization
13
Ohio State University Department of Computer Science and Engineering 12 Our Approach: Automatic Data Virtualization Automatically create data services –A new application of compiler technology A meta-data descriptor describes the layout of data on a repository An abstract view is exposed to the users Two implementations: –Relational /SQL-based –XML/XQuery based
14
Ohio State University Department of Computer Science and Engineering 13 System Overview Compiler Analysis and Code Generation Extraction Service STORM Aggregation Service Meta-data Descriptor User Defined Aggregate Query frontend Select Query Input
15
Ohio State University Department of Computer Science and Engineering 14 Design a Meta-data Description Language Requirements –Specify the relationship of a dataset to the virtual dataset schema –Describe the dataset physical layout within a file –Describe the dataset distribution on nodes of one or more clusters –Specify the subsetting index attributes –Easy to use for data repository administrators and also convenient for our code generation
16
Ohio State University Department of Computer Science and Engineering 15 Design Overview Dataset Schema Description Component Dataset Storage Description Component Dataset Layout Description Component
17
Ohio State University Department of Computer Science and Engineering 16 An Example Oil Reservoir Management –The dataset comprises several simulation on the same grid –For each realization, each grid point, a number of attributes are stored. –The dataset is stored on a 4 node cluster. Component I: Dataset Schema Description [IPARS]// { * Dataset schema name *} REL = short int// {* Data type definition *} TIME = int X = float Y = float Z = float SOIL = float SGAS = float Component II: Dataset Storage Description [IparsData] //{* Dataset name *} //{* Dataset schema for IparsData *} DatasetDescription = IPARS DIR[0] = osu0/ipars DIR[1] = osu1/ipars DIR[2] = osu2/ipars DIR[3] = osu3/ipars
18
Ohio State University Department of Computer Science and Engineering 17 Compiler Analysis Compiler Analysis Meta-data descriptor Create AFC Process AFC Index & Extraction function code Data _Extract { Find _File _Groups() Process _File _Groups() } Find _File _Groups { Let S be the set of files that match against the query Classify files in S by the set of attributes they have Let S 1, …,S m be the m sets T = Ø foreach {s 1, …,s m } s i ∈ S i { {* cartesian product between S 1, …,S m *} If the values of implicit attributes are not inconsistent { T = T ∪ {s 1, …,s m } } Output T } Process _File _Groups { foreach {s 1, …,s m } ∈ T Find _Aligned _File _Chunks() Supply implicit attributes for each file chunk foreach Aligned File Chunk { Check against index Compute offset and length Output the aligned file chunk }
19
Ohio State University Department of Computer Science and Engineering 18 Test the ability of our code generation tool Oil Reservoir Management The performance difference is within 4%~10% as for Layout 0. Correctly and efficiently handle a variety of different layouts for the same data
20
Ohio State University Department of Computer Science and Engineering 19 Evaluate the Scalability of Our Tool Scale the number of nodes hosting the Oil reservoir management dataset Extract a subset of interest at the size of 1.3GB The execution times scale almost linearly. The performance difference varies between 5%~34%, with an average difference of 16%.
21
Ohio State University Department of Computer Science and Engineering 20 Comparison with an existing database (PostgreSQL) 6GB data for Satellite data processing. The total storage required after loading the data in PostgreSQL is 18GB. Create Index for both spatial coordinates and S1 in PostgreSQL. No special performance tuning applied for the experiment. No.Description 1SELECT * FROM TITAN; 2SELECT * FROM TITAN WHERE X>=0 AND X =0 AND Y =0 AND Z<=100; 3SELECT * FROM TITAN WHERE DISTANCE(X,Y,Z) < 1000; 4SELECT * FROM TITAN WHERE S1 < 0.01; 5SELECT * FROM TITAN WHERE S1 < 0.5;
22
Ohio State University Department of Computer Science and Engineering 21 Outline Automatic Data Virtualization –Relational/SQL –XML/XQuery based Information Integration Middleware for Streaming Data Coarse-grained pipelined parallelism
23
Ohio State University Department of Computer Science and Engineering 22 XML/XQuery Implementation TEXT … NetCDF RMDB HDF5 XML XQuer y ???
24
Ohio State University Department of Computer Science and Engineering 23 Programming/Query Language High-level declarative languages ease application development –Popularity of Matlab for scientific computations New challenges in compiling them for efficient execution XQuery is a high-level language for processing XML datasets –Derived from database, declarative, and functional languages ! –XPath (a subset of XQuery) embedded in an imperative language is another option
25
Ohio State University Department of Computer Science and Engineering 24 Approach / Contributions Use of XML Schemas to provide high-level abstractions on complex datasets Using XQuery with these Schemas to specify processing Issues in Translation –High-level to low-level code –Data-centric transformations for locality in low-level codes –Issues specific to XQuery »Recognizing recursive reductions »Type inferencing and translation
26
Ohio State University Department of Computer Science and Engineering 25 External Schema XQuery Sources Compiler XML Mapping Service System Architecture logical XML schemaphysical XML schema C++/C
27
Ohio State University Department of Computer Science and Engineering 26 Using a High-level Schema High-level view of the dataset – a simple collection of pixels Latitude, longitude, and time explicitly stored with each pixel Easy to specify processing –Don’t care about locality / unnecessary scans At least one order of magnitude overhead in storage –Suitable as a logical format only
28
Ohio State University Department of Computer Science and Engineering 27 XQuery Overview XQuery -A language for querying and processing XML document - Functional language - Single Assignment - Strongly typed XQuery Expression - for let where return (FLWR) - unordered - path expression Unordered( For $d in document(“depts.xml”)//deptno let $e:=document(“emps.xml”)//emp [Deptno= $d] where count($e)>=10 return {$d, {count($e) } {avg($e/salary)} } )
29
Ohio State University Department of Computer Science and Engineering 28 Satellite- XQuery Code Unordered ( for $i in ( $minx to $maxx) for $j in ($miny to $maxy) let p:=document(“sate.xml”) /data/pixel where lat = i and long = j return {$i} {$j} {accumulate($p)} ) Define function accumulate ($p) as double { let $inp := item-at($p,1) let $NVDI := (( $inp/band1 - $inp/band0)div($inp/band1+$inp/band0 )+1)*512 return if (empty( $p) ) then 0 else { max($NVDI, accumulate(subsequence ($p, 2 ))) }
30
Ohio State University Department of Computer Science and Engineering 29 Challenges Need to translate to low-level schema –Focus on correctness and avoiding unnecessary reads Enhancing locality –Data-centric execution on XQuery constructs –Use information on low-level data layout Issues specific to XQuery –Reductions expressed as recursive functions –Generating code in an imperative language »For either direct compilation or use a part of a runtime system »Requires type conversion
31
Ohio State University Department of Computer Science and Engineering 30 Outline Automatic Data Virtualization –Relational/SQL –XML/XQuery based Information Integration Middleware for Streaming Data Coarse-grained pipelined parallelism
32
Ohio State University Department of Computer Science and Engineering 31 Introduction : Information Integration Goal: to provide a uniform access/query interface to multiple heterogeneous sources. Challenges: –Global schema –Query optimization –Resource discovery –Ontology discrepancy –etc.
33
Ohio State University Department of Computer Science and Engineering 32 Introduction: Wrapper Goal: to provide the integration system transparent access to data sources Challenges –Development cost –Performance –Transportability
34
Ohio State University Department of Computer Science and Engineering 33 Roadmap Introduction System Overview Meta Data Description Language Wrapper Generation Conclusion
35
Ohio State University Department of Computer Science and Engineering 34 Overview: Main Components User’s view of the data – Meta data description language Mapping between input and output schema – Schema mapping Parse inputs and generate outputs – DataReader and DataWriter
36
Ohio State University Department of Computer Science and Engineering 35 System Overview Meta Data Descriptor ParserMapping Generator Internal Data Entry Representation Schema Mapping Code Generator DataReaderDataWriter Integrator Source Dataset Target Dataset
37
Ohio State University Department of Computer Science and Engineering 36 Meta Data Descriptor (1) Design Goals: –Easy to interpret and process –Easy to write –Sufficiently expressive Added features (for bioinformatics datasets): –Strings with no fixed size –Delimiters are used for separating fields –Fields may be divided into lines/variables –Total number of items unknown
38
Ohio State University Department of Computer Science and Engineering 37 Meta Data Descriptor (2) Component I. Schema Description [FASTA] ID = string DESCRIPTION = string SEQ = string Schema name Data field name Data type
39
Ohio State University Department of Computer Science and Engineering 38 Meta Data Descriptor (3) Component II. Layout Description DATASET “FASTAData” { DATATYPE {FASTA} DATASPACE LINESIZE=80 { LOOP ENTRY 1:EOF:1 { “>” ID “ “ DESCRIPTION “\n” | EOF } } DATA {osu/fasta} } >Example1 envelope protein ELRLRYCAPAGFALLKCNDA DYDGFKTNCSNVSVVHCTNL MNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKH >Example2 synthetic peptide HITREPLKHIPKERYRGTNDT… Dataset name Schema name File layout File location IDDESCRIPTION SEQ
40
Ohio State University Department of Computer Science and Engineering 39 Meta Data Descriptor (4) Input SWISSPROT data … LOOP ENTRY 1:EOF:1 { “ID” ID LOOP I 1:3:1 {“\nDT” DATE} [“\nOG” ORGANELLE] … <“\nSQ SEQUENCE” LENGTH “AA;” MOL_WT “MW;” CRC “CRC64;” “\n//\n” | EOF } … ID CRAM_CRAAB AC P01542; DT 21-JUL-1986 (Rel. 01, Created) DT 21-JUL-1986 (Rel. 01, Last sequence update) DT 01-NOV-1997 (Rel. 35, Last annotation update) DE CRAMBIN. GN THI2. OS Crambe abyssinica (Abyssinian crambe). OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; OC euphyllophytes; Spermatophyta; Magnoliophyta; eudicotyledons; OC core eudicots; Rosidae; eurosids II; Brassicales; Brassicaceae; OC Crambe. … SQ SEQUENCE 46 AA; 4736 MW; F6ADE458 CRC64; >P1;CRAM_CRAAB TTCCPSIVAR SNFNVCRLPG TPEAICATYT GCIIIPGATC PGDYAN \\ …
41
Ohio State University Department of Computer Science and Engineering 40 Wrapper Generation: Mapping Generator Goal: Generate schema mapping from schema descriptors Criteria: Strict name matching [SWISSPROT]:[FASTA] ID:ID SEQ:SEQ [input schema]: [output schema] source field : target field DESCRIPTION:DESCRIPTION “from SWISSPROT”
42
Ohio State University Department of Computer Science and Engineering 41 Wrapper Generation: Parser Key Observation: –Data stored in entry-wise manner LOOP ENTRY 1:EOF:1 { … single data entry … } –Entry made of delimiter-variable pairs with “environment” symbols in between
43
Ohio State University Department of Computer Science and Engineering 42 Wrapper Generation: Parse Tree LOOP ENTRY 1:EOF:1 { “>” ID “ “ DESCRIPTION “\n” | EOF } “>”-ID“ “-DESCRIPTION “\n”-DUMMY|EOF Data Entry “\n”-SEQ
44
Ohio State University Department of Computer Science and Engineering 43 Wrapper Generation: Code Generator Create two application specific modules –DataReader: »Scans the source data file; »Locates DLM-VAR pair; »Submits variable required by the target with its order. –Data Writer: »Takes in the variable and its order; »Looks up DLM-VAR pair; »Checks linesize; »Writes target file.
45
Ohio State University Department of Computer Science and Engineering 44 Outline Automatic Data Virtualization –Relational/SQL –XML/XQuery based Information Integration Middleware for Streaming Data Coarse-grained pipelined parallelism
46
Ohio State University Department of Computer Science and Engineering 45 Streaming Data Model Continuous data arrival and processing Emerging model for data processing –Sources that produce data continuously: sensors, long running simulations –WAN bandwidths growing faster than disk bandwidths Active topic in many computer science communities –Databases –Data Mining –Networking ….
47
Ohio State University Department of Computer Science and Engineering 46 Summary/Limitations of Current Work Focus on – centralized processing of stream from a single source (databases, data mining) – communication only (networking) Many applications involve – distributed processing of streams – streams from multiple sources
48
Ohio State University Department of Computer Science and Engineering 47 Motivating Application Switch Network X Network Fault Management System
49
Ohio State University Department of Computer Science and Engineering 48 Motivating Application (2) Computer Vision Based Surveillance
50
Ohio State University Department of Computer Science and Engineering 49 Features of Distributed Streaming Processing Applications Data sources could be distributed –Over a WAN Continuous data arrival Enormous volume –Probably can ’ t communicate it all to one site Results from analysis may be desired at multiple sites Real-time constraints –A real-time, high-throughput, distributed processing problem
51
Ohio State University Department of Computer Science and Engineering 50 Motivation Challenges & Possible Solutions –Challenge1: Data, Communication, and/or Compute- Intensive Switch Network X
52
Ohio State University Department of Computer Science and Engineering 51 Challenges & possible Solutions –Challenge1: Data and/or Computation intensive –Solution: Grid computing technologies Switch Network Motivation
53
Ohio State University Department of Computer Science and Engineering 52 Challenges & possible Solutions Challenge1: Data and/or Computation intensive Solution: Grid computing technologies Challenge 2: real-time analysis is required Motivation Solution: Self-Adaptation functionality is desired
54
Ohio State University Department of Computer Science and Engineering 53 Need for a Grid-Based Stream Processing Middleware Application developers interested in data stream processing –Will like to have abstracted »Grid standards and interfaces »Adaptation function –Will like to focus on algorithms only GATES is a middleware for –Grid-based –Self-adapting Data Stream Processing
55
Ohio State University Department of Computer Science and Engineering 54 Using GATES Break down the analysis into several sub-tasks that make a pipeline Implement each sub-task in Java Write an XML configuration file for the sub-tasks to be automatically deployed. Launch the application by running a java program (StreamClient.class) provided by the GATES
56
Ohio State University Department of Computer Science and Engineering 55 System Architecture
57
Ohio State University Department of Computer Science and Engineering 56 Adaptation for Real-time Processing Analysis on streaming data is approximate Accuracy and execution rate trade-off can be captured by certain parameters (Adaptation parameters) –Sampling Rate –Size of summary structure Application developers can expose these parameters and a range of values
58
Ohio State University Department of Computer Science and Engineering 57 Public class Sampling-Stage implements StreamProcessing{ … void init(){ … } … void work(buffer in, buffer out){ … while(true) { Image img = get-from-buffer-in-GATES(in); Image img-sample = Sampling(img, sampling-ratio); put-to-buffer-in-GATES(img-sample, out); } … } API for Adaptation sampling-ratio = GATES.getSuggestedParameter(); GATES.Information-About-Adjustment-Parameter(min, max, 1)
59
Ohio State University Department of Computer Science and Engineering 58 Outline Automatic Data Virtualization –Relational/SQL –XML/XQuery based Information Integration Middleware for Streaming Data Coarse-grained pipelined parallelism
60
Ohio State University Department of Computer Science and Engineering 59 Context: Coarse Grained Pipelined Parallelism Motivating application scenarios Internet data
61
Ohio State University Department of Computer Science and Engineering 60 Motivating Application Classes Scientific data analysis –solving shallow water equations (SWE) –developing Eastern North Pacific Tidal model Data mining –k-nearest neighbor search algorithm –k-means clustering –hot list query Visualization –visualizing time-dependent, two-dimensional wake vortex computations –Iso-surface rendering Image analysis –virtual microscope
62
Ohio State University Department of Computer Science and Engineering 61 Ways to Implement Local processing Internet data
63
Ohio State University Department of Computer Science and Engineering 62 Ways to Implement Remote processing Internet data
64
Ohio State University Department of Computer Science and Engineering 63 Our approach –A coarse-grained pipelined execution model is a good match Internet Ways to Implement data
65
Ohio State University Department of Computer Science and Engineering 64 Overview of Our Efforts Language and Compiler Framework for Coarse-Grained Pipelined Parallelism (SC 2003) Reduction Strategies (SC 2003) Support for program adaptation (SC 2004) DataCutter Runtime System (Saltz et al.) Packet size optimization (ICPP 2004) Filter decomposition problem (submitted)
66
Ohio State University Department of Computer Science and Engineering 65 Group Members Ph.D students –Liang Chen –Wei Du –Leo Glimcher –Ruoming Jin –Xiaogang Li –Kaushik Sinha –Li Weng –Xuan Zhang Masters students –Anjan Goswami –Swarup Sahoo
67
Ohio State University Department of Computer Science and Engineering 66 Getting Involved Talk to me Most recent papers are available online Sign in for my 888
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.