Compiler (and Runtime) Support for CyberInfrastructure Gagan Agrawal (joint work with Wei Du, Xiaogang Li, Ruoming Jin, Li Weng)

Slides:

Advertisements

Similar presentations

XML: Extensible Markup Language

Advertisements

Streaming NetCDF John Caron July What does NetCDF do for you? Data Storage: machine-, OS-, compiler-independent Standard API (Application Programming.

C van Ingen, D Agarwal, M Goode, J Gupchup, J Hunt, R Leonardson, M Rodriguez, N Li Berkeley Water Center John Hopkins University Lawrence Berkeley Laboratory.

1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.

Technical Architectures

Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.

David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.

Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Supporting High-Level Abstractions through XML Technologies Xiaogang Li Gagan Agrawal The Ohio State University.

XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.

Course Instructor: Aisha Azeem

Chapter 17 Methodology – Physical Database Design for Relational Databases Transparencies © Pearson Education Limited 1995, 2005.

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Christopher Jeffers August 2012

Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.

Efficient Evaluation of XQuery over Streaming Data Xiaogang Li Gagan Agrawal The Ohio State University.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Lecture 9 Methodology – Physical Database Design for Relational Databases.

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.

Ohio State University Department of Computer Science and Engineering 1 Supporting SQL-3 Aggregations on Grid-based Data Repositories Li Weng, Gagan Agrawal,

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal June 1,

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.

Compiler Support for Exploiting Coarse-Grained Pipelined Parallelism Wei Du Renato Ferreira Gagan Agrawal Ohio-State University.

Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

High-level Interfaces and Abstractions for Data-Driven Applications in a Grid Environment Gagan Agrawal Department of Computer Science and Engineering.

Ohio State University Department of Computer Science and Engineering 1 Tools and Techniques for the Data Grid Gagan Agrawal.

Compiler Supported High-level Abstractions for Sparse Disk-resident Datasets Renato Ferreira Gagan Agrawal Joel Saltz Ohio State University.

CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.

XML and Database.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,

Ohio State University Department of Computer Science and Engineering Data-Centric Transformations on Non- Integer Iteration Spaces Swarup Kumar Sahoo Gagan.

1 Supporting Dynamic Migration in Tightly Coupled Grid Applications Liang Chen Qian Zhu Gagan Agrawal Computer Science & Engineering The Ohio State University.

Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.

Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.

Computer Science and Engineering FREERIDE-G: A Grid-Based Middleware for Scalable Processing of Remote Data Leonid Glimcher Gagan Agrawal.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Packet Size optimization for Supporting Coarse-Grained Pipelined Parallelism Wei Du Gagan Agrawal Ohio State University.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Research Overview Gagan Agrawal Associate Professor.

Ohio State University Department of Computer Science and Engineering 1 Tools and Techniques for the Data Grid Gagan Agrawal The Ohio State University.

Using XQuery for Flat-File Scientific Datasets Xiaogang Li Gagan Agrawal The Ohio State University.

1 A Grid-Based Middleware’s Support for Processing Distributed Data Streams Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering.

Ohio State University Department of Computer Science and Engineering 1 Tools and Techniques for the Data Grid Gagan Agrawal.

1 Supporting a Volume Rendering Application on a Grid-Middleware For Streaming Data Liang Chen Gagan Agrawal Computer Science & Engineering Ohio State.

Copyright © 2010 The HDF Group. All Rights Reserved1 Data Storage and I/O in HDF5.

Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism Wei Du, Gagan Agrawal Ohio State University.

Servicing Seismic and Oil Reservoir Simulation Data through Grid Data Services Sivaramakrishnan Narayanan, Tahsin Kurc, Umit Catalyurek and Joel Saltz.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

Efficient Evaluation of XQuery over Streaming Data

LOCO Extract – Transform - Load

Spark Presentation.

Supporting Fault-Tolerance in Streaming Grid Applications

Overview of big data tools

GATES: A Grid-Based Middleware for Processing Distributed Data Streams

Resource Allocation in a Middleware for Streaming Data

Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How

The Ohio State University

CSE591: Data Mining by H. Liu

Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.

New (Applications of) Compiler Techniques for Data Grids

LCPC02 Wei Du Renato Ferreira Gagan Agrawal

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Compiler (and Runtime) Support for CyberInfrastructure Gagan Agrawal (joint work with Wei Du, Xiaogang Li, Ruoming Jin, Li Weng)

What is CyberInfrastructure ? How computing is done is changing with advances in internet and emergence of web  Access web-pages, data, web-services from the internet What does it mean in terms of large scale computing  Supercomputers are no longer stand-alone resources  Large data repositories are common

What is CyberInfrastructure ? Infrastructures we are familiar with  Transportation infrastructure  Telecommunication infrastructure  Power supply/distribution infrastructure CyberInfrastructure means large scale computing infrastructure on the internet  Enable sharing of resources  Enable large-scale web-services Access and process a 1 tera-byte file as a web-service Run a job on a large supercomputer using your web-browser !

CyberInfrastructure CyberInfrastructure is also a new division with CISE directorate of National Science Foundation  Shows the importance Needs new research at all levels  Networking / parallel computing hardware  System software  Applications

Why is Compiler Support Needed for Cynerinfrastructure ? Compilers have often simplified application development Application development for Cyberinfrastructure is a hard problem !!  We need transparence to different resources  We need transparence to different dataset sources and formats  We need applications to adapt to resource availability  ….

Outline Compiler supported Coarse-grained pipelined parallelism  Why ?  How ? XML Based front-ends to scientific datasets Compiler support for application self- adaptation A SQL front-end to a grid data management system

General Motivation Language and Compiler Support for Parallelism of many forms has been explored  Shared memory parallelism  Instruction-level parallelism  Distributed memory parallelism  Multithreaded execution Application and technology trends are making another form of parallelism desirable and feasible  Coarse-Grained Pipelined Parallelism

Coarse-Grained Pipelined Parallelism (CGPP) Definition  Computations associated with an application are carried out in several stages, which are executed on a pipeline of computing units Example — K-nearest Neighbor Given a 3-D range R=, and a point  = (a, b, c). We want to find the nearest K neighbors of  within R. Range_queryFind the K-nearest neighbors

Coarse-Grained Pipelined Parallelism is Desirable & Feasible Application scenarios Internet data

Coarse-Grained Pipelined Parallelism is Desirable & Feasible A new class of data-intensive applications  Scientific data analysis  data mining  data visualization  image analysis Two direct ways to implement such applications  Downloading all the data to user’s machine – often not feasible  Computing at the data repository - usually too slow

Our belief  A coarse-grained pipelined execution model is a good match Internet data Coarse-Grained Pipelined Parallelism is Desirable & Feasible

Coarse-Grained Pipelined Parallelism needs Compiler Support Computation needs to be decomposed into stages Decomposition decisions are dependent on execution environment  How many computing sites available  How many available computing cycles on each site  What are the available communication links  What’s the bandwidth of each link Code for each stage follows the same processing pattern, so it can be generated by compiler Shared or distributed memory parallelism needs to be exploited High-level language and compiler support are necessary

An Entire Picture Java Dialect Compiler Support DataCutter Runtime System Decomposition Code Generation

Language Dialect Goal  to give compiler information about independent collections of objects, parallel loops and reduction operations, pipelined parallelism Extensions of Java  Pipelined_loop  Domain & Rectdomain  Foreach loop  reduction variables

ISO-Surface Extraction Example Code public class isosurface { public static void main(String arg[]) { float iso_value; RectDomain CubeRange = [min:max]; CUBE[1d] InputData = new CUBE[CubeRange]; Point p, b; RectDomain PacketRange = [1:runtime_def_num_packets ]; RectDomain EachRange = [1:(max-min)/runtime_define_num_packets]; Pipelined_loop (b in PacketRange) { Foreach (p in EachRange) { InputData[p].ISO_SurfaceTriangles(iso_value,…); } … … }} For (int i=min; i++; i<max-1) { // operate on InputData[i] } Pipelined_loop (b in PacketRange) Pipelined_loop (b in PacketRange) { 0. foreach ( …) { … } 1. foreach ( …) { … } … … … … n-1. S; } Merge Merge RectDomain PacketRange = [1:4];

Experimental Results Versions  Default version Site hosting the data only reads and transmits data, no processing at all User’s desktop only views the results, no processing at all All the work are done by the compute nodes  Compiler-generated version Intelligent decomposition is done by the compiler More computations are performed on the end nodes to reduce the communication volume  Manual version Hand-written DataCutter filters with similar decomposition as the compiler-generated version Computing nodes workload heavy Communication volume high workload balanced between each node Communication volume reduced

Experimental Results: ISO-Surface Rendering (Z-Buffer Based) Width of pipeline Small dataset 150M Large dataset 600M Speedup Speedup % improvement over default version

Outline Compiler supported Coarse-grained pipelined parallelism  Why ?  How ? XML Based front-ends to scientific datasets Compiler support for application self- adaptation A SQL front-end to a grid data management system

Motivation The need  Analysis of datasets is becoming crucial for scientific advances Emergence of X-Informatics  Complex data formats complicate processing  Need for applications that are easily portable – compatibility with web/grid services The opportunity  The emergence of XML and related technologies developed by W3C  XML is already extensively used as part of Grid/Distributed Computing Can XML help in scientific data processing?

The Big Picture TEXT … NetCDF RMDB HDF5 XML XQuer y ???

Programming/Query Language High-level declarative languages ease application development  Popularity of Matlab for scientific computations New challenges in compiling them for efficient execution XQuery is a high-level language for processing XML datasets  Derived from database, declarative, and functional languages !  XPath (a subset of XQuery) embedded in an imperative language is another option

Approach / Contributions Use of XML Schemas to provide high-level abstractions on complex datasets Using XQuery with these Schemas to specify processing Issues in Translation  High-level to low-level code  Data-centric transformations for locality in low-level codes  Issues specific to XQuery Recognizing recursive reductions Type inferencing and translation

External Schema XQuery Sources Compiler XML Mapping Service System Architecture logical XML schemaphysical XML schema C++/C

Satellite Data Processing Time[t] ···  Data collected by satellites is a collection of chunks, each of which captures an irregular section of earth captured at time t  The entire dataset comprises multiples pixels for each point in earth at different times, but not for all times  Typical processing is a reduction along the time dimension - hard to write on the raw data format

Using a High-level Schema High-level view of the dataset – a simple collection of pixels Latitude, longitude, and time explicitly stored with each pixel Easy to specify processing  Don’t care about locality / unnecessary scans At least one order of magnitude overhead in storage  Suitable as a logical format only

XQuery Overview XQuery -A language for querying and processing XML document - Functional language - Single Assignment - Strongly typed XQuery Expression - for let where return (FLWR) - unordered - path expression Unordered( For $d in document(“depts.xml”)//deptno let $e:=document(“emps.xml”)//emp [Deptno= $d] where count($e)>=10 return {$d, {count($e) } {avg($e/salary)} } )

Satellite- XQuery Code Unordered ( for $i in ( $minx to $maxx) for $j in ($miny to $maxy) let p:=document(“sate.xml”) /data/pixel where lat = i and long = j return {$i} {$j} {accumulate($p)} ) Define function accumulate ($p) as double { let $inp := item-at($p,1) let $NVDI := (( $inp/band1 - $inp/band0)div($inp/band1+$inp/band0 )+1)*512 return if (empty( $p) ) then 0 else { max($NVDI, accumulate(subsequence ($p, 2 ))) }

Challenges Need to translate to low-level schema  Focus on correctness and avoiding unnecessary reads Enhancing locality  Data-centric execution on XQuery constructs  Use information on low-level data layout Issues specific to XQuery  Reductions expressed as recursive functions  Generating code in an imperative language For either direct compilation or use a part of a runtime system Requires type conversion

Mapping to Low-level Schema  A number of getData functions to access elements(s) of required types  getData functions written in XQuery  allow analysis and transformations  Want to insert getData functions automatically preserve correctness and avoid unnecessary scans getData(lat x, long y) getData(lat x) getData(long y) getData(lat x, long y, time t) ….

Summary – XML Based Front-ends A case for the use of XML technologies in scientific data analysis XQuery – a data parallel language ? Identified and addressed compilation challenges A compilation system has been built Very large performance gains from data-centric transformations Preliminary evidence that high-level abstractions and query language do not degrade performance substantially

Outline Compiler supported Coarse-grained pipelined parallelism  Why ?  How ? XML Based front-ends to scientific datasets Compiler support for application self- adaptation A SQL front-end to a grid data management system

Applications in a Grid Environment characteristics summarized long-running applications adaptation to changing environments is desirable constraints-based  response time output can be varied in a given range  resolution  accuracy  precision How to achieve adaptation?

Proposed Language Extensions public interface Adapt_Spec { string constraints; // “RESP_TIME <= 50ms” List opti_vars; // “m”, “clipwin.x” List thresholds; // “m>=N”, “sampling_factor>=1” List opti_dir; }

Implementation Issues & Strategies Language Aspect Compiler Implementation Performance Modeling & Resource Monitoring Experimental Design

Outline Compiler supported Coarse-grained pipelined parallelism  Why ?  How ? XML Based front-ends to scientific datasets Compiler support for application self- adaptation A SQL front-end to a grid data management system

Overview of the Project Cyber-infrastructure/grid environment comprises distributed data sources Users will like seem-less access to the data SQL is popular for accessing data from a single database SQL for grid-based accesses  Data is distributed  Data is not managed by the relational database system Need to export data layout information to the query planner

Overview (Contd.) Use Grid-db-lite as the backend  A grid data management middleware Define and use a data description language Parse SQL queries and the data description language and generate a Grid-db-lite application

Design Dataset description file  Data set schema Dataset list file  Cluster configure  Dataset storage location Meta-data  Logical data space ( number of dimension )  Attributes for index declaration  Partition  Physical data storage annotation

[IPARS] RID = INT2 TIME = INT4 X = FLOAT Y = FLOAT Z = FLOAT POIL = FLOAT PWAT = FLOAT …… [bh] DatasetDescription = IPARS io = file Dim = 17x65x65 Npart = 8 … Osumed1 = osumed01.epn.osc.edu, osumed02.epn.osc.edu, … 0 = bh-10-1 osumed1 /scratch1/bh = bh-10-2 osumed1 /scratch1/bh-10-2 …… Description file Data list file { Group “ROOT” { DATASET “bh” { DATATYPE { IPARS } DATASPACE {RANK 3 } DATAINDEX { RID, TIME } PARTS { 9503, 9503, 9537, 9554, 9503, 9707, 9520, 9520 } DATA { DATASET SPACIAL, DATASET POIL, DATASET PWAT, …… } Group “SUBGROUP” { DATASET “SPACIAL” { DATATYPE { } DATASPACE { SKIP 4 LINES LOOP PARTS { X SPACE Y SPACE Z SKIP 1 LINE } DATA {PART in (0,1,2,3,4,5,6,7).0.PART.5.init } DATASET “POIL” { DATATYPE { } DATASPACE { LOOP TIME { SKIP 1 double LOOP PARTS { POIL } } DATA { PART in (0,1,2,3,4,5,6,7).0.PART.5.0 } …… } Meta- data

[TITAN] X = INT4 Y = INT4 Z = INT4 S1 = INT4 S2 = INT4 S3 = INT4 S4 = INT4 S5 = INT4 [TitanData] DatasetDescription = TITAN io = file Dim = NULL Npart = 1 Osumed1 = osumed01.epn.osc.edu 0 = NULL osumed1 /scratch1/weng/Titan/ Description file Data list file { Group “ROOT” { DATASET “TitanData” { DATATYPE { TITAN } DATASPACE {RANK 3 } DATAINDEX { FID, OFFSET, BSIZE } DATA { DATASET TITAN, INDEXSET TITANINDEX} } Group “SUBGROUP” { DATASET “TITAN” { DATATYPE { struct TITAN_Record_t {unsigned int x, y, z; unsigned int s1,s2,s3,s4,s5; }; } DATASPACE { LOOP {struct TITAN_Record_t} } DATA { 0 } } INDEXSET “TITANINDEX” { DATATYPE { HOST hostid; struct Block3D { MBR rect; JMP jmp; FID fid; OFFSET offset; BSIZE bsize; }; } DATASPACE { LOOP { HOST SPACE struct Block3D } } DATA { IndexFile } } } } Meta-data

Compilation Issues Interface between Index() and Extractor()  Range query A chunk can be totally in the query range, partially in the query range, or totally outside of the query range  How to choose a suitable size for indexed chunks Interface between Extractor() and GridDB-lite  Explore alternative methods to get tuples/records  Smarter extractor can signal GridDB for some filtering operations Query transform  Optimization?  Hosts allocation for stages ( DP, DM, Client) Some other potential issues?  The granularity of “tuple”  Data partitioning methods  …

Other Research Areas Runtime support systems  Ease parallelization of data mining algorithms in a cluster environment (FREERIDE)  Grid-based processing of distributed data streams Algorithms for Data Mining / OLAP  Parallel and scalable algorithms  Algorithms for processing distributed data streams

Group Members Seven Ph.D students  Liang Chen  Wei Du  Anjan Goswami  Ruoming Jin  Xiaogang Li  Li Weng  Xuan Zhang Two Masters students  Leo Glimcher  Swarup Sahoo Part-time student  Kolagatla Reddy

Getting Involved Talk to me Sign in for my 888