The Ohio State University

Slides:



Advertisements
Similar presentations
XML: Extensible Markup Language
Advertisements

Bottom-up Evaluation of XPath Queries Stephanie H. Li Zhiping Zou.
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Supporting High-Level Abstractions through XML Technologies Xiaogang Li Gagan Agrawal The Ohio State University.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Sangam: A Transformation Modeling Framework Kajal T. Claypool (U Mass Lowell) and Elke A. Rundensteiner (WPI)
XML files (with LINQ). Introduction to LINQ ( Language Integrated Query ) C#’s new LINQ capabilities allow you to write query expressions that retrieve.
XML-to-Relational Schema Mapping Algorithm ODTDMap Speaker: Artem Chebotko* Wayne State University Joint work with Mustafa Atay,
Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.
Efficient Evaluation of XQuery over Streaming Data Xiaogang Li Gagan Agrawal The Ohio State University.
Graph Data Management Lab, School of Computer Science gdm.fudan.edu.cn XMLSnippet: A Coding Assistant for XML Configuration Snippet.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Ohio State University Department of Computer Science and Engineering 1 Supporting SQL-3 Aggregations on Grid-based Data Repositories Li Weng, Gagan Agrawal,
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.
High-level Interfaces and Abstractions for Data-Driven Applications in a Grid Environment Gagan Agrawal Department of Computer Science and Engineering.
Ohio State University Department of Computer Science and Engineering 1 Tools and Techniques for the Data Grid Gagan Agrawal.
Compiler Supported High-level Abstractions for Sparse Disk-resident Datasets Renato Ferreira Gagan Agrawal Joel Saltz Ohio State University.
Ohio State University Department of Computer Science and Engineering Data-Centric Transformations on Non- Integer Iteration Spaces Swarup Kumar Sahoo Gagan.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.
Compiler (and Runtime) Support for CyberInfrastructure Gagan Agrawal (joint work with Wei Du, Xiaogang Li, Ruoming Jin, Li Weng)
Dec. 13, 2002 WISE2002 Processing XML View Queries Including User-defined Foreign Functions on Relational Databases Yoshiharu Ishikawa Jun Kawada Hiroyuki.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Searching Topics Sequential Search Binary Search.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Research Overview Gagan Agrawal Associate Professor.
Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu and Gagan Agrawal Enabling.
Using XQuery for Flat-File Scientific Datasets Xiaogang Li Gagan Agrawal The Ohio State University.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
1 XSLT XSLT (extensible stylesheet language – transforms ) is another language to process XML documents. Originally intended as a presentation language:
Information Retrieval in Practice
Complexity Analysis (Part I)
Functional Programming
XML: Extensible Markup Language
Efficient Evaluation of XQuery over Streaming Data
Planning & System installation
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Database Management System
Spark Presentation.
Optimizing Compilers Background
Methodology – Physical Database Design for Relational Databases
COSC160: Data Structures Linked Lists
Algorithm Analysis CSE 2011 Winter September 2018.
Compiling Dynamic Data Structures in Python to Enable the Use of Multi-core and Many-core Libraries Bin Ren, Gagan Agrawal 9/18/2018.
Year 2 Updates.
Mini Language Interpreter Programming Languages (CS 550)
Algorithm design and Analysis
Communication and Memory Efficient Parallel Decision Tree Construction
Programming 1 (CS112) Dr. Mohamed Mostafa Zayed 1.
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
Grid Based Data Integration with Automatic Wrapper Generation
Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How
Introduction to Programming
Recurrences.
Supporting High-Performance Data Processing on Flat-Files
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
New (Applications of) Compiler Techniques for Data Grids
Complexity Analysis (Part I)
Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.
Complexity Analysis (Part I)
LCPC02 Wei Du Renato Ferreira Gagan Agrawal
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

The Ohio State University Compiler Support for Efficient Processing of XML Datasets Xiaogang Li Renato Ferreira Gagan Agrawal The Ohio State University

Motivation The need The opportunity Analysis of datasets is becoming crucial for scientific advances Complex data formats complicate processing Need for applications that are easily portable – compatibility with web/grid services The opportunity The emergence of XML and related technologies developed by W3C XML is already extensively used as part of Grid/Distributed Computing Can XML help in scientific data processing?

Motivation (Contd.) High-level declarative languages ease application development Popularity of Matlab for scientific computations New challenges in compiling them for efficient execution XQuery is a high-level language for processing XML datasets Derived from database, declarative, and functional languages !

Approach / Contributions Show how XML schemas and XQuery simplify scientific data processing Identify issues in compiling XQuery for this class of applications New compiler analyses and algorithms Developed and evaluated a compiler

Outline XQuery language features Scientific data processing applications Compilation Challenges Compiler analyses Identifying and parallelizing reductions Data-centric transformations Type inferencing for XQuery to C++ translation System Architecture Use of Active Data Repository (ADR) as the runtime system Experimental results Conclusions

XQuery Overview XQuery -A language for querying and processing XML document - Functional language - Single Assignment - Strongly typed XQuery Expression - for let where return (FLWR) - unordered - path expression Unordered( For $d in document(“depts.xml”)//deptno let $e:=document(“emps.xml”)//emp [Deptno= $d] where count($e)>=10 return <big-dept> {$d, <count> {count($e) }</count> <avg> {avg($e/salary)}<avg> } </big-dept> )

Target Applications Focus on scientific data processing applications Arise in a number of scientific domains Frequently involve generalized reductions Can be expressed conveniently in XQuery Low-level data layout is hidden from the application programmers

Satellite- XQuery Code Unordered ( for $i in ( $minx to $maxx) for $j in ($miny to $maxy) let p:=document(“sate.xml”) /data/pixel return <pixel> <latitude> {$i} </latitude> <longitude> {$j} <longitude> <sum>{accumulate($p)}</sum> </pixel> ) Define function accumulate ($p) as double { let $inp := item-at($p,1) let $NVDI := (( $inp/band1 -$inp/band0)div($inp/band1+$inp/band0)+1)*512 return if (empty( $p) ) then 0 else { max($NVDI, accumulate(subsequence ($p, 2 ))) }

VMScope- XQuery Code Define function accumulate ($p) as element { Unordered ( for $i in ( $x1 to $x2) for $j in ($y1 to $y2) let p:=document(“vmscope.xml”) /data/pixel [(x=$i) and ( y=$j) and (scale >=$z1) return <pixel> <latitude> {$i} </latitude> <longitude>{$j} <longitude> <sum>{accumulate($p)}</sum> </pixel> ) Define function accumulate ($p) as element { if (empty( $p) then $null else let $max=accumulate(subsequence($p,2)) let $q := item-at( $p, 1) return if ($q/scale < $max/scale ) or ($max = $null ) then $max else $q }

Challenges Reductions expressed as recursive functions Direct execution can be very expensive Generating code in an imperative language For either direct compilation or use a part of a runtime system Requires type conversion Enhancing locality Data-centric execution on XQuery constructs Use information on low-level data layout

System Architecture XML Mapping Service Compiler XML Files XML Schema XQuery Sources XML Mapping Service Compiler Data Indexing Service Data Distribution Service

Compilation Framework Compiler Analysis Recursion Analysis Type Analysis Data Centric Analysis XQuery Parser Schema Parser Compiler Front-end ADR Code Generation XQuery Schema Local Reduction Global Reduction

Compiler Analysis and Tasks Analysis of recursive function - Extracting associative and commutative operations involved - Transform recursive function to iterative operations Data centric transformation -Reconstruct unordered loops of the query -New strategy requires only one scan of the dataset Type inferencing and analysis

Recursion Analysis Assumption Objective -Expressed in Canonical form 1) Linear recursive function 2) Operation is associative and commutative Objective - Extracting associative and commutative operations - Extracting initialization conditions - Transform into iterative operations - Generate a global reduction function Canonical Form Define function F($t) { if (p1) then F1 ($t) Else F2(F3($t), F4(F(F5($t)))) }

Recursion analysis -Algorithm 1. Add leaf nodes represent or are defined by recursive function to Set S. 2. Keep only nodes that may be returned as the final value to Set S. 3. Recursively find a least common ancestor A of all nodes in S. 4. Return the subtree with A as Root. 5. Examine if the subtree represents an associative and communicative operation Example define function accumulate ($p) return double if (empty($p) ) then 0 else let $val := accumulate(subsequence($p,2)) let $q := item-at($p,1) return If ($q >= $val) then $val else $q

Data Centric Transformation Objective -Reconstruct unordered loops of the query so that only one scan of the entire dataset is sufficient Algorithm 1. Perform loop fusion that is necessary 2. Generate abstract iteration space 3. Extracting necessary and sufficient conditions that maps an element in the dataset to the iteration space

Naïve Strategy Output Dataset Requires 3 Scans

Data Centric Strategy Output Datasets Requires just one scan

Type inferencing Objective Challenge - Type analysis for generation of C/C++ code Challenge - An XQuery expression may return multiple types - XQuery supports parametric polymorphism Algorithm -Constraint-based type inference algorithm -Bottom-up and recursive -Compute a set of all possible types an expression may return typeswitch ($pixel) { case element double pixel return max( $pixel/d_value,0) case element integer pixel return max( $pixel/i_value,0) default 0 }

Type inference C++ code generation -Use Union to represent multiple simple types -Use C++ class polymorphism represent multiple complex types -Use function clone for parametric Polymorphism class t_pixel pixel; struct tmp result_1{ union { double r1 ; int r2 ; } tmp result_1 t1; if (pixel.tag double pixel) t1.r1 = max 1(pixel.d vaule,0) ; else if (pixel .tag integer pixel) t1.r2 = max 2(pixel.i value,0) ; else t1.r2 = 0;

Evaluating Data centric Transformation Virtual Microscope Satellite

Parallel Performance- VMScope

Parallel Performance- Satellite

Conclusions A case for the use of XML technologies in scientific data analysis XQuery – a data parallel language ? Identified and addressed compilation challenges A compilation system has been built Very large performance gains from data-centric transformations Achieves good parallel performance