Xiaodan Wang Department of Computer Science Johns Hopkins University Processing Data Intensive Queries in Scientific Database Federations.

Slides:



Advertisements
Similar presentations
Adam Jorgensen Pragmatic Works Performance Optimization in SQL Server Analysis Services 2008.
Advertisements

© Bharati Vidyapeeths Institute of Computer Applications and Management, New Delhi © Bharati Vidyapeeths Institute of Computer Applications and.
Choosing an Order for Joins
Efficient Event-based Resource Discovery Wei Yan*, Songlin Hu*, Vinod Muthusamy +, Hans-Arno Jacobsen +, Li Zha* * Chinese Academy of Sciences, Beijing.
The Big Picture Scientific disciplines have developed a computational branch Models without closed form solutions solved numerically This has lead to.
Hopkins Storage Systems Lab, Department of Computer Science Automated Physical Design in Database Caches T. Malik, X. Wang, R. Burns Johns Hopkins University.
Johns Hopkins University Xiaodan Wang Eric Perlman Randal Burns Tamas Budavari Charles Meneveau Alexander Szalay Purdue University Tanu Malik JAWS: J ob-
International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
The State of the Art in Distributed Query Processing by Donald Kossmann Presented by Chris Gianfrancesco.
Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage Wei Zhang, Tao Yang, Gautham Narayanasamy University of California at Santa Barbara.
The Virtual Microscope Umit V. Catalyurek Department of Biomedical Informatics Division of Data Intensive and Grid Computing.
Session – 10 QUERY OPTIMIZATION Matakuliah: M0184 / Pengolahan Data Distribusi Tahun: 2005 Versi:
VLDB Revisiting Pipelined Parallelism in Multi-Join Query Processing Bin Liu and Elke A. Rundensteiner Worcester Polytechnic Institute
CS 347Notes 041 CS 347: Distributed Databases and Transaction Processing Notes04: Query Optimization Hector Garcia-Molina.
Kangseok Kim, Marlon E. Pierce Community Grids Laboratory, Indiana University
The Power of Choice in Data-Aware Cluster Scheduling
Word Wide Cache Distributed Caching for the Distributed Enterprise.
Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.
Amdahl Numbers as a Metric for Data Intensive Computing Alex Szalay The Johns Hopkins University.
Hopkins Storage Systems Lab, Department of Computer Science A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching Xiaodan Wang, Tanu.
Distributed Multimedia March 19, Distributed Multimedia What is Distributed Multimedia?  Large quantities of distributed data  Typically streamed.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
◦ What is an Operating System? What is an Operating System? ◦ Operating System Objectives Operating System Objectives ◦ Services Provided by the Operating.
Resilient Peer-to-Peer Streaming Presented by: Yun Teng.
What are the main differences and commonalities between the IS and DA systems? How information is transferred between tasks: (i) IS it may be often achieved.
Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.
Applications hitting a wall today with SQL Server Locking/Latching Scale-up Throughput or latency SLA Applications which do not use SQL Server.
Quantitative Evaluation of Unstructured Peer-to-Peer Architectures Fabrício Benevenuto José Ismael Jr. Jussara M. Almeida Department of Computer Science.
Ó 1998 Menascé & Almeida. All Rights Reserved.1 Part V Workload Characterization for the Web (Book, chap. 6)
Xiaodan Wang, Randal Burns Department of Computer Science Johns Hopkins University Tanu Malik Cyber Center Purdue University LifeRaft: Data-Driven, Batch.
Workshop on Networking Meets Databases (NetDB’07) Throughput-Optimized, Global-Scale Join Processing in Scientific Federations Xiaodan Wang, Randal Burns,
Active Sampling for Accelerated Learning of Performance Models Piyush Shivam, Shivnath Babu, Jeff Chase Duke University.
Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
August 23, 2001ITCom2001 Proxy Caching Mechanisms with Video Quality Adjustment Masahiro Sasabe Graduate School of Engineering Science Osaka University.
Querying The Internet With PIER Nitin Khandelwal.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
DOE Network PI Meeting 2005 Runtime Data Management for Data-Intensive Scientific Applications Xiaosong Ma NC State University Joint Faculty: Oak Ridge.
Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.
1 Querying the Physical World Son, In Keun Lim, Yong Hun.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
1 Internet Traffic Measurement and Modeling Carey Williamson Department of Computer Science University of Calgary.
Hopkins Storage Systems Lab, Department of Computer Science Network-Aware Join Processing in Global-Scale Database Federations X. Wang, R. Burns, A. Terzis.
1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.
Scalability of Local Image Descriptors Björn Þór Jónsson Department of Computer Science Reykjavík University Joint work with: Laurent Amsaleg (IRISA-CNRS)
Department of Computer Science Johns Hopkins University Xiaodan Wang Advisor: Randal Burns Processing Data-Intensive Queries in Petabyte-Scale Scientific.
BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.
Presented by: Omar Alqahtani Fall 2016
Table General Guidelines for Better System Performance
Architecture and Algorithms for an IEEE 802
CSCI5570 Large Scale Data Processing Systems
A Black-Box Approach to Query Cardinality Estimation
Parallel Data Laboratory, Carnegie Mellon University
PA an Coordinated Memory Caching for Parallel Jobs
Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz
April 30th – Scheduling / parallel
Database Applications (15-415) DBMS Internals- Part VII Lecture 19, March 27, 2018 Mohammad Hammoud.
Akshay Tomar Prateek Singh Lohchubh
Outline Introduction Background Distributed DBMS Architecture
Selected Topics: External Sorting, Join Algorithms, …
Table General Guidelines for Better System Performance
GATES: A Grid-Based Middleware for Processing Distributed Data Streams
Implementation of Relational Operations
Evaluation of Relational Operations: Other Techniques
Database System Architectures
Presentation transcript:

Xiaodan Wang Department of Computer Science Johns Hopkins University Processing Data Intensive Queries in Scientific Database Federations

Problem Data avalanche in scientific databases – Exponential growth in data size (Pan-STARRS) – Accumulation of data at multiple data sources (clustered and federated databases) Exploring massive, widely distributed data – Joins to find correlations across multiple databases – Queries are data intensive: large transfers over the network, and scan large portions of the data – Query throughput limits scale of exploration To improve overall query throughput but potentially sacrifice performance of individual queries

Processing Data Intensive Queries in Scientific Database Federations Target Application SkyQuery Federation of Astronomy Databases – Dozens of multi-terabyte databases across three Continents – Queries that perform full db scans lasting hours or days – Intermediate join results that are hundreds of MBs – Scalability concerns both in data size and number of sites Cross-match – Probabilistic spatial join across multiple databases – Join results are accumulated, shipped from site to site, and delivered to scientists

Processing Data Intensive Queries in Scientific Database Federations Cross-Match Workload A forward looking analysis shows that network dominates 90% of performance A quarter of the cross- match queries execute for minutes to several hours

Processing Data Intensive Queries in Scientific Database Federations Incorporating Network Structure

Processing Data Intensive Queries in Scientific Database Federations Network-Aware Join Processing Capture heterogeneity in global-scale federations – Metric to exploit high throughput paths – Decentralized, local optimizations using aggregate stats – Routing at the application layer – Two-approximate, MST-based solution with extensions that employ semi-joins and explore bushy plans – Clustering to explore trade-offs with computation cost Over a ten-fold reduction in network utilization for large joins

Processing Data Intensive Queries in Scientific Database Federations A Case for Batch Processing Top ten buckets accessed by 61% of queries and reuse occur close temporally 2% of buckets capture more than half of the workload and should be cached

Processing Data Intensive Queries in Scientific Database Federations LifeRaft: Data-Driven Batch Proc. Eliminate redundant I/O to improve query throughput Batch queries with that exhibit data sharing – Pre-process queries to identify data sharing – Co-schedule queries that access the same data – Access contentious data first to maximize sharing – Improves performance by two-fold

Processing Data Intensive Queries in Scientific Database Federations Discussion Cache replacement for LifeRaft – Benefits contentions data regions that experience reuse (Cache hit for LifeRaft is 40% compared with 7% for arrival order processing) – Evaluate strategies that exploit I/O behavior of batch workloads (segmented strategy) Buffering and workload overflow – Large intermediate join results – Migrate pairs of workload and bucket Better support for interactive queries – Short and selective queries that focus on small region – Indefinite queuing times in presence of batch workloads

Processing Data Intensive Queries in Scientific Database Federations Discussion (cont.) Batch processing in a distributed environment – Network-aware scheduling does not consider computation cost – Batch processing for a single system environment Federating LifeRaft – Coordinate exec. of query that join multiple DBs – Batch proc. requires databases to buffer results – Maximize overall batch size while alleviating memory used for buffering and network cost

Processing Data Intensive Queries in Scientific Database Federations Exploring Alt. Join Schedules

Processing Data Intensive Queries in Scientific Database Federations Discussion (cont.) Explore both join schedules and opportunities for batching simultaneously – Bushy and semi-join plans increase computation while clustering decrease computation – Skew in join workload (ie. sites close to end user) – Quantify trade-offs with computation cost (ie. number of buckets in batch processing) Users submit cross-match queries in batches Applying LifeRaft to other data-intensive, temporal- spatial data such as Turbulence database

Processing Data Intensive Queries in Scientific Database Federations Supplementary Slides

Processing Data Intensive Queries in Scientific Database Federations Cross-Match Queries Join by increasing cardinality (count *) – Minimal I/O – Fewer bytes on the network Query Mediator Probe Query Result Count: 30Count: 100Count: 800

Processing Data Intensive Queries in Scientific Database Federations Spanning Tree Approximation (STA) B C A D E F G H

Processing Data Intensive Queries in Scientific Database Federations STA: Find MST B C A D E F G H

Processing Data Intensive Queries in Scientific Database Federations STA: Join Using Paths on the MST B C A D E F G H

Processing Data Intensive Queries in Scientific Database Federations Filter and refine Partition data into buckets

Processing Data Intensive Queries in Scientific Database Federations Scheduling Behavior Q i – Q i1, Q i2, Q i3 B1B1 B2B2 B3B3 B4B4 B5B5 B6B6 B7B7 B8B8 QiQi QjQj QkQk Sub-divide queries by bucket: Q j – Q j3, Q j4, Q j5, Q j6, Q j7, Q j8 Assumptions: - Inter-query time of 1 sec - I/O for each bucket of 1 sec - Cache size of 2 - Join cost is negligible Q j – Q j5, Q j6, Q j7, Q j8 QkQk

Processing Data Intensive Queries in Scientific Database Federations Arrival order with no sharing Qi1Qi1 B1B1 Q i Arr Qi2Qi2 B2B2 Qi3Qi3 B3B3 Qj1Qj1 B1B1 Q j ArrQ k Arr Qj3Qj3 B3B3 Q i End Qj4Qj4 B4B4 Qj6Qj6 B6B6 Qj7Qj7 B7B7 Qj8Qj8 B8B8 Q j End Qk1Qk1 B1B1 Qk4Qk4 B4B4 Qk8Qk8 B8B8 Q k End Q i – 3 sec Completion Times: Q j – 8 secQ k – 13 secAvg – 8 sec B1B1 B2B2 B3B3 B4B4 B5B5 B6B6 B7B7 B8B8 QiQi QjQj QkQk QkQk … Tp –.2 qry/sec

Processing Data Intensive Queries in Scientific Database Federations Age based scheduling (bias 1) Qi1Qi1 B1B1 Q i Arr Qi2Qi2 B2B2 Qi5Qi5 B5B5 Qi3Qj3Qi3Qj3 B3B3 Q j ArrQ k ArrQ i End Q j End Q k End Qj1Qk1Qj1Qk1 B1B1 Qj4Qk4Qj4Qk4 B4B4 Qj6Qk6Qj6Qk6 B6B6 Q i – 3 sec Completion Times: Q j – 7 secQ k – 7 secAvg – 5.6 secTp –.33 qry/sec B1B1 B2B2 B3B3 B4B4 B5B5 B6B6 B7B7 B8B8 QiQi QjQj QkQk QkQk Qj8Qk8Qj8Qk8 B8B8 Qj7Qk7Qj7Qk7 B7B7

Processing Data Intensive Queries in Scientific Database Federations Contention based scheduling (bias 0) Qi1Qi1 B1B1 Q i Arr Qi2Qi2 B2B2 Qi3Qj3Qi3Qj3 B3B3 Q j ArrQ k Arr Q i End Q j End Qk5Qk5 B5B5 Q k End Q j1 Q k1 Q j4 Q k4 B 1 B 4 Qj6Qk6Qj6Qk6 B6B6 Qj7Qk7Qj7Qk7 B7B7 Q i – 7 sec Completion Times: Q j – 5 secQ k – 6 secAvg – 6 secTp –.38 qry/sec B1B1 B2B2 B3B3 B4B4 B5B5 B6B6 B7B7 B8B8 QiQi QjQj QkQk QkQk Qj8Qk8Qj8Qk8 B8B8 (5.6) (.33)

Processing Data Intensive Queries in Scientific Database Federations Parameter tuning using trade-off curves

Processing Data Intensive Queries in Scientific Database Federations Tuning the age bias Throughput performance gap grows while response time gap is insensitive to saturation Increasing age bias is more attractive at low saturation