Presentation is loading. Please wait.

Presentation is loading. Please wait.

Xiaodan Wang, Randal Burns Department of Computer Science Johns Hopkins University Tanu Malik Cyber Center Purdue University LifeRaft: Data-Driven, Batch.

Similar presentations


Presentation on theme: "Xiaodan Wang, Randal Burns Department of Computer Science Johns Hopkins University Tanu Malik Cyber Center Purdue University LifeRaft: Data-Driven, Batch."— Presentation transcript:

1 Xiaodan Wang, Randal Burns Department of Computer Science Johns Hopkins University Tanu Malik Cyber Center Purdue University LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

2 LifeRaft: Data-Driven, Batch Processing BETTER LUCK NEXT TIME!

3 LifeRaft: Data-Driven, Batch Processing Problem Q1 Q2 Q3 Q4

4 LifeRaft: Data-Driven, Batch Processing Goals Eliminate redundant I/O to improve query throughput Batch queries with that exhibit data sharing – Pre-process queries to identify data sharing – Co-schedule queries that access the same data – Access contentious data first to maximize sharing Starvation resistance – Avoid indefinite queuing times (response time) – Enforce some constraints on completion order

5 LifeRaft: Data-Driven, Batch Processing Target Applications Data intensive scan queries – Executed against a clustered index – Clustered and federated databases (e.g. joins that correlate multiple nodes) Peta-scale astronomy (Pan-STARRS) – Data are partitioned spatially – Many queries scan full DB and last hours or days Cross-match – Probabilistic spatial join across multiple databases

6 LifeRaft: Data-Driven, Batch Processing Filter and Refine Filter queries – Pre-process queries to determine join buckets – Buckets B 1,…,B n and queries Q 1,…, Q m – Workload W ij denote objects from Q i that overlap B j Refinement – Read buckets one-at-a-time – Sort-merge join (sort by HTM ID) – Query specific predicates applied on output tuples

7 LifeRaft: Data-Driven, Batch Processing Workload Throughput Metric Greedily in order of decreasing workload throughput Exploits data regions that experience contention May starve requests – Favors buckets experiencing frequent reuse – No guarantee a particular bucket or query receives service

8 LifeRaft: Data-Driven, Batch Processing Aged Workload Throughput Metric Inspired by disk-drive head scheduling Balance arrival order (low response time) with contention (high throughput) Adaptive trade-offs based on workload saturation – Maximize rate at which objects are joined during saturated workloads – Enforce completion order (queuing times) to prevent indefinite starvation during low saturation

9 LifeRaft: Data-Driven, Batch Processing Scheduling Behavior Q i – Q i1, Q i2, Q i3 B1B1 B2B2 B3B3 B4B4 B5B5 B6B6 B7B7 B8B8 QiQi QjQj QkQk Sub-divide queries by bucket: Q j – Q j3, Q j4, Q j5, Q j6, Q j7, Q j8 Assumptions: - Inter-query time of 1 sec - I/O for each bucket of 1 sec - Cache size of 2 - Join cost is negligible Q j – Q j5, Q j6, Q j7, Q j8 QkQk

10 LifeRaft: Data-Driven, Batch Processing Arrival order with no sharing Qi1Qi1 B1B1 Q i Arr Qi2Qi2 B2B2 Qi3Qi3 B3B3 Qj1Qj1 B1B1 Q j ArrQ k Arr Qj3Qj3 B3B3 Q i End Qj4Qj4 B4B4 Qj6Qj6 B6B6 Qj7Qj7 B7B7 Qj8Qj8 B8B8 Q j End Qk1Qk1 B1B1 Qk4Qk4 B4B4 Qk8Qk8 B8B8 Q k End Q i – 3 sec Completion Times: Q j – 8 secQ k – 13 secAvg – 8 sec B1B1 B2B2 B3B3 B4B4 B5B5 B6B6 B7B7 B8B8 QiQi QjQj QkQk QkQk … Tp –.2 qry/sec

11 LifeRaft: Data-Driven, Batch Processing Age based scheduling (bias 1) Qi1Qi1 B1B1 Q i Arr Qi2Qi2 B2B2 Qi5Qi5 B5B5 Qi3Qj3Qi3Qj3 B3B3 Q j ArrQ k ArrQ i End Q j End Q k End Qj1Qk1Qj1Qk1 B1B1 Qj4Qk4Qj4Qk4 B4B4 Qj6Qk6Qj6Qk6 B6B6 Q i – 3 sec Completion Times: Q j – 7 secQ k – 7 secAvg – 5.6 secTp –.33 qry/sec B1B1 B2B2 B3B3 B4B4 B5B5 B6B6 B7B7 B8B8 QiQi QjQj QkQk QkQk Qj8Qk8Qj8Qk8 B8B8 Qj7Qk7Qj7Qk7 B7B7

12 LifeRaft: Data-Driven, Batch Processing Contention based scheduling (bias 0) Qi1Qi1 B1B1 Q i Arr Qi2Qi2 B2B2 Qi3Qj3Qi3Qj3 B3B3 Q j ArrQ k Arr Q i End Q j End Qk5Qk5 B5B5 Q k End Q j1 Q k1 Q j4 Q k4 B 1 B 4 Qj6Qk6Qj6Qk6 B6B6 Qj7Qk7Qj7Qk7 B7B7 Q i – 7 sec Completion Times: Q j – 5 secQ k – 6 secAvg – 6 secTp –.38 qry/sec B1B1 B2B2 B3B3 B4B4 B5B5 B6B6 B7B7 B8B8 QiQi QjQj QkQk QkQk Qj8Qk8Qj8Qk8 B8B8 (5.6) (.33)

13 LifeRaft: Data-Driven, Batch Processing Throughput Performance

14 LifeRaft: Data-Driven, Batch Processing Tuning the age bias Throughput performance gap grows while response time gap is insensitive to saturation Increasing age bias is more attractive at low saturation

15 LifeRaft: Data-Driven, Batch Processing Parameter tuning using trade-off curves

16 LifeRaft: Data-Driven, Batch Processing Discussion Impact of caching strategies Workload overflow – Large intermediate join results – Migrate pairs of workload and bucket Beyond completion order – Higher priority for interactive queries Batch processing in a clustered environment P. Agrawal, D.Kifer, and C. Olston. Scheduling Shared Scans of Large Data Files. In VLDB, 2008.

17 LifeRaft: Data-Driven, Batch Processing WHAT ABOUT US?

18 LifeRaft: Data-Driven, Batch Processing Filter and refine Partition data into buckets

19 LifeRaft: Data-Driven, Batch Processing Average Response Time

20 LifeRaft: Data-Driven, Batch Processing Outline Motivation – Goals for data-driven, batch scheduling – Target application (SkyQuery) LiftRaft scheduler – Filter and refine queries – Throughput maximizing metric – Starvation resistance – Differences in outcomes Workload adaptive parameter selection


Download ppt "Xiaodan Wang, Randal Burns Department of Computer Science Johns Hopkins University Tanu Malik Cyber Center Purdue University LifeRaft: Data-Driven, Batch."

Similar presentations


Ads by Google