Automatic Scaling of Selective SPARQL Joins Using the TIRAMOLA System E. Angelou, N. Papailiou, I. Konstantinou, D. Tsoumakos, N. Koziris Computing Systems.

Slides:



Advertisements
Similar presentations
Large Scale Computing Systems
Advertisements

Efficient Event-based Resource Discovery Wei Yan*, Songlin Hu*, Vinod Muthusamy +, Hans-Arno Jacobsen +, Li Zha* * Chinese Academy of Sciences, Beijing.
Virtual Machine Usage in Cloud Computing for Amazon EE126: Computer Engineering Connor Cunningham Tufts University 12/1/14 “Virtual Machine Usage in Cloud.
Cloud Computing Imranul Hoque. Today’s Cloud Computing.
StratusLab is co-funded by the European Community’s Seventh Framework Programme (Capacities) Grant Agreement INSFO-RI Ioannis Konstantinou Greek.
Milestone 1 Workshop in Information Security – Distributed Databases Project Access Control Security vs. Performance By: Yosi Barad, Ainat Chervin and.
Automatic Resource Scaling for Web Applications in the Cloud Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer Science.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Presentation by Krishna
Nikolay Tomitov Technical Trainer SoftAcad.bg.  What are Amazon Web services (AWS) ?  What’s cool when developing with AWS ?  Architecture of AWS 
Introduction to DoC Private Cloud
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
Capacity Planning in SharePoint Capacity Planning Process of evaluating a technology … Deciding … Hardware … Variety of Ways Different Services.
Storing RDF Data in Hadoop And Retrieval Pankil Doshi Asif Mohammed Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham.
© 2013 Mellanox Technologies 1 NoSQL DB Benchmarking with high performance Networking solutions WBDB, Xian, July 2013.
FOSS4G: 52°North WPS Behind the buzz of Cloud Computing - 52°North Open Source Geoprocessing Software in the Clouds FOSS4G 2009.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.
Berlin SPARQL Benchmark (BSBM) Presented by: Nikhil Rajguru Christian Bizer and Andreas Schultz.
Introduction To Windows Azure Cloud
8/9/2015 Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, dtsouma, Computing Systems Laboratory.
1 NETE4631 Managing the Cloud and Capacity Planning Lecture Notes #8.
Automated, Elastic Resource Provisioning for NoSQL clusters Using TIRAMOLA Ioannis Konstantinou CSLAB, National Technical University of Athens, Greece.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Goodbye rows and tables, hello documents and collections.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Introduction to Hadoop and HDFS
Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools Mohammad Farhan Husain, Latifur Khan, Murat Kantarcioglu and Bhavani Thuraisingham.
Amazon Web Services BY, RAJESH KANDEPU. Introduction  Amazon Web Services is a collection of remote computing services that together make up a cloud.
Cloud Computing & Amazon Web Services – EC2 Arpita Patel Software Engineer.
CHEN Ge CSIS, HKU March 9, Jigsaw W3C’s Java Web Server.
KIT – University of the State of Baden-Württemberg and National Large-scale Research Center of the Helmholtz Association Institute of Applied Informatics.
Autonomic SLA-driven Provisioning for Cloud Applications Nicolas Bonvin, Thanasis Papaioannou, Karl Aberer Presented by Ismail Alan.
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, Bhavani Thuraisingham University.
IMDGs An essential part of your architecture. About me
Presented by: Mostafa Magdi. Contents Introduction. Cloud Computing Definition. Cloud Computing Characteristics. Cloud Computing Key features. Cost Virtualization.
Hypertable Doug Judd Zvents, Inc.. hypertable.org Background.
Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19.
Large-scale Linked Data Management Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman Big Linked Data Tutorial Semantic Days 2012.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
Aneka Cloud ApplicationPlatform. Introduction Aneka consists of a scalable cloud middleware that can be deployed on top of heterogeneous computing resources.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Cloudera Kudu Introduction
FP CumuloNimbo: A Highly Scalable Transactional Cloud PaaS Bettina Kemme, McGill University CumuloNimbo Canada EU Future Internet Workshop, 2011.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
1 Benchmarking Cloud Serving Systems with YCSB Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan and Russell Sears Yahoo! Research.
Cloud Computing from a Developer’s Perspective Shlomo Swidler CTO & Founder mydrifts.com 25 January 2009.
KAASHIV INFOTECH – A SOFTWARE CUM RESEARCH COMPANY IN ELECTRONICS, ELECTRICAL, CIVIL AND MECHANICAL AREAS
© 2012 Eucalyptus Systems, Inc. Cloud Computing Introduction Eucalyptus Education Services 2.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Course: Cluster, grid and cloud computing systems Course author: Prof
Introduction to Distributed Platforms
Diskpool and cloud storage benchmarks used in IT-DSS
Running virtualized Hadoop, does it make sense?
StratusLab Final Periodic Review
StratusLab Final Periodic Review
Hadoop and NoSQL at Thomson Reuters
AWS COURSE DEMO BY PROFESSIONAL-GURU. Amazon History Ladder & Offering.
Public vs Private Cloud Usage Costs:
Introduction to Apache
Benchmarking Cloud Serving Systems with YCSB
Cloud Computing Architecture
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Jena HBase: A Distributed, Scalable, Efficient RDF Triple Store
Jena HBase: A Distributed, Scalable, Efficient RDF Triple Store
Presentation transcript:

Automatic Scaling of Selective SPARQL Joins Using the TIRAMOLA System E. Angelou, N. Papailiou, I. Konstantinou, D. Tsoumakos, N. Koziris Computing Systems Laboratory, National Technical University of Athens Semantic Web Information Management - SWIM 2012

Motivation – the story(1) ‘Big-data’ processing era – (Web) analytics, science, business – Store + analyze everything Distributed, high-performance processing – From P2P to Grid computing – And now to the clouds… Traditional databases not up to the task Automatic Scaling of Selective SPARQL Joins Using TIRAMOLA SWIM’12 2

Motivation – the story (2) NoSQL – Non-relational – Horizontal scalable – Distributed – Open source – And often: schema-free, easily replicated, simple API, eventually consistent /(not ACID), big-data-friendly, etc Column family, Document store, Key-Value, … Automatic Scaling of Selective SPARQL Joins Using TIRAMOLA SWIM’12 3

NoSQLs and elasticity Many offer elasticity+sharding: – Expand/contract resources according to demand – Pay-as-you-go, robustness, performance – Shared-nothing architecture allows that – Important! See Apr 2011 Amazon outage (foursquare, reddit,…) Automatic Scaling of Selective SPARQL Joins Using TIRAMOLA SWIM’12 4

thus…(end of the story) PaaS and NoSQLs are (or should be) inherently elastic How efficiently do they implement elasticity? – NoSQLs over an IaaS platform EC2, Eucalyptus, OpenStack,… – Study that registers qualitative + quantitative results Related Work – Report NoSQL performance (not elasticity) – Cloud platform elasticity (no NoSQL) – Domain-specific – Initial cluster size (not dynamic) Automatic Scaling of Selective SPARQL Joins Using TIRAMOLA SWIM’12 5

Contributions TIRAMOLA: A VM-based framework for NoSQL cluster monitoring and resource provisioning For a cluster resize, identify and measure – Cost, gains – In terms of: Time, effort, increase in throughput, latency, …? A generic platform: – any NoSQL engine – User-defined policies – Automatic resource provisioning Open source: Use-case: Distributed SPARQL query processing Automatic Scaling of Selective SPARQL Joins Using TIRAMOLA SWIM’12 6

TIRAMOLA Architecture Automatic Scaling of Selective SPARQL Joins Using TIRAMOLA SWIM’12 7

8 Platform Setup 8 physical nodes – 2xSixCore Intel Xeon® Hyperthreading – 48GB RAM, 2 X 2TB RAID 0 disks each VMs – Similar to an Amazon EC2 large instance – 4-core processor, 8GB RAM, 50GB disk space – QCOW image: 1.6GB compressed, 4.3GB uncompressed – VM root fs instead of EBS (Reddit outage) Openstack 2.0 Cluster

Clients, Data and Workloads Hbase (v ), Cassandra (v beta) – Hadoop – Replication factor: 3 – 8 initial nodes (VMs) Ganglia YCSB tool – Database: 20M objects – 20GB raw (Cass ~60GB, Hbase ~90GB) – Loads: UNI_R, UNI_U, UNI_RMW, ZIPF_R Default: uniform read, 10%-50% range – λ parameter (tricky) – Both client (YCSB) and cluster (Ganglia) metrics reported 9 Automatic Scaling of Selective SPARQL Joins Using TIRAMOLA SWIM’12

The setup λ start =180Kreqs/sec (way over critical point) At time t=370sec: – Double the cluster size (add extra 8 nodes) – Triple the cluster size (add extra 16 nodes) – 4 different experiments for each database READ+8, READ+16, UPDATE+8 and UPDATE+16 Measure client, cluster-side metrics – query latency, throughput and total cluster usage – Vs time 10 Automatic Scaling of Selective SPARQL Joins Using TIRAMOLA SWIM’12

2. HBase cluster resize 1/2 Transient negative effect in read workloads – HMaster reassigns regions – Clients experience cache misses Effect is more apparent in READ+8 – A larger cluster absorbs load more easily Updates experience oscillations Compaction effect Incoming data is cached in memory (low latency) When memory is full, data is flushed to disk (high latency) 11 Automatic Scaling of Selective SPARQL Joins Using TIRAMOLA SWIM’12

2. HBase cluster resize 2/2 For READ, throughput is (roughly) doubled and tripled – More servers handle more requests – Data is not transferred, but it is cached  Update is not affected  I/O bound operation  Updates will be handled by the initial 8 nodes  Only new regions due to compaction will be served by new nodes 12 Automatic Scaling of Selective SPARQL Joins Using TIRAMOLA SWIM’12

2. Cassandra cluster resize 1/2 READ workloads – Initial latency goes to 1/2 and 1/3 of 22 secs – More servers join the ring and take requests – No transient effect because of decentralized p2p nature  UPDATE workloads  The same effect as in the read case  Writes are faster than reads due to weak consistency model 13 Automatic Scaling of Selective SPARQL Joins Using TIRAMOLA SWIM’12

2. Cassandra cluster resize 2/2 READ workloads – Similar trend as in the latency case UPDATE workloads – Same almost linear effect Weak consistency model  READ workloads  CPU decreases to 60% and 50%  Load is enough for every node to contribute  UPDATE same as READ 14 Automatic Scaling of Selective SPARQL Joins Using TIRAMOLA SWIM’12

What about the time cost? VM initialization – 3min for addition, negligible for removal (few secs) Node configuration – Config files and propagation (at most 30 sec cycle) Region rebalance – Actively participate in the NoSQL cluster – Cassandra more efficient, Hbase depends on data, #nodes,… Data rebalance – Optional – Hbase: data / cluster-size dependent (+2h) – Cassandra: individual loadbalance signals 15 Automatic Scaling of Selective SPARQL Joins Using TIRAMOLA SWIM’12

Use-Case: Scaling SPARQL Joins H 2 RDF: Storing RDF triples over a cloud-based index – (s,p,o) permutations for efficient access HΒase store – WWW’2012 demo, code at Adaptive joins – Map-Reduce joins – Centralized joins SPARQL querying – Jena parser – Join planner Automatic Scaling of Selective SPARQL Joins Using TIRAMOLA SWIM’12 16

SPARQL Querying Architecture Automatic Scaling of Selective SPARQL Joins Using TIRAMOLA SWIM’12 17

Elasticity-provisioning use-case Dataset: – LUBM (5k Universities, 7M triples, 125 GB) Queries: SELECT ?x WHERE {?x rdf:type ub:GraduateStudent. ?x ub:takesCourse } for different courses Load – Sinusoidal (40qu/sec avg, 70qu/sec peak, 1h period) Objective function – CPU load, 50% – 60% thresholds 18 Automatic Scaling of Selective SPARQL Joins Using TIRAMOLA SWIM’12

Results Elastic operations Pay as you go 19 Automatic Scaling of Selective SPARQL Joins Using TIRAMOLA SWIM’12

Questions  “TIRAMOLA: Elastic NoSQL Provisioning through A Cloud Management Platform” – SIGMOD 2012(Demo Track)  Demo C  “On the Elasticity of NoSQL Databases over Cloud Management Platforms” – CIKM 2011  “Automatic, multi-grained elasticity- provisioning for the Cloud” – FP7 STReP (accepted for EU funding, kickoff date 9/2012)  Automatic Scaling of Selective SPARQL Joins Using TIRAMOLA SWIM’12