PERFORMANCE STUDY OF BIG DATA ON SMALL NODES. Ομάδα: Παναγιώτης Μιχαηλίδης Αντρέας Σόλου Instructor: Demetris Zeinalipour.

Slides:



Advertisements
Similar presentations
Daniel Schall, Volker Höfner, Prof. Dr. Theo Härder TU Kaiserslautern.
Advertisements

Computer Abstractions and Technology
FAWN: Fast Array of Wimpy Nodes Developed By D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, V. Vasudevan Presented by Peter O. Oliha.
FAWN: Fast Array of Wimpy Nodes A technical paper presentation in fulfillment of the requirements of CIS 570 – Advanced Computer Systems – Fall 2013 Scott.
TU/e Processor Design 5Z032 1 Processor Design 5Z032 The role of Performance Henk Corporaal Eindhoven University of Technology 2009.
FAWN: A Fast Array of Wimpy Nodes Presented by: Aditi Bose & Hyma Chilukuri.
Energy Efficient Prefetching with Buffer Disks for Cluster File Systems 6/19/ Adam Manzanares and Xiao Qin Department of Computer Science and Software.
Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:
Chapter 4 Assessing and Understanding Performance
1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.
1 The Problem of Power Consumption in Servers L. Minas and B. Ellison Intel-Lab In Dr. Dobb’s Journal, May 2009 Prepared and presented by Yan Cai Fall.
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
Analyzing the Energy Efficiency of a Database Server Hanskamal Patel SE 521.
© 2013 Mellanox Technologies 1 NoSQL DB Benchmarking with high performance Networking solutions WBDB, Xian, July 2013.
Lecture 2: Technology Trends and Performance Evaluation Performance definition, benchmark, summarizing performance, Amdahl’s law, and CPI.
A Workflow-Aware Storage System Emalayan Vairavanathan 1 Samer Al-Kiswany, Lauro Beltrão Costa, Zhao Zhang, Daniel S. Katz, Michael Wilde, Matei Ripeanu.
Basic Computer Structure and Knowledge Project Work.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Application-driven Energy-efficient Architecture Explorations for Big Data Authors: Xiaoyan Gu Rui Hou Ke Zhang Lixin Zhang Weiping Wang (Institute of.
1 Computer Performance: Metrics, Measurement, & Evaluation.
Computing Hardware Starter.
DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.
Random access memory.
1 Lecture 20: WSC, Datacenters Topics: warehouse-scale computing and datacenters (Sections ) – the basics followed by a look at the future.
Storage in Big Data Systems
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
1 CHAPTER 2 THE ROLE OF PERFORMANCE. 2 Performance Measure, Report, and Summarize Make intelligent choices Why is some hardware better than others for.
Hardware Trends. Contents Memory Hard Disks Processors Network Accessories Future.
Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.
1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.
[Tim Shattuck, 2006][1] Performance / Watt: The New Server Focus Improving Performance / Watt For Modern Processors Tim Shattuck April 19, 2006 From the.
The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong.
Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19.
1 CS/EE 362 Hardware Fundamentals Lecture 9 (Chapter 2: Hennessy and Patterson) Winter Quarter 1998 Chris Myers.
Testing… Testing… 1, 2, 3.x... Performance Testing of Pi on NT George Krc Mead Paper.
U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.
© GCSE Computing Computing Hardware Starter. Creating a spreadsheet to demonstrate the size of memory. 1 byte = 1 character or about 1 pixel of information.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Performance and Energy Efficiency Evaluation of Big Data Systems Presented by Yingjie Shi Institute of Computing Technology, CAS
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,
Lec2.1 Computer Architecture Chapter 2 The Role of Performance.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
Unit - 4 Introduction to the Other Databases.  Introduction :-  Today single CPU based architecture is not capable enough for the modern database.
Sunpyo Hong, Hyesoon Kim
 System Requirements are the prerequisites needed in order for a software or any other resources to execute efficiently.  Most software defines two.
Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
Tackling I/O Issues 1 David Race 16 March 2010.
Next Generation of Apache Hadoop MapReduce Owen
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
History of Computers and Performance David Monismith Jan. 14, 2015 Based on notes from Dr. Bill Siever and from the Patterson and Hennessy Text.
29/04/2008ALICE-FAIR Computing Meeting1 Resulting Figures of Performance Tests on I/O Intensive ALICE Analysis Jobs.
Measuring Performance II and Logic Design
Lecture 2: Performance Evaluation
Seth Pugsley, Jeffrey Jestes,
September 2 Performance Read 3.1 through 3.4 for Tuesday
Green cloud computing 2 Cs 595 Lecture 15.
Architecture & Organization 1
PA an Coordinated Memory Caching for Parallel Jobs
Unit 2 Computer Systems HND in Computing and Systems Development
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
Architecture & Organization 1
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
Odroid XU4.
Kenichi Kourai Kyushu Institute of Technology
Computer Organization and Design Chapter 4
Efficient Migration of Large-memory VMs Using Private Virtual Memory
Presentation transcript:

PERFORMANCE STUDY OF BIG DATA ON SMALL NODES. Ομάδα: Παναγιώτης Μιχαηλίδης Αντρέας Σόλου Instructor: Demetris Zeinalipour

WHAT IS BIG DATA? Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

BIG DATA PROCESSING  Traditional Processing Nodes  Large computing nodes (Intel Xeon)  High energy consumption (~thousands of watts)  Data Trends  Continuous increase in volume of Big Data for process  Increasing variety of data  Problems occurred  Energy utilization problem  Increase of cost to build and maintain a datacentre  Need for more man power

CASE OF STUDY  Use of mobile nodes  Significantly improved in the last years  Energy efficient  Cheap  Small in size

TRADITIONAL NODE VS MOBILE NODE  Supermicro 813M 1U server system with:  Two Intel Xeon E core CPUs  8GB DDR3 memory  1TB Hard Disk  1Gbit Ethernet Network Card  Odroid XU development board with:  Samsung Exynos 5410 SoC using ARM big.LITTLE architecture  8 cores  2GB low power DDR3 memory  64GB eMMC flash storage  100Mbit Ethernet Network Card

ARM big.LITTLE on Odroid XU system  Mobile CPU is consisted of:  4 ARM Cortex-A7 LITTLE cores ( MHz)  4 ARM Cortex-A15 big cores ( MHz)  Only one cluster is used each time  To improve network speed we use a USB 3.0 Gigabit Ethernet adapter

BENCHMARKS  Performance of CPU MIPS  Dhrystone benchmark  Core-Mark CPU benchmark  Custom Java execution benchmark  Memory bandwidth  Parallel Memory Bandwidth Benchmark  I/O throughput and latencies  dd 8.20  ioping 0.7  Networking bandwidth on TCP and UDP

Dhrystone Benchmark  ARM big.LITTLE  LITTLE cores have 21% better performance than big ones as per MHz  ARM claims that LITTLE cores should have less Dhrystone MIPS per MHz that big ones  Reason:  Different compilers (gcc vs armcc) generating the machine code differently  Xeon node  Almost twice the performance of ARM

CoreMark Benchmark  To confirm Dhrystone’s results  LITTLE cores have better performance than big ones as per MHz  Big cores score 3.52 instead of the expected 4.68  Different compiler  Xeon node  Almost twice the performance of big ARM

Java Execution Benchmark  ARM big.LITTLE  LITTLE cores have less than half the MIPS of big cores  big cores achieve just 7% of the performance of Xeon cores  Using only a quarter of the energy

Parallel Memory Bandwidth Benchmark  When data fits in cache  Xeon has 450GB/s bandwidth using 8 cores and 225GB/s using 4 cores  ARM big have 10 times less bandwidth  ARM LITTLE have 20 times less bandwidth  When main memory access is needed  ARM big have 2 times less bandwidth  ARM LITTLE have 4 times less bandwidth

I/O Throughput and Latencies  On write instructions  ARM big cores have 4 times worse throughput than Xeon’s  ARM LITTLE cores have even less throughput  Disk driver and file system have important CPU usage  On buffered read  Results are correlated with memory bandwidth values mentioned before  Write latency on eMMC memory is grater than traditional hard disks  Modern hard disks have caches and intelligent controllers  eMMC uses NAND flash which has big write latency

Networking Bandwidth on TCP and UDP  TCP  ARM big cores have 3 times less bandwidth than Xeon’s  ARM LITTLE cores have 4 times less bandwidth than Xeon’s  UDP  ARM big cores have 2 times less bandwidth than Xeon’s  ARM LITTLE cores have 3 times less bandwidth than Xeon’s  Justification  Ethernet adapter connected through USB 3.0 on Ondroid XU server system  Longer communication path shown by the grater ping latency

Complex Data Intensive Workloads  Big Data execution on small nodes in comparison with traditional server-class nodes  Evaluating the performance of HDFS, Hadoop MapReduce and query processing frameworks

Hadoop TestDFSIO Benchmark (1)  Throughput and Energy consumption of I/O instructions  12GB input data on 1 and 6 nodes  Writing on 6 nodes  Throughput drastically reduced especially on Xeon node  HDFS replication mechanism x3  Network and Storage operations lower the overall throughput also due to the replication  Leads to higher energy consumption  Reading degradation when using 6 nodes not as visible  Reading not affected by replication mechanism

Hadoop TestDFSIO Benchmark (2)  Writing on 6 nodes  Xeon’s throughput is 2 times grater than ARM  With 4 times more energy consumption  Reading using 6 nodes  Xeon’s throughput is 3 times better than ARM big.LITTLE  With 5 timer more energy consumption  ARM big.LITTLE consums less energy than Xeon but with 2-3 times lower throughput

Hadoop (1)  Evaluating time performance and energy efficiency through the 5 workloads  We set number of slots to 4 matching the number of cores of each node  Error on Kmeans and Terrasort due to insufficient memory  Kmeans: limit the io.sort.mb to 50MB  Terrasort: limit the number of slots to 2

Time Performance and Energy Consumption (2)  Pi workload implemented on C++ benefits both nodes, especially ARM node  All workloads show a sublinear scaling as number of nodes increases  To achieve the execution time of one Xeon node, we need multiple ARM nodes

Workload Categorization (3)  We categorize the workloads based on time, power and energy  Pi Java and Kmeans  Larger execution times on ARM compared to Xeon  Leading to higher energy usage due to high CPU usage  Pi C++ and Grep  Smaller execution time gap  Both are CPU-intensive and have high power usage  ARM energy usage is significantly lower  Workcount and Terrasort  I/O intensive workloads  Better execution times on Xeon  Higher memory and storage bandwidth

Hadoop Workloads Conclusion (4)  Performance per power (PPR) ratio shows the amount of useful work performed per unit of energy  Higher PPR indicates more energy efficient execution (green colour)

Query Processing (1)  Evaluating the performance of OLTP and OLAP workloads using TPC-C and TPC-H benchmarks accordingly on MySQL  TPC-C: data intensive, random data access with read to write ratio 1.9:1  Xeon has 2 times higher throughput than Odroid XU  Similar Response Time (RT) between Xeon and ARM big/big.LITTLE  RT on ARM LITTLE is 2 times worse  Xeon has 10 times higher average power consumption with only 2 times better throughput

Query Processing (2)  TPC-H: read-only queries and I/O bounded  We run 22 queries of different types  On scan queries and long-running queries Xeon performs 2-5 timer better than all ARM configurations  Because of the higher read throughput and larger memory used as file cache  On random access queries with small working set ARM performs times better  Because of lower read latency of Odroid XU eMMC flash based storage  Summary  Xeon is using more energy  ARM is more energy efficient on executing queries

Query Processing – Shark (1)  Evaluating the performance of in-memory Big Data analytics  Scan and Join queries  On scan queries ARM big/big.LITTLE is timer slower than Xeon and 3 times better in energy usage  ARM LITTLE are 2 times slower and 4 times better in energy usage  Join queries have 3 different senarios  Both servers have enough cache memory  Xeon has enough cache memory  None has enough cache memory

Query Processing – Shark (2)  On first scenario  ARM big/big.LITTLE are 2-4 times slower than Xeon and times more energy efficient  ARM LITTLE is 3-7 times slower and only times better in energy usage  On third scenario  ARM big/big.LITTLE are 2.4 times slower and LITTLE are 4.2 times slower  ARM has up to 4 times better energy usage  Xeon has to read from storage thus doesn’t benefit from it’s high memory bandwidth  On Second scenario  Runtime gap increases leading to high energy usage on ARM  Summary  ARM has better energy efficiency on I/O bounded scan queries  For join queries because of larger memory and I/O bandwidth traditional servers are better

TCO Analysis (1)  Two cost models  Marginal Cost  Google’s TCO model  Lifespan of a server, 3 years  Cost is expressed in US Dollars  Lifespan of a datacentre, 12 years  Typical usage of a server has  Lower bound: 10% utilization  Upper bound: 75% utilization  Electricity price range: 0.024$/kWh $/kWh

Marginal Cost (2)  Simple cost model: Equipment cost + Electricity cost  On Pi and Terrasort  ARM uses 20% and 50% utilization to achieve Xeons lower bound performance with 4 times less cost  We need more than one ARM nodes to achieve Xeons upper bound performance  6 for Terrasort Workload with 86% utilization each  2 for Pi Workload with 82% utilization each  On I/O intensive jobs with high utilization  ARM costs 50% more since 6 ARM servers are needed to achieve the utilization of Xeon  On idle we can assume having zero energy consumption  Xeon uses 22% less energy  ARM uses just 6-10% less energy since it’s already energy efficient

Google’s TCO Model (3)  Also includes capital cost and operational expenses  Capital cost is 15$/Watt as per datacentre’s power capacity  Operational cost is 0.04$/kW per month  Equipment cost includes maintenance cost of 5% for both type of servers  This model also includes the interest rate for a loan which is 8% per year  Electricity costs are calculated based on the average power consumption of the datacentre  Additional costs are expressed as PUE (Power Usage Effectiveness) per month  PUE for Xeon server: 1.1  PUE for ARM server: 1.5  All in all the 6 ARM servers implementation costs more than traditional Xeon server

Conclusion  Lower power advantage of ARM nodes is cancelled from small memory size, low memory and I/O bandwidth, and software immaturity  On CPU-intensive workloads ARM is 10 times slower when using Pi MapReduce on Java  On C++ this time is improved by a factor of 5  Data analysis is 4 times cheaper  For I/O-intensive workloads, 6 ARMs are needed to achieve the performance of a Xeon  50% bigger TCO  For query processing ARMs are much more energy efficient with slightly lower throughput  For small random database accesses, ARM are faster due to lower I/O latency  For sequential database scan, Xeon servers benefit more from bigger memory size

Conclusion  In the near future, if the memory of ARM is increased with maintaining their fast I/O and with maturity of software, they will become a serious contender for traditional Intel/AMD server system