INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from Internet Services Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang.

Slides:

Advertisements

Similar presentations

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

Advertisements

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.

Daniel Schall, Volker Höfner, Prof. Dr. Theo Härder TU Kaiserslautern.

Transforming Business with Advanced Analytics: Introducing the New Intel® Xeon® Processor E7 v2 Family Seetha Rama Krishna Director, APAC HPC Solutions.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.

Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.

KMemvisor: Flexible System Wide Memory Mirroring in Virtual Environments Bin Wang Zhengwei Qi Haibing Guan Haoliang Dong Wei Sun Shanghai Key Laboratory.

CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.

CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.

Analysis of Database Workloads on Modern Processors Advisor: Prof. Shan Wang P.h.D student: Dawei Liu Key Laboratory of Data Engineering and Knowledge.

1 HYRISE – A Main Memory Hybrid Storage Engine By: Martin Grund, Jens Krüger, Hasso Plattner, Alexander Zeier, Philippe Cudre-Mauroux, Samuel Madden, VLDB.

INSTITUTE OF COMPUTING TECHNOLOGY Benchmarking Datacenter and Big Data Systems Wanling Gao, Zhen Jia, Lei Wang, Yuqing Zhu, Chunjie Luo, Yingjie Shi, Yongqiang.

Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.

Memory System Characterization of Big Data Workloads

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

February 11, 2003Ninth International Symposium on High Performance Computer Architecture Memory System Behavior of Java-Based Middleware Martin Karlsson,

Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

INSTITUTE OF COMPUTING TECHNOLOGY BPOE-4 workshop The fourth workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware Salt Lake.

Analyzing the Energy Efficiency of a Database Server Hanskamal Patel SE 521.

Application-driven Energy-efficient Architecture Explorations for Big Data Authors: Xiaoyan Gu Rui Hou Ke Zhang Lixin Zhang Weiping Wang (Institute of.

DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.

11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.

SYNAR Systems Networking and Architecture Group Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures Daniel Shelepov and Alexandra.

Multi-core architectures. Single-core computer Single-core CPU chip.

Introduction to Hadoop and HDFS

Multi-Core Architectures

Feb. 19, 2008 Multicore Processor Technology and Managing Contention for Shared Resource Cong Zhao Yixing Li.

Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

Our work on virtualization Chen Haogang, Wang Xiaolin {hchen, Institute of Network and Information Systems School of Electrical Engineering.

Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19.

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

INSTITUTE OF COMPUTING TECHNOLOGY Understanding Big Data Workloads on Modern Processors using BigDataBench Jianfeng Zhan

Srihari Makineni & Ravi Iyer Communications Technology Lab

Matchmaking: A New MapReduce Scheduling Technique

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

PERFORMANCE STUDY OF BIG DATA ON SMALL NODES. Ομάδα: Παναγιώτης Μιχαηλίδης Αντρέας Σόλου Instructor: Demetris Zeinalipour.

Performance and Energy Efficiency Evaluation of Big Data Systems Presented by Yingjie Shi Institute of Computing Technology, CAS

Revision - 01 Intel Confidential Page 1 Intel HPC Update Norfolk, VA April 2008.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

ApproxHadoop Bringing Approximations to MapReduce Frameworks

CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.

By Islam Atta Supervised by Dr. Ihab Talkhan

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

Sunpyo Hong, Hyesoon Kim

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 1 (Performance measurement)

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Parallel Programming Models

CS203 – Advanced Computer Architecture

Seth Pugsley, Jeffrey Jestes,

Hadoop Clusters Tess Fulkerson.

Energy-Efficient Address Translation

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Department of Computer Science University of California, Santa Barbara

Many-Core Graph Workload Analysis

Automatic Tuning of Two-Level Caches to Embedded Applications

Department of Computer Science University of California, Santa Barbara

Presentation transcript:

INSTITUTE OF COMPUTING TECHNOLOGY BigDataBench: a Big Data Benchmark Suite from Internet Services Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, Gang Lu, Kent Zhang, Xiaona Li, and Bizhu Qiu HPCA

Orlando, HPCA 2014 Why Big Data Benchmarking? Measuring big data systems and architectures quantitatively

Orlando, HPCA 2014 What is BigDataBench? An open source big data benchmarking project 6 real-world data sets – Generate (4V) big data 19 workloads – OLTP, Cloud OLTP, OLAP, and offline analytics – Same workloads: different implementations

Orlando, HPCA 2014 Executive summary Big Data Benchmarks Do we know enough about big data benchmarking? Big Data workload characterization What are differences from traditional workloads? Exploring best big data architectures brawny-core or wimpy multi-core or wimpy many-core?

Orlando, HPCA 2014 Outline Benchmarking Methodology and Decision Big Data Workload Characterization Evaluating Hardware Systems with Big Data Conclusion 3 3 2

Orlando, HPCA 2014 Methodology 4V of Big Data System and architecture characteristics BigDataBench Refine

Orlando, HPCA 2014 Methodology (Cont’) Diverse Data Sets Diverse Worklo ads Data Sources Text data Graph data Table data Extended … Data Types Structured Semi-structured Unstructured Big Data Sets Preserving 4V BigDataBench Investigate Typical Application Domains BDGS: big data generation tools Application Types OLTP Cloud OLTP OLAP Offline analytics Basic & Important Operations and Algorithms Extended… Represent Software Stack Extended… Big Data Workloads

Orlando, HPCA 2014 Top Sites on the Web More details in Search Engine, Social Network and Electronic Commerce take 80% page views of all the Internet service.

Orlando, HPCA 2014 MPI Shark Impala NoSql Software Stacks BigDataBench Summary 19 Workloads (Cloud) OLTP OLAP Offline Analytics Search Engine Social Network E-commerce Six Real-world Data Sets Google Web Graph E-commerce Transaction Wikipedia Entries BDGS(Big Data Generator Suite) for scalable data Facebook Social Network ProfSearch Person resumes Amazon Movie Reviews

Orlando, HPCA 2014 Outline Benchmarking Methodology and Decision Big Data Workload Characterization Evaluating Hardware Systems with Big Data Conclusion

Orlando, HPCA 2014 Big Data Workloads Analyzed Input data size varying from 32GB to 1TB

Orlando, HPCA 2014 Other Benchmarks Compared HPCC Representative HPC benchmark suite 7 benchmarks PARSEC CMP (Multi-threaded) benchmark suite 12 benchmarks SPECCPU SPECCFP SPECINT

Orlando, HPCA 2014 Metrics User-perceivable metrics OLTP services: requests per second(RPS) Cloud OLTP: operations per second(OPS) OLAP and Offline analytics: data processed per second(DPS) Micro-architecture characteristics Hardware performance counter

Orlando, HPCA 2014 Experimental Configurations Testbed Configurations Fifteen nodes: 1 master + 14 slaves Data input size: 32GB~1TB Each node: 2*Xeon E5645, 16GB Memory, 8TB Disk Network: 1Gb Ethernet CPU TypeIntel CPU Core Intel Xeon E G L1D CacheL1I CacheL2 CacheL3 Cache 6*32KB 6*256KB12MB Software Configurations OS : Centos 5.5 with Linux kernel Stacks: Hadoop 1.0.2, Hbase , Hive 0.9, MPICH2 1.5, Nutch 1.1, and Rubis 5.0

Orlando, HPCA 2014 Instruction Breakdown Data Analytics Services More integer instructions (Less floating point instructions) The average ratio of integer to floating point instructions is 75 FP instruction: X87+SSE FP ( X87, SSE_Pack_Float, SSE_Pack_Double, SSE_Scalar_Float and SSE_Scalar_Double ) Integer instruction: Total _Ins - FP_Ins - Branch_Ins - Store_Ins - Load_Ins

Orlando, HPCA 2014 Floating Point Operation Intensity (E5310) Total number of floating point instructions divided by total number of memory access bytes in a run of workload. Very low floating point operation intensity : two orders of magnitude lower than in the traditional workloads Data Analytics Services CPU TypeIntel CPU Core Intel Xeon E G L1 Cache L2 CacheL3 Cache 4*32KB 2*4MBNone

Orlando, HPCA 2014 Floating Point Operation Intensity Data AnalyticsServices Floating point operation intensity on E5645 is higher than that on E5310

Orlando, HPCA 2014 Integer Operation Intensity Data Analytics Services Integer operation intensity is in the same order like the traditional workloads Integer operation intensity on E5645 is higher than that on E5310 L3 Cache is effective & Bandwidth improvement

Orlando, HPCA 2014 Possible reasons (Xeon E5645 vs. Xeon E5310) More cores in one processor Deeper cache hierarchy level: L1~L3 vs. L1~L2 Larger bandwidth in Front Side Bus Sixe cores in Xeon E5645 vs. four cores in Xeon E5310 L3 cache is effective in decreasing memory access traffic for big data workloads Xeon E5645 adopts Intel QuickPath Interconnect (QPI) to eliminate bottlenecks in Front Side Bus [ASPLOS 2012] Hyperthreading technology Hyperthreading can improve performance by factors of 1.3~1.6 times for scale-out workloads Technique improvements of Xeon E5645:

Orlando, HPCA 2014 Cache Behaviors Higher L1I Cache misses than the traditional workloads Data analytic workloads have better L2 Cache behaviors than service workloads with the exception of BFS Good L3 Cache behaviors Data Analytics Services

Orlando, HPCA 2014 TLB Behaviors data analysis service 14 5 Higher ITBL misses than the traditional workloads

Orlando, HPCA 2014 Computation intensity (integer operations) Integer Operations per Byte (Receiving from networks) Integer Operations per Byte (Memory Accesses) X axis : (total number of integer instructions)/(total memory access bytes) Higher : execute more integer operations between two memory accesses Y axis : (total number of integer instructions)/(total bytes receiving from networks) Higher : execute more integer operations on the same receiving bytes

Orlando, HPCA 2014 Big Workloads Characterization Summary Data movement dominated computing Low computation intensity Cache Behaviors ( Xeon E5645) Very high L1I MPKI L3 Cache is effective Diverse workload behaviors Computation/communication vs. computation/memory accesses

Orlando, HPCA 2014 Outline Benchmarking Methodology and Decision Big Data Workload Characterization Evaluating Hardware Systems with Big Data Y. Shi, S. A. McKee et al. Performance and Energy Efficiency Implications from Evaluating Four Big Data Systems, Submitted to IEEE Micro. Conclusion 3 3

Orlando, HPCA 2014 State-of-art Big Data System Architectures Wimpy many-core processors Wimpy multi-core processors Brawny-core processors Big Data System & Architecture Trends Hardware Designers: What are the best big data system and architectures in terms of both performance and energy efficiency? Data Center Administrators: How to choose appropriate hardware for big data applications?

Orlando, HPCA 2014 Evaluated Platforms  Xeon E5310 (Brawny-core) scale-up Xeon E5645 (Brawny- core)  Atom D510 (Wimpy multi-core) scale-out TileGx 36 (Wimpy many-core) ModelXeon E5645Xeon E5310Atom D510TileGx36 No. of Processors2111 No. of Cores/CPU64236 Frequency2.4GHz1.6GHz1.66GHz1.2GHz L1 Cache (I/D)32KB/32KB 32KB/24KB32KB/32KB L2 Cache256KB*64096KB*2512KB*2256KB*36 L3 Cache12MBNONE TDP80W 13W45W Basic Information ModelXeon E5645Xeon E5310Atom D510TileGx36 Pipeline Depth Superscalar Widths4423 Instruction Set Architecture X86 MIPS Hyper-threadingYesNoYesNo Out-of-Order ExecutionYes No Specified Floating Point Unit Yes No Architectural Characteristics

Orlando, HPCA 2014 Chosen Workloads from BigDataBench Application Type Offline analytics Realtime analytics Workload Sort Wordcount Grep Naïve Bayes K-means Select Query Aggregation Query Join Query Time Complexity O(n*logn) O(n) O(m*n) O(n) Map Operation Quicksort String comparison & integer calculation Statistics computation Distance computation String comparison String comparison & integer calculation String comparison Reduce Operation Merge sort Combination Merge None Combination Cross product Reduce Input/Map Input e e e-5 N/A

Orlando, HPCA 2014 Experimental Configurations  Software stack ： Hadoop  Cluster configuration: Xeon & Atom-based systems ： 1 master + 4 slaves Tilera system ： 1 master + 2 slaves  Data Size: 500MB, 2GB, 8GB, 32GB, 64GB, 128GB  Apples-to-Apples comparison ： Deploy the systems with the same network and disk configurations Provide about 1GB memory for each hardware thread / core Adjust the Hadoop parameters to optimize performance

Orlando, HPCA 2014 Metrics  Performance ： Data processed per second (DPS)  Energy Efficiency ： Data processed per joule(DPJ) Data Input Size DPS = Running Time Data Input Size DPJ = Energy Consumption  Report DPS and DPJ per processor

Orlando, HPCA 2014 General Observations The Average DPS Comparison The Average DPJ Comparison  I/O intensive workload (Sort) ： many-core TileGx36 achieves the best performance and energy efficiency, The brawny-core processors do not provide performance advantages.  CPU-intensive and floating point operation dominated workloads (Bayes & K- means) : brawny-core processors show obvious performance advantages with close energy efficiency to wimpy-core processors.  Other workloads: no platform consistently wins in terms of both performance and energy efficiency. Report the average number only when the data sizes bigger than 8GB (not fully utilized on small data sizes).

Orlando, HPCA 2014 Improvements from Scaling-out the Wimpy Core (TileGx36 vs. Atom D510) The core of TileGx36 is more wimpy than Atom D510 TileGx36 integrates more cores on the NOC(Network on Chip) Adopts MIPS-derived VLIW instruction set. Does not support hyperthreading. Less stages in the pipeline depth. Does not have dedicated floating point units. 36 cores in TileGx36 vs. 4 cores Atom D510

Orlando, HPCA 2014 Improvements from Scaling-out the Wimpy Core (TileGx36 vs. Atom D510) The DPS Comparison The DPJ Comparison  I/O intensive workload (Sort): TileGx36 shows 4.1 times performance improvement, 1.01 times energy improvement (on average).  CPU-intensive and floating point operation dominated workloads(Bayes & K-means): TileGx36 shows 2.5 times performance advantage and 0.7 times energy efficiency (on average).  Other workloads: TileGx36 shows 2.5 times performance improvement, 1.03 times energy improvement (on average).

Orlando, HPCA 2014 Improvements from Scaling-out the Wimpy Core (TileGx36 vs. Atom D510) The core of TileGx36 is more wimpy than Atom D510 TileGx36 integrates more cores on the NOC(Network on Chip) Adopts MIPS-derived VLIW instruction set. Does not support hyperthreading. Less stages in the pipeline depth. Does not have dedicated floating point units. 36 cores in TileGx36 vs. 4 cores Atom D510  Scaling out the wimpy core can bring performance advantage by improving execution parallelism.  Simplifying the wimpy cores and integrating more cores on the NOC is an option for Big Data workloads.

Orlando, HPCA 2014 Scale-up the Brawny Core(Xeon E5645) vs. Scale-out the Wimpy Core (TileGx36) The DPS Comparison The DPJ Comparison  I/O intensive workload (Sort): TileGx36 shows 1.2 times performance improvement, 1.9 times energy improvement (on average).  CPU-intensive and floating point operation dominated workloads (Bayes & K- means): E5645 shows 4.2 times performance improvement, 2.0 times energy improvement (on average).  Other workloads: E5645 shows performance advantage, but with no consistent energy improvement.

Orlando, HPCA 2014 Hardware Evaluation Summary  No one-size-fits-all solution None of the microprocessors consistently wins in terms of both performance and energy efficiency for all of our Big Data workloads  One-size-fits-a-bunch solution There are different classes of Big Data workloads, and each class of workload realizes better performance and energy efficiency on different architectures.

Orlando, HPCA 2014 Outline Benchmarking Methodology and Decision Big Data Workload Characterization Evaluating hardware systems With Big Data Conclusion 3 3

Orlando, HPCA 2014 Conclusion An open source big data benchmark suite Data-centric benchmarking methodology Big Data workload characterization Data movement dominated computing Diverse behaviors Must including diversity of data and workloads Eschew one-size-fits-all solution Tailor system designs to specific workload requirements.

Orlando, HPCA 2014 THANKs