Energy-efficient Cluster Computing with FAWN: Workloads and Implications Vijay Vasudevan, David Andersen, Michael Kaminsky*, Lawrence Tan, Jason Franklin,

Slides:



Advertisements
Similar presentations
Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain.
Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Daniel Schall, Volker Höfner, Prof. Dr. Theo Härder TU Kaiserslautern.
SILT: A Memory-Efficient, High-Performance Key-Value Store
FAWN: Fast Array of Wimpy Nodes Developed By D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, V. Vasudevan Presented by Peter O. Oliha.
1 MemScale: Active Low-Power Modes for Main Memory Qingyuan Deng, David Meisner*, Luiz Ramos, Thomas F. Wenisch*, and Ricardo Bianchini Rutgers University.
FAWN: Fast Array of Wimpy Nodes A technical paper presentation in fulfillment of the requirements of CIS 570 – Advanced Computer Systems – Fall 2013 Scott.
The Search for Energy- Efficient Building Blocks for the Data Center Laura Keys, Suzanne Rivoire, and John D. Davis Researcher, Microsoft.
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.
Memory System Characterization of Big Data Workloads
1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.
1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.
Shimin Chen Big Data Reading Group Presented and modified by Randall Parabicoli.
Chapter 1 CSF 2009 Computer Performance. Defining Performance Which airplane has the best performance? Chapter 1 — Computer Abstractions and Technology.
FAWN: A Fast Array of Wimpy Nodes Authors: David G. Andersen et al. Offence: Jaime Espinosa Chunjing Xiao.
FAWN: A Fast Array of Wimpy Nodes Presented by: Aditi Bose & Hyma Chilukuri.
UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
Energy-Efficient Query Processing on Embedded CPU-GPU Architectures Xuntao Cheng, Bingsheng He, Chiew Tong Lau Nanyang Technological University, Singapore.
1 The Problem of Power Consumption in Servers L. Minas and B. Ellison Intel-Lab In Dr. Dobb’s Journal, May 2009 Prepared and presented by Yan Cai Fall.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
Storage: Scaling Out > Scaling Up? Ankit Singla Chi-Yao Hong.
Analyzing the Energy Efficiency of a Database Server Hanskamal Patel SE 521.
Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access Niladrish Chatterjee Manjunath Shevgoor Rajeev Balasubramonian Al Davis.
A Workflow-Aware Storage System Emalayan Vairavanathan 1 Samer Al-Kiswany, Lauro Beltrão Costa, Zhao Zhang, Daniel S. Katz, Michael Wilde, Matei Ripeanu.
Application-driven Energy-efficient Architecture Explorations for Big Data Authors: Xiaoyan Gu Rui Hou Ke Zhang Lixin Zhang Weiping Wang (Institute of.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.
Scaling and Packing on a Chip Multiprocessor Vincent W. Freeh Tyler K. Bletsch Freeman L. Rawson, III Austin Research Laboratory.
1 Enabling Large Scale Network Simulation with 100 Million Nodes using Grid Infrastructure Hiroyuki Ohsaki Graduate School of Information Sci. & Tech.
Storage Class Memory Architecture for Energy Efficient Data Centers Bruce Childers, Sangyeun Cho, Rami Melhem, Daniel Mossé, Jun Yang, Youtao Zhang Computer.
Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu
Last Time Performance Analysis It’s all relative
CSE 691: Energy-Efficient Computing Lecture 6 SHARING: distributed vs. local Anshul Gandhi 1307, CS building
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
Extreme-scale computing systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward exa-scale computing.
CSE 691: Energy-Efficient Computing Lecture 7 SMARTS: custom-made systems Anshul Gandhi 1307, CS building
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
1 CS/EE 6810: Computer Architecture Class format:  Most lectures on YouTube *BEFORE* class  Use class time for discussions, clarifications, problem-solving,
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.
Dean Tullsen UCSD.  The parallelism crisis has the feel of a relatively new problem ◦ Results from a huge technology shift ◦ Has suddenly become pervasive.
Amar Phanishayee,LawrenceTan,Vijay Vasudevan
1 CS : Technology Trends Ion Stoica and Ali Ghodsi ( August 31, 2015.
BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
The Sort Benchmark AlgorithmsSolid State Disks External Memory Multiway Mergesort  Phase 1: Run Formation  Phase 2: Merge Runs  Careful parameter selection.
Accounting for Load Variation in Energy-Efficient Data Centers
CMP Design Space Exploration Subject to Physical Constraints Yingmin Li, Benjamin Lee, David Brooks, Zhigang Hu, Kevin Skadron HPCA’06 01/27/2010.
1 Lecture 2: Metrics to Evaluate Systems Topics: Metrics: power, reliability, cost, benchmark suites, performance equation, summarizing performance with.
1 Efficient Mixed-Platform Clouds Phillip B. Gibbons, Intel Labs Michael Kaminsky, Michael Kozuch, Padmanabhan Pillai (Intel Labs) Gregory Ganger, David.
CSE 591: Energy-Efficient Computing Lecture 3 SPEED: processor Anshul Gandhi 347, CS building
Tackling I/O Issues 1 David Race 16 March 2010.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
Spiros Papadimitriou Google Research Project re:Cycle Recycling CPU Cycles Stavros Harizopoulos HP Labs.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
The Sort Benchmark AlgorithmsSolid State Disks External Memory Multiway Mergesort  Phase 1: Run Formation  Phase 2: Merge Runs  Careful parameter selection.
Seth Pugsley, Jeffrey Jestes,
Anshul Gandhi 347, CS building
Scaling the Memory Power Wall with DRAM-Aware Data Management
Morgan Kaufmann Publishers
PA an Coordinated Memory Caching for Parallel Jobs
BitWarp Energy Efficient Analytic Data Processing on Next Generation General Purpose GPUs Jason Power || Yinan Li || Mark D. Hill || Jignesh M. Patel.
Presentation transcript:

Energy-efficient Cluster Computing with FAWN: Workloads and Implications Vijay Vasudevan, David Andersen, Michael Kaminsky*, Lawrence Tan, Jason Franklin, Iulian Moraru Carnegie Mellon University, *Intel Labs Pittsburgh 1

Energy in Data Centers US data centers now consume 2% of total US power Energy has become important metric of system performance Can we make data intensive computing more energy efficient? – Metric: Work per Joule 2

3 Goal: reduce peak power Traditional Datacenter Power Cooling Distribution 20% 20% energy loss (good) { { 1000W 750W 100% FAWN 100W <100W Servers

Wimpy Nodes are Energy Efficient 4 …but slow Sort Rate (MB/Sec) AtomDesktopServer Sort Efficiency (MB/Joule) Atom DesktopServer Atom Node: + energy efficient - lower frequency (slower) - limited mem/storage Sorting 10GB of data

5 FAWN - Fast Array of Wimpy Nodes Leveraging parallelism and scale out to build eEfficient Clusters

FAWN in the Data Center Why is FAWN more energy-efficient? When is FAWN more energy-efficient? What are the future design implications? 6

CPU Power Scaling and System Efficiency Fastest processors exhibit superlinear power usage Fixed power costs can dominate efficiency for slow processors FAWN targets sweet spot in system efficiency when including fixed costs * Efficiency numbers include 0.1W power overhead 7 Speed vs. Efficiency

FAWN in the Data Center Why is FAWN more energy-efficient? When is FAWN more energy-efficient? 8

When is FAWN more efficient? Modern Wimpy FAWN Node Prototype Intel “Pineview” Atom Two 1.8GHz cores 2GB of DRAM 18W -- 29W (idle – peak) Single 2.8GHz quad-core Core i GB of DRAM 40W – 140W (idle – peak) Core i7-based Desktop (Stripped down) 9

1. I/O-bound – Seek or scan 2. Memory/CPU-bound 3. Latency-sensitive, but non parallelizable 4. Large, memory-hungry Data-intensive computing workloads FAWN’s sweet spot 10

Memory-bound Workloads Atom 2x as efficient when in L1 and DRAM Desktop Corei7 has 8MB L3 Efficiency vs. Matrix Size 11 Atom wins Corei7-8T wins Atom wins Wimpy nodes can be more efficient when cache effects are taken into account, for your workloads it may require tuning of algorithms

CPU-bound Workload Crypto: SHA1/RSA Optimization matters! – Unopt. C: Atom wins – Opt. Asm: Old: Corei7 wins! New: Atom wins! 12 Old- SHA1 (MB/J) New- SHA1 (MB/J) RSA- Sign (Sign/J) Atom i CPU-bound operations can be more energy efficient on low- power processors However, code may need to be hand optimized

Potential Hurdles Memory-hungry workloads – Performance depends on locality at many scales E.g., prior cache results, on or off chip/machine – Some success w algo. changes e.g., virus scanning Latency-sensitive, non-parallelizable – E.g., Bing search, strict latency bound on processing time W.o. software changes, found atom too slow 13

FAWN in the Data Center Why is FAWN more energy-efficient? When is FAWN more energy-efficient? What are the future design implications? – With efficient CPUs, memory power becomes critical 14

Memory power also important Today’s high speed systems: mem. ~= 30% of power DRAM power draw – Storage: Idle/refresh – Communication: Precharge and read Memory bus (~40% ?) CPU to mem distance greatly affects power – Point-to-point topology more efficient than bus, reduces trace length +Lower latency, + Higher bandwidth, + Lower power cons - Limited memory per core – Why not stack CPU and memory? DRAM Line Refresh CPU Memory bus 15

Preview of the Future FAWN RoadMap Nodes with single CPU chip with many low-frequency cores Less memory, stacked with shared interconnect Industry and academia beginning to explore – iPad, EPFL Arm+DRAM 16

To conclude, FAWN arch. more efficient, but… Up to 10x increase in processor count Tight per-node memory constraints Algorithms may need to be changed Research needed on… – Metrics: Ops per Joule? Atoms increase workload variability & latency Incorporate quality of service metrics? – Models: Will your workload work well on FAWN? 17 Questions?

Related Work System Architectures – JouleSort: SATA disk-based system w. low-power CPUs – Low-power processors for datacenter workloads Gordon: Focus on FTL, simulations CEMS, AmdahlBlades, Microblades, Marlowe, Bluegene – IRAM: Tackling memory wall, thematically similar approach Sleeping, complementary approach – Hibernator, Ganesh et al., Pergamum 18