Scaling for the Future Katherine Yelick U.C. Berkeley, EECS

Slides:



Advertisements
Similar presentations
Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
Advertisements

Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.
Introduction to DBA.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Option 2: The Oceanic Data Utility: Global-Scale Persistent Storage John Kubiatowicz.
Option 2: The Oceanic Data Utility: Global-Scale Persistent Storage John Kubiatowicz.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
Opportunities for Continuous Tuning in a Global Scale File System John Kubiatowicz University of California at Berkeley.
1 IRAM and ISTORE David Patterson, Katherine Yelick, John Kubiatowicz U.C. Berkeley, EECS
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
CLUSTER COMPUTING Prepared by: Kalpesh Sindha (ITSNS)
Virtualization Concept. Virtualization  Real: it exists, you can see it.  Transparent: it exists, you cannot see it  Virtual: it does not exist, you.
RAID: High-Performance, Reliable Secondary Storage Mei Qing & Chaoxia Liao Nov. 20, 2003.
Module 9 Review Questions 1. The ability for a system to continue when a hardware failure occurs is A. Failure tolerance B. Hardware tolerance C. Fault.
Computer System Architectures Computer System Software
A brief overview about Distributed Systems Group A4 Chris Sun Bryan Maden Min Fang.
Virtualization. Virtualization  In computing, virtualization is a broad term that refers to the abstraction of computer resources  It is "a technique.
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
Distributed Software Engineering Lecture 1 Introduction Sam Malek SWE 622, Fall 2012 George Mason University.
Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
N. GSU Slide 1 Chapter 05 Clustered Systems for Massive Parallelism N. Xiong Georgia State University.
Distributed Computing Systems CSCI 4780/6780. Distributed System A distributed system is: A collection of independent computers that appears to its users.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Distributed Computing Systems CSCI 4780/6780. Geographical Scalability Challenges Synchronous communication –Waiting for a reply does not scale well!!
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Data Center & Large-Scale Systems (updated) Luis Ceze, Bill Feiereisen, Krishna Kant, Richard Murphy, Onur Mutlu, Anand Sivasubramanian, Christos Kozyrakis.
Chapter 20 Parallel Sysplex
Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.
Distributed Computing Systems CSCI 6900/4900. Review Distributed system –A collection of independent computers that appears to its users as a single coherent.
WINDOWS SERVER 2003 Genetic Computer School Lesson 12 Fault Tolerance.
Distributed Computing Systems CSCI 4780/6780. Scalability ConceptExample Centralized servicesA single server for all users Centralized dataA single on-line.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Parallel IO for Cluster Computing Tran, Van Hoai.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Tackling I/O Issues 1 David Race 16 March 2010.
Background Computer System Architectures Computer System Software.
VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.
1 Distributed Systems Architectures Distributed object architectures Reference: ©Ian Sommerville 2000 Software Engineering, 6th edition.
Chapter 1 Characterization of Distributed Systems
Computers for the Post-PC Era
Option 2: The Oceanic Data Utility: Global-Scale Persistent Storage
Self Healing and Dynamic Construction Framework:
Rough Schedule 1:30-2:15 IRAM overview 2:15-3:00 ISTORE overview break
Berkeley Cluster: Zoom Project
GlassFish in the Real World
Computer Architecture at Berkeley
Storage Virtualization
EECS 582 Midterm Review Mosharaf Chowdhury EECS 582 – F16.
The University of Adelaide, School of Computer Science
Lecture 18 Warehouse Scale Computing
QNX Technology Overview
Web Server Administration
Research in Internet Scale Systems
Cloud Computing Architecture
Lecture 18 Warehouse Scale Computing
Lecture 18 Warehouse Scale Computing
ISTORE Update David Patterson University of California at Berkeley
Co-designed Virtual Machines for Reliable Computer Systems
Introduction To Distributed Systems
Database System Architectures
A microprocessor into a memory chip Dave Patterson, Berkeley, 1997
IRAM Vision Microprocessor & DRAM on a single chip:
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Scaling for the Future Katherine Yelick U.C. Berkeley, EECS http://iram.cs.berkeley.edu/{istore} http://www.cs.berkeley.edu/projects/titanium

Two Independent Problems Building a reliable, scalable infrastructure Scalable processor, cluster, and wide-area systems IRAM, ISTORE, and OceanStore One example application for the infrastructure Microscale simulation of biological systems Model signals from cell membrane to nucleus Understanding disease and for pharmacological and BioMEMS-mediated therapy

IRAM: Scaling within a Chip Microprocessor & DRAM on a single chip: Avoids memory bus bottleneck Address power limits by spreading logic over chip VIRAM chip: Vector architecture exploits bandwidth preserves power & area advantages Support for multimedia IBM will fabricate Sp ’01 200 MHz, 3.2 Gflops, 2 W .18 um mixed logic/DRAM $ Proc L2$ L o g i c f a b Bus D R A M I/O D R A M f a b Proc Bus I/O $B for separate lines for logic and memory Single chip: either processor in DRAM or memory in logic fab

ISTORE: Scaling Clusters Design points 2001: 80 nodes in 3 racks 2002: 1000 nodes in 10 racks (?) 2005: 10K nodes in 1 rack (?) Add IRAM to 1” disk Key problems are availability, maintainability, and evolutionary growth (AME) of a thousand node servers Approach Hardware built for availability: monitor, diagnostics New class of benchmarks for AME Reliable systems from unreliable hw/sw components Introspection: the system watches itself

OceanStore: Scaling to Utilities Canadian OceanStore Sprint AT&T Pac Bell IBM IBM Transparent data service provided by federation of companies: Monthly fee paid to one service provider Companies buy and sell capacity from each other Assumptions: Untrusted Infrastructure: only ciphertext in the infrastructure Promiscuous Caching: cache anywhere, anytime Optimistic Concurrency Control: avoid locking

The Real Scalability Problems: AME Availability systems should continue to meet quality of service goals despite failures and extreme load Maintainability minimize human administration Evolutionary Growth graceful evolution; dynamic scalability These are problems for computation and storage services

Research Principles Redundancy everywhere Hardware: processors, networks, disks,… Software: language, libraries, runtime,… Introspection reactive techniques to detect and adapt to failures, workload variations, and system evolution proactive techniques to anticipate and avert problems before they happen Benchmarking Define quantitative AME measures Benchmarks drive the field

Benchmarks Availability benchmarks Measure QoS as fault events occur Support for fault injection key Example of software RAID system Maintainability benchmarks Human factor is a challenge Evolutionary growth benchmarks Performance with heterogeneous hardware

Example: Faults in Software RAID Linux Solaris Compares Linux and Solaris reconstruction Linux: minimal performance impact but longer window of vulnerability to second fault Solaris: large perf. impact but restores redundancy fast

Simulating Microscale Biological Systems Large scale simulation useful for Fundamental biological questions: cell behavior Design of treatments, including Bio-MEMs Simulations limited in part by Machine complexity, e.g., memory hierarchies Algorithmic complexity, e.g., adaptation Old software model: Hide the machine from the users Implicit parallelism, hardware-controlled caching, Results were unusable Witness success of MPI

New Model for Scalable High Confidence Computing Domain-specific language that judiciously exposes machine structure Explicit parallelism, load balancing and locality control Allows for construction of complex, distributed data structures Current Demonstration on higher level models Heart simulation Future plans Algorithms and software that adapts to faults Microscale systems

Conclusions Scaling at all levels Processors, clusters, wide area Application challenges Both storage and compute intensive Key challenges to future infrastructure are: Availability and reliability Complexity of the machine