Gordon Bell Bay Area Research Center Microsoft Corporation

Slides:

Advertisements

Similar presentations

Copyright Gordon Bell Beyond Moores Law (and the web): Whats next? Nets Everywhere

Advertisements

Gordon Bell Bay Area Research Center Microsoft Corporation

The Device Revolution Building The Next Generation Infrastructure Mohamed A. Gawdat Regional Manager Communications & Mobile Devices Division Middle East.

Hardware & the Machine room Week 5 – Lecture 1. What is behind the wall plug for your workstation? Today we will look at the platform on which our Information.

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures

IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

MP3 / MD740 Strategy & Information Systems Sept. 15, 2004 Computing Hardware – Moore's Law, Hardware Markets, and Computing Evolution Network Effects,

Chapter 17 Parallel Processing.

Revolution Yet to Happen1 The Revolution Yet to Happen Gordon Bell & James N. Gray (from Beyond Calculation, Chapter 1) Rivier College, CS699 Professional.

Chapter 4 Computer Organization Vishal Shah Vishal Pinto.

CPE 731 Advanced Computer Architecture Multiprocessor Introduction

The History of Computers By: Casey Walsh. Introduction Computer history can be broken down into five generations of change. Computer history can be broken.

Hardware and Software Basics. Computer Hardware  Central Processing Unit - also called “The Chip”, a CPU, a processor, or a microprocessor  Memory (RAM)

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Operating System.

Chapter 1 The Big Picture Chapter Goals Describe the layers of a computer system Describe the concept of abstraction and its relationship to computing.

Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.

Department of Computer and Information Science, School of Science, IUPUI Dale Roberts, Lecturer Computer Science, IUPUI CSCI.

Information Systems Today: Managing in the Digital World TB4-1 4 Technology Briefing Networking.

IT Infrastructures and Emerging Technologies

Operating Systems Operating System

Introduction to Computers Personal Computing 10. What is a computer? Electronic device Performs instructions in a program Performs four functions –Accepts.

Intro to MIS MGMT 661 Management Information Systems Summer Dannelly 1 st Meeting.

1 Computer System - OVERVIEW Reference: Giesecke et al. Chapter 2 Print multiple handouts on one page Select File – Print Edit the following selections.

Lecture 2 : Introduction to Multicore Computing

Parallel Computing Laxmikant Kale

Module 2: Information Technology Infrastructure Chapter 1: Hardware and Software.

Last Time Performance Analysis It’s all relative

Outline Course Administration Parallel Archtectures –Overview –Details Applications Special Approaches Our Class Computer Four Bad Parallel Algorithms.

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

Eng.Abed Al Ghani H. Abu Jabal Introduction to computers.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

Message Passing Computing 1 iCSC2015,Helvi Hartmann, FIAS Message Passing Computing Lecture 1 High Performance Computing Helvi Hartmann FIAS Inverted CERN.

Delivering Video over IP

Enabling Technologies (Chapter 1)  Understand the technology and importance of:  Virtualization  Cloud Computing  WAN Acceleration  Deep Packet Inspection.

20 October Management of Information Technology Chapter 6 Chapter 6 IT Infrastructure and Platforms Asst. Prof. Wichai Bunchua.

MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,

Shashwat Shriparv InfinitySoft.

Brian Hixenbaugh Network Managment. My Home Network.

Microsoft Confidential 1 WWSMM 2000 Next Generation Networking Device Ecosystem Shunichi Kajisa ( 加治佐俊一 ) Director East Asia Windows Division Microsoft.

Grid Computing Framework A Java framework for managed modular distributed parallel computing.

Keyboard Computer Mouse Input devices is the information you put into the computer.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 May 2, 2006 Session 29.

Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.

CS203 – Advanced Computer Architecture Performance Evaluation.

SEPTEMBER 8, 2015 Computer Hardware 1-1. HARDWARE TERMS CPU — Central Processing Unit RAM — Random-Access Memory  “random-access” means the CPU can read.

OPERATING SYSTEM REVIEW. System Software The programs that control and maintain the operation of the computer and its devices The two parts of system.

The University of Adelaide, School of Computer Science

Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.

Lecture 1: Network Operating Systems (NOS)

CS203 – Advanced Computer Architecture

Hardware vs. Software Question 1 What is hardware?

Constructing a system with multiple computers or processors

CS775: Computer Architecture

What is Parallel and Distributed computing?

Computer Architecture

Constructing a system with multiple computers or processors

Constructing a system with multiple computers or processors

Constructing a system with multiple computers or processors

Chapter 1 Introduction.

Multi Core Processing What is term Multi Core?.

Computer Evolution and Performance

Chapter 4 Multiprocessors

The University of Adelaide, School of Computer Science

Types of Parallel Computers

Presentation transcript:

Gordon Bell Bay Area Research Center Microsoft Corporation All the chips outside… and around the PC what new platforms? Apps? Challenges, what’s interesting, and what needs doing? Gordon Bell Bay Area Research Center Microsoft Corporation

Architecture changes when everyone and everything is mobile Architecture changes when everyone and everything is mobile! Power, security, RF, WWW, display, data-types e.g. video & voice… it’s the application of architecture!

The architecture problem The apps Data-types: video, voice, RF, etc. Environment: power, speed, cost The material: clock, transistors… Performance… it’s about parallelism Program & programming environment Network e.g. WWW and Grid Clusters Multiprocessors Storage, cluster, and network interconnect Processor and special processing Multi-threading and multiple processor per chip Instruction Level Parallelism vs Vector processors

IP On Everything

poochi

Sony Playstation export limiits

PC At An Inflection Point? It needs to continue to be upward. These scalable systems provide the highest technical (Flops) and commercial (TPC) performance. They drive microprocessor competition! Non-PC devices and Internet PCs

Consumer PCs Mobile Companions TV/AV The Dawn Of The PC-Plus Era, Not The Post-PC Era… devices aggregate via PCs!!! Household Management Communications Automation & Security

PC will prevail for the next decade as a dominant platform … 2nd to smart, mobile devices Moore’s Law increases performance; and alternatively reduces prices PC server clusters with low cost OS beat proprietary switches, smPs, and DSMs Home entertainment & control … Very large disks (1TB by 2005) to “store everything” Screens to enhance use Mobile devices, etc. dominate WWW >2003! Voice and video become important apps! C = Commercial; C’ = Consumer

Where’s the action? Problems? Constraints: Speech, video, mobility, RF, GPS, security… Moore’s Law, including network speed Scalability and high performance processing Building them: Clusters vs DSM Structure: where’s the processing, memory, and switches (disk and ip/tcp processing) Micros: getting the most from the nodes Not ISAs: Change can delay Moore Law effect … and wipe out software investment! Please, please, just interpret my object code! System on a chip alternatives… apps drive Data-types (e.g. video, video, RF) performance, portability/power, and cost

High Performance Computing A 60+ year view

High performance architecture/program timeline 1950 . 1960 . 1970 . 1980 . 1990 . 2000 Vtubes Trans. MSI(mini) Micro RISC nMicr Sequential programming---->------------------------------ (single execution stream) <SIMD Vector--//--------------- Parallelization--- Parallel programs aka Cluster Computing <--------------- multicomputers <--MPP era------ ultracomputers 10X in size & price! 10x MPP “in situ” resources 100x in //sm NOW VLSCC geographically dispersed Grid

-------- Connectivity-------- Computer types -------- Connectivity-------- WAN/LAN SAN DSM SM Netwrked Supers… GRID VPPuni NEC mP NEC super Cray X…T (all mPv) Clusters micros vector Legion Condor Beowulf NT clusters T3E SP2(mP) NOW SGI DSM clusters & Mainframes Multis WSs PCs

Technical computer types WAN/LAN SAN DSM SM New world: Clustered Computing (multiple program streams) Old World ( one program stream) Netwrked Supers… GRID VPPuni NEC mP T series NEC super Cray X…T (all mPv) micros vector Legion Condor Beowulf SP2(mP) NOW SGI DSM clusters & Mainframes Multis WSs PCs

Dead Supercomputer Society

Dead Supercomputer Society ACRI Alliant American Supercomputer Ametek Applied Dynamics Astronautics BBN CDC Convex Cray Computer Cray Research Culler-Harris Culler Scientific Cydrome Dana/Ardent/Stellar/Stardent Denelcor Elexsi ETA Systems Evans and Sutherland Computer Floating Point Systems Galaxy YH-1 Goodyear Aerospace MPP Gould NPL Guiltech Intel Scientific Computers International Parallel Machines Kendall Square Research Key Computer Laboratories MasPar Meiko Multiflow Myrias Numerix Prisma Tera Thinking Machines Saxpy Scientific Computer Systems (SCS) Soviet Supercomputers Supertek Supercomputer Systems Suprenum Vitesse Electronics

SCI Research c1985-1995 35 university and corporate R&D projects 2 or 3 successes… All the rest failed to work or be successful

How to build scalables? To cluster or not to cluster… don’t we need a single, shared memory?

Application Taxonomy Technical Commercial General purpose, non-parallelizable codes (PCs have it!) Vectorizable Vectorizable & //able (Supers & small DSMs) Hand tuned, one-of MPP course grain MPP embarrassingly // (Clusters of PCs...) Database Database/TP Web Host Stream Audio/Video Technical Commercial If central control & rich then IBM or large SMPs else PC Clusters

SNAP … c1995 Scalable Network And Platforms A View of Computing in 2000+ We all missed the impact of WWW! This talk / essay portrays our view of computer-server architecture trends. (It is silent on the client cellphones, toasters, and gameboys This is an early draft. We are sending a copy to you in hopes that you’ll read and comment on it. We would like to publish it in several forms: 2 hr video lecture, an kickoff article in a ComputerWorld issue that Gordon is editing, a monograph enlarged to be published within a year. January 1, 1995 Gordon Bell Jim Gray

Computing SNAP built entirely from PCs Legacy mainframes & minicomputers servers & terms Portables Legacy mainframe & minicomputer servers & terminals Wide-area global network Mobile Nets Wide & Local Area Networks for: terminal, PC, workstation, & servers Person servers (PCs) scalable computers built from PCs Person servers (PCs) Centralized & departmental uni- & mP servers (UNIX & NT) Centralized & departmental servers buit from PCs ??? Here's a much more radical scenario, but one that seems very likely to me. There will be very little difference between servers and the person servers or what we mostly associate with clients. This will come because economy of scale is replaced by economy of volume. The largest computer is no longer cost-effective. Scalable computing technology dictates using the highest volume, most cost-effective nodes. This means we build everything, including mainframes and multiprocessor servers from PCs! TC=TV+PC home ... (CATV or ATM or satellite) A space, time (bandwidth), & generation scalable environment

How Will Future Computers Be Built? Thesis: SNAP: Scalable Networks and Platforms Upsize from desktop to world-scale computer based on a few standard components Because: Moore’s law: exponential progress Standardization & Commoditization Stratification and competition When: Sooner than you think! Massive standardization gives massive use Economic forces are enormous

Bell Prize and Future Peak Tflops (t) *IBM Petaflops study target NEC CM2 XMP NCube

Top 10 tpc-c Top two Compaq systems are: 1.1 & 1.5X faster than IBM SPs; 1/3 price of IBM 1/5 price of SUN

Courtesy of Dr. Thomas Sterling, Caltech

Five Scalabilities Size scalable -- designed from a few components, with no bottlenecks Generation scaling -- no rewrite/recompile or user effort to run across generations of an architecture Reliability scaling… chose any level Geographic scaling -- compute anywhere (e.g. multiple sites or in situ workstation sites) Problem x machine scalability -- ability of an algorithm or program to exist at a range of sizes that run efficiently on a given, scalable computer. Problem x machine space => run time: problem scale, machine scale (#p), run time, implies speedup and efficiency, Unclear whether scalability has any meaning for a real system. While the following are all based on some order of N, the engineering details of a system determine missing constants! A system is scalable if efficiency(n,x) =1 for all algorithms, number of processors, n, and problem sizes x. Fails to recognize cost, efficiency, and whether VLSCs are practical (affordable) in a reasonable time scale. Cost < O(N2) rules out the cross-point= O(N2), however latency is O(1); Omega O(N log N), Ring/Bus/Mesh O(N) Bandwidth required to be <O(logN) Supercomputer bandwidth are O(N)... no caching, hierarchies SIMD didn't scale, CM5 probably won't. 19:25 (4:00) -4:25 Compatibility with the future is important. No matter how much you build on standards, you want the next one to take all the programs (without recompiliation), files, and run them with no changes!

Why I gave up on large smPs & DSMs Economics: Perf/Cost is lower…unless a commodity Economics: Longer design time & life. Complex. => Poorer tech tracking & end of life performance. Economics: Higher, uncompetitive costs for processor & switching. Sole sourcing of the complete system. DSMs … NUMA! Latency matters. Compiler, run-time, O/S locate the programs anyway. Aren’t scalable. Reliability requires clusters. Start there. They aren’t needed for most apps… hence, a small market unless one can find a way to lock in a user base. Important as in the case of IBM Token Rings vs Ethernet.

FVCORE Performance Finite Volume Community Climate Model, Joint Code development NASA, LLNL and NCAR 50 SX-5 SX-4 Max C90-16 Max T3E

Architectural Contrasts – Vector vs Microprocessor Vector registers 8 KBytes Memory CPU Vector System 1st & 2nd Lvl Caches 8 MBytes Memory CPU Microprocessor System 500Mhz 600Mhz Two results per clock Two results per clock (Will be 4 in next Gen SGI) Vector lengths arbitrary Vector lengths fixed Vectors fed at low speed Vectors fed at high speed Cache based systems are nothing more than “vector” processors with a highly programmable “vector” register set (the caches). These caches are 1000x larger than the vector registers on a Cray vector system, and provide the opportunity to execute vector work at a very high sustained rate. In particular, note 512 CPU Origins contain 4 GBytes of cache. This is larger than most problems of interest, and offers a tremendous opportunity for high performance across a large number of CPUs. This has been borne out in fact at NASA Ames.

Convergence to one architecture mPs continue to be the main line Convergence to one architecture

“Jim, what are the architectural challenges … for clusters?” WANS (and even LANs) faster than backplanes at 40 Gbps End of busses (fc=100 MBps)… except on a chip What are the building blocks or combinations of processing, memory, & storage? Infiniband http://www.infinibandta.org starts at OC48, but it may not go far or fast enough if it ever exists. OC192 is being deployed.

What is the basic structure of these scalable systems? Overall Disk connection especially wrt to fiber channel SAN, especially with fast WANs & LANs

Modern scalable switches … also hide a supercomputer Scale from <1 to 120 Tbps of switch capacity 1 Gbps ethernet switches scale to 10s of Gbps SP2 scales from 1.2 Gbps

GB plumbing from the baroque: evolving from the 2 dance-hall model Mp — S — Pc : | : |—————— S.fc — Ms | : |— S.Cluster |— S.WAN — MpPcMs — S.Lan/Cluster/Wan — :

SNAP Architecture---------- With this introduction about technology, computing styles, and the chaos and hype around standards and openness, we can look at the Network & Nodes architecture I posit.

ISTORE Hardware Vision System-on-a-chip enables computer, memory, without significantly increasing size of disk 5-7 year target: MicroDrive:1.7” x 1.4” x 0.2” 2006: ? 1999: 340 MB, 5400 RPM, 5 MB/s, 15 ms seek 2006: 9 GB, 50 MB/s ? (1.6X/yr capacity, 1.4X/yr BW) Integrated IRAM processor 2x height Connected via crossbar switch growing like Moore’s law 16 Mbytes; ; 1.6 Gflops; 6.4 Gops 10,000+ nodes in one rack! 100/board = 1 TB; 0.16 Tflops

The Disk Farm? or a System On a Card? 14" The 500GB disc card An array of discs Can be used as 100 discs 1 striped disc 50 FT discs ....etc LOTS of accesses/second of bandwidth A few disks are replaced by 10s of Gbytes of RAM and a processor to run Apps!!

Map of Gray Bell Prize results Redmond/Seattle, WA single-thread single-stream tcp/ip via 7 hops desktop-to-desktop …Win 2K out of the box performance* New York Arlington, VA San Francisco, CA 5626 km 10 hops

Ubiquitous 10 GBps SANs in 5 years 1Gbps Ethernet are reality now. Also FiberChannel ,MyriNet, GigaNet, ServerNet,, ATM,… 10 Gbps x4 WDM deployed now (OC192) 3 Tbps WDM working in lab In 5 years, expect 10x, wow!! 1 GBps 120 MBps (1Gbps) 80 MBps 5 MBps 40 MBps 20 MBps

The Promise of SAN/VIA:10x in 2 years http://www.ViArch.org/ Yesterday: 10 MBps (100 Mbps Ethernet) ~20 MBps tcp/ip saturates 2 cpus round-trip latency ~250 µs Now Wires are 10x faster Myrinet, Gbps Ethernet, ServerNet,… Fast user-level communication tcp/ip ~ 100 MBps 10% cpu round-trip latency is 15 us 1.6 Gbps demoed on a WAN

Processor improvements… 90% of ISCA’s focus

We get more of everything 54

Mainframes, minis, micros, and risc Created orginially in 78 at DEC Will risc continue at 60%/yr. or x2/18 mos... Moore's speed law? What about GaAs??? when? When do we put the mainframe out its misery? The speed increase is typically only 26% with clock x 2 per 3 yr and 26% or x2 per 3 year arch. 27:45(60) -45

Computer ops/sec x word length / $

Growth of microprocessor performance 10000 Cray T90 Micros Supers 1000 Cray 2 Cray Y-MP Cray C90 Alpha RS6000/590 Cray X-MP Alpha 100 RS6000/540 Cray 1S i860 10 Performance in Mflop/s R2000 1 80387 0.1 6881 80287 8087 0.01 1998 1980 1982 1986 1988 1990 1992 1994 1996

Albert Yu predictions ‘96 When 2000 2006 Clock (MHz) 900 4000 4.4x MTransistors 40 350 8.75x Mops 2400 20,000 8.3x Die (sq. in.) 1.1 1.4 1.3x

Processor Limit: DRAM Gap 60%/yr.. DRAM 7%/yr.. 1 10 100 1000 1980 1981 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) Performance “Moore’s Law” Y-axis is performance X-axis is time Latency Cliché: Not e that x86 didn’t have cache on chip until 1989 Alpha 21264 full cache miss / instructions executed: 180 ns/1.7 ns =108 clks x 4 or 432 instructions Caches in Pentium Pro: 64% area, 88% transistors *Taken from Patterson-Keeton Talk to SigMod

The “memory gap” Multiple e.g. 4 processors/chip in order to increase the ops/chip while waiting for the inevitable access delays Or alternatively, multi-threading (MTA) Vector processors with a supporting memory system System-on-a-chip… to reduce chip boundary crossings

If system-on-a-chip is the answer, what is the problem? Small, high volume products Phones, PDAs, Toys & games (to sell batteries) Cars Home appliances TV & video Communication infrastructure Plain old computers… and portables

SOC Alternatives… not including C/C++ CAD Tools The blank sheet of paper: FPGA Auto design of a basic system: Tensilica Standardized, committee designed components*, cells, and custom IP Standard components including more application specific processors *, IP add-ons and custom One chip does it all: SMOP *Processors, Memory, Communication & Memory Links,

Xilinx 10Mg, 500Mt, .12 mic

Free 32 bit processor core

System-on-a-chip alternatives FPGA Sea of un-committed gate arrays Xylinx, Altera Compile a system Unique processor for every app Tensillica Systolic | array Many pipelined or parallel processors + custom DSP | VLIW Spec. purpose processors cores + custom TI Pc & Mp. ASICS Gen. Purpose cores. Specialized by I/O, etc. IBM, Intel, Lucent Universal Micro Multiprocessor array, programmable I/o Cradle

Cradle: Universal Microsystem trading Verilog & hardware for C/C++ UMS : VLSI = microprocessor : special systems Software : Hardware Single part for all apps App spec’d@ run time using FPGA & ROM 5 quad mPs at 3 Gflops/quad = 15 Glops Single shared memory space, caches Programmable periphery including: 1 GB/s; 2.5 Gips PCI, 100 baseT, firewire $4 per flops; 150 mW/Gflops

UMS Architecture Memory bandwidth scales with processing Must allow mix and match of applications. Design reuse is important thus scalability is a must. Resources must be balanced. Cradle is developing such an architecture which has multiple processors (MSPs) which are attached to private memories and can communicate with external devices through a Dram controller and programmable I/O. Explain architecture- Regular, Modular, Processing with Memory, High speed bus Memory bandwidth scales with processing Scalable processing, software, I/O Each app runs on its own pool of processors Enables durable, portable intellectual property

Recapping the challenges Scalable systems Latency in a distributed memory Structure of the system and nodes Network performance for OC192 (10 Gbps) Processing nodes and legacy software Mobile systems… power, RF, voice, I/0 Design time!

The End