Advanced Computer Architecture Introduction

Slides:



Advertisements
Similar presentations
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Advertisements

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
Computer performance.
+ CS 325: CS Hardware and Software Organization and Architecture Introduction.
Folklore Confirmed: Compiling for Speed = Compiling for Energy Tomofumi Yuki INRIA, Rennes Sanjay Rajopadhye Colorado State University 1.
Computing Hardware Starter.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Computer Architecture
Computer Organization & Assembly Language © by DR. M. Amer.
Processor Architecture
CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.
Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Lecture 3. Performance Prof. Taeweon Suh Computer Science & Engineering Korea University COSE222, COMP212, CYDF210 Computer Architecture.
Computer Architecture Furkan Rabee
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
DCS/1 CENG Distributed Computing Systems Measures of Performance.
Introduction to Computers - Hardware
Introduction to Computing Systems
Computer Hardware What is a CPU.
Computer Organization and Architecture Lecture 1 : Introduction
CS203 – Advanced Computer Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Lecture 2: Performance Evaluation
Memory COMPUTER ARCHITECTURE
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
September 2 Performance Read 3.1 through 3.4 for Tuesday
ECE 4100/6100 Advanced Computer Architecture Lecture 1 Performance
Green cloud computing 2 Cs 595 Lecture 15.
How do we evaluate computer architectures?
Defining Performance Which airplane has the best performance?
CSNB COMPUTER SYSTEM CHAPTER 1 INTRODUCTION CSNB153 computer system.
Parallel computer architecture classification
Parallel Processing - introduction
Super Computing By RIsaj t r S3 ece, roll 50.
Morgan Kaufmann Publishers
COSC 3406: Computer Organization
Architecture & Organization 1
INTRODUCTION TO MICROPROCESSORS
Chapter 3: Principles of Scalable Performance
Architecture & Organization 1
BIC 10503: COMPUTER ARCHITECTURE
CMSC 611: Advanced Computer Architecture
Chapter 3 Hardware and software 1.
Course Description: Parallel Computer Architecture
CSE8380 Parallel and Distributed Processing Presentation
Chapter 1 Introduction.
Chapter 3 Hardware and software 1.
COMP60621 Fundamentals of Parallel and Distributed Systems
Chapter 1 Introduction.
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
The University of Adelaide, School of Computer Science
Chapter 4 Multiprocessors
CpE 5110 Principles of Computer Architecture: Performance Metrics
CMSC 611: Advanced Computer Architecture
CS 704 Advanced Computer Architecture
The University of Adelaide, School of Computer Science
COMP60611 Fundamentals of Parallel and Distributed Systems
Chapter 2: Performance CS 447 Jason Bakos Fall 2001 CS 447.
William Stallings Computer Organization and Architecture
Presentation transcript:

Advanced Computer Architecture Introduction A.R. Hurson 128 EECH Building, Missouri S&T hurson@mst.edu

Advanced Computer Architecture CpE6110 Advanced Computer Architecture Instructor: A. R. Hurson hurson@mst.edu Office: EECH 128 341-6201 Office Hours: By appointment Class notes and reading materials are available at: https://hurson.weebly.com/cpe-6110-advanced-computer-architecture.html Reference: Computer Architecture a Quantitative Approach, Hennessy and Patterson Outline: 1. Performance and Cost Analysis 2. Instruction set Analysis 3. Scheduling/Load balancing 4. Memory hierarchies (revisit) 5. Advanced Caching a) Internet caching b) Cooperative caching 6. Concurrency — Classifications a) Vector processors b) SIMD array architectures c) MIMD architectures 7. Systolic design 8. VLIW/Super-scalar/Super-pipeline Data-flow processing Multithreaded architecture Transactional Memory Multicore architecture Administrative: Homework Assignments 10% Quizzes 20% Project (term Paper) 20% Midterm Exam 20% Final Exam 30%

Advanced Computer Architecture Race to the top  2011 Rank Site Computer/Year Vendor Country Cores Rmax (Pflops) Rpeak (Pflops) Power (MW) 1 RIKEN Advanced Institute for Computational Science K computer, SPARC64 VIIIfx 2.0GHz,/ 2011 Fujitsu Japan 548,352 8.162 8.774 9.899 2 National Supercomputing Center in Tianjin Tianhe-1A - NUDT / 2010 NUDT China 186,368 2.566 4.701 4.040 3 DOE/SC/Oak Ridge National Laboratory Jaguar - Cray XT5-2.6 GHz / 2009 Cray Inc. USA 224,162 1.759 2.331 6.951 4 National Supercomputing Centre in Shenzhen Nebulae - Dawning TC3600 Blade/ 2010 Dawning 120,640 1.271 2.984 2.580 5 GSIC Center, Tokyo Institute of Technology TSUBAME 2.0/2010 NEC/HP 73,278 1.192 2.288 1.399 6 DOE/NNSA/LANL/SNL Cielo - Cray XE6 8-core 2.4 GHz /2011Cray Inc. 142,272 1.110 1.365 3.980 7 NASA/Ames Research Center/NAS Pleiades - 2.93 Ghz,/ 2011 SGI 111,104 1.088 1.315 4.102 8 DOE/SC/LBNL/NERSC Hopper - Cray XE6 12-core 2.1 GHz / 2010 Cray Inc. 153,408 1.054 1.289 2.910 9 Commissariat a l'Energie Atomique (CEA) Tera-100 - Bull bullx super-node S6010/S6030 / 2010 Bull SA France 138,368 1.050 1.255 4.590 10 DOE/NNSA/LANL Roadrunner - 3.2 Ghz /2009 IBM 122,400 1.042 1.376 2.346

Advanced Computer Architecture #1 #500 Sum Race to the top  2011

Advanced Computer Architecture Race to the top  2012

Advanced Computer Architecture Race to the top: If the projection holds, we would expect an Exaflops system around 2019. In 2012 Top500 list was dominated by IBM Blue Gene/Q with four systems in top 10. The largest in Lawrence Livermore National Laboratory with more than 16 Petaflops sustained performance. It replaced K computer from Japan (the first 10 Petaflops machine).

Advanced Computer Architecture Tianhe-2 (MilkyWay-2) is the No. 1 system since 2013. It is used for simulation, analysis, and government security applications.  It is a collection of 16,000 computer nodes with a total of 3,120,000 cores. Each of the 16,000 nodes possess 88 gigabytes of memory. The total CPU plus coprocessor memory is 1,375 TiB (approximately 1.34 PiB).

Advanced Computer Architecture Four major challenges have been recognized for an Exaflops machine: Energy and power, Memory and storage, Concurrency and locality, Resiliency

Advanced Computer Architecture It is estimated that an Exaflops machine would consume about 20 Magawatt of power which is equivalent to 50 Gflops/W.

Advanced Computer Architecture Computer Architecture refers to the attributes of a system visible to a programmer — i.e., attributes that have a direct impact on the logical execution of a program. Architectural Attributes include the instruction set, the number of bits used to represent various data types, I/O mechanisms, and techniques for addressing memory.

Advanced Computer Architecture Computer Organization refers to the operational units and their interconnections that realize the architectural specifications. Organizational Attributes include those hardware details transparent to the programmer, such as control signals, interfaces between the computer and peripherals, and the memory technology used.

Advanced Computer Architecture Computer Hardware refers to the hardware detail design — logic design, and the implementation (packaging, power, cooling, ...).

Advanced Computer Architecture For a good design, architecture (instruction set design), organization, and hardware as well as software (compiler and operating system) issues must be considered.

Advanced Computer Architecture Common Performance Metrics Execution time, Bandwidth, Throughput, User CPU Time, MIPS MFLOPS Speed up Efficiency

Advanced Computer Architecture Scaleup ─ Handling large task by increasing the degree of parallelism. It is the ability to process larger tasks in the same amount of time by providing more resources. Power Consumption ─ Becomes an important performance metric when we use mobile wireless access devices. Network Connectivity ─ Becomes of interest when connectivity is through wireless medium. Data reliability and integrity ─ Becomes factor of interest at the presence of mobility and wireless communication.

Advanced Computer Architecture Summary Computer architecture/Computer organization Role of computer architect Some performance Metrics How to improve performance? Hardware technology, Innovative architectural features, and Efficient resource management.

Advanced Computer Architecture Performance improvement Advances in Technology Architectural Advances Better Resource Management Program behavior

Advanced Computer Architecture Questions Network Latency Memory Latency How to manage resources efficiently?

Advanced Computer Architecture Grosch’s Law In 1940s Grosch studied the relationship between the power (computational speed) (P) and cost (C) of a computer. He postulated that: P = k * Cs where k and s are positive constants. He also argued that s is close to 2.

Advanced Computer Architecture Grosch’s Law According to this law, in order to sell a computer for twice as much, it must be four times as fast. With the advances in technology, it is easy to see that Grosch’s law is no longer valid.

Advanced Computer Architecture Performance Measures Amdahl's law — The performance improvement gained by improving some portion of an architecture is limited by the fraction of the time the improved portion is used — a small number of sequential operations can effectively limit the speed up of a parallel algorithm.

Advanced Computer Architecture Performance Measures Amdahl's law allows a quick way to calculate the speed up based on two factors: The fraction of the computation time in the original task that is affected by the enhancement, and The improvement gained by the enhanced execution mode (speed up of the enhanced portion).

Advanced Computer Architecture Performance Measures — Amdahl's law

Advanced Computer Architecture Gustafson-Barsis’s Law Parallel architectures comprised of hundreds of processors can be built with substantial improvement in performance. They argued that in practice, the problem size scales up with the number of processors (n).

Advanced Computer Architecture Gustafson-Barsis’s Law If s and p are the serial and parallel times spend on a parallel system then: s + p * n represents the execution time. They introduced a new factor, scaled speed up factor (SS(n)): SS(n) = (s + p * n) / (s + p)

Advanced Computer Architecture Gustafson-Barsis’s Law Speed up should be measured by scaling the problem to the number of processors, not by fixing the problem size.

Advanced Computer Architecture Performance Measures Maximum Concurrency — For any computer there is a maximum number of bits or bit pairs — maximum concurrency (Cm) — that can be processed concurrently whether it is under single-instruction or multiple- instruction control.

Advanced Computer Architecture Performance Measures Average Concurrency — The maximum concurrency is an indication of the computer processing capability. The actual utilization of this capability is indicated by the average concurrency defined as: where Ci is the concurrency at Dti.

Advanced Computer Architecture Performance Measures Average Concurrency — If ti is set to one, then the average concurrency over a period of T time units is:

Advanced Computer Architecture Performance Measures Hardware Utilization — The average hardware utilization is defined as: where i is the hardware utilization at time i.

Advanced Computer Architecture Performance Measures Cm is determined by the hardware design, Ca or  is highly dependent on the software and applications. A general-purpose computer should achieve a high  for as many applications as possible. A special-purpose computer would yield a high  for at least the intended applications. In either case, maximizing the value of  for a computer design is important.

Advanced Computer Architecture Performance Measures Parallel Systems — For a parallel processor the average parallelism is defined as: for T time units.

Advanced Computer Architecture Performance Measures Parallel Systems — Similarly the average hardware utilization is defined as: where ri is the hardware utilization for the parallel processor at time i.

Advanced Computer Architecture Performance Measures Parallel Systems If Pa is the effective parallelism over a period of T, and , Pi, and i are the corresponding effective values, then the effective hardware utilization is: ~

Advanced Computer Architecture Performance Measures A successful parallel processor design should yield: A high , as well as the required throughput for, at least, the intended application (s). This involves not only a proper hardware and software design, but also the development of efficient parallel algorithms for these applications.

Advanced Computer Architecture Summary Amdahl's law More performance metrics CPU Time Concurrency Hardware Utilization

Advanced Computer Architecture Performance Measures Pipeline Systems Latency (L) is defined as the number of time units separating two successive initiations of events. Naturally, the lower the latency the higher the performance. Latency could be any integer value including zero.

Advanced Computer Architecture Performance Measures Pipeline Systems The average latency is defined as the average number of time unit between two initiations. The initiation rate (I) is the average number of the initiations per clock unit:

Advanced Computer Architecture Performance Measures Pipeline Systems For stage Si, stage utilization (USi) indicates on the average how often Si has been used: USi = I * ni where ni represents the number of time Si is used in one initiation.

Advanced Computer Architecture Performance Measures Pipeline Systems For a linear pipe, if i denotes the execution time of stage Si then: = MAX 1 (d ) i U S

Advanced Computer Architecture Performance Measures Means to evaluate a system Application programs — Workload. Real Programs — A collection of programs that are run often by the user. Kernels — Small, key pieces from real programs. Benchmarks — A set of familiar, small, and well behaved programs known to the user. Synthetic benchmarks — An artificial set of small programs that are intended to match the average frequency of operations and operands of a large set of programs.

Advanced Computer Architecture Is one number enough? As per our discussion, so far, performance was the major design constraint. However, the power is becoming a problem. Power consumption became an issue with the growth of wireless technology and mobile devices. However, it is becoming of concern since feeding several Magawatt of power to run a supercomputer is not a trivial task and requires a great amount of supporting infrastructure

Advanced Computer Architecture Is one number enough? It is estimated that each Magawatt of power consumption increases the electricity cost about 1 million $$$ each year. In addition to the cost, environmental impact become an issue as well. Now data centers have a significant share in the global CO2 emission.

Advanced Computer Architecture Is one number enough?

Advanced Computer Architecture Is one number enough?

Advanced Computer Architecture Is one number enough? Alternatively, the so called Top500 Green list based on Flops/W efficiency was developed.

Advanced Computer Architecture Is one number enough? 45

Advanced Computer Architecture Is one number enough? On the Green500 list, as per June 2012, the top 21 spots are held by IBM Blue Gene/Q systems with an efficiency of over 2.1 GFlops/Watt (a huge gap with the top non Blue Gene/Q system).

Advanced Computer Architecture As noted before, an Exaflops machine would consume about 20 Magawatt of power which is equivalent to 50 Gflops/W. Relative to Blue Gene/Q then the power efficiency needs to be improved by a factor of 25.

Advanced Computer Architecture Power and Energy Power is important factor for a data center, since it needs to be fed. Energy is of more concern from an application point of view.

Advanced Computer Architecture Power and Energy It makes no sense to run an energy efficient application in a data center where most of the power is consumed for cooling. Also, an inefficient software application can waste a great amount of energy, even on low- power platform. One needs to develop a holistic approach to reduce power consumption (i.e., Green Computing/Green IT/…)

Advanced Computer Architecture Power: Amount of electricity (energy) produced/consumed at a specific moment in time. Energy: Power consumption over time.

Advanced Computer Architecture Power Usage Effectiveness PTOT = Total power of facility PIT = Power consumed by IT equipment PTOT = PIT + Pinfrastructure

Advanced Computer Architecture Data Center infrastructure Efficiency Typically 1.7  PUE  2 University of Illinois’ National Petascale Computing Facility is aiming at a PUE of 1.2.

Advanced Computer Architecture To summarize: at the application level the common metric is time-to- solution with no indication of power consumption. How about Energy-to-solution as a potential metric? Apparently, neither metric sounds reasonable, the 1st one is not Green and the 2nd one is not high performance.

Advanced Computer Architecture How about: Flops/W (basis for Green 500 list)? Energy-delay-product (EDP) Energy proportional computing  Power consumed by a component scales with its utilization Energy consumed by an application * runtime

Advanced Computer Architecture Power distribution in a high performance machine

Advanced Computer Architecture About 50% of power is consumed by processor, 20% by memory, 20% by interconnect and storage subsystem, and 10% by the other units (i.e., fan).

Advanced Computer Architecture Power consumption at CPU level

Advanced Computer Architecture Power consumption at CPU level So when frequency is reduced the voltage is reduced as well: Dynamic Voltage and Frequency Scaling Power Gating 

Advanced Computer Architecture Memory Different power States (lowering operational voltage )  i.e., instead of traditional 1.5v to 1.35v

Advanced Computer Architecture Memory

Advanced Computer Architecture Network and Storage

Advanced Computer Architecture Network and Storage

Advanced Computer Architecture Network Energy-performance trade off Congestion  A measure of performance Dilation (# of hops)  A measure of energy consumption

Advanced Computer Architecture Storage Effect of Remote Direct Memory Access on power consumption in RDMA-enabled network vs. traditional communication (TCP/IP). Replace traditional mechanical disks with solid state drives. This brings us the issue of capacity and cost.

Advanced Computer Architecture Source Chris Johnson, University of Utah, IPDPS2012

Advanced Computer Architecture 1 Bit = Binary Digit 8 Bits = 1 Byte 1000 Bytes = 1 Kilobyte 1000 Kilobytes = 1 Megabyte 1000 Megabytes = 1 Gigabyte 1000 Gigabytes = 1 Terabyte 1000 Terabytes = 1 Petabyte 1000 Petabytes = 1 Exabyte 1000 Exabytes = 1 Zettabyte 1000 Zettabytes = 1 Yottabyte 1000 Yottabytes = 1 Brontobyte 1000 Brontobytes = 1 Geopbyte We can store 3/4 of 1 Exabyte of data using all the trees on the entire planet. Sources: http://www.whatsabyte.com/ and http://wiki.answers.com Mac Air Disk - 120 Gb Company Servers Supercomputers The World

Advanced Computer Architecture 295 Feb. 2011 all disk storage all digital info new digital info/yr all human documents in 40,000 Yrs Exabytes (10 18) all spoken words in all lives Say “So let us look at one of the consequences of this data and process explosion”. Every two days we create as much data as we did from the beginning of mankind until 2003! amount human minds can store in 1yr Sources: Lesk, Berkeley SIMS, Landauer, EMC, TechCrunch, Smart Planet

Advanced Computer Architecture How many trees does it take to print out an Exabyte? 1 Exabyte = 1000 Petabytes = could hold approximately 500,000,000,000,000 pages of standard printed text It takes one tree to produce 94,200 pages of a book Thus it will take 530,785,562,327 trees to store an Exabyte of data In 2005, there were 400,246,300,201 trees on Earth We can store .75 Exabytes of data using all the trees on the entire planet. Sources: http://www.whatsabyte.com/ and http://wiki.answers.com

Brain Information Bandwidth Source Chris Johnson, University of Utah, IPDPS2012