CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.

Slides:



Advertisements
Similar presentations
I/O Chapter 8. Outline Introduction Disk Storage and Dependability – 8.2 Buses and other connectors – 8.4 I/O performance measures – 8.6.
Advertisements

Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
1 Parallel Scientific Computing: Algorithms and Tools Lecture #2 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
Performance of Cache Memory
Lecture 34: Chapter 5 Today’s topic –Virtual Memories 1.
Measuring Performance
CSCE 212 Chapter 8 Storage, Networks, and Other Peripherals Instructor: Jason D. Bakos.
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
Caching IV Andreas Klappenecker CPSC321 Computer Architecture.
1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Computer ArchitectureFall 2007 © November 28, 2007 Karem A. Sakallah Lecture 24 Disk IO and RAID CS : Computer Architecture.
Review of Mem. HierarchyCSCE430/830 Review of Memory Hierarchy CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu (U.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1 Lecture 26: Storage Systems Topics: Storage Systems (Chapter 6), other innovations Final exam stats:  Highest: 95  Mean: 70, Median: 73  Toughest.
The Memory Hierarchy II CPSC 321 Andreas Klappenecker.
Lecture 3: A Case for RAID (Part 1) Prof. Shahram Ghandeharizadeh Computer Science Department University of Southern California.
Chapter 17 Parallel Processing.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
S.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output Cache Main Memory Secondary Memory (Disk)
1 Lecture 14: Virtual Memory Today: DRAM and Virtual memory basics (Sections )
Parallel Processing Architectures Laxmi Narayan Bhuyan
1 CS222: Principles of Database Management Fall 2010 Professor Chen Li Department of Computer Science University of California, Irvine Notes 01.
I/O – Chapter 8 Introduction Disk Storage and Dependability – 8.2 Buses and other connectors – 8.4 I/O performance measures – 8.6.
1 Chapter 7: Storage Systems Introduction Magnetic disks Buses RAID: Redundant Arrays of Inexpensive Disks.
A brief overview about Distributed Systems Group A4 Chris Sun Bryan Maden Min Fang.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
I/O Computer Organization II 1 Introduction I/O devices can be characterized by – Behavior: input, output, storage – Partner: human or machine – Data rate:
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
COSC 3330/6308 Solutions to the Third Problem Set Jehan-François Pâris November 2012.
Pipelining and Parallelism Mark Staveley
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
EKT303/4 Superscalar vs Super-pipelined.
Processor Memory Processor-memory bus I/O Device Bus Adapter I/O Device I/O Device Bus Adapter I/O Device I/O Device Expansion bus I/O Bus.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Multiprocessor So far, we have spoken at length microprocessors. We will now study the multiprocessor, how they work, what are the specific problems that.
1 Lecture 27: Disks Today’s topics:  Disk basics  RAID  Research topics.
Abstract Increases in CPU and memory will be wasted if not matched by similar performance in I/O SLED vs. RAID 5 levels of RAID and respective cost/performance.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
1 Components of the Virtual Memory System  Arrows indicate what happens on a lw virtual address data physical address TLB page table memory cache disk.
CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.
CS161 – Design and Architecture of Computer
A Case for Redundant Arrays of Inexpensive Disks (RAID) -1988
CS 704 Advanced Computer Architecture
ECE232: Hardware Organization and Design
Memory COMPUTER ARCHITECTURE
CS161 – Design and Architecture of Computer
Improving Memory Access 1/3 The Cache and Virtual Memory
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Parallel Processing - introduction
Multi-Processing in High Performance Computer Architecture:
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CMSC 611: Advanced Computer Architecture
Parallel Processing Architectures
Lecture 24: Memory, VM, Multiproc
Chapter 4 Multiprocessors
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Presentation transcript:

CA 714CA Midterm Review

C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware Complete list of techniques in figure 5.26 on page 499

C5 AMAT Average memory access time = Hit time + miss rate * miss penalty + Useful in focusing on the memory performance –Not the over all system performance –Left out CPI base in the calculation

C5 CPI CPI calculation used for over all system performance comparison, such as speedup of computer A to computer B CPI = CPI base + Penalty –CPI base is the CPI without the special case penalty. –Penalty is the penalty cycle per instruction

C5 CPI Example CPI Calculation Example: CPI for two leveled cache. Assume unified L2 and separate instruction and data cache. CPI = CPI base +Penalty CPI base depends on the program and processor. Penalty = L1 miss penalty + L2 miss penalty L1 miss penalty = data L1 miss penalty + instruction miss penalty Data L1 miss penalty = data L1 access per instr * data L1 miss rate * data L1 miss penalty. Instruction miss penalty = instruction L1 access per instr * instruction L1 miss rate * instruction L1 miss penalty. L2 miss penalty = L2 access per instr * L2 miss rate * L2 miss penalty. L2 access per instr = instruction L1 access per instr * instruction L1 miss rate + data L1 access per instr * data L1 miss rate

C5 Virtual Memory Easier to program in virtual memory Additional hardware needed to translate between the two OS is usually used to translate the two TLB is used to cached the translation result for faster translation. Cpu virtual address $ virtual/physical address Mem physical address TLB Virtual address Physical address

C5 VM CPI calculation for memory with VM CPI = CPI base + Penalty Penalty = TLB miss penalty cycle per instruction = TLB miss per instruction * penalty cycle per TLB miss = TLB access per instruction * TLB miss rate * penalty cycle per TLB miss TLB access per instruction is for both data and instruction access.

C7 Disk Average disk access time = average seek time + average rotational delay + transfer time + controller overhead Average rotational delay = time for platter to rotate half a cycle Transfer time = size of access / transfer speed Transfer speed = rotational speed * size of tracks –Assuming the bit are read off continuously as the disk head pass over it.

C7 RAID Use small disks to build a large storage system –Smaller disks are mass produce and so cheaper –Large number of disks results high failure rate –Use redundancy to lower failure rate RAID 2: Mirror RAID 3: bit interleave RAID 4/5 : distributed bit interleaving

C7 RAID Mean time to data loss MTTDL = MTTF 2 disk / (N*(G-1)*MTTR disk ) N = total number of disks in the system G = number of disks in the bit protected group MTTR = mean time to repair = mean time to detection + mean time to replacement 1/MTTF =  1/MTTF(component) All component

C7 RAID MTTDL example RAID 2 system with 10 GB total capacity. Individual disk are 1 GB each with MTTF disk hr. Assume MTTR of 1 hr Solution –G = 2 since each disk is mirrored. –N = 20, 10 for 10 GB of capacity 10 for mirroring – MTTDL = MTTF 2 disk / (N*(G-1)*MTTR disk ) = / (20*(2-1)*1 disk )

C7 Queuing Theory Used to analyze system resource requirement and performance + more accurate than the simple extreme case study + less complicated than the full simulation study Results are based on exponential distribution modeling of the arrival rate

C8 Class of connections ordered by decreasing distance, bandwidth, latency –Wan/internet –LAN –SAN –Bus

C8 Total transfer time = sender overhead + Time of flight + message size/bandwidth + receiver overhead –Latency = sender overhead + Time of flight + receiver overhead –Hard to improve latency, easy to improve bandwidth Effect bandwidth = Message size / Total transfer time –Total transfer time is dominated by latency as bandwidth increases. –Need bigger message size to ameliorate the latency overhead

C6 Multiprocessor Limit Amdahl’s Law –Speedup = 1/[(fraction enhanced/speedup) + fraction not enhanced] Fraction enhanced is limited by sequential portion of the program Communication overhead may limit speedup for the parallel portion.

C6 scaling calculation for app Matrix multiplication. Take the squaring special case. A*A –A is a square matrix with n elements. –P is the number of processors Computation scaling: A is n^.5 * n^.5 matrix. Calculation complexity is then n^1.5 Diving between p processor gives (n^1.5)/p Communication scaling: Assume matrix A is tiled into square with side dimension n^.5/p^.5 elements needed are row and column this gives 2* [(n^.5/p^.5)(n^.5) - (n^.5/p^.5)] ~ n/p^.5 Computation to communication scaling n^.5/p^.5

C6 Type of multiprocessor SISD : single instruction single data –5 stage pipe lined risc processor SIMD: single instruction multiple data –Vector processor MISD: multiple instruction single data MIMD: multiple instruction multiple data –Super scalar, clustered machines, VLIW

C6 dealing with parallelism Shared memory –Communication between processors are implicit Message passing –Explicit communication They are equivalent in term of functionality. Can build on from another