Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSE 586 Computer Architecture Lecture 7

Similar presentations


Presentation on theme: "CSE 586 Computer Architecture Lecture 7"— Presentation transcript:

1 CSE 586 Computer Architecture Lecture 7
Jean-Loup Baer CSE 586 Spring 00

2 Highlights from last week
Improving cache performance (c’ed from previous week) Critical word first Sector caches Lock-up free caches “Anatomy” of a cache predictor CSE 586 Spring 00

3 Highlights from last week (c’ed)
Main memory DRAM basics Interleaving Low order bits for reading consecutive words in parallel “Middle” bits for banks of banks allowing concurrent access by several devices Page-mode and SDRAMs Processor in Memory paradigm (IRAM, Active Pages) Rambus CSE 586 Spring 00

4 Highlights from last week (c’ed)
Virtual memory Paging and segmentation Page tables TLB’s Address translation CSE 586 Spring 00

5 From Virtual Address to Memory Location (highly abstracted; revisited)
ALU hit Virtual address cache miss miss TLB Main memory hit Physical address CSE 586 Spring 00

6 TLB Management TLB’s organized as caches
Current trend: about 128 entries; Separate TLB’s for instruction and data; Some part of the TLB reserved for system. Most current trend: hierarchy of TLB’s TLB’s are write-back. The only thing that can change is dirty bit + any other information needed for page replacement algorithm At context-switch, the virtual page translations in the TLB are not valid for the new task Invalid the TLB (set all valid bits to 0) Or append a Process ID (PID) number to the tag in the TLB. When a new task takes over, the O.S. creates a new PID. (PID’s are recycled and entries corresponding to “old PID” are invalidated.) CSE 586 Spring 00

7 Paging systems -- Hardware/software Interactions
Page tables Managed by the O.S. Address of the start of the page table for a given process is found in a special register which is part of the state of the process (unless the page table is “inverted” – hashing on v.p.# + PID) The O.S. has its own page table (unless inverted table scheme) and some pages of the O.S. are pinned in physical space. The O.S. knows where the pages are stored on disk Page fault When a program attempts to access a location which is part of a page that is not in main memory, we have a page fault CSE 586 Spring 00

8 Page Fault Detection (simplified)
Page fault is detected by the hardware (invalid bit in PTE ) To resolve a page fault takes millions of cycles (disk I/O) The program that has a page fault must be interrupted; a context-switch occurs A page fault is an exception Occurs in the middle of an instruction In order to restart the program later, the state of the program (registers, PC, special registers) must be saved Instructions must be restartable ; hence the need for precise exceptions CSE 586 Spring 00

9 Page Fault Handler Page fault exceptions are cleared by an O.S. program called the page fault handler which will(conceptually) Grab a physical frame from a free list maintained by the O.S. Find out where the faulting page resides on disk Initiate a read for that page Choose a frame to free (if needed), i.e., run a replacement algorithm If the replaced frame is dirty, initiate a write of that frame to disk Context-switch, i.e., give the CPU to a task ready to proceed CSE 586 Spring 00

10 Completion of Page Fault
When the faulting page has been read from disk ( a few ms later) The disk controller will raise an interrupt The O.S. will take over (context-switch) and modify the PTE (in particular, make it valid) The program that had the page fault is put on the queue of tasks ready to be run CSE 586 Spring 00

11 Speeding up L1 Access Cache can be (speculatively) accessed in parallel with TLB if its indexing bits are not changed by the virtual-physical translation Cache access (for reads) becomes: Cycle 1: Access to TLB and access to L1 cache (read data at given index) Cycle 2: Compare tags and if hit, send data to register CSE 586 Spring 00

12 Virtually Addressed Cache
Tag Index Dsp Page Number Offset 1 1 Tag data PTE 2. Compare TLB Cache CSE 586 Spring 00

13 Virtually Addressed, Physically Tagged Caches
Can be done for small L1, i.e., capacity < (page * ass.) Can be done for larger caches if O.S. does a form of page coloring such that “index” is the same for synonyms The bits of the virtual address that are part of the index are kept in the L2 tag to be checked on L1 misses Can also be done more generally (e.g., using “pointers” between L1 and L2) CSE 586 Spring 00

14 Synonyms v.p. x, process A v.p # index Map to same physical page
v.p. y, process B Map to synonyms in the cache To avoid synonyms, O.S. or hardware enforces these bits to be the same CSE 586 Spring 00

15 Virtually Addressed, Virtually Tagged Caches
No easy fix for the synonym problem Need to flush the cache at context-switch time Total flush can be avoided with PIDs Used mostly for I-caches Still need to look at TLB (why? See in a few slides) CSE 586 Spring 00

16 Choosing a Page Size Large page size Smaller page table
Better coverage in TLB Amortize time to bring in page from disk Allow fast access to larger L1 (virtually addressed, physically tagged) BUT more fragmentation (unused portions of the page) Longer time for start-up CSE 586 Spring 00

17 Allowing Multiple Page Sizes
Recent micros support multiple page sizes Better use of TLB is main reason If only two page sizes (one basic, one large) A single bit in the PTE is sufficient (with don’t care masks) If more than 2 sizes, a field in PTE indicates the size (multiple of basic) of the page Physical frames of various sizes need to be contiguous Need heuristics to coalesce/split pages dynamically CSE 586 Spring 00

18 Protection Protect user processes from each other
Disallow an unauthorized process to access (read, write, execute) part of the addressing space of another process Disallow the use of some instructions by user processes E.g., disallow process A to change the access rights to process B Leads to Kernel mode (operating system) vs. user mode Only kernel mode can use some privileged instructions (e.g., disabling interrupts, I/O functions) Providing system calls whereby CPU can switch between user and kernel mode CSE 586 Spring 00

19 Where to Put Access Rights
Associate protection bits with PTE e.g., disallow user to read/write /execute in kernel space or read someone else’s program (hence TLB check for v.add. I-caches) e.g., allow one process to share-read some data with another but not modifying it etc. User/kernel protection model can be extended with rings of protection e.g., Pentium has 4 levels of protection. Further extension leads to the concept of capabilities. Access rights to objects are passed from one process to another CSE 586 Spring 00

20 I/O and Caches (primer for cache coherence)
Main memory The “green” page is transferred to/from disk from/to memory. The red “line” is mapped in the cache. Cache Disk P CSE 586 Spring 00

21 I/O Passes through the Cache
I/O data passes through the cache on its way from/to memory Previous figure does not reflect this possibility. No coherency problem but contention for cache access between the processor and the I/O Wipes out the entire page area of the cache (i.e. replaces useful lines from other pages that conflict with the page being transferred) Implemented in Amdahl machines (main 1980) w/o on-chip caches CSE 586 Spring 00

22 I/O and Caches – Use of the O.S.
I/O interacts directly with main memory Software solution Output (Memory-disk): (1) Write-through cache. No problem since correct info is in memory; (2)Write back cache: purge the cache of all lines in the page with dirty bits via O.S. interaction (ISA needs a purge instruction) Input (WT and WB). Use a non-cacheable buffer space. Then after input, flush the cache of these addresses and make the buffer cacheable CSE 586 Spring 00

23 Software solution: Output
WT case: The contents of the “red” line are already in memory. No problem. WB case. If the “red” line is dirty, first write it back to memory. It might also be wise to invalidate all entries belonging to the page in the cache CSE 586 Spring 00

24 I/O Consistency -- Hardware Approach
Subset of the shared-bus multiprocessor cache coherence protocol Cache has duplicate set of tags Allows concurrent checking of the bus and service requests from the processor Cache controller snoops on the bus On input, if there is a match in tags, store in both the cache and main memory On output, if there is a match in tags: if WT invalidate the entry; if WB take the cache entry instead of the memory contents and invalidate. CSE 586 Spring 00

25 Hardware Solution: Output
The cache controller snoops on the bus. When it’s time to transfer the red line, its contents come from the cache and the line is invalidated CSE 586 Spring 00

26 Hardware Solution: Input
The cache controller snoops on the bus. If there is match in tags, then the data is stored both in the cache and in main memory. CSE 586 Spring 00

27 Input-output I/O is very much architecture dependent
I/O requires cooperation between Processor that issues I/O command (read, write etc.) Buses that provide the interconnection between processor, memory and I/O devices I/O controllers that handle the specifics of control of each device and interfacing Devices that store data or signal events CSE 586 Spring 00

28 Basic (very simplified) I/O architecture
CPU We concentrate first on disk and disk controller Cache Bus M.Cont. D.Cont. N.Cont. Main memory Disks Network CSE 586 Spring 00

29 Basic Disk Terminology
track platters sector Cylinder Read-write heads Disk surface CSE 586 Spring 00

30 Disk Physical Characteristics
Platters (1 to 20) with diameters from 1.3 to 8 inches (recording on both sides) Tracks (2500 to 5000 tracks/inch) Cylinders (all the tracks in the same position in the platters) Sectors (e.g., sectors/track with gaps and info related to sectors between them; typical sector bytes) Current trend: constant bit density (105 bits/inch), i.e., more info (sectors) on outer tracks CSE 586 Spring 00

31 Example: Seagate Barracuda
Disk for server 10 disks hence 20 surfaces 7,500 cylinders hence 7,500 * 20 = 150,000 tracks 237 sectors/track (average) 512 bytes/sector Total capacity = * 237 *  = 18 GB CSE 586 Spring 00

32 Disk Access Time Arm(s) with a reading/writing head
Four components in an access: Seek time (to move the arm on the right cylinder). Rotation time (on the average 1/2 rotation). At 3600 RPM, 8.3 ms. Current disks are 3600 or 5400 ,7200 or even RPM Transfer time depends on rotation time, amount to transfer (minimal size a sector), recording density, disk/memory connection. Disk controller time. Overhead to perform an access CSE 586 Spring 00

33 Seek Time From 0 (if arm already positioned) to a maximum of 15-20 ms.
Not a linear function of distance (speedup + coast + slowdown + settle) Even when reading tracks on the same cylinder, there is a minimal seek time (due to severe tolerances for head positioning) Barracuda example: Average seek time = 8 ms; track to track seek time = 1ms; full disk seek = 17 ms. CSE 586 Spring 00

34 Rotation and Transfer Times
Rotation Time Barracuda example: 7200 RPM thus average is 4.17 ms Transfer time Depends on rotation time, amount to transfer (minimal size a sector), recording density, disk/memory connection. Today, transfer time occurs at 2 to 16MB/second CSE 586 Spring 00

35 Disk Controller Overhead
Disk controller contains a microprocessor + buffer memory + possibly a cache (for disk sectors) Overhead to perform an access (of the order of 1 ms) Receiving orders from CPU and interpreting them Managing the transfer between disk and memory (e.g., managing the DMA) Transfer rate between disk and controller is smaller than between disk and memory, hence: Need for a buffer in controller This buffer might take the form of a cache (mostly for read-ahead) CSE 586 Spring 00

36 Improvements in Disks Capacity (via density). Same growth rate as DRAMs Price decrease has followed (today $10-$50/GB?) Access times have decreased but not enormously Higher density -> smaller drives -> smaller seek time RPM has increased from 3600 to 7200 RPM and even more Transfer time has improved (but still slow compared to bus transfer times) CPU speed - DRAM access is one “memory wall” DRAM access time - Disk access time is a “memory gap” Technologies to fill the gap have not succeeded (currently the most promising is more DRAM backed up by batteries) CSE 586 Spring 00

37 Connecting CPU, Memory and I/O
CPU-Memory bus Cache Bus adapter Main memory CPU I/O bus I/O contr. I/O contr. I/O contr. disk Graphics Network CSE 586 Spring 00

38 Buses Simplest interconnect Key parameters:
Low cost: set of shared wires Easy to add devices (although variety of devices might make the design more complex or less efficient -- longer bus and more electrical load; hence the distinction between I/O buses and CPU/memory buses) But bus is a single shared resource so can get saturated (both physically because of electrical load, and performance-wise because of contention to access it ) Key parameters: Width (number of lines:data, addresses, control) Speed (limited by length and electrical load) CSE 586 Spring 00

39 Memory and I/O Buses CPU/memory bus: tailored to the particular CPU
Fast (separate address and data lines) Often short and hence synchronous (governed by a clock) Wide ( and even 256 bits) Expensive I/O bus: follows some standard so many types of devices can be hooked on to it Asynchronous (hand-shaking protocol) Narrower CSE 586 Spring 00

40 Bus Transactions Consists of arbitration and commands
Arbitration: who is getting control of the bus Commands: type of transaction (read, write, ack, etc…) Read, Write, Atomic Read-Modify-Write (atomic swap) Read: send address and data is returned Write: send address and data Read-Modify-write : keep bus during the whole transaction. Used for synchronization between processes (but more clever ways of doing this today; see Multiprocessor lecture to come) CSE 586 Spring 00

41 Bus Arbitration Arbitration: who gets the bus if several requests occur at the same time Only one master (processor): centralized arbitration Multiple masters (most common case): centralized arbitration (FIFO, daisy-chain, round-robin, combination of those) vs. decentralized arbitration (each device knows its own priority) Communication protocol between master and slave Synchronous (for short buses - no clock skew - i.e. CPU/memory) Asynchronous (hand-shaking finite-state machine; easier to accommodate many devices) CSE 586 Spring 00

42 Split-transaction Buses (a little more detail)
Split a read transaction into Send address (CPU is master) Send data (Memory is master) In between these two transactions (memory access time) the bus is freed Requires “tagging” the transaction with ids of sender/receiver Can even have more concurrency by having different transactions using the data and address lines concurrently Useful for multiprocessor systems and for I/O CSE 586 Spring 00

43 SGI Challenge Bus 256 bit data bus and 40 bit address bus
Split-transaction 1.2 Gigabytes per second sustained transferred rate 9.5 million transactions per second Synchronous 47.6 MHz (21 ns.) per cycle Can support 36 MIPS R4400, 16 GB memory, 4 I/O bus adapters CSE 586 Spring 00

44 I/O Hardware-software Interface
I/O is best left to the O.S. (for protection and scheduling in particular) O.S. provides routines that handles devices (or controllers) CPU must be able to: Tell a device what it wants done (e.g., read, write, etc.) Start the operation (or tell the device controller to start it) Find out when the operation is completed (with or without error) No unique way to do all this. Depends on ISA and I/O architecture CSE 586 Spring 00

45 I/O Operations Specific I/O instructions Memory-mapped I/O
I/O instruction specifies both the device number and a command (or an address where the I/O device can find a series of commands) Example: Intel x86 (IN and OUT between EAX register and an I/O port whose address is either an immediate or in the DX register) Memory-mapped I/O Portions of address space devoted to I/O devices (read/write to these addresses transfer data or are used to control I/O devices) Memory ignores these addresses CSE 586 Spring 00

46 I/O Termination Two techniques to know when an I/O operation terminates Polling Interrupts CPU repeatedly checks whether a device has completed Used for “slow” devices such as the mouse (30 times a second) Negligible contention for CPU resources. E.g., if polling takes 100 instr. and each inst. has an average CPI of 2, the overhead is 200 * 30 = 6K cycles/sec. On a 300 Mhz machine, this is 2 thousands of 1% When the I/O completes it generates an (I/O) interrupt CSE 586 Spring 00

47 I/O Interrupts An interrupt is like an exception but it occurs asynchronously as a consequence of an external stimulus (I/O, power failure etc.) Presence of an interrupt checked on every cycle Upon an interrupt, O.S. takes over (context-switch) Two basic schemes to handle the interrupt Vectored interrupts: the O.S. is told (by the hardware) where to handle the interrupt Use of a cause register. The O.S. has to examine the contents of that register to transfer to the appropriate handler CSE 586 Spring 00

48 DMA Having long blocks of I/O go through the processor via load-store is totally inefficient DMA (direct memory address) controller: Specialized processor for transfer of blocks between memory and I/O devices w/o intervention from CPU (except at beg. and end) Has registers set up by CPU for beginning memory address and count DMA device interrupts CPU at end of transfer DMA device is a master for the bus CSE 586 Spring 00

49 DMA and Virtual Memory What if the block to transfer is greater than 1 page Address translation registers within the DMA device What if the O.S. replaces a page where the transfer is taking place Pages are “pinned” (locked) during transfer More complex DMA’s become I/O Channels CSE 586 Spring 00

50 Benchmarking (I/O intensive?)
TPC-a and TPC-B Simulating debit/credit transactions Random read/writes (100 byte records) with occasional sequential writes. Measurement of the database software as well as of the disk I/O subsystem and memory hierarchy TPC-C: more complex query processing TPC-D (decision support system – data mining) SPEC SFS (synthetic benchmark) CSE 586 Spring 00

51 Disk Arrays Reliability: is anything broken?
Availability: is the system still usable? Availability can be improved by adding more hardware (e.g.,ECC, disk arrays) that provides some redundancy In the case of I/O, simplest redundant system is mirroring: write each data on two disks. Cost: double the amount of hardware Performance: no increase (in fact might be worse for writes since has to wait for the longest of the two to complete) CSE 586 Spring 00

52 RAIDs (Redundant Arrays of Inexpensive Disks)
Concept of striping: data written consecutively on N disks Performance wise: no improvement in latency but improvement in throughput (parallelism) But now probability of failure is greater So add disks (redundant arrays of inexpensive disks) Mirroring = RAID1 RAID 3: Have one disk hold the “sum” (parity) of info on other disks. Interleave data at the bit level. Requires reading/writing of all the disks even for “small” reads. Writes imply: read old data, merge with new, write all data CSE 586 Spring 00

53 RAID 4 and 5 RAID 4: interleave data at the sector level.
Check correctness at sector level Still use an extra disk for bit level interleaving correction Writes can be done using only sectors of the disk to be written and the redundant disk (only one write at a time and a “write” is in fact 2 reads (old data, parity) and 2 writes (new data, new parity) RAID 5: interleave the parity sectors on the disks Allows parallelism in small writes. 4/5/2019 CSE 586 Spring 00

54 RAID 5 Disk 0 Disk n data parity CSE 586 Spring 00


Download ppt "CSE 586 Computer Architecture Lecture 7"

Similar presentations


Ads by Google