(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign Department of Computer Science
Early machines We will present a series of idealized and simplified models –Read more about the real models in architecture textbooks official prereq: cs232, cs333 –The idea here to review the concepts and define our vocabulary Processor Memory Location 0 Location 1 Location k
Early machines Early machines: Complex instruction sets, (lets say) no registers –Processor can access any memory location equally fast –Instructions: Operations: Add L1, L2, L3 (Add contents of Location L1 to that of Location L2, and store results in L3.) Branching: Branch to L4 (Note that some locations store program instructions), Coonditional Branching: If (L1>L2) goto L3 Processor Memory Location 0 Location 1 Location k
Registers Processors are faster than memory –they can deal with data within the processor much faster So, create some locations in processor for storing data –Called registers; Often with a special register called Accumulator Now we need new instructions for dealing with data in registers: –Data movement instructions Move from register to memory, memory to register, register to register, and memory to memory –Computation instructions: In addition to the previous ones, we now add instructions to allow one or more operands being a register CPU registers Memory Processor
Load-Store architectures (RISC) Do not allow memory locations to be operands –For computations as well as control instructions Only instructions to reference memory are: –Load R, L # move contents of L into register R –Store R, L # move contents of register R into memory location L Notice that the number of instructions is now dramatically reduced –Further, allow only relatively simple instructions to do register-to- register operations –More complex operations implemented in software –Compiler has a bigger responsibility now
Caches The processor still has to wait for data from memory –I.e. Load and Store instructions are slower –Although more often the CPU is executing register-only instructions –Load and store latency Dictionary meaning: latency is the delay between stimulus and response OR: delay between a data-transfer instruction and beginning of data transfer But, faster SRAM memory is available (although expensive) Idea: just like registers, put some more of data in faster memory –Which data?? –Principle of locality: (empirical observation) Data accessed correlates with past accesses, spatially and temporarily Without this, caches will be worthless (unless most data fits in cache)
Caches Processor Memory Cache Processor still issues load and store instructions as before, but the cache controller intercepts the requests, and if the location has been cached, deals with it using cache Data transfer between cache and memory is not seen by the processor Cache controller
Cache Issues Level 2 cache Cache lines –Bring a bunch of data “at once” : exploit spatial locality block transfers are faster – byte cache lines typical –Trade-off: or why larger and large cache lines aren’t good either
Cache blocks and Cache Lines Processor Memory Cache A cache block is a physical part of the cache. A cache line is a section of the address space. A line is brought into a cache block. Of course, line-size and block-size are the same. Cache controller L1 block
Cache Management How is cache managed? –Its job: given an address, find if it is cache, and return contents if so. Also, write data back to memory when needed and bring data from the memory when needed –Ideally, a fully associative cache will be good Keep cache lines anywhere in the physical cache But looking up is hard
Cache management Alternative scheme: –Each cache line (I.e. address) has exactly one place in the cache memory where it can be stored. –Of course, there are more than one cache lines that will have the same area of cache memory as their possible target Why? –Only one cache line can live inside a cache block at a time –If you want to bring in a new one, the old one must be “emptied” A tradeoff: set-associative caches –Have each line map to more than 1 (say 4) physical locations
Parallel Machines: an abstract introduction Our main focus will be on three kinds of machines –Bus-based shared memory machines –Scalable shared memory machines Cache coherent Hardware support for remote memory access –Distributed memory machines
Bus based machines PE0PE1 PE N-1 Mem0Mem1 Memk
Bus based machines Any processor can access any memory location –Read and write Bus bandwidth is a limiting factor Also, how do you deal with 2 processors changing the same data? –Locks (more on this later)
Scalable shared memory m/cs PE0 Interconnection Network with support for remote memory access Mem0 Not popular, as all data is slow to access
Distributed memory m/cs Interconnection Network PE0 Mem0 PEp Memp PE1 Mem1
Introducing caches into the picture! Now, we have more complex problems : –can’t be fixed by locks alone: –copy of the same variables in two different caches may contain different values. Cache controller must do more PE0PE1 PE p-1 Mem0Mem1 Mem p-1 cache
Distributed memory m/cs Interconnection Network PE0 Mem0 cache Pep-1 Memp-1 cache PE1 Mem1 cache
Writing parallel programs Programming model –How should a programmer view the parallel machine? –Sequential programming: von Neumann model Parallel programming models: –Shared memory (Shared address space) model –Message passing model –Shared Objects model