Chapter 2 Parallel Architecture
Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t improve performance with increased frequency due to heat issues What to do with the additional transistors? – Faster ways to compute operations – Pipelining – Cache – MULTICORE!!
Levels of Parallelism Bit Level – Word size related – how many bits do we work on at one time (4, 8, 16, 32, 64). Parallel vs. serial addition Pipelining – Break an instruction into components and process them like an assembly line: Fetch Decode Execute Write-back
Parallelism Levels Pipeline (cont.) – Each component can work on its stage on a different instruction in parallel. – Superpipeline – many stages. Multiple function units – Have a separate unit for various functions (integer arithmetic, float arithmetic, load/store, etc.) that can be executed in parallel as long as there are no dependencies.
Parallelism Levels (cont.) Process/Thread Level – Used for multicore/multiple processors – Issues with shared memory/cache vs. distributed memory – Done at the programming level
Flynn’s Taxonomy Data/InstructionSingleMultiple SingleSISDMISD MultipleSIMDMIMD SISD – Standard single processor working on a single data item (pair) at a time MISD – NOT FEASIBLE SIMD – One instruction is performed on multiple data items – also called a vector processor since it looks like the operands are vectors of data MIMD – Multiple processing elements working on data independently.
Memory Organization Distributed Memory – Memory is local and private on each processor – Sharing information is done via message passing between nodes – Faster system if the nodes have DMA Direct Memory Access – controller can get values from or put values into memory without using the processor The processor can keep processing while information is transferred. – Faster with routers to handle data transfer without bothering the processor
Memory Organization (cont.) Shared Memory – Memory is global and “public” – Processes share variables for communications – concerned about “race” conditions – Different results with different execution orders. Ex.: Processor 1Processor 2 LW $t1,X (A)LW $t1,X (D) ADDI $t1,$t1,1 (B)ADDI $t1, $t1, 1 (E) SW $t1, X (C)SW $t1, X (F) ABCDEF gives different result from ADBECF
Shared Memory Generally easier programming Does not scale well (hardware issues with many processors hitting memory at the same time) Cache coherence – if each processor/core has its own cache, then the same global memory location may be mapped to 2 caches which get updated independently (non-shared cache) NUMA – non-uniform memory access. There is a hierarchy of memory that have different access times.
Memory Organization Virtually shared memory – There is a difference between the programmer’s view and the hardware – The programmer writes the code as if the memory is shared, but the memory, in reality, is distributed – The system automatically generates the messages to get values to the proper processor – Definitely NUMA
Cache Memory Small, fast memory between processor and main memory – Is feasible because of temporal and spatial locality – Holds a subset of main memory – Cache hit vs. miss – Cache mapping issues – COHERENCY
Thread Level Parallelism Multithreading – multiple threads executing “simultaneously” on a single processor. – Can’t be simultaneous since single processor – CPU swaps between threads based on time (timeslicing) and/or switch-on-event (one thread waiting for I/O). Multiple cores allow true simultaneousness.
Simultaneous Multithreading Requires multiple functional units and replication of the PC register and all the general user registers (state of the machine) – Creates “Logical Processors” Allow multiple instructions from different threads to execute simultaneously as long as they do not overlap functional units
Multicore Processors Needs an OS that recognizes and schedules tasks on the different cores. Can run different programs on different cores Programs need to be written in such a way that parts of program can be run simultaneously on separate cores to have any improvement in time with multiple cores
Multicore Architecture One main issue is the location of the Cache(s) – Each core has a local cache – Every core shares a cache – Each core has a local L1 cache and shares a L2 cache. Caches need to communicate for Coherency – Network communication – Pipeline (go to the next)
Interconnection Networks Ideally, every node would be connected to every other node so that 2 mutually exclusive pairs of nodes could communicate simultaneously. – Requires O(n 2 ) connections – does not scale well Most use a significantly reduced set of connections (cost vs. speed) – Routing technique needed if not fully connected.