High Performance Computer Architecture Challenges Rajeev Balasubramonian School of Computing, University of Utah.

High Performance Computer Architecture Challenges Rajeev Balasubramonian School of Computing, University of Utah

Dramatic Clock Speed Improvements!! The 1 st Intel processor 108 KHz Intel Pentium 4 3.2 GHz

Clock Speed = Performance ? The Intel Pentium4 has a higher clock speed than the IBM Power4 – does the Pentium4 execute your program faster?

Clock Speed = Performance ? The Intel Pentium4 has a higher clock speed than the IBM Power4 – does the Pentium4 execute your program faster? Completing instruction Clock tick Case 1: Case 2: Time

Performance = Clock Speed x Parallelism

What About Parallelism?

Dramatic Clock Speed Improvements!! The 1 st Intel processor 108 KHz Intel Pentium 4 3.2 GHz

The Basic Pipeline Consider an automobile assembly line: Stage 1Stage 2Stage 3Stage 4 1 day A new car rolls out every day A new car rolls out every half day In each case, it takes 4 days to build a car, but… More stages  more parallelism and less time between cars

What Determines Clock Speed? Clock speed is a function of work done in each stage – in the earlier examples, the clock speeds were 1 car/day and 2 cars/day Similarly, it takes plenty of “work” to execute an instruction and this work is broken into stages Execution of a single instruction

What Determines Clock Speed? Clock speed is a function of work done in each stage – in the earlier examples, the clock speeds were 1 car/day and 2 cars/day Similarly, it takes plenty of “work” to execute an instruction and this work is broken into stages Execution of a single instruction 250ps  4GHz clock speed

Clock Speed Improvements Why have we seen such dramatic improvements in clock speed?  work has been broken up into more stages  early Intel chips executed work equivalent to approximately 56 logic gates; today’s chips execute 12 logic gates worth of work  transistors have been becoming faster  as technology improves, we can draw smaller and smaller transistors/gates on a chip and that improves their speed (doubles every 5-6 years)

Will these Improvements Continue? Transistors will continue to shrink and become faster for at least 10 more years Each pipeline stage is already pretty small – improvements from this factor will cease If clock speed improvements stagnate, should we turn our focus to parallelism?

Microprocessor Blocks Branch Predictor Decode & Rename Issue Logic ALU L2 Cache L1 Instr Cache L1 Data Cache Register File

Innovations: Branch Predictor Branch Predictor Decode & Rename Issue Logic ALU L2 Cache L1 Data Cache Register File Improve prediction accuracy by detecting frequent patterns L1 Instr Cache

Innovations: Out-of-order Issue Branch Predictor Issue Logic ALU L2 Cache L1 Instr Cache L1 Data Cache Register File Out-of-order issue: if later instructions do not depend on earlier ones, execute them first Decode & Rename

Innovations: Superscalar Architectures Branch Predictor Decode & Rename Issue Logic ALU L2 Cache L1 Instr Cache L1 Data Cache Multiple ALUs: increase execution bandwidth ALU Register File

Innovations: Data Caches Branch Predictor Decode & Rename Issue Logic ALU L2 Cache L1 Instr Cache L1 Data Cache Register File 2K papers on caches: efficient data layout, stride prefetching

Summary Historically, computer engineers have focused on performance Performance is a function of clock speed and parallelism As technology improves, clock speeds will improve, although at a slower rate Parallelism has been gradually improving and plenty of low-hanging fruits have been picked

Outline Recent Microprocessor History Current Trends and Challenges Solutions to Handling these Challenges

Trend I : An Opportunity Transistors on a chip have been doubling every two years (Moore’s Law) In the past, transistors have been used for out-of-order logic, large caches, etc… In the future, transistors can be employed for multiple processors on a single chip

Chip Multiprocessors (CMP) The IBM Power4 has two processors on a die Sun has announced the 8-processor Niagara P1P2 P3P4 L2 cache

The Challenge Nearly every chip will have multiple processors, but where are the threads? Some applications will truly benefit – they can be easily decomposed into threads Some applications are inherently sequential – can we execute speculative threads to speed up these programs? (open problem!)

Trend II : Power Consumption Power  a f C V 2, where a is activity factor, f is frequency, C is capacitance, and V is voltage Every new chip has higher frequency, more transistors (higher C), and slightly lower voltage – the net result is an increase in power consumption

Scary Slide! Power density cannot be allowed to increase at current rates (Source: Borkar et al., Intel)

Impact of Power Increases Well, UtahPower sends you fatter bills every month To maintain constant chip temperature, heat produced on a chip has to be dissipated away – every additional watt increases cooling cost of a chip by approximately $4 !! If temperature of a chip rises, the power dissipated also increases (almost exponentially)  a vicious cycle!

Trend III : Wire Delays As technology improves, logic gates shrink  their speed increases and clock speeds improve As logic gates shrink, wires shrink too – unfortunately, their speed improves only marginally In relative terms, future chips will have fast transistors/gates and slow wires Computation is cheap, communication is expensive!

Impact of Wire Delays Crossing the chip used to take one cycle In the future, crossing the chip can take up to 30 cycles Many structures on a chip are wire-constrained (register file, cache) – their access times slow down  throughput decreases as instructions sit around waiting for values Long wires also consume power

Trend IV : Soft Errors High energy particles constantly collide with objects and deposit charge Transistors are becoming smaller and on-chip voltages are being lowered  it doesn’t take much to toggle the state of the transistor The frequency of this occurrence is projected to increase by nine orders of magnitude over a 20 year period

Impact of Soft Errors When a particle strike occurs, the component is not rendered permanently faulty – only the value it contains is erroneous Hence, this is termed a transient fault or soft error The error propagates when other instructions read this faulty value This is already a problem for mission-critical apps (space, defense, highly-available servers) and may soon be a problem in other domains

Summary of Trends More transistors, more processors on a single chip High power consumption Long wire delays Frequent soft errors We are attempting to exploit transistors to increase parallelism – in light of the above challenges, we’d be happy to even preserve parallelism

Transistors & Wire Delays Bring in a large window of instructions so you can find high parallelism Distribute instructions across processors so that communication is minimized Instructions Processors

Difficult Branches Mispredicted branches result in poor parallelism and wasted work (power) Solution: when you arrive at a fork, take both directions – execute on low frequency units to control power dissipation levels Instructions Processors

Thermal Emergencies Heterogeneous units allow you to reduce cooling costs If a chip’s peak power is 110W, allow enough cooling to handle 100W average – save $40/chip! If the application starts consuming more than 100W and temperature starts to rise, start favoring the low power processor cores – intelligent management allows you to make forward progress even in a thermal emergency

Handling Long Wire Delays Wires can be designed to have different properties Knob 1: wire width and spacing: fat wires are faster, but have low bandwidth

Handling Wire Capacitance Knob 2: wires have repeaters/buffers – many, large buffers  low delay, high power consumption

Mapping Data to Wires We can optimize wires for delay, bandwidth, power Different data transfers on a chip have different latency and bandwidth needs – an intelligent mapping of data to wires can improve performance and lower power consumption

Handling Soft Errors Errors can be detected and corrected by providing redundancy – execute two copies of a program (perhaps, on a CMP) and compare results Note that this doubles power consumption! Leading Thread Trailing Thread

Handling Soft Errors Trailing thread is capable of higher performance than leading thread – but there’s no point catching up – hence, artificially slow the trailing thread by lowering its frequency  lower power dissipation Leading Thread Trailing Thread Trailing thread never fetches data from memory and never guesses at branches Peak thruput: 1 BIPS 2 BIPS

Summary of Solutions Heterogeneous wires and processors Instructions and data have different needs: map them to appropriate wires and processors Note how these solutions target multiple issues simultaneously: slow wires, many transistors, soft errors, power/thermal emergencies

Conclusions Performance has improved because of clock speed and parallelism advances Clock speed improvements will continue at a slower rate Parallelism is on a downward trend because of technology trends and because low-hanging fruits have been picked We must find creative ways to preserve or even improve parallelism in the future

Slide Title Point 1.

High Performance Computer Architecture Challenges Rajeev Balasubramonian School of Computing, University of Utah.

Similar presentations

Presentation on theme: "High Performance Computer Architecture Challenges Rajeev Balasubramonian School of Computing, University of Utah."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

High Performance Computer Architecture Challenges Rajeev Balasubramonian School of Computing, University of Utah.

Similar presentations

Presentation on theme: "High Performance Computer Architecture Challenges Rajeev Balasubramonian School of Computing, University of Utah."— Presentation transcript:

Similar presentations

About project

Feedback