Presentation is loading. Please wait.

Presentation is loading. Please wait.

MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Similar presentations


Presentation on theme: "MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam."— Presentation transcript:

1 MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam

2  Main ideal  Motivation  Challenges  Core design A method for representing inter-instruction data dependences Forwardflow – Dataflow Queue (DQ) Forwardflow architecture  Related Work  Conclusion & problems Outline

3 Motivation and Challenges Consider this vision: Micro architects hope to improve applications’ overall efficiency by focusing on thread-level parallelism (TLP), rather than instruction-level parallelism (ILP) within a single thread.

4 Challenges  Parallel Speedup Limited by Parallel Fraction  i.e., Only ~10x speedup at N=512, f=90% [ ] (1 - f ) + f N Two fundamental problems: Amdahl’s Law ? All cores can simultaneously operate at full speed? Increase more cores, but speed up a little bit

5 Simultaneously Active Fraction (SAF): “the fraction of the entire chip resources that can be active simultaneously” Challenges The physical limits on power delivery and heat dissipation. In the long term, to maintain fixed power and area budgets as technology scales, The fraction of active transistors must decrease with each technology generation. Two fundamental problems: Amdahl’s Law ? All cores can simultaneously operate at full speed?

6 Motivation For single-thread performance  Exploit ILP For multiple threads  Save power  Exploit TLP

7  CMPs will need Scalable Cores  Scale UP for Performance Use more resources for more performance (allowing single- threaded applications to aggressively exploit ILP and MLP to the limits of available power). Resources: (e.g., cores, caches, hardware accelerators, etc.).  Scale DOWN Motivation A scalable core is a processor capable of operating in several different configurations, each offering a varied power/ performance point.

8  CMPs will need Scalable Cores  Scale UP  Scale DOWN for Energy Conservation Exploit TLP with many small cores when power is constrained, scalable cores can scale down to conserve per-core energy Motivation

9 Scalable Cores CMP equipped with scalable cores: Scaled up to run few threads quickly (left), and scaled down to run many threads in parallel (right). Scalable cores have the potential to adapt their behavior to best match their current workload and operating conditions.

10  Ideal: Forward flow  New Scalable Core μArch  Uses pointers  Distributes values  Scales to large instruction window sizes  Full window scheduler  Scales dynamically  Variable-sized instruction window Ideas

11 Forwardflow architecture Problems: in a scalable core, resource allocation changes over time. Designers of scalable cores should avoid structures that are difficult to scale, like centralized register files and bypassing networks. This work focus on scaling window size

12 Serialized Successor Representation (SSR) A method for representing inter-instruction data dependences, called Serialized Successor Representation (SSR) Instead of maintaining value names, SSR describes values’ relationships to operands of other instructions. Instructions in SSR are represented as three-operand tuples: SOURCE1 (S1), SOURCE2(S2), and DESTINATION (D), Each operand consists of a value and a successor pointer. Operand pointers are used to represent data dependences

13 Serialized Successor Representation (SSR) D S1S2 The pointer field of the producing instruction’s D-operand designates the first successor operand, belonging to a subsequent successor—usually the S1- or S2- operand of a later instruction. If a second successor exists, the pointer field at the first successor operand will be used to designate the location of the second successor operand The locations of subsequent operands can be encoded in a linked-list fashion, relying on the pointer at successor i to designate the location of successor i+1.

14 Serialized Successor Representation (SSR) distributed chains of pointers NULL pointer Pros: not in renaming never requires a search or broadcast operation to locate a successor for any dynamic value it can be built from simple SRAMs

15 Forwardflow – Dataflow Queue (DQ) Instructions, values, and data dependences reside in a distributed Dataflow Queue (DQ) The DQ is comprised of independent banks and pipelines, which can be activated or de-activated by system software to scale a core’s execution resources to implement core scaling.

16 Forwardflow architecture

17 Fetch Read instructions from L1-I Cache Predict Branches Pass on to Decode phase Fetch proceeds no differently than other high-performance microarchitectures.

18 Decode Determine to which pointer chains, if any, each instruction belongs. It does this using the Register Consumer Table (the RCT resembles a traditional rename table).The RCT is implemented as an SRAM-based table. The RCT also identifies registers last written by a committed instruction Decode detects and handles potential data dependences, analogous to traditional renaming.

19 Dispatch Dispatch inserts instructions into the Dataflow Queue (DQ) and instructions issue when their operands become available

20 Dispatched/Executing the ld instruction is ready to issue because both source available in the ARF Decode updates the RCT to indicate that ld produces R3 Dispatch reads the ARF to obtain R1’s value, writes both operands into the DQ, and issues the ld

21 When the add is decoded, it consults the RCT and finds that R3’s previous use was as the ld’s destination field Dispatch updates the pointer from ld’s destination to the add’s first source operand. the add’s immediate operand (55) is written into the DQ at dispatch.

22 The mult’s decode consults the RCT, and discovers that both operands, R3 and R4, are not yet available and were last referenced by the add’s source 1 operand and the add’s destination operand Dispatch of the mult therefore checks for available results in both the add’s source 1 value array and destination value array, and appends the mult to R3’s and R4’s pointer chains.

23 the sub appends itself to the R3 pointer chain, and writes its dispatch-time ready operand (66) into the DQ.

24 Wakeup, Selection, and Issue completion of the ld, the memory value (99) is written into the DQ ld’s destination pointer is followed to the first successor

25 Wakeup, Selection, and Issue the add’s metadata and source 2 value are read, and, coupled with the arriving value of 99, the add now be issued. The update hardware reads the add’s source 1 pointer, discovering the mult as the next successor.

26 Wakeup, Selection, and Issue the mult’s metadata, other source operand, and next pointer field are read. the source 1 operand is unavailable, and the mult will issue at a later time

27 Wakeup, Selection, and Issue Finally, following the mult’s source 2 pointer to the sub delivers 99 to the sub’s first operand, enabling the sub to issue.

28 Methodology: target machine On each tile resides a single core, a private L1-I cache (32KB), a private write-through write- invalidate L1-D cache (32KB), a private L2 cache (1MB) which manages coherency in the L1-D via inclusion, and one bank of a shared L3 cache. It is assumed that cores and private caches can be powered off without affecting the shared L3, the L3 operates in its own voltage domain

29  Scalable Schedulers  Direct Instruction Wakeup [Ramirez04]:  Scheduler has a pointer to the first successor  Secondary table for matrix of successors  Hybrid Wakeup [Huang02]:  Scheduler has a pointer to the first successor  Each entry has a broadcast bit for multiple successors  Half Price [Kim02]:  Slice the scheduler in half  Second operand often unneeded Related Work

30  Dataflow & Distributed Machines  Tagged-Token [Arvind90]  Values (tokens) flow to successors  TRIPS [Sankaralingam03]:  Discrete Execution Tiles: X, RF, $, etc.  EDGE ISA  Clustered Designs [e.g. Palacharla97]  Independent execution queues Related Work

31 Conclusion:  Allowing the system to trade-off power and performance Problems:  What happened if DQ banks is larger (>8) or smaller (<8)  We have know ideal about software must change to accommodate concurrency Conclusion and problems


Download ppt "MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam."

Similar presentations


Ads by Google