Presentation is loading. Please wait.

Presentation is loading. Please wait.

NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina.

Similar presentations


Presentation on theme: "NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina."— Presentation transcript:

1 NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina State University Niket K. Choudhary, Salil V. Wadhavkar, Tanmay A. Shah, Sandeep S. Navada, Hashem H. Najaf-abadi, Eric Rotenberg

2 NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 2  Generic pipeline configuration ↑ Good performance on wide range of applications ↓ Not highest-performing for any given application ↓ Power inefficient High-Performance Superscalar Processor

3 NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 3 Application-Specific Superscalar Processor App. X generic superscalar processor application-specific superscalar processor

4 NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 4 Propagation Delay 2-way superscalar4-way superscalar 2-way to 4-way: –Increase sizes of ILP-extracting units to expose and exploit more ILP –Hide increase in propagation delays with deeper pipelining –Except: worsened propagation delays not hidden for inter- instruction dependences dependenciesindependencies 2-way 4-way App. 1 App. 2 2-way 4-way Execution Time propagation delay (ns)

5 NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 5 Heterogeneous Multi-core App. 1App. 2App. N Customize each core to an application, class of application, or class of application behavior.

6 NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 6  Customization captures interplay between program, microarchitecture, and technology  Need real superscalar designs …  … and need many of them Challenge Need tool for automatically composing physical designs of arbitrary superscalar processors. Need to try out many real superscalar designs.

7 NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 7  Research: High fidelity designs improve discovery  Development: Designs should be product strength Target both R & D

8 NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 8 Canonical Superscalar Processor  Different superscalar processors have same canonical pipeline stages  Their canonical stages differ in terms of: Complexity  Width, i.e., number of superscalar “ways”  Sizes of stage-specific structures Sub-pipelining  How deeply pipelined a canonical stage is

9 NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 9 1) Define composable interfaces of canonical pipeline stages, so that they can be stitched together to compose an overall superscalar processor. 2) Pre-design multiple versions of each canonical pipeline stage, that differ in their width and stage- specific structure sizes (complexity) and depth (sub- pipelining). 3) Develop a high-level superscalar synthesis tool that can automatically compose an arbitrary superscalar processor based on processor-level and stage-level constraints (frequency, power, and area), and output multiple representations (verilog, cycle-accurate C++, netlist, and physical design) of the processor. FabScalar

10 NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 10 SSL and Composability fetch scalar, 1 to 3 stages 2-way superscalar, 1 to 3 stages decode rename

11 NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 11 Status  Designed synthesizable verilog for a baseline superscalar processor Starting point for populating SSL with pipeline stage designs StageDescription Fetch4-wide, 512-entry BTB, 128-entry bimodal branch predictor, 8-entry RAS, 16-instruction fetch buffer Decode4-wide, ISA = PISA (MIPS-like) Rename4-wide, 32-entry rename map table with 8 read and 4 write ports, 4 shadow map tables (checkpoints) Dispatch4-wide Issue4-wide issue, 32-entry issue queue Register Read4-wide, 128-entry physical register file with 8 read ports and 4 write ports Execute1 simple ALU, 1 complex ALU, 1 branch ALU, 1 AGEN + 1 port to load-store unit Load-Store Unit16-entry load queue, 16 entry store queue Writeback4-wide Retire4-wide, 128-entry active list with 4 read and 4 write ports, arch. map table with 4 read and 4 write ports Niket

12 NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 12 Status (cont.) Niket

13 NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 13 Status (cont.) Niket

14 NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 14  Developed cycle-accurate C++ simulator and verilog/C++ co-simulation environment Cycle-accurate at pipeline stage level Status (cont.) Salil gapgccgziptwolfvortexvpr IPC0.45 0.540.440.520.48

15 NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 15  Developed register file compiler Superscalar processor has many specialized and highly-ported RAM- based structures Status (cont.) Tanmay 16R8W bitcell layout

16 NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 16  Begun sub-pipelining key stages: fetch and issue  Block-ahead pipelining [Seznec et al.] Status (cont.) A B C D A B C D Unpipelined Fetch throughput = 1 Pipelined Fetch (no block-ahead) throughput = 1 A B C D Pipelined Fetch (with block-ahead) throughput = 2 Jayneel

17 NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 17 Example Applications  Superscalar customization, fast design-space exploration Sandeep

18 NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 18 Example Applications (cont.) Configure parallel processor for parallel workload at hand. Tiled Het. Multi-cores  Core-Selectability in Chip Multiprocessors Hashem

19 NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 19  Revisit microarchitecture techniques  Techniques discarded for limited applicability may be valuable in workload-customized cores Example Applications (cont.)

20 NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 20  Conventional methodology flawed Arbitrarily pick a baseline (perhaps rules-of-thumb) Add gadget to baseline Speedup: (baseline+gadget) / (baseline) Influence of gadget depends on choice of baseline Example: Value prediction more important with undersized IQ  OK methodology Baseline = custom core for each benchmark Add gadget to this baseline, per benchmark Speedup: (baseline+gadget) / (baseline)  Better methodology Baseline = custom core for each benchmark Recustomize core with gadget in place (new global optimum) Speedup: (recustomized core) / (customized core) Example Applications (cont.)

21 NC STATE UNIVERSITY Eric Rotenberg © 2009 WARP’09 6/20/09 21  Customizing superscalar cores has value in application- specific designs and heterogeneous multi-core chips  Customization captures interplay among program, microarchitecture, and technology  FabScalar enables the composition of arbitrary superscalar processors, inclusive of technology  Enabled by canonical view of superscalar pipeline, and a lot of “pre-fab” by students who aren’t paid enough Summary accepting donations http://www.tinker.ncsu.edu/ericro/research/fabscalar.htm Supported by NSF and IBM.


Download ppt "NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina."

Similar presentations


Ads by Google