Exploiting Detachability Hashem H. Najaf-abadi Eric Rotenberg.

Slides:



Advertisements
Similar presentations
Chapter 3 Embedded Computing in the Emerging Smart Grid Arindam Mukherjee, ValentinaCecchi, Rohith Tenneti, and Aravind Kailas Electrical and Computer.
Advertisements

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,
August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Read Section 1.4, Section 1.7 (pp )
1 OS Driven Core Selection for HCMP Systems Anand Bhatia, Rishkul Kulkarni.
A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.
Configurational Workload Characterization Hashem H. Najaf-abadi Eric Rotenberg.
A Design Space Evaluation of Grid Processor Architecture Jiening Jiang May, 2005 The presentation based on the paper written by Ramadass Nagarajan, Karthikeyan.
Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December Efraim Rotem Intel.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science August 20, 2009 Enabling.
Colorado Computer Architecture Research Group Architectural Support for Enhanced SMT Job Scheduling Alex Settle Joshua Kihm Andy Janiszewski Daniel A.
1  1998 Morgan Kaufmann Publishers and UCB Performance CEG3420 Computer Design Lecture 3.
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
COMPUTER ORGANIZATIONS CSNB123 May 2014Systems and Networking1.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors Ayose Falcón Alex Ramirez Mateo Valero HPCA-10 February 18, 2004.
Benchmarks Prepared By : Arafat El-madhoun Supervised By:eng. Mohammad temraz.
Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.
Are New Languages Necessary for Manycore? David I. August Department of Computer Science Princeton University.
Higher Computing Computer Systems S. McCrossan 1 Higher Grade Computing Studies 3. Computer Performance Measures of Processor Speed When comparing one.
Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
CSC 4250 Computer Architectures September 26, 2006 Appendix A. Pipelining.
HPArch Research Group. |Part III: Overview of MacSim Features of MacSim Basic MacSim architecture How to simulate architectures with MacSim |Part IV:
Computer Organization and Architecture Tutorial 1 Kenneth Lee.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Core-Selectability in Chip-Multiprocessors Hashem H. Najaf-abadi Niket K. Choudhary Eric Rotenberg.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
1  1998 Morgan Kaufmann Publishers Where we are headed Performance issues (Chapter 2) vocabulary and motivation A specific instruction set architecture.
EKT303/4 Superscalar vs Super-pipelined.
Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
EGRE 426 Computer Organization and Design Chapter 4.
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
UNIT III -PIPELINE.
Computer Architecture CSE 3322 Web Site crystal.uta.edu/~jpatters/cse3322 Send to Pramod Kumar, with the names and s.
Optimizing Multipliers for the CPU: A ROM based approach Michael Moeng Jason Wei Electrical Engineering and Computer Science University of California:
BITS Pilani, Pilani Campus Today’s Agenda Role of Performance.
Variation Aware Application Scheduling in Multi-core Systems Lavanya Subramanian, Aman Kumar Carnegie Mellon University {lsubrama,
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Dynamic Scheduling Why go out of style?
CS Lecture 20 The Case for a Single-Chip Multiprocessor
COSC3330 Computer Architecture
Assessing and Understanding Performance
Zhichun Zhu Zhao Zhang ECE Department ECE Department
Lu Peng, Jih-Kwon Peir, Konrad Lai
Antonia Zhai, Christopher B. Colohan,
Improved schedulability on the ρVEX polymorphic VLIW processor
Taeweon Suh § Hsien-Hsin S. Lee § Shih-Lien Lu † John Shen †
Application-Specific Processing on a General Purpose Core via Transparent Instruction Set Customization Nathan Clark, Manjunath Kudlur, Hyunchul Park,
Faustino J. Gomez, Doug Burger, and Risto Miikkulainen
Improving Memory System Performance for Soft Vector Processors
Performance ICS 233 Computer Architecture and Assembly Language
Overview Prof. Eric Rotenberg
ECE 721, Spring 2019 Prof. Eric Rotenberg.
Computer Organization and Design Chapter 4
Presentation transcript:

Exploiting Detachability Hashem H. Najaf-abadi Eric Rotenberg

Different jobs, Different tools Different applications have different characteristics and therefore different resource needs. Therefore a single fixed architecture compromises the performance of the individual, for the performance of all.

Architectural changeability (Transformation) In silicon-based technology; performance of a changeable (polymorphic) design in a fixed configuration is less than a non-changeable implementation of the same configuration.

Changeability in subcomponent Changeability at any level of the design hierarchy Changeability in interconnect Sub-component Interconnect Interacting subcomponents may need to change too

Changeability at logic-circuit level In an adder for instance; F.A. carry Or the bypasses; F.U.

Changeability at the pipeline level In the execution for instance; fetch decode dispatch issue execute write-back execute

Changeability at the processor level L2 cache Core A Core B Core C At least there’s no higher level for changeability to spread to.

Heterogeneity Pros: No low-level changeability Cons: Poor scalability (die area is consumed, burdening access to system resources) Inflexible (once configurations are placed in the system, they are permanent, while their need is user dependent)

Spread Heterogeneity to numerous chips Pros: Increases the overall die area, thus ameliorating the unscalability Cons: Exacerbates the burdening of access to system resources Remains inflexible in the forms of architectural diversity that are made available

Exploiting Detachability Detachability: a property that already exists (due to marketing and packaging issues). Pros: No suboptimality due to limited die are or burdening of access to system resources. Flexible in the forms of architectural diversity

Exploiting Detachability Other advantages: A substrate for gradual employment of alternate technologies (which tend to be application dependent) A paradigm where architects can focus on innovations for enhancing architectures for specific applications, rather than tweaking the same old design.

Changeability in real world applications Rough automatic design-space exploration for the integer SPEC2000 benchmarks Randomly varied the L1 and L2 cache sizes, the processor width, issue queue size, and clock period.

Customization results bzipgapgccgzipmcfparserperltwolfvortexvprcrafty No. mem. access cycles No. front-end cycles Processor width Issue queue size B-to-Back lat. of dep. inst Clock period No. L1 access cycles No. L1-cache lines L1-cache line-size L1-cache associativity No. L2 access cycles No. L2-cache lines L2-cache line-size L2-cache associativity

On each other’s bzipgapgccgzipmcfparserperltwolfvortexvprcrafty bzip * gap * gcc gzip * mcf parser perl * twolf * vortex vpr * crafty * average Rows indicate benchmarks, and columns indicate the their customized architectures

Representative architectures Assigning surrogates: gcc gzip parser perl twolf vortex vpr crafty gap 7 6 bzip 5 mcf

Customization results gccmcfparservortexcrafty No. mem. access cycles No. front-end cycles Processor width35526 Issue queue size64 B-to-Back lat. of dep. inst Clock period No. L1 access cycles No. L1-cache lines L1-cache line-size L1-cache associativity11128 No. L2 access cycles No. L2-cache lines L2-cache line-size L2-cache associativity1641 8