Erik P. DeBenedictis Sandia National Laboratories May 5, 2005 Reversible Logic for Supercomputing How to save the Earth with Reversible Computing Sandia.

Slides:



Advertisements
Similar presentations
Subthreshold SRAM Designs for Cryptography Security Computations Adnan Gutub The Second International Conference on Software Engineering and Computer Systems.
Advertisements

1 Cleared for Open Publication July 30, S-2144 P148/MAPLD 2004 Rea MAPLD 148:"Is Scaling the Correct Approach for Radiation Hardened Conversions.
A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.
Advances in Modeling and Simulation for US Nuclear Stockpile Stewardship February 2, 2009 James S. Peery Director Computers, Computation, Informatics and.
Erik P. DeBenedictis, Organizer Sandia National Laboratories Los Alamos Computer Science Institute Symposium 2004 The Path To Extreme Computing Sandia.
Erik P. DeBenedictis Sandia National Laboratories February 24, 2005 Sandia Zettaflops Story A Million Petaflops Sandia is a multiprogram laboratory operated.
Zhao Lixing.  A supercomputer is a computer that is at the frontline of current processing capacity, particularly speed of calculation.  Supercomputers.
Ensemble Emulation Feb. 28 – Mar. 4, 2011 Keith Dalbey, PhD Sandia National Labs, Dept 1441 Optimization & Uncertainty Quantification Abani K. Patra, PhD.
How to Reach Zettaflops Erik P. DeBenedictis Avogadro-Scale Computing April 17, 2008 The Bartos Theater Building E15 MIT SAND C Sandia is a multiprogram.
Computational Mathematics: Accelerating the Discovery of Science Juan Meza Lawrence Berkeley National Laboratory
Unstructured Data Partitioning for Large Scale Visualization CSCAPES Workshop June, 2008 Kenneth Moreland Sandia National Laboratories Sandia is a multiprogram.
Claude TADONKI Mines ParisTech – LAL / CNRS / INP 2 P 3 University of Oujda (Morocco) – October 7, 2011 High Performance Computing Challenges and Trends.
Exploring Communication Options with Adaptive Mesh Refinement Courtenay T. Vaughan, and Richard F. Barrett Sandia National Laboratories SIAM Computational.
A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.
Supercomputers Daniel Shin CS 147, Section 1 April 29, 2010.
Large Area, High Speed Photo-detectors Readout Jean-Francois Genat + On behalf and with the help of Herve Grabas +, Samuel Meehan +, Eric Oberla +, Fukun.
Lecture 1: Introduction to High Performance Computing.
1 EE244 Project Your Title EE244 – Fall 2000 Name 1 Name 2.
Emerging Technologies – A Critical Review Presenter: Qufei Wu 12/05/05.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
ME381R Lecture 1 Overview of Microscale Thermal Fluid Sciences and Applications Dr. Li Shi Department of Mechanical Engineering The University of Texas.
1a-1.1 Parallel Computing Demand for High Performance ITCS 4/5145 Parallel Programming UNC-Charlotte, B. Wilkinson Dec 27, 2012 slides1a-1.
Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.
Erik P. DeBenedictis Sandia National Laboratories May 16, 2005 Petaflops, Exaflops, and Zettaflops for Science and Defense Sandia is a multiprogram laboratory.
Overview of Future Hardware Technologies The chairman’s notes from the plenary session appear in blue throughout this document. Subject to these notes.
© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March
LTE Review (September 2005 – January 2006) January 17, 2006 Daniel M. Dunlavy John von Neumann Fellow Optimization and Uncertainty Estimation (1411) (8962.
SOS8 Erik P. DeBenedictis Sandia National Laboratories Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for.
PuReMD: Purdue Reactive Molecular Dynamics Package Hasan Metin Aktulga and Ananth Grama Purdue University TST Meeting,May 13-14, 2010.
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
1 Memory Technology Comparison ParameterZ-RAMDRAMSRAM Size11.5x4x StructureSingle Transistor Transistor + Cap 6 Transistor Performance10.5x2x.
Data Intensive Computing at Sandia September 15, 2010 Andy Wilson Senior Member of Technical Staff Data Analysis and Visualization Sandia National Laboratories.
Future role of DMR in Cyber Infrastructure D. Ceperley NCSA, University of Illinois Urbana-Champaign N.B. All views expressed are my own.
Erik P. DeBenedictis Sandia National Laboratories October 24-27, 2005 Workshop on the Frontiers of Extreme Computing Sandia is a multiprogram laboratory.
R. Ryne, NUG mtg: Page 1 High Energy Physics Greenbook Presentation Robert D. Ryne Lawrence Berkeley National Laboratory NERSC User Group Meeting.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Erik P. DeBenedictis, Organizer Sandia National Laboratories Los Alamos Computer Science Institute Symposium 2004 The Path To Extreme Computing Sandia.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.
Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Split-Row: A Reduced Complexity, High Throughput.
Bayesian Macromodeling for Circuit Level QCA Design Saket Srivastava and Sanjukta Bhanja Department of Electrical Engineering University of South Florida,
October 12, 2004Thomas Sterling - Caltech & JPL 1 Roadmap and Change How Much and How Fast Thomas Sterling California Institute of Technology and NASA.
Extreme Computing 2005 Center for Nano Science and Technology University of Notre Dame Craig S. Lent University of Notre Dame Molecular QCA and the limits.
System Architecture: Near, Medium, and Long-term Scalable Architectures Panel Discussion Presentation Sandia CSRI Workshop on Next-generation Scalable.
A lower bound to energy consumption of an exascale computer Luděk Kučera Charles University Prague, Czech Republic.
VLSI: A Look in the Past, Present and Future Basic building block is the transistor. –Bipolar Junction Transistor (BJT), reliable, less noisy and more.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
© 2009 IBM Corporation Motivation for HPC Innovation in the Coming Decade Dave Turek VP Deep Computing, IBM.
Northeastern U N I V E R S I T Y 1 Design and Test of Fault Tolerant Quantum Dot Cellular Automata Electrical and Computer Department.
Reconfigurable Computing Aspects of the Cray XD1 Sandia National Laboratories / California Craig Ulmer Cray User Group (CUG 2005) May.
LOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGA Project Guide: Smt. Latha Dept of E & C JSSATE, Bangalore. From: N GURURAJ M-Tech,
Erik P. DeBenedictis Sandia National Laboratories October 27, 2005 Workshop on the Frontiers of Extreme Computing Overall Outbrief Sandia is a multiprogram.
DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –
Exascale climate modeling 24th International Conference on Parallel Architectures and Compilation Techniques October 18, 2015 Michael F. Wehner Lawrence.
Erik P. DeBenedictis Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of.
B5: Exascale Hardware. Capability Requirements Several different requirements –Exaflops/Exascale single application –Ensembles of Petaflop apps requiring.
Trends in IC technology and design J. Christiansen CERN - EP/MIC
Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.
Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.
Advanced Computing and Information Systems laboratory Nanocomputing technologies José A. B. Fortes Dpt. of Electrical and Computer Eng. and Dpt. of Computer.
Integrated Microsystems Lab. EE372 VLSI SYSTEM DESIGNE. Yoon 1-1 Panorama of VLSI Design Fabrication (Chem, physics) Technology (EE) Systems (CS) Matel.
Overview of VLSI 魏凱城 彰化師範大學資工系. VLSI  Very-Large-Scale Integration Today’s complex VLSI chips  The number of transistors has exceeded 120 million 
Multifidelity Optimization Using Asynchronous Parallel Pattern Search and Space Mapping Techniques Genetha Gray*, Joe Castro i, Patty Hough*, and Tony.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Chapter1 Fundamental of Computer Design
Ultimate Theoretical Models of Nanocomputers
Overview of VLSI 魏凱城 彰化師範大學資工系.
Topics 8: Advance in Parallel Computer Architectures
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Presentation transcript:

Erik P. DeBenedictis Sandia National Laboratories May 5, 2005 Reversible Logic for Supercomputing How to save the Earth with Reversible Computing Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL SAND C

Applications and $100M Supercomputers 1 Zettaflops 100 Exaflops 10 Exaflops 1 Exaflops 100 Petaflops 10 Petaflops 1 Petaflops 100 Teraflops System Performance Year    Red Storm/Cluster Technology  Nanotech + Reversible Logic  P (green) best-case logic (red)    Architecture: IBM Cyclops, FPGA, PIM No schedule provided by source Applications [Jardin 03] S.C. Jardin, “Plasma Science Contribution to the SCaLeS Report,” Princeton Plasma Physics Laboratory, PPPL-3879 UC-70, available on Internet. [Malone 03] Robert C. Malone, John B. Drake, Philip W. Jones, Douglas A. Rotman, “High-End Computing in Climate Modeling,” contribution to SCaLeS report. [NASA 99] R. T. Biedron, P. Mehrotra, M. L. Nelson, F. S. Preston, J. J. Rehder, J. L. Rogers, D. H. Rudy, J. Sobieski, and O. O. Storaasli, “Compute as Fast as the Engineers Can Think!” NASA/TM , available on Internet. [SCaLeS 03] Workshop on the Science Case for Large-scale Simulation, June 24-25, proceedings on Internet a [DeBenedictis 04], Erik P. DeBenedictis, “Matching Supercomputing to Progress in Science,” July Presentation at Lawrence Berkeley National Laboratory, also published as Sandia National Laboratories SAND report SAND P. Sandia technical reports are available by going to and accessing the technical library No schedule provided by source Applications [Jardin 03] S.C. Jardin, “Plasma Science Contribution to the SCaLeS Report,” Princeton Plasma Physics Laboratory, PPPL-3879 UC-70, available on Internet. [Malone 03] Robert C. Malone, John B. Drake, Philip W. Jones, Douglas A. Rotman, “High-End Computing in Climate Modeling,” contribution to SCaLeS report. [NASA 99] R. T. Biedron, P. Mehrotra, M. L. Nelson, F. S. Preston, J. J. Rehder, J. L. Rogers, D. H. Rudy, J. Sobieski, and O. O. Storaasli, “Compute as Fast as the Engineers Can Think!” NASA/TM , available on Internet. [SCaLeS 03] Workshop on the Science Case for Large-scale Simulation, June 24-25, proceedings on Internet a [DeBenedictis 04], Erik P. DeBenedictis, “Matching Supercomputing to Progress in Science,” July Presentation at Lawrence Berkeley National Laboratory, also published as Sandia National Laboratories SAND report SAND P. Sandia technical reports are available by going to and accessing the technical library. Compute as fast as the engineer can think [NASA 99]  100   1000  [SCaLeS 03] Full Global Climate [Malone 03] Plasma Fusion Simulation [Jardin 03] MEMS Optimize

Objectives and Challenges Could reversible computing have a role in solving important problems? –Maybe, because power is a limiting factor for computers and reversible logic cuts power However, a complete computer system is more than “low power” –Processing, memory, communication in right balance for application –Speed must match user’s impatience –Must use a real device, not just an abstract reversible device

Outline An Exemplary Zettaflops Problem The Limits of Current Technology Arbitrary Architectures for the Current Problem –Searching the Architecture Space –Bending the Rules to Find Something –Exemplary Solution Conclusions

FLOPS Increases for Global Climate 1 Zettaflops 1 Exaflops 10 Petaflops 100 Teraflops 10 Gigaflops Ensembles, scenarios 10  Embarrassingly Parallel New parameterizations 100  More Complex Physics Model Completeness 100  More Complex Physics Spatial Resolution 10 4  (10 3   ) Resolution IssueScaling Clusters Now In Use (100 nodes, 5% efficient) 100 Exaflops Run length 100  Longer Running Time Ref. “High-End Computing in Climate Modeling,” Robert C. Malone, LANL, John B. Drake, ORNL, Philip W. Jones, LANL, and Douglas A. Rotman, LLNL (2004)

Outline An Exemplary Zettaflops Problem The Limits of Current Technology Arbitrary Architectures for the Current Problem –Searching the Architecture Space –Bending the Rules to Find Something –Exemplary Solution Conclusions

8 Petaflops 80 Teraflops Projected ITRS improvement to 22 nm (100  ) Lower supply voltage (2  ) ITRS committee of experts Expert Opinion Scientific Supercomputer Limits Reliability limit 750KW/(80k B T) 2  logic ops/s Esteemed physicists (T=60°C junction temperature) Best-Case Logic Microprocessor Architecture Physical Factor Source of Authority Assumption: Supercomputer is size & cost of Red Storm: US$100M budget; consumes 2 MW wall power; 750 KW to active components 100 Exaflops Derate 20,000 convert logic ops to floating point Floating point engineering (64 bit precision) 40 TeraflopsRed Stormcontract 1 Exaflops 800 Petaflops  125:1  Uncertainty (6  ) Gap in chart Estimate Improved devices (4  ) Estimate 4 Exaflops32 Petaflops Derate for manufacturing margin (4  ) Estimate 25 Exaflops200 Petaflops

Outline An Exemplary Zettaflops Problem The Limits of Current Technology Arbitrary Architectures for the Current Problem –Searching the Architecture Space –Bending the Rules to Find Something –Exemplary Solution Conclusions

Supercomputer Expert System Expert System & Optimizer (looks for best 3D mesh of generalized MPI connected nodes,  P and other) Application/Algorithm run time model as in applications modeling Logic & Memory Technology design rules and performance parameters for various technologies (CMOS, Quantum Dots, C Nano-tubes …) Interconnect Speed, power, pin count, etc. Physical Cooling, packaging, etc. Time Trend Lithography as a function of years into the future Results 1. Block diagram picture of optimal system (model) 2. Report of FLOPS count as a function of years into the future

Simple case: finite difference equation Each node holds n  n  n grid points Volume-area rule –Computing  n 3 –Communications  n 2 Sample Analytical Runtime Model T step = 6 n 2 C bytes T byte + n 3 F grind /floprate Volume n 3 cells n n n Face-to-face n 2 cells

Expert System for Future Supercomputers Applications Modeling –Runtime T run = f 1 (n, design) Technology Roadmap –Gate speed = f 2 (year), –chip density = f 3 (year), –cost = $(n, design), … Scaling Objective Function –I have $C 1 & can wait T run =C 2 seconds. What is the biggest n I can solve in year Y? Use “Expert System” To Calculate: Report: and illustrate “design” Max n: $<C 1, T run <C 2 All designs Floating operations T run (n, design)

Outline An Exemplary Zettaflops Problem The Limits of Current Technology Arbitrary Architectures for the Current Problem –Searching the Architecture Space –Bending the Rules to Find Something –Exemplary Solution Conclusions

The Big Issue Initially, didn’t meet constraints One Barely Plausible Solution Scaled Climate Model Initial reversible nanotech 2D  3D mesh, one cell per processor Parallelize cloud-resolving model and ensembles More Parallelism Consider only highest performance published nanotech device QDCA Consider special purpose logic with fast logic and low-power memory More Device Speed

ITRS Device Review QDCA TechnologySpeed (min-max) Dimension (min-max) Energy per gate-op Comparison CMOS 30 ps-1  s8 nm-5  m 4 aJ RSFQ1 ps-50 ps 300 nm- 1  m 2 aJLarger Molecular10 ns-1 ms1 nm- 5 nm10 zJSlower Plastic 100  s-1 ms100  m-1 mm 4 aJLarger+Slower Optical100 as-1 ps 200 nm-2  m 1 pJLarger+Hotter NEMS100 ns-1 ms nm1 zJSlower+Larger Biological 100 fs-100  s6-50  m.3 yJSlower+Larger Quantum100 as-1 fs nm1 zJLarger QDCA100 fs-10ps1-10 nm1 yJSmaller, faster, cooler Data from ITRS ERD Section, data from Notre Dame

Outline An Exemplary Zettaflops Problem The Limits of Current Technology Arbitrary Architectures for the Current Problem –Searching the Architecture Space –Bending the Rules to Find Something –Exemplary Solution Conclusions

An Exemplary Device: Quantum Dots Pairs of molecules create a memory cell or a logic gate Ref. “Clocked Molecular Quantum-Dot Cellular Automata,” Craig S. Lent and Beth Isaksen IEEE TRANSACTIONS ON ELECTRON DEVICES, VOL. 50, NO. 9, SEPTEMBER 2003

100 GHz1 THz10 THz100 THz Energy/E k Ref. “Maxwell’s demon and quantum-dot cellular automata,” John Timler and Craig S. Lent, JOURNAL OF APPLIED PHYSICS 15 JULY 2003 How could we increase “Red Storm” from 40 Teraflops to 1 Zettaflops? Answer –>2.5  10 7 power reduction per operation –Faster devices  more parallelism >2.5  10 7 –Smaller devices to fit existing packaging 3000  faster 30  faster 2004 Device Level  10 8  1 Zettaflops Scientific Supercomputer

100 GHz1 THz10 THz100 THz Ref. “Maxwell’s demon and quantum-dot cellular automata,” John Timler and Craig S. Lent, JOURNAL OF APPLIED PHYSICS 15 JULY Ref. “Helical logic,” Ralph C. Merkle and K. Eric Drexler, Nanotechnology 7 (1996) 325–339. A number of post- transistor devices have been proposed The shape of the performance curves have been validated by a consensus of reputable physicists However, validity of any data point can be questioned Cross-checking appropriate; see  Not Specifically Advocating Quantum Dots Helical Logic

CPU Design Leading Thoughts –Implement CPU logic using reversible logic High efficiency for the component doing the most logic –Implement state and memory using conventional logic Low efficiency, but not many operations –Permits programming much like today Conventional Memory CPU Logic CPU State Reversible Logic Irreversible Logic

Atmosphere Simulation at a Zettaflops Supercomputer is 211K chips, each with 70.7K nodes of 5.77K cells of 240 bytes; solves 86T=44.1Kx44.1Kx 44.1K cell problem. System dissipates 332KW from the faces of a cube 1.53m on a side, for a power density of 47.3KW/m 2. Power: 332KW active components; 1.33MW refrigeration; 3.32MW wall power; 6.65MW from power company. System has been inflated by 2.57 over minimum size to provide enough surface area to avoid overheating. Chips are at 99.22% full, comprised of 7.07G logic, 101M memory decoder, and 6.44T memory transistors. Gate cell edge is 34.4nm (logic) 34.4nm (decoder); memory cell edge is 4.5nm (memory). Compute power is 768 EFLOPS, completing an iteration in 224µs and a run in 9.88s.

Performance Curve Year  FLOPS rate on Atmosphere Simulation  Cluster Custom Custom QDCA Rev. Logic Rev. Logic Microprocessor

Outline An Exemplary Zettaflops Problem The Limits of Current Technology Arbitrary Architectures for the Current Problem –Searching the Architecture Space –Bending the Rules to Find Something –Exemplary Solution Conclusions

There are important applications that are believed to exceed the limits of irreversible logic –At US$100M budget –E. g. solution to global warming Reversible logic & nanotech point in the right direction –Low power Device Requirements –Push speed of light limit –Substantially sub-k B T –Molecular scales Software and Algorithms –Must be much more parallel than today With all this, just barely works Conclusions appear to apply generally