One-Chip TeraArchitecture 19 martie 2009 One-Chip TeraArchitecture Gheorghe Stefan

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

ECE669 L3: Design Issues February 5, 2004 ECE 669 Parallel Computer Architecture Lecture 3 Design Issues.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
The University of Adelaide, School of Computer Science
CM-5 Massively Parallel Supercomputer ALAN MOSER Thinking Machines Corporation 1993.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Connex Technology Proprietary and Confidential 1 The CA1024: A Massively Parallel Processor for Cost-Effective HDTV.
Introduction to Reconfigurable Computing CS61c sp06 Lecture (5/5/06) Hayden So.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Chapter Hardwired vs Microprogrammed Control Multithreading
The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University
VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
Copyright © 2006, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Intel® Core™ Duo Processor.
Computer performance.
Gheorghe M. Ștefan “The semiconductor industry threw the equivalent of a Hail Mary pass when it switched from making microprocessors.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Membrane Computing in Connex Environment 1WMC8 June 2007 Membrane Computing in the Connex Environment Gheorghe Stefan BrightScale Inc., Sunnyvale, CA &
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
CSE Advanced Computer Architecture Week-1 Week of Jan 12, 2004 engr.smu.edu/~rewini/8383.
Computer Organization and Architecture Tutorial 1 Kenneth Lee.
Gheorghe M. Ştefan
István Lőrentz 1 Mihaela Malita 2 Răzvan Andonie 3 Mihaela MalitaRăzvan Andonie 3 (presenter) 1 Electronics and Computers Department, Transylvania University.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Introduction to MMX, XMM, SSE and SSE2 Technology
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
Chapter 1 Introduction to the Systems Approach
Computer Architecture SIMD Ola Flygt Växjö University
The World Leader in High Performance Signal Processing Solutions Multi-core programming frameworks for embedded systems Kaushal Sanghai and Rick Gentile.
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters
Lecture 3: Computer Architectures
WorldScape Defense Company, L.L.C. Company Proprietary Slide 1 An Ultra-High Performance Scalable Processing Architecture for HPC and Embedded Applications.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
CS203 – Advanced Computer Architecture Performance Evaluation.
“Processors” issues for LQCD January 2009 André Seznec IRISA/INRIA.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Processor Level Parallelism 1
Mihaela Malița Gheorghe M. Ștefan
CS203 – Advanced Computer Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
System On Chip.
Multi-core processors
Presented by: Tim Olson, Architect
Architecture & Organization 1
Multi-Processing in High Performance Computer Architecture:
Map-Scan Node Accelerator for Big-Data
Hyperthreading Technology
Multi-Processing in High Performance Computer Architecture:
Array Processor.
Lecture 41: Introduction to Reconfigurable Computing
Architecture & Organization 1
The CA1024: A Massively Parallel Processor for Cost-Effective HDTV
EE 4xx: Computer Architecture and Performance Programming
Chapter 4 Multiprocessors
Multicore and GPU Programming
CSE 502: Computer Architecture
Multicore and GPU Programming
Presentation transcript:

One-Chip TeraArchitecture 19 martie 2009 One-Chip TeraArchitecture Gheorghe Stefan

One-Chip TeraArchitecture Outline  One-Chip Parallel Engines - an Emergent Market  One-Chip Parallel Architecture & its Performance  Integral Parallel Architecture  A Case Study: BA1024  Concluding Remarks

One-Chip TeraArchitecture One-Chip Parallel Engines – an Emergent Market Parallelism is ubiquitous: Instruction Level Parallelism Multi-Threaded Execution Multi-Core Many-Core Engines Many-Computer (Message Passing Interface)

One-Chip TeraArchitecture (Performance / Power or Price) & Market 1.Performance only approach: supercomputing 2.Performance/Price approach: hi-end PCs 3.Performance/Price & performance/Power approach: embedded computing

One-Chip TeraArchitecture SoC Market & Programmability SoC in nano-meter era asks for:  High complexity  High intensity  High flexibility One Giga-Gate per Chip Era enforce complexity can’t follow size Then the key word is PROGRAMMABILITY

One-Chip TeraArchitecture Embedded Parallel Computing “Flexible & Feasible ASIC” = programmable parallel engine ASIC is a circuit = inherent heterogeneous parallel system Flexibility = programmability Feasibility = segregating all kinds of the simple parallel structures from the complex program

One-Chip TeraArchitecture One-Chip Parallel Architecture & its Performance Because a programmable structure competes with ASIC philosophy, one-chip parallel architecture must be an integral parallel architecture The performance is evaluated according to the weight of each type of instruction: float, pointer, word, half-word, byte Examples: for a half-word machine a word instruction is executed in 2 cycles For a word machine a float instruction is executed in 20 cycles

One-Chip TeraArchitecture

Weighted Tera Instruction Per Second (TIPS) Medium intense float: Float op: 10% Word op: 13% Half-word op: 30% Byte op: 37% 1 TIPS = 2.75 TOPS High intense float: Float op: 25% Word op: 35% Half-word op: 12% Byte op: 28% 1 TIPS = 5.96 TOPS Float op : 20 cycles Word op : 2 cycles Half-word op : 1 cycle Byte op : 0.5 cycles Then: 1 TIPS = 3 – 6 TOPS

One-Chip TeraArchitecture Integral Parallel Architecture (IPA) Computation is: Complex (control intensive) Intense (data intensive) Parallelism is: data parallelism (almost SIMD) time parallelism (a sort of MIMD) speculative parallelism (a true MISD)

One-Chip TeraArchitecture Complex vs. Intense Intense computation: high latency functional pipe array computation buffer based hierarchy 400 GOPS (half-word ops) 0.5 cm 2, 6 W, 0.4 GHz 800 GOPS/ cm GOPS/W Complex computation: OS oriented multi-threading cache based hierarchy 4 GIPS & 2 GFLOPS 1.5 cm 2, 50 W, 2 GHz (2.6 GIPS+1.3 GFLOPS)/cm 2 (0.08GIPS GFLOPS)/W

One-Chip TeraArchitecture Embedded Parallel Organization Coarse grain Multi-Core Engine for complex computation & fine grain Many-core Engine for intense computation Multi-Core: 2 – 16 multi-threaded complex processors Many-Core: 256 – 4096 small & simple execution units (EU) or processing elements (PE)

One-Chip TeraArchitecture Chip Organization Cache Interconnection Fabric DDR SDRAM Interface Multi-Core Many-Core Buffer

One-Chip TeraArchitecture A Case Study: BA1024 The organization of BA1024: multi-core area of 4 MIPS many-core data parallel area of 1024 simple PEs speculative time parallel pipe of 8 PEs interfaces (DDR, PCI, video & audio interfaces for 2 HDTV channels)

One-Chip TeraArchitecture Overall performances of BA GOP/sec 6.4 GB/sec: external bandwidth 800 GB/sec: internal bandwidth > 60 GOPS/Watt > 8 GOPS/mm 2 65 nm, Standard process Note: 1 OP = 16-bit simple integer operation (excluding multiplication)

One-Chip TeraArchitecture Full Vector Operations Line i Line k Line j +, -, *, XOR, etc. = Line k = Line i OP Line j Line k = Line i OP scalar value (repeated for all elements) 16-bit data operand

One-Chip TeraArchitecture Conditioned Operations Based Line i Line k Line j +, -, *, XOR, etc. = This enables selective processing based on data content.

One-Chip TeraArchitecture Multi-Core Organization Multi-threaded programming model Each core supports: block multi-threading interleaved multi-threading Number of cores limited by the random access to the external memory

One-Chip TeraArchitecture Extrapolating BA1024 performance Medium float environment: 45 nm, standard process 1cm EUs 0.7 GHz 2.8 TOPS = 1 TIPS ~ 25 W High float environment: 45 nm, standard process 1cm EUs 1.5 GHz 6 TOPS = 1 TIPS ~ 50W

One-Chip TeraArchitecture Concluding Remarks 1. Segregating the complex from intense is the key 2. Using all forms of parallelism allow the competition with ASIC approach 3. Implementation issues limit the true scalability 4. The organization must be maintained as simple as possible in order to be easy hidden to the user 5. The Landscape of Parallel Computing Research: A View from Berkeley is a good tool to evaluate our approach

One-Chip TeraArchitecture Main technical contributors to the project:  Emanuele Altieri, BrightScale Inc., CA  Frank Ho, BrightScale Inc., CA  Mihaela Malita, St. Anselm College, NH  Bogdan Mitu, BrightScale Inc., CA  Marius Stoian, PUB, Romania  Dominique Thiebaut, Smith College, MA  Dan Tomescu, BrightScale Inc., CA