Giga-Scale System-On-A-Chip International Center on System-on-a-Chip (ICSOC) Jason Cong University of California, Los Angeles Tel: 310-206-2775, Email:

Slides:



Advertisements
Similar presentations
High-Level Constructors and Estimators Majid Sarrafzadeh and Jason Cong Computer Science Department
Advertisements

FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
Graduate Computer Architecture I Lecture 15: Intro to Reconfigurable Devices.
Design Technology Center National Tsing Hua University IC-SOC Design Driver Highlights Cheng-Wen Wu.
Reconfigurable Computing: What, Why, and Implications for Design Automation André DeHon and John Wawrzynek June 23, 1999 BRASS Project University of California.
Design Technology Center National Tsing Hua University IC-SOC Design Driver Highlights Cheng-Wen Wu.
Giga-Scale System-On-A-Chip International Center on System-on-a-Chip (ICSOC) Jason Cong University of California, Los Angeles Tel: ,
Behavioral Synthesis Outline –Synthesis Procedure –Example –Domain-Specific Synthesis –Silicon Compilers –Example Tools Goal –Understand behavioral synthesis.
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
Spring 08, Jan 15 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2007 Introduction Vishwani D. Agrawal James J. Danaher.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
Spring 07, Jan 16 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2007 Introduction Vishwani D. Agrawal James J. Danaher.
Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse Grain and Fine Grain Optimizations.
Tim Cheng1 Key Results - Verification Developed and released ATPG-based SAT solvers for circuits (Univ. of California, Santa Barbara) –Integrating structural.
Design Automation for VLSI, MS-SOCs & Nanotechnologies Dr. Malgorzata Chrzanowska-Jeske Mixed-Signal System-on-Chip (supported.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Bitwidth-Aware Scheduling and Binding in High-Level Synthesis X. Cheng +, J. Cong, Y. Fan, G. Han, J. Lin, J. Xu +, Z. Zhang Computer Science Department,
Giga-Scale System-On-A-Chip International Center on System-on-a-Chip (ICSOC) Jason Cong University of California, Los Angeles Tel: ,
International Center on Design-for- Nanotechnologies (IC-DFN) Jason Cong University of California, Los Angeles
林永隆 (Youn-Long Lin) Department of Computer Science National Tsing Hua University High-Level Synthesis of VLSIs THEDA Tsing Hua Electronic Design Automation.
Merging Synthesis With Layout For Soc Design -- Research Status Jinian Bian and Hongxi Xue Dept. Of Computer Science and Technology, Tsinghua University,
International Center on Design for Nanotechnologies (IC-DFN) Jason Cong University of California, Los Angeles Tel: ,
Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.
Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.
Combining High Level Synthesis and Floorplan Together EDA Lab, Tsinghua University Jinian Bian.
Architecture-Level Synthesis for Automatic Interconnect Pipelining
(1) Introduction © Sudhakar Yalamanchili, Georgia Institute of Technology, 2006.
Power Reduction for FPGA using Multiple Vdd/Vth
CAD Techniques for IP-Based and System-On-Chip Designs Allen C.-H. Wu Department of Computer Science Tsing Hua University Hsinchu, Taiwan, R.O.C {
CAD for Physical Design of VLSI Circuits
Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.
LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.
Paper Review: XiSystem - A Reconfigurable Processor and System

Automated Design of Custom Architecture Tulika Mitra
Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.
Xilinx Programmable Logic Design Solutions Version 2.1i Designing the Industry’s First 2 Million Gate FPGA Drop-In 64 Bit / 66 MHz PCI Design.
SHA-3 Candidate Evaluation 1. FPGA Benchmarking - Phase Round-2 SHA-3 Candidates implemented by 33 graduate students following the same design.
HDL-Based Layout Synthesis Methodologies Allen C.-H. Wu Department of Computer Science Tsing Hua University Hsinchu, Taiwan, R.O.C {
CSE 494: Electronic Design Automation Lecture 2 VLSI Design, Physical Design Automation, Design Styles.
Canary SRAM Built in Self Test for SRAM VMIN Tracking
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Hardware/Software Co-design Design of Hardware/Software Systems A Class Presentation for VLSI Course by : Akbar Sharifi Based on the work presented in.
Chonnam national university VLSI Lab 8.4 Block Integration for Hard Macros The process of integrating the subblocks into the macro.
ELEC692/04 course_des 1 ELEC 692 Special Topic VLSI Signal Processing Architecture Fall 2004 Chi-ying Tsui Department of Electrical and Electronic Engineering.
26 th International Conference on VLSI January 2013 Pune,India Optimum Test Schedule for SoC with Specified Clock Frequencies and Supply Voltages Vijay.
3 rd Nov CSV881: Low Power Design1 Power Estimation and Modeling M. Balakrishnan.
RF network in SoC1 SoC Test Architecture with RF/Wireless Connectivity 1. D. Zhao, S. Upadhyaya, M. Margala, “A new SoC test architecture with RF/wireless.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
George Mason University ECE 448 – FPGA and ASIC Design with VHDL ASICs vs. FPGAs ECE 448 Lecture 15.
Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.
EE 466/586 VLSI Design Partha Pande School of EECS Washington State University
1 Power estimation in the algorithmic and register-transfer level September 25, 2006 Chong-Min Kyung.
A High-Level Synthesis Flow for Custom Instruction Set Extensions for Application-Specific Processors Asia and South Pacific Design Automation Conference.
1 Power-Aware System on a Chip A. Laffely, J. Liang, R. Tessier, C. A. Moritz, W. Burleson University of Massachusetts Amherst Boston Area Architecture.
Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles Zhiyi Yu, Bevan Baas VLSI Computation Lab, ECE Department University of California,
Dept. of Electronics Engineering & Institute of Electronics National Chiao Tung University Hsinchu, Taiwan ISPD’16 Generating Routing-Driven Power Distribution.
ELEC 7950 – VLSI Design and Test Seminar
-1- Soft Core Viterbi Decoder EECS 290A Project Dave Chinnery, Rhett Davis, Chris Taylor, Ning Zhang.
Jason Cong University of California, Los Angeles Tel: ,
Architecture and Synthesis for Multi-Cycle Communication
ELEC 7770 Advanced VLSI Design Spring 2016 Introduction
Evaluating Register File Size
ELEC 7770 Advanced VLSI Design Spring 2014 Introduction
ELEC 7770 Advanced VLSI Design Spring 2012 Introduction
ELEC 7770 Advanced VLSI Design Spring 2010 Introduction
HIGH LEVEL SYNTHESIS.
Measuring the Gap between FPGAs and ASICs
Department of Computer Science and Technology
Presentation transcript:

Giga-Scale System-On-A-Chip International Center on System-on-a-Chip (ICSOC) Jason Cong University of California, Los Angeles Tel: , (Other participants are listed inside)

Jason Cong2 Project Summary Develop new design methodology to enable efficient giga-scale integration for system-on-a-chip (SOC) designs Project includes three major components –SOC synthesis tools and methodologies –SOC verification, test, and diagnosis –SOC design driver – network processor

Jason Cong3 Research Team by Institutions  US  UCLA: Jason Cong  UC Santa Barbara: Tim Cheng  Taiwan  NTHU: Shi-Yu Huang, Tingting Hwang, J. K. Lee, Youn-Long Lin, C. L. Liu, Cheng-Wen Wu, Allen Wu  NCTU: Jing-Yang Jou  China  Tsinghua Univ.: Jinian Bian, Xianlong Hong, Zeyi Wang, Hongxi Xue  Peking Univ.: Xu Cheng  Zhejiang Univ.: Xiaolang Yan

Jason Cong4 Current Research Team  US  UCLA: Jason Cong  UC Santa Barbara: Tim Cheng  Taiwan  NTHU: Shi-Yu Huang, Tingting Hwang, J. K. Lee, Youn-Long Lin, C. L. Liu, Cheng-Wen Wu, Allen Wu  NCTU: Jing-Yang Jou  China  Tsinghua Univ.: Jinian Bian, Xianlong Hong, Zeyi Wang, Hongxi Xue  Peking Univ.: Xu Cheng  Zhejiang Univ.: Xiaolang Yan  Several new faculty members in the 7 institutions  Guest members from National University of Singapore, Purdue Univ., and UCLA (EE Dept)

Jason Cong5 Thrust 1 -- SOC Synthesis Environment/Methodology (Led by Jason Cong) Code Generation for Retargetable Compiler and Assembler Generator Design Spec VHDL/C Co-Simulation Design Partitioning DSP Synthesis and Optimization FPGA Synthesis and Technology Mapping ASIC Synthesis Interconnect-Driven High-level Synthesis Synthesis for IP Reuse Physical Synthesis for Full-Chip Assembly Embedded Processors DSPs Embedded FPGAs Customized Logic

Jason Cong clock 2 clock 3 clock 4 clock 5 clock n ITRS’ um Tech n 5.63 G Hz across-chip clock n 800 mm 2 (28.3mm x 28.3mm) n IPEM BIWS estimations u Buffer size: 100x u Driver/receiver size: 100x n On semi-global layer (tier 3) : u Can travel up to 11.4 mm in one cycle u Need 5 clock cycles from corner to corner Interconnect Bottleneck in Nanometer Designs u 2nd challenge: Single-cycle full chip synchronization is no longer possible u Not supported by the current CAD toolset u About to happen soon

Jason Cong7 Regular Distributed Register Architecture (2) Global Interconnect … LCC Reg. file … LCC Reg. file … LCC Reg. file … LCC Reg. file … LCC Reg. file … LCC Reg. file FSM Local Computational Cluster (LCC) …. Register File WiWi HiHi Island FSM ADD MUX MUL Cluster with area constraint  Use register banks:  Registers in each island are partitioned to k banks for 1 cycle, 2 cycle, … k cycle interconnect communication in each island  Highly regular 1 cycle 2 cycle k cycle

Jason Cong8 MCAS: Placement-Driven Architectural Synthesis Using RDR Architecture Register and port binding Datapath & FSM generation Floorplan constraints RTL VHDL files Multi-cycle path constraints CDFG C / VHDL CDFG generation + 2 * 3* * 7 * * 11 * RDR Arch. Spec. Target clock period Resource allocation Resource constraints -+ ** -- * * - * - * Interconnected Component Graph (ICG) Functional unit binding Mult1 Alu2 Mult2 Alu1 Interconnected Component Graph (ICG) Location information Scheduling-driven placement Reg. file … Alu1 1,5,10 … Reg. file … Mul2 3,7,12 … Alu2 2,6,9 Mul1 4,8,11 Placement-driven rebinding & scheduling Cycle1 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6 Cycle7 * * * +- * -- * - * - Reg. file … Alu1 1,5,10 … Reg. file … Mul2 3,7,11 … Alu2 2,6,9 Mul1 4,8,12

Jason Cong9 Experimental Results (3)  Synopsys Behavioral Compiler setting: default (optimizing latency)  Average latency ratio of MCAS vs. BC: 69% n MCAS basic flow vs. Synopsys’ Behavioral Compiler (on Virtex-II) LatencyResource

Jason Cong10 Optimality Study of Large-Scale Circuit Placement Construction of Placement Example with Known Optimal (PEKO) [C. Chang et al, 2003] ? n Construct instances with known optimal using the characteristic of the original problem n First quantitative evaluation of the optimality of circuit placement problem n Existing placement algorithms can be 70% to 150% away from the optimal

Jason Cong11 High Interest in the Community Two EE Times articles coverage –Placement tools criticized for hampering IC designs [Feb’03] –IC placement benchmarks needed, researchers say [April’03] More than 60 downloads from our website –Cadence, IBM, Intel, Magma, Mentor Graphics, Synopsys, etc –CMU, SUNY, UCB, UCSB, UCSD, UIC, UMichgan, UWaterloo, etc Used in every placement since its publication

Jason Cong12 1. Synthesis & Verification  Hardware/Software Partition:  Propose a SSS based H/S partition algorithm (ASICON2003)  better solution than SA and less runtime than Tabu  High-level Synthesis:  Re-synthesis algorithm after floorplanning for timing optimization (ASICON2003)  Based on initial scheduling do floorplanning  After floorplanning do re-scheduling and re-allocation by force- balance method  Controller Synthesis:  A Heuristic State Minimization Algorithm For Incompletely Specified Finite State Machine (ASICON2003, JCST)

Jason Cong13 2. Floorplanning & Interconnect Planning  Based on proposed Corner Block List (CBL) representation propose several Extended Corner Block List, ECBL, CCBL and SUB-CBL to speed up floorplanning and handle more complicate L/T shaped and rectilinear shaped blocks.  Propose floorplanning algorithms with some geometric constraints, such as boundary, abutment, L/T shaped blocks.  Propose integrated floorplanning and buffer planning algorithms with consideration of congestion.  Using research results from UCLA on interconnect planning  About 30 papers published in DAC, ICCAD, ISPD, ASPDAC, ISCAS and Transactions.

Jason Cong14 3. P/G Network Analysis & Optimization  Propose an Area Minimization of Power Distribution Network Using Efficient Nonlinear Programming Techniques (ICCAD2001, accepted by IEEE Trans. On CAD)  Propose a decoupling capacitance optimization algorithm for Robust On-Chip Power Delivery (ASPDAC2004, ASICON2003) 4. Global Routing & Special Routing  Propose several congestion, timing, and both timing and congestion optimization global routing algorithms  Papers were published in ASPDAC, ISCAS, and IEEE Transactions.

Jason Cong15 5. Parasitic R/L/C Etraction  3-D R/C Extraction using Boundary Element Method (BEM)  Quasi-Multiple Medium (QMM) BEM algorithms  Hierarchical Block BEM (HBBEM) technique  Fast 3-D Inductance Extraction (FIE)  Papers were published in ASPDAC, ASICON and IEEE Transaction on MTT

Jason Cong16 Thrust 2 -- SOC Verification, Test, and Diagnosis (Led by Tim Cheng) Verification and Testing Enabling techniques for semi-formal functional verification Integrated framework for simulation, vector generation and model checking Testing and diagnosis for heterogeneous SOC Self-testing using on- chip programmable components Self-testing for on-chip analog/mixed-signal components New test techniques for deep-submicron embedded memories Scalable constraint-solving techniques Automatic/semi-automatic functional vector generation from HDL code

Tim Cheng17 Key Results - Verification Developed and released ATPG-based SAT solvers for circuits (Univ. of California, Santa Barbara) –Integrating structural ATPG and SAT techniques with new conflict learning –CSAT: Fast combinational solver (released on March 2003) Demonstrated X speedup over state-of-the-art SAT solvers on industrial test cases (reported by Intel and Calypto) Has been integrated into Intel’s FV verification system and a startup’s verification engine Publications: DATE2003 and DAC2003 –Satori2: Fast sequential solver (released on Dec. 2003) Demonstrated 10X-200X speedup over a commercial, sequential ATPG engine on public benchmark circuits Publications: ICCAD2003, HLDVT2003 and ASPDAC2004

Tim Cheng18 Key Results - Testing A new Statistical Delay Testing and Diagnosis framework consisting of five major components (UCSB): Defect Injection & Simulation Statistical Timing Analysis Framework (Cell-based characterization) Static Timing Analysis Dynamic Timing Simulator Path Filtering Critical Path Selection Diagnosis ATPG/Pattern Selection Selection/Generation of high quality tests for target paths [ITC’01][DATE 2004] Selection/Generation of high quality tests for target paths [ITC’01][DATE 2004]  Identifying tests that activate longer delay along the target path Delay fault diagnosis based on statistical timing model [DATE’03, VTS’03, DAC’03] Delay fault diagnosis based on statistical timing model [DATE’03, VTS’03, DAC’03]  Ref: Krstic, Wang, Cheng,& Abadir, DATE’03–Best Paper Award in Test Statistical timing analysis Statistical critical path selection [DAC’02,ICCAD’02]  Selecting statistical long & true paths whose tests maximize detection of parametric failures Path coverage metric [ASPDAC’03]  Estimating the quality of a path set

Tim Cheng19 Key Results - Testing On-Chip Jitter Extraction for Bit-Error-Rate (BER) Testing of Multi- GHz Signal (UCSB) –Using on-chip, single-shot measurement unit to sample signal periods for spectral analysis –Demonstrated, through simulation, accurate extraction of multiple sinusoids and random jitter components for a 3GHz signal –Publications: ASPDAC2004 and DATE2004

Jason Cong20 Thrust 3 – Design Driver: Network Security Processor (Led by Prof. C. W. Wu) Applications: IPSec, SSL, VPN, etc. Functionalities: –Public key: RSA, ECC –Secret key: AES –Hashing (Message authentication): HMAC (SHA-1/MD5) –Truly random number generator (FIPS 140-1,140-2 compliant) Target technology: 0.18  m or below Clock rate: 200MHz or higher (internal) 32-bit data and instruction word 10Gbps (OC192) Power: 1 to 10mW/MHz at 3V (LP to HP) Die size: 50mm 2 On-chip bus: AMBA (Advanced Microcontroller Bus Architecture)

Jason Cong21 Encryption Modules (PKEM) Public key encryption module –Operations: 32-bit word-based modular multiplication Multiplication over GF(p) and GF(2 m ) An RSA cryptography engine with small area overhead and high speed Scalable word-width TSMC 0.35μm 34K gates (1.7×1.8 mm 2 ) 100MHz clock Scalable key length Throughput –512-bit key: 1.79Kbps/MHz –1024-bit key: 470bps/MHz

Jason Cong22 Encryption Modules (SKEM) Secret key encryption module –Operations: Matrix operations, manipulation AES cryptography 32-bit external interface 58K gates Over 200MHz clock Throughput: 2Gbps Support key length of 128/192/256 bits Technology TSMC 0.25  m CMOS Package128CQFP Core Size 1,279 x 1,271  m 2 Gate Count63.4K Max. Freq.250MHz Throughput Gbps (128-bit key) Gbps (196-bit key) Gbps (256-bit key)

Jason Cong23 Journal Publications C.-T. Huang and C.-W. Wu, ``High-speed easily testable Galois-field inverter'', IEEE Trans. Circuits and Systems II: Analog and Digital Signal Processing, vol. 47, no. 9, pp , Sept S.-A. Hwang and C.-W. Wu, ``Unified VLSI systolic array design for LZ data compression'', IEEE Trans. VLSI Systems, vol. 9, no. 4, pp , Aug C.-H. Wu, J.-H. Hong, and C.-W. Wu, ``VLSI design of RSA cryptosystem based on the Chinese Remainder Theorem'', J. Inform. Science and Engineering, vol. 17, no. 6, pp , Nov J.-H. Hong and C.-W. Wu, ``Cellular array modular multiplier for the RSA public- key cryptosystem based on modified Booth's algorithm'', IEEE Trans. VLSI Systems, vol. 11, no. 3, pp , June C.-P. Su, T.-F. Lin, C.-T. Huang, and C.-W. Wu, ``A high-throughput low-cost AES processor'', IEEE Communications Magazine, vol. 41, no. 12, pp , Dec

Jason Cong24 Conference Publications J.-H. Hong and C.-W. Wu, ``Radix-4 modular multiplication and exponentiation algorithms for the RSA public-key cryptosystem'', in Proc. Asia and South Pacific Design Automation Conf. (ASP-DAC), Yokohama, Jan. 2000, pp J.-H. Hong, P.-Y. Tsai, and C.-W. Wu, ``Interleaving schemes for a systolic RSA public-key cryptosystem based on an improved Montgomery's algorithm'', in Proc. 11th VLSI Design/CAD Symp., Pingtung, Aug. 2000, pp C.-H. Wu, J.-H. Hong, and C.-W. Wu, ``An RSA cryptosystem based on the Chinese Remainder Theorem'', in Proc. 11th VLSI Design/CAD Symp., Pingtung, Aug. 2000, pp C.-H. Wu, J.-H. Hong, and C.-W. Wu, ``RSA cryptosystem design based on the Chinese Remainder Theorem'', in Proc. Asia and South Pacific Design Automation Conf. (ASP-DAC), Yokohama, Jan. 2001, pp Y.-C. Lin, C.-P. Su, C.-W. Wang, and C.-W. Wu, ``A word-based RSA public-key crypto-procesoor core'', in Proc. 12th VLSI Design/CAD Symp., Hsinchu, Aug T.-F. Lin, C.-P. Su, C.-T. Huang, and C.-W. Wu, ``A high-throughput low-cost AES cipher chip'', in Proc. 3rd IEEE Asia- Pacific Conf. ASIC, Taipei, Aug. 2002, pp Y.-T. Lin, C.-P. Su, C.-T. Huang, C.-W. Wu, S.-Y. Huang, and T.-Y. Chang, ``Low-power embedded memory architecture design for SOC'', in Proc. 13th VLSI Design/CAD Symp., Taitung, Aug. 2002, pp M.-C. Sun, C.-P. Su, C.-T. Huang, and C.-W. Wu, ``Design of a scalable RSA and ECC crypto-processor'', in Proc. Asia and South Pacific Design Automation Conf. (ASP-DAC), Kitakyushu, Jan. 2003, pp , (Best Paper Award). C.-P. Su, T.-F. Lin, C.-T. Huang, and C.-W. Wu, ``A highly efficient AES cipher chip'', in Proc. Asia and South Pacific Design Automation Conf. (ASP-DAC), Kitakyushu, Jan. 2003, pp , (Design Contest Special Feature Award). J.-H. Hong, C.-L. Liu, B.-Y. Tsai, and C.-W. Wu, ``A radix-4 modular multiplier for fast RSA public-key cryptosystem'', in Proc. 14th VLSI Design/CAD Symp., Hualien, Aug. 2003, pp M.-Y. Wang, C.-P. Su, C.-T. Huang, and C.-W. Wu, ``An HMAC processor with integrated SHA-1 and MD5 algorithms'', in Proc. Asia and South Pacific Design Automation Conf. (ASP-DAC), Yokohama, Jan (to appear).