SPARK Accelerating ASIC designs through parallelizing high-level synthesis Sumit Gupta Rajesh Gupta
© 2003 Spark Team, Confidential 2 Outline The target The problem The technology The competition The market opportunity The people The status The plan
3 A Chip Is A Wonderful Thing! A typical chip, circa: 2006 l50 square millimeters l50 million transistors l1-10 GHz, MOP/sq mm, MIPS/mW l300 mm, 10,000 units/wafer, 20K wafers/month l$5 per part Does not matter what you build lProcessor, MEMS, Networking, Wireless, Memory nBut it takes $20M to build one today, going to $50+M lSo there is a strong incentive to port your application, system, box to the “chip”
4 But Design Decisions Matter!
© 2003 Spark Team, Confidential 5 Technical Target Anyone and everyone with a technology IP to grind (build on-chip) –E.g., WLAN, Cellphone Chips: about 50 GOPS in BB processing –and about 72 other application ‘markets’ enhanced by ASIC/FPGA parts More technically –Behavioral descriptions with complex and nested conditionals and loops.
© 2003 Spark Team, Confidential 6 The Problem Doing chip design in a system house is increasingly a costly proposition –Case Study: Conexant in a chip 9 month from PRD to parts 7 months from PRD to synthesizable RTL The pain is in getting the algorithmic right for the chip implementation Would love a “compiler” –but “push-buttons” just do not work.
© 2003 Spark Team, Confidential 7 Enter High-Level Synthesis Task Analysis HW/SW Partitioning ASIC Processor Core Memory FPGA I/O Hardware Behavioral Description Software Behavioral Description Software Compiler High Level Synthesis
© 2003 Spark Team, Confidential 8 Poor QOR, even Poor Controllability M e m o r y ALU Control Data path d = e - fg = h + i If Node TF c x = a + b c = a < b j = d x g l = e + x x = a + b; c = a < b; if (c) then d = e – f; else g = h + i; j = d x g; l = e + x;
© 2003 Spark Team, Confidential 9 The Technology: Enter SPARK C Input VHDL Output Original CDFG Optimized CDFG Scheduling & Binding Source-Level Compiler Transformations Scheduling Compiler & Dynamic Transformations By the time you got to CDFG, it is already too late Parallelize (judiciously) and submerge it with HLS.
© 2003 Spark Team, Confidential 10 Why SPARK, Why Now? The chip designer is finally –letting go of the cycle boundary in design –being replaced by non-chip types Education and awareness through –Synopsys Behavioral Compiler –But not ready to be the dominator… SPARK changes the landscape –Parallelizing compilation as the ‘power tool’
© 2003 Spark Team, Confidential 11 SPARK Core Strengths Focus on –Transformations that increase amount of parallelism available in the source description –Tightly integrate with parallelizing compiler transformations Provide a HLS Toolbox for the micro- architect –Fire the circuit designer.
© 2003 Spark Team, Confidential 12 The POC and The Experiments Intel ILD design –Produced a design that fundamentally restructures the input description (the way a designer would, and no tool could) Bunch of other media benchmarks –40-70% improvement in delay for the same area –Based on Synopsys backend See appendix.
© 2003 Spark Team, Confidential 13 The Market Opportunity The big picture –Semi is $140B, Fabless Semi is $15B –EDA currently is about $4B Current EDA market –$1B Synthesis and verification $400M synthesis, $400M verification, $200M E. –$3B in PDA, IP and Design Services. $400M Synthesis –90% is RTL and below. Market movement and ‘structural’ changes.
© 2003 Spark Team, Confidential 14 Future ESL and Synthesis Market Keys to growth –ASIC focus (including structured ASICS) –‘Power tool’ key to commanding high ASPs Challenge –The raid of the FPGAs In which case, PHLS will be OEM’d –ASICs mired in Nano swamp Attention shifts to PDA, stationary semi market
© 2003 Spark Team, Confidential 15 The Competition The early educator: Synopsys BC –Classical HLS that just does not work, fundamentally flawed The improviser: Cadence Get2Chip A2C –Done a good job at RTL The others –Celoxica, Forte, Synfora, BlueSpec –“Boutiques” primarily targeted for “somebody else”
Synopsys Behav. Compiler Traditional HLS: Synthesis from subset of SystemC and Behav VHDL No parallelizing and beyond basic block (BBB) transformations Cadence/Get2ChipA2C Traditional HLS; closely tied to logic synthesis No parallelizing and BBB trafos Celoxica DK Design Suite Uses explicitly parallelized input in Handel-C; traditional HLS No pure behavioral input such as C or SystemC Forte DS Cynthesiz er Traditional HLS from SystemC with design space exploration No parallel and BBB trafos SynforaNA Maps applications to a VLIW processor and a pipelined array of processors – uses parallelizing transformations in VLIW compiler Does not do HLS at all – it’s more of a mapping tool from C to a processor array BlueSpecNA Based on term rewriting systems; starts from a description closer to RTL than to behav Not HLS – input is behav code already scheduled into states The Competition
© 2003 Spark Team, Confidential 17 What Do We Want To Do? Make it accessible to SystemC, SystemVerilog –Front end architecture to port it across Implement missing compiler passes –Really standard stuff but missing piece now Work out a design flow –Build a path to existing RTL flow incl. validation Industry strength characterization Secure IP rights
© 2003 Spark Team, Confidential 18 Synergistic Activities SPARK release on the web –Mailing list –Build the users group –Expand to SystemC User Community Kluwer book in preparation –Announcement at DATE, Feb 2004 –Availability at DAC, June 2004
© 2003 Spark Team, Confidential 19 Exit Strategy Not yet worked out, but… Build a stand-alone EDA company –As a standalone it would not work unless complemented by verification Build to be bought –As an HLS company License technology –Companies that have shown interest in licensing it Poseidon Systems, Cadence
© 2003 Spark Team, Confidential 20 SPARK History A joint project –Rajesh Gupta, Nikil Dutt, Alex Nicolau Kicked off in Fall 1999 –First Ph.D., Sumit Gupta, 2003 Supported by –Semiconductor Research Corporation, SRC –Intel grant as a match to UC Micro –National Science Foundation.
Copyright Sumit Gupta Case Study: Intel Instruction Length Decoder Stream of Instructions Instruction Length Decoder First Insn Second Insn Third Instruction Instruction Buffer
Copyright Sumit Gupta ILD Synthesis: Resulting Architecture Speculate Operations, Fully Unroll Loop, Eliminate Loop Index Variable Multi-cycle Sequential Architecture Multi-cycle Sequential Architecture Single cycle Parallel Architecture Single cycle Parallel Architecture Our toolbox approach enables us to develop a script to synthesize applications from different domains Our toolbox approach enables us to develop a script to synthesize applications from different domains Final design looks close to the actual implementation done by Intel Final design looks close to the actual implementation done by Intel
Copyright Sumit Gupta Target Applications Design # of Ifs # of Loops # Non-Empty Basic Blocks # of Operations MPEG-1 pred MPEG-1 pred MPEG-2 dp_frame GIMPtiler
Copyright Sumit Gupta Speculative Code Motions + Pre-Synthesis Transforms + Dynamic CSE Scheduling & Logic Synthesis Results Non-speculative CMs: Within BBs & Across Hier Blocks 42% 10% 36% 8% 39% Overall: % improvement in Delay Almost constant Area
Copyright Sumit Gupta Non-speculative CMs: Within BBs & Across Hier Blocks + Speculative Code Motions + Pre-Synthesis Transforms + Dynamic CSE Scheduling & Logic Synthesis Results 14% 20% 1% 33% 41% 52% Overall: % improvement in Delay Almost constant Area