Please do not distribute

Please do not distribute
5/20/2018 Tutorial Outline Time Topic 8:45 am – 9:00 am Hands-on: Virtual Machine Setup 9:00 am – 9:20 am Presentation: Accelerator Research Overview 9:20 am – 9:35 am Presentation: Aladdin: Accelerator Pre-RTL Modeling 9:35 am – 10:15 am Hands-on: Accelerator Design Space Exploration using Aladdin 10:15 am – 10:30 am Break 10:30 am – 11:00 am Presentation: gem5-Aladdin: Accelerator System Integration 11:00 am – 12:00 pm Hands-on: SoC Design Space Exploration using gem5-Aladdin Amortize optimization phase GYW

5/20/2018 Aladdin and gem5-Aladdin: Research Infrastructures for Specialized Architectures Hello everyone, I’m sophia shao, I was a phd student working with professor david brooks at harvard for the past few years. I just graduated earlier this year, and joined nvidia research as a research scientist. In the past few years, our group at harvard has done quite a lot of work on developing architectural level modeling and simulating infrastructures for specialized architectures, specifically the aladdin tool for pre-rtl, power-performance accelerator modeling and gem5-aladdin for system-level soc simulator. Over the course of the tutorial, we will talk about what the these two tools do, how they work, and how you can use them to explore different accelerator designs. Yakun Sophia Shao, Sam Xi, Gu-Yeon Wei, David Brooks Harvard University GYW

Moore’s Law The semiconductor industry has recorded impressive achievements since 1965, when Gordon Moore published the observation that the number of transistors per unit area would double every other year. This simple plot has stood the test of the time, and set the pace of performance improvement in the semiconductor industry. Basically just wait for two more years, and we are going to get 2x faster machines. It has been great except at where we are today, it’s very unlikely that moore’s law is gonna continue.

CMOS Scaling is Slowing Down
Please do not distribute 5/20/2018 CMOS Scaling is Slowing Down 180 nm 130 nm 90 nm 65 nm 45 nm In fact, the transistor scaling that Gordon Moore has predicted is already measurably slowing down. The slide here shows Intel’s historical technology scaling trend based on the release date of the first microprocessor in each technology node. Over the past decade, Intel has been following quite closely with Moore’s law by introducing a new generation of technology node every two years. However, the introduction of Intel’s 14 nm process was delayed for half of a year beyond the original projection. It was supposed to be released at the second quarter of 2014, but did not release until the fourth quarter. Moreover, in July 2015 Intel announced that the 10 nm node will not be ready until the mid In this case, 14nm technology is going to be around for at least 3 years, if there is no more delays in 10. 32 nm 22 nm 14 nm 10 nm GYW

CMOS Technology Scaling
Please do not distribute 5/20/2018 CMOS Technology Scaling Technological Fallow Period The slowdown of technology scaling looks very familiar. If we look at all the past generations of transistor technologies and how they have scaled over time, we see this kind of S curve, where for each new technology, when it’s first discovered, it was not as good as the existing technology. So it takes roughly 5 to 10 years to become mature enough to deliver better performance than current generation, that’s when it replaces the old technology. After it becomes the mainstream technology, it eventually will run into either power or density issues that would stop the performance growth until it’s finally replaced by a new type of technology. When the technology stops delivering better performance and the new technology is not mature enough, this is what we called technological fallow period, where we don’t get much performance improvement from technology scaling. Unfortunately, is exactly where we are. We have observed the end of voltage scaling for cmos transisters and the slowdown of cmos density scaling. At the same time, we haven’t seen any new technology that’s ready or has the potential to take over cmos yet. Even if a new technology is already discovered in some physics or chemistry research lab, it’s still take 10~15 years before it catches up with cmos. So during the technological fallow period, we cannot get more free rides with transistor scaling, what can we do at architectural level to keep the performance growth? GYW

Potential for Specialized Architectures
16 Encryption 17 Hearing Aid 18 FIR for disk read 19 MPEG Encoder 20 Baseband One promising direction is hardware specialization, where hardware itself is designed for specific applications. Because we can customize the hardware based on the application’s requirement, and we can also remove a lot of overhead in general purpose processing, specialized architectures can deliver orders of magnitude performance and energy benefits compared to general-purpose solutions. This figure here shows the energy efficiency comparison between general-purpose processors, DSPs, and dedicated application-specific accelerators. The data here was collected from 20 different chips across different architectures, published in ISSCC over 5 years. Compared to general-purpose processors, specialized processors like DSPs deliver from 10x to 100x more energy efficiency, while dedicated application-specific accelerators are 1000x more energy efficient. [Zhang and Brodersen]

Cores, GPUs, and Accelerators: Apple A8 SoC
Please do not distribute 5/20/2018 Cores, GPUs, and Accelerators: Apple A8 SoC Out-of-Core Accelerators GYW

Cores, GPUs, and Accelerators: Apple A8 SoC
Please do not distribute 5/20/2018 Cores, GPUs, and Accelerators: Apple A8 SoC Out-of-Core Accelerators Maltiel Consulting estimates Our estimates GYW

Challenges in Accelerators
Flexibility Fixed-function accelerators are only designed for the target applications. Programmability Today’s accelerators are explicitly managed by programmers.

5/20/2018 Today’s SoC OMAP 4 SoC GYW

5/20/2018 Today’s SoC DMA ARM Cores GPU DSP SD USB Audio Video Face Imaging System Bus Secondary Bus Tertiary OMAP 4 SoC GYW

Challenges in Accelerators
Flexibility Fixed-function accelerators are only designed for the target applications. Programmability Today’s accelerators are explicitly managed by programmers. Design Cost Accelerator (and RTL) implementation is inherently tedious and time-consuming.

5/20/2018 Today’s SoC GPU/DSP CPU Buses Mem Inter- face Acc If we use this diagram to represent the architecture of today’s SoC with a couple cores and a handful of accelerators connected through simple busses. GYW

Future Accelerator-Centric Architectures
Please do not distribute 5/20/2018 Future Accelerator-Centric Architectures GPU/DSP Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Cores The future accelerator-centric architecture is a heterogeneous system like this with big cores, small cores, GPUs and more importantly a sea of fine-grained accelerators to provide energy efficiency and application coverage. As architects, in order to quantitatively reason about novel architectural features in such systems, we need accelerator simulators to help us answer questions like: ~. If we have a fast accelerator simulator, we can easily simulate accelerators at different granularity and compare its tradeoffs. -> increase flexibility ~. In this case, a fast accelerator simulator can help us do an early stage design space search before starting the time-consuming RTL design flow. -> reduce design cost ~. We already have simulation infrastructure for the rest of the system, what’s missing here is the simulation framework for accelerator. With such famework, we can easily model the interaction between cores, accelerators and shared memory, and propose novel mechanism to provide architectural support to make programming easier. How to decompose applications into accelerators? How to rapidly design lots of accelerators? How to design and manage the shared resources? Flexibility Design Cost Programmability GYW

Contributions WIICA: Accelerator Workload Characterization [ISPASS’13]
GPU/DSP Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Cores MachSuite: Accelerator Benchmark Suite [IISWC’14] Aladdin: Accelerator Pre-RTL, Power-Performance Simulator [ISCA’14, TopPicks’15] Accelerator Design w/ High-Level Synthesis [ISLPED’13_1] gem5-Aladdin: Accelerator-System Co-Design [MICRO’16]

Please do not distribute

Similar presentations

Presentation on theme: "Please do not distribute"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Please do not distribute

Similar presentations

Presentation on theme: "Please do not distribute"— Presentation transcript:

Similar presentations

About project

Feedback