Please do not distribute

Slides:



Advertisements
Similar presentations
Please do not distribute
Advertisements

Supporting x86-64 Address Translation for 100s of GPU Lanes Jason Power, Mark D. Hill, David A. Wood UW-Madison Computer Sciences 2/19/2014.
Mohammadsadegh Sadri, Christian Weis, Norbert When and Luca Benini
Appendix B. Memory Hierarchy CSCI/ EENG – W01 Computer Architecture 1 Dr. Babak Beheshti Slides based on the PowerPoint Presentations created by.
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
Advanced Processor Architectures for Embedded Systems Witawas Srisa-an CSCE 496: Embedded Systems Design and Implementation.
System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.
Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter.
LOGO HW/SW Co-Verification -- Mentor Graphics® Seamless CVE By: Getao Liang March, 2006.
Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.
1 CDSC CHP Prototyping Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou.
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.
1 Lecture 15: Virtual Memory and Large Caches Today: TLB design and large cache design basics (Sections )
Hier wird Wissen Wirklichkeit Computer Architecture – Part 5 – page 1 of 25 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Part 5 Fundamentals.
Embedded Computing From Theory to Practice November 2008 USTC Suzhou.
Lecture 17: Virtual Memory, Large Caches
Murali Vijayaraghavan MIT Computer Science and Artificial Intelligence Laboratory RAMP Retreat, UC Berkeley, January 11, 2007 A Shared.
Toward Cache-Friendly Hardware Accelerators
Rapid Exploration of Accelerator-rich Architectures: Automation from Concept to Prototyping David Brooks, Yu-Ting Chen, Jason Cong, Zhenman Fang, Brandon.
Implementation of DSP Algorithm on SoC. Mid-Semester Presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompaning engineer : Emilia Burlak.
ABACUS: A Hardware-Based Software Profiler for Modern Processors Eric Matthews Lesley Shannon School of Engineering Science Sergey Blagodurov Sergey Zhuravlev.
CERN CMS Project Host / SD Card Configuration Data Access Dave Ojika Alex Madorsky Dr. Darin Acosta Dr. Ivan Furic.
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
Please do not distribute
2017/4/21 Towards Full Virtualization of Heterogeneous Noc-based Multicore Embedded Architecture 2012 IEEE 15th International Conference on Computational.
Content Project Goals. Term A Goals. Quick Overview of Term A Goals. Term B Goals. Gantt Chart. Requests.
SOC Consortium Course Material ASIC Logic National Taiwan University Adopted from National Chiao-Tung University IP Core Design.
The MachSuite Benchmark
Tutorial Outline Time Topic 9:00 am – 9:30 am Introduction 9:30 am – 10:10 am Standalone Accelerator Simulation: Aladdin 10:10 am – 10:30 am Standalone.
Seoul National University
Memory/Storage Architecture Lab 1 Virtualization History of Computing = History of Virtualization  e.g., process abstraction, virtual memory, cache memory,
GBT Interface Card for a Linux Computer Carson Teale 1.
CPE 626 Advanced VLSI Design Aleksandar Milenkovic Assistant.
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
ESL and High-level Design: Who Cares? Anmol Mathur CTO and co-founder, Calypto Design Systems.
Analysis of Verification System using SoC Platform Communication Circuit & System Design Lab., Dept. of Computer and Communication Engineering, Chungbuk.
CPE 626 Advanced VLSI Design Lecture 2 Aleksandar Milenkovic
BridgePoint Integration John Wolfe / Robert Day Accelerated Technology.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
ECE 526 – Network Processing Systems Design Network Processor Introduction Chapter 11,12: D. E. Comer.
Electronic system level design Teacher : 蔡宗漢 Electronic system level Design Lab environment overview Speaker: 范辰碩 2012/10/231.
Chapter 91 Logical Address in Paging  Page size always chosen as a power of 2.  Example: if 16 bit addresses are used and page size = 1K, we need 10.
Full and Para Virtualization
Caches for Accelerators
Multi-objective Topology Synthesis and FPGA Prototyping Framework of Application Specific Network-on-Chip m Akram Ben Ahmed Xinyu LI, Omar Hammami.
Content Project Goals. Workflow Background. System configuration. Working environment. System simulation. System synthesis. Benchmark. Multicore.
Using Uncacheable Memory to Improve Unity Linux Performance
Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware WU DI NOV. 3, 2015.
Design and Modeling of Specialized Architectures Yakun Sophia Shao May 9 th, 2016 Harvard University P HD D ISSERTATION D EFENSE.
PARADE: A Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Architectural Design and Exploration Zhenman Fang, Michael Gill Jason Cong,
F1-17: Architecture Studies for New-Gen HPC Systems
Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin
Please do not distribute
Please do not distribute
Jason Cong, Yu-Ting Chen, Zhenman Fang, Bingjun Xiao, Peipei Zhou
Please do not distribute
Please do not distribute
Please do not distribute
Please do not distribute
Performance Tuning Team Chia-heng Tu June 30, 2009
Snehasish Kumar, Arrvindh Shriraman, and Naveen Vedula
Seoul National University
OS Virtualization.
Highly Efficient and Flexible Video Encoder on CPU+FPGA Platform
Virtual Memory.
Horizontally Partitioned Hybrid Main Memory with PCM
Exascale Programming Models in an Era of Big Computation and Big Data
Presentation transcript:

Please do not distribute 4/17/2017 Tutorial Outline Time Topic 9:00 am – 9:30 am Introduction 9:30 am – 10:10 am Standalone Accelerator Simulation: Aladdin 10:10 am – 10:30 am Standalone Accelerator Generation: High-Level Synthesis 10:30 am – 11:00 am HLS-Based Accelerator-Rich Architecture Simulation: PARADE 11:00 am – 11:30 am Break 11:30 am – 12:00 pm Pre-RTL SoC Simulation: gem5-Aladdin 12:00 pm – 12:30 pm FPGA Prototyping: ARACompiler 12:30 pm – 2:00 pm Lunch 2:00 pm – 3:00 pm Panel on Accelerator Research 3:00 pm – 3:30 pm Accelerator Benchmarks and Workload Characterization 3:30 pm – 4:00 pm 4:00 pm – 5:00 pm Hands-on Exercise Amortize optimization phase GYW

Please do not distribute 4/17/2017 Integration for Heterogeneous SoC Modeling Yakun Sophia Shao, Sam Xi, Gu-Yeon Wei, David Brooks Harvard University GYW

Accelerator-CPU Integration: Today’s Conventional SoCs Easy to integrate lots of IP, simple accelerator design Hard to program and share data Core L2 $ … L3 $ DMA On-Chip System Bus Acc #1 Scratchpad Acc #n

Accelerator Integration Trend Users design application-specific hardware accelerators. System vendors provide Host Service Layer with virtual memory and cache coherence support Intel QuickAssist QPI-Based FPGA Accelerator Platform (QAP) IBM POWER8’s Coherent Accelerator Processor Interface (CAPI) Main CPU/SoC FPGA or user-defined ASIC Core … Core Accelerator L2 $ L2 $ Acc Agent Host Service Layer L3 $

IBM CAPI: Two part solution Example of state-of-the-art: IBM POWER8’s Coherent Accelerator Processor Interface (CAPI) Virtual Addressing & Data Caching Easier, Natural Programming Model

IBM CAPI: Two part solution Coherent Accelerator Processor Proxy (CAPP) Snoops PowerBus on behalf of accelerator Power Service Layer (PSL) Performs address translations, page table walker support Provides cache and interface logic Accelerator Core … Core PCIe L2 $ L2 $ PSL CAPP L3 $ On-Chip Coherent PowerBus … Memory Cache TLB

But… accelerators are not one size fits all Problem: PSL layer consumes ~20-30% of FPGA resources… for one accelerator Applications have drastically different requirements. Memory design customization is often more important than datapath customization

gem5-Aladdin Integration CPU Acc Datapath Cache Scratchpad TLB DMA Engine Cache LLC DRAM

Code example: Sift void imsmooth(F2D* array, float sigma, F2D* product); void sift() { … imsmooth(I, temp, gss[0]); mapArrayToAccelerator(imsmooth, “array”, (void *)I, sizeof(I)); mapArrayToAccelerator(imsmooth, “product”, (void *)product, sizeof(product)); invokeAcceleratorAndBlock(imsmooth); }

Start Aladdin Simulation Code example: Sift void imsmooth(F2D* array, float sigma, F2D* product); void sift() { … // imsmooth(I, temp, gss[0]); mapArrayToAccelerator(imsmooth, “array”, (void *)I, sizeof(I)); mapArrayToAccelerator(imsmooth, “product”, (void *)product, sizeof(product)); invokeAccelerator(imsmooth); } Start Aladdin Simulation

Simulating Accelerator with Memory System using Aladdin Cache Memory

Acc Cache Memory CPU Cache Memory

Modeling Accelerators in an SoC-like Environment Please do not distribute 4/17/2017 Modeling Accelerators in an SoC-like Environment Acc Core Core Cache Memory GYW

Modeling Accelerators in an SoC-like Environment Core Cache Memory

Accelerator Research Infrastructure Standalone System Integration Modeling Aladdin gem5-Aladdin High-Level Synthesis PARADE RTL Prototyping FPGA

Tutorial References Y.S. Shao and D. Brooks, “ISA-Independent Workload Characterization and its Implications for Specialized Architectures,” ISPASS’13. B. Reagen, Y.S. Shao, G.-Y. Wei, D. Brooks, “Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware,” ISLPED’13. Y.S. Shao, B. Reagen, G.-Y. Wei, D. Brooks, “Aladdin: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures,” ISCA’14. B. Reagen, B. Adolf, Y.S. Shao, G.-Y. Wei, D. Brooks, “MachSuite: Benchmarks for Accelerator Design and Customized Architectures,” IISWC’14.