A Computing Origami: Folding Streams in FPGAs S. M. Farhad PhD Student University of Sydney DAC 2009, California, USA.

Slides:

Advertisements

Similar presentations

SDN Controller Challenges

Advertisements

Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.

Bottleneck Elimination from Stream Graphs S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

1 Estimating Shared Congestion Among Internet Paths Weidong Cui, Sridhar Machiraju Randy H. Katz, Ion Stoica Electrical Engineering and Computer Science.

Dynamic Index Coding Broadcast Station N N Michael J. Neely, Arash Saber Tehrani, Zhen Zhang University of Southern California Paper available.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

Rate Distortion Optimized Streaming Maryam Hamidirad CMPT 820 Simon Fraser Univerity 1.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

Field-Programmable Logic and its Applications INTERNATIONAL CONFERENCEMadrid, August 28-30, 2006 Tom VanCourt, Altera Corporation altera.com.

Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 17: Application-Driven Hardware Acceleration (3/4)

 Based on the resource constraints a lower bound on the iteration interval is estimated  Synthesis targeting reconfigurable logic (e.g. FPGA) faces the.

Multiple Sender Distributed Video Streaming Thinh Nguyen, Avideh Zakhor appears on “IEEE Transactions On Multimedia, vol. 6, no. 2, April, 2004”

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Performance Analysis of Processor Characterization Presentation Performed by : Winter 2005 Alexei Iolin Alexander Faingersh Instructor:

Better-Behaved Better- Performing Multimedia Networking Jae Chung and Mark Claypool (Avanish Tripathi) Computer Science Department Worcester Polytechnic.

Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.

Switch EECS 252 – Spring 2006 RAMP Blue Project Jue Sun and Gary Voronel Electrical Engineering and Computer Sciences University of California, Berkeley.

Seven Minute Madness: Reconfigurable Computing Dr. Jason D. Bakos.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.

Multiple Sender Distributed Video Streaming Nguyen, Zakhor IEEE Transactions on Multimedia April 2004.

Study of AES Encryption/Decription Optimizations Nathan Windels.

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

October 26, 2006 Parallel Image Processing Programming and Architecture IST PhD Lunch Seminar Wouter Caarls Quantitative Imaging Group.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

1 Optimal Power Allocation and AP Deployment in Green Wireless Cooperative Communications Xiaoxia Zhang Department of Electrical.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Communication Overhead Estimation on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

Min Chen, and Yuhong Yan Concordia University, Montreal, Canada Presentation at ICWS 2012 June 24-29, 2012, Hawaii (Honolulu), USA Redundant Service Removal.

Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.

High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop

Genetic Programming on General Purpose Graphics Processing Units (GPGPGPU) Muhammad Iqbal Evolutionary Computation Research Group School of Engineering.

SECTION 1.3 PROPERTIES OF FUNCTIONS PROPERTIES OF FUNCTIONS.

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.

Harmony: A Run-Time for Managing Accelerators Sponsor: LogicBlox Inc. Gregory Diamos and Sudhakar Yalamanchili.

A few issues on the design of future multicores André Seznec IRISA/INRIA.

Timing-Driven Routing for FPGAs Based on Lagrangian Relaxation

RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.

Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

CS590F Project: Wireless Streaming Protocol Xiaojun Lin Jitesh Nair Samrat Kulkarni.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

Receiver Driven Bandwidth Sharing for TCP Authors: Puneet Mehra, Avideh Zakor and Christophe De Vlesschouwer University of California Berkeley. Presented.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.

Scaling Network Emulation Using Topology Replication Second Year Project Advisor : Amin Vahdat Committee: Jeff Chase, Jun Yang.

UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.

Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.

Technical Seminar Presentation Presented by : SARAT KUMAR BEHERA NATIONAL INSTITUTE OF SCIENCE AND TECHNOLOGY [1] Presented By SARAT KUMAR BEHERA Roll.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

IncApprox The marriage of incremental and approximate computing Pramod Bhatotia Dhanya Krishnan, Do Le Quoc, Christof Fetzer, Rodrigo Rodrigues* (TU Dresden.

Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.

Jehandad Khan and Peter Athanas Virginia Tech

Advanced Computer Networks

Efficient Complex Operators for Irregular Codes

Instructor: Dr. Phillip Jones

Symmetric Allocations for Distributed Storage

for Network Processors

STUDY AND IMPLEMENTATION

Research Opportunities in IP Wide Area Storage

A graphing calculator is required for some problems or parts of problems 2000.

Hossein Omidian, Guy Lemieux

Your Programmable NIC Should Be a Programmable Switch

Presentation transcript:

A Computing Origami: Folding Streams in FPGAs S. M. Farhad PhD Student University of Sydney DAC 2009, California, USA

2 Outline Motivation  Stream programming  FPGA  Problem Stream Folding Results Conclusion 2

Stream Programming Paradigm Programs expressed as stream graphs  Streams: Sequence of data elements  Actor: Functions applied to streams Independent actors with explicit communication Regular and repeating computation 3 Actor/Filter Streams

FPGA FPGAs are widely available as programmable coprocessors Opportunities to exploit FPGA-based acceleration  Multimedia, networking, graphics, and security codes 4

Problem Maximizing throughput subject to  Area and latency constraints Resolving bottleneck actors  The replicated filters do not require resynthesis 5

Motivating Example 6

7

8

9 Outline Motivation  Stream programming  FPGA  Problem Stream Folding Results Conclusion 9

Area/Throughput Design Folding 1 foreach Filter f in S do 2 workFactor[f] = f.latency.S.runs(f); 3 designPointArea + = f.area.workFactor[f]; 4 scaleLimit = min f.hasState (1/workFactor[f]); 5 scaling = min(AREA/designPointArea, scaleLimit); 6 foreach Filter f in S do 7 replication[f] = workFactor[f].scaling; 8 while area(replication) > AREA do 9 replication = reduceThroughput(replication); 10

Calculating Throughput 11

Calculating Latency FPGAs that are coupled to host processors Initiation interval (DMA) Replication improves throughput, it often increases the latency! Major factors for latency variation  Non-periodic data arrival  Data-token reordering  Local congestion 12

Latency constrained design folding 1 latConf= null ; T = ∞; 2 while throughput(thrConf) ≤ T do 3 if feasibleImprovement(thrConf) then 4 candidates = simAnnealing(thrConf, T); 5 foreach candidate in candidates do 6 if throughput(candidate) < T then 7 latConf = candidate; 8 T = throughput(latConf); 9 thrConf = reduceThroughput(thrConf); 10 return latConf 13

Results Benchm ark Minimum areaBest throughputConstrained design LUTsLatencyIILUTsLatencyIILUTsLatencyII Constrai nt Run time MatrixM ult Latency ≤ s Serpent Latency ≤ s FFT AREA ≤ s FMRadio AREA ≤ s DCT AREA ≤ s BitonicS ort AREA ≤ s Syntheti c AREA ≤ s 14

Questions?