Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,

Slides:



Advertisements
Similar presentations
Professur für Technische Informatik A Self Distributing Virtual Machine for FPGA Multicores Klaus Waldschmidt J. W. Goethe-University Technische Informatik.
Advertisements

Multiprocessor Architecture for Image processing Mayank Kumar – 2006EE10331 Pushpendre Rastogi – 2006EE50412 Under the guidance of Dr.Anshul Kumar.
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
50s Computer Software and Software Engineering
ECE-777 System Level Design and Automation Hardware/Software Co-design
1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.
Multiple Processor Systems
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.
Figure 1.1 Interaction between applications and the operating system.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.
1 Optimizing Utility in Cloud Computing through Autonomic Workload Execution Reporter : Lin Kelly Date : 2010/11/24.
1 Chapter 14 Embedded Processing Cores. 2 Overview RISC: Reduced Instruction Set Computer RISC-based processor: PowerPC, ARM and MIPS The embedded processor.
UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
RRB/STS ORNL Workshop Integrated Hardware/Software Security Support R. R. BrooksSam T. Sander Associate ProfessorAssistant Professor Holcombe Department.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.
Uniform Reconfigurable Processing Module for Design and Manufacturing Integration V. Kirischian, S. Zhelnokov, P.W. Chun, L. Kirischian and V. Geurkov.
Managing a Cloud For Multi Agent System By, Pruthvi Pydimarri, Jaya Chandra Kumar Batchu.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
PERFORMANCE ANALYSIS cont. End-to-End Speedup  Execution time includes communication costs between FPGA and host machine  FPGA consistently outperforms.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Hybrid Prototyping of MPSoCs Samar Abdi Electrical and Computer Engineering Concordia University Montreal, Canada
Portable and Predictable Performance on Heterogeneous Embedded Manycores (ARTEMIS ) ARTEMIS 2 nd Project Review October 2014 Summary of technical.
Embedded Runtime Reconfigurable Nodes for wireless sensor networks applications Chris Morales Kaz Onishi 1.
Multiprossesors Systems.. What are Distributed Databases ? “ A Logically interrelated collection of shared data ( and a description of this data) physically.
Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
© TRESETarget Industry TRESE Group Department of Computer Science University of Twente P.O. Box AE Enschede, The Netherlands
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:
Rinoy Pazhekattu. Introduction  Most IPs today are designed using component-based design  Each component is its own IP that can be switched out for.
POLITECNICO DI MILANO Blanket Team Blanket Reconfigurable architecture and (IP) runtime reconfiguration support in Dynamic Reconfigurability.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
VMware vSphere Configuration and Management v6
Enabling Self-management of Component-based High-performance Scientific Applications Hua (Maria) Liu and Manish Parashar The Applied Software Systems Laboratory.
IT-SOC 2002 © 스마트 모빌 컴퓨 팅 Lab 1 RECONFIGURABLE PLATFORM DESIGN FOR WIRELESS PROTOCOL PROCESSORS.
High-Speed Policy-Based Packet Forwarding Using Efficient Multi-dimensional Range Matching Lakshman and Stiliadis ACM SIGCOMM 98.
Mobile Computing and Wireless Communication Pisa 26 November 2002 Roberto Baldoni University of Roma “La Sapienza”
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
Self-Adaptive Embedded Technologies for Pervasive Computing Architectures Self-Adaptive Networked Entities Concept, Implementations,
A Design Flow for Optimal Circuit Design Using Resource and Timing Estimation Farnaz Gharibian and Kenneth B. Kent {f.gharibian, unb.ca Faculty.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
1 Hardware-Software Co-Synthesis of Low Power Real-Time Distributed Embedded Systems with Dynamically Reconfigurable FPGAs Li Shang and Niraj K.Jha Proceedings.
CS4315A. Berrached:CMS:UHD1 Introduction to Operating Systems Chapter 1.
Parallel Computing Presented by Justin Reschke
Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Hardware Acceleration of A Boolean Satisfiability Solver
Introduction to parallel programming
CS5102 High Performance Computer Systems Thread-Level Parallelism
Parallel Programming By J. H. Wang May 2, 2017.
University of Technology
EE 4xx: Computer Architecture and Performance Programming
Department of Electrical Engineering Joint work with Jiong Luo
Presented By: Darlene Banta
Lecture 18: Coherence and Synchronization
Presentation transcript:

Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering, Utah State University ClustersTime to propagate(ticks) # of distb. steps First solution(ticks) Cluster sizePropagation speed-up Distb speed-up % of stack usage FAILS1011NA / /34/ /26/26/ Abstract The architecture of a generic, flexible and scalable embedded finite domain constraint solver targeting an FPGA is presented in this work. We exploit the spatial parallelism offered by COTS FPGA architectures via the instantiation of multiple soft-core processors, which collectively implement the constraint solver. Soft-core processors facilitate the development of flexible software based algorithms for implementing individual constraints. The multi-core architecture realized on the FPGA facilitates tight inter-core synchronization required when solving constraints in parallel. Introduction Constraint satisfaction and optimization techniques are commonly employed in scheduling problems, industrial manufacturing and automation processes, borrowing concepts from Operations Research (OR) and Artificial Intelligence (AI). Constraint satisfaction has also been applied in the design, synthesis and optimization of embedded systems. By modeling the scheduling problem as a constraint satisfaction problem (CSP), the embedded system becomes adaptable to dynamic changes in the environment. In this work, we model the task graph of dynamic occurrences for a space application in terms of precedence constraints and use the multi-processor embedded CSP solver. Solver Architecture Propagation in a CSP with variables belonging to domain of finite cardinality is amenable to parallel processing. Instantiation of multiple soft-core MicroBlaze processors from Xilinx with distributed memories exploited the spatial parallelism provided by FPGA. The model in the above figure implies the need for a globally shared memory, simultaneously accessible by all propagation elements for sharing variable information. Without such sharing, propagators cannot cooperate to jointly make progress towards a solution. The Consolidator Emulation of shared memory via distributed memories and communication links imposes issues of coherency. Local copies of shared data are cached on each processor requiring. Data coherency is maintained through two steps. First, when a finite domain variable is updated by a remote processor, updates are sent to the owning processor. Second, all updates are routed through a hardware unit called a consolidator. The consolidator is a comparator that acts as a check point for all updates to the constraint store, and only accepts updates which improve or tighten the currently stored bounds for the variable. Since propagation approaches a solution through bounds analysis of finite domain variables, ordering between updates need not be maintained. To evaluate our embedded FD constraint solver, we used a task graph from a previously published hypothetical graph representing the events in an autonomous space mission planning algorithm. Our constraint model consists of enforcing temporal precedence constraints in order to derive a schedule of events in the graph. Measurements from the solver implementation executing on a VirtexII Pro Xilinx FPGA are provided in the table. Results indicate that a performance speed-up in propagation increases as the number of processors is increased. The solver failed to converge in the single-processor case, due to lack of sufficient space in the local configuration stack. Online constraint satisfaction potentially opens the door to a variety of introspective dynamic optimizations to embedded systems. We have developed an approach for embedding a finite domain constraint solver on an FPGA, using a network of soft-core processors, distributed memories and point-to- point communication. The implementation of the solver framework is generic enough to allow different propagators embodying various other constraints to be added in the system. Architecture The implementation of a low-latency, globally shared memory accessible by several computational devices is not practical on an FPGA fabric, due to limitations on the number of read-write ports of internal memories. Our implementation emulates shared memory by making use of on-chip BlockRAM. The constraint store is partitioned and distributed among the local memories associated with each soft core processor. Results Conclusion