On Implementing Sorting Network Machines with FPGAs

Slides:



Advertisements
Similar presentations
Faculty of Sciences and Technology University of Algarve, Faro João M. P. Cardoso April 30, 2001 IEEE Symposium on Field-Programmable Custom Computing.
Advertisements

Multi-dimensional Packet Classification on FPGA: 100Gbps and Beyond
Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.
Sumitha Ajith Saicharan Bandarupalli Mahesh Borgaonkar.
Batcher’s merging network Efficient Parallel Algorithms COMP308.
Scalable and Low Cost Design Approach for Variable Block Size Motion Estimation Hadi Afshar, Philip Brisk, Paolo Ienne EPFL Hadi Afshar, Philip Brisk,
MergeSort (Example) - 1. MergeSort (Example) - 2.
Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.
ECE Synthesis & Verification - Lecture 0 1 ECE 697B (667) Spring 2006 ECE 697B (667) Spring 2006 Synthesis and Verification of Digital Circuits VLSI.
Moving NN Triggers to Level-1 at LHC Rates Triggering Problem in HEP Adopted neural solutions Specifications for Level 1 Triggering Hardware Implementation.
Java Flowpaths: Efficiently Generating Circuits for Embedded Systems from Java WorldComp ESA 2006 Las Vegas, Nevada EXCERPT Darrin Hanna, Michael DuChene,
Design of parallel algorithms Sorting J. Porras. Problem Rearrange numbers (x 1,...,x n ) into ascending order ? What is your intuitive approach –Take.
Programmable logic and FPGA
CS 584 Lecture 11 l Assignment? l Paper Schedule –10 Students –5 Days –Look at the schedule and me your preference. Quickly.
FPGA Defect Tolerance: Impact of Granularity Anthony YuGuy Lemieux December 14, 2005.
1 Sorting by Transpositions Based on the First Increasing Substring Concept Advisor: Professor R.C.T. Lee Speaker: Ming-Chiang Chen.
CSCI-455/552 Introduction to High Performance Computing Lecture 22.
Simulating a CRCW algorithm with an EREW algorithm Lecture 4 Efficient Parallel Algorithms COMP308.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.
A Compact and Efficient FPGA Implementation of DES Algorithm Saqib, N.A et al. In:International Conference on Reconfigurable Computing and FPGAs, Sept.
Power Reduction for FPGA using Multiple Vdd/Vth
Design and Characterization of TMD-MPI Ethernet Bridge Kevin Lam Professor Paul Chow.
International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.
A comprehensive method for the evaluation of the sensitivity to SEUs of FPGA-based applications A comprehensive method for the evaluation of the sensitivity.
LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.
Efficient FPGA Implementation of QR
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
(TPDS) A Scalable and Modular Architecture for High-Performance Packet Classification Authors: Thilan Ganegedara, Weirong Jiang, and Viktor K. Prasanna.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
Decimal Multiplier on FPGA using Embedded Binary Multipliers Authors: H. Neto and M. Vestias Conference: Field Programmable Logic and Applications (FPL),
Can Cloud Computing be Used for Planning? An Initial Study Authors: Qiang Lu*, You Xu†, Ruoyun Huang†, Yixin Chen† and Guoliang Chen* from *University.
Implementation of Finite Field Inversion
A Profiler for a Multi-Core Multi-FPGA System by Daniel Nunes Supervisor: Professor Paul Chow September 30 th, 2008 University of Toronto Electrical and.
J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.
Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation.
Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.
A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.
Preliminary Design of FONT4 Digital ILC Feedback System Hamid Dabiri khah Queen Mary, University of London 30/05/2005.
StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:
Rinoy Pazhekattu. Introduction  Most IPs today are designed using component-based design  Each component is its own IP that can be switched out for.
Floating-Point Divide and Square Root for Efficient FPGA Implementation of Image and Signal Processing Algorithms Xiaojun Wang, Miriam Leeser
Hardware Accelerator for Combinatorial Optimization Fujian Li Advisor: Dr. Areibi.
Lopamudra Kundu Reg. No. : of Roll No.:- 91/RPE/ Koushik Basak
Comparison Networks Sorting Sorting binary values Sorting arbitrary numbers Implementing symmetric functions.
Muhammad Shoaib Bin Altaf. Outline Motivation Actual Flow Optimizations Approach Results Conclusion.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
CDA 4253 FPGA System Design RTL Design Methodology 1 Hao Zheng Comp Sci & Eng USF.
SCORES: A Scalable and Parametric Streams-Based Communication Architecture for Modular Reconfigurable Systems Abelardo Jara-Berrocal, Ann Gordon-Ross NSF.
2/19/2016http://csg.csail.mit.edu/6.375L11-01 FPGAs K. Elliott Fleming Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 3.
Lecture 10: Computer Design Basics: The ALU and the Shifter Soon Tee Teoh CS 147.
+ Even Odd Sort & Even-Odd Merge Sort Wolfie Herwald Pengfei Wang Rachel Celestine.
An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm Miaoqing Huang 1, Kris Gaj 2, Soonhak Kwon 3, Tarek El-Ghazawi 1 1 The George.
Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.
Sorting: Parallel Compare Exchange Operation A parallel compare-exchange operation. Processes P i and P j send their elements to each other. Process P.
Author: Weirong Jiang, Viktor K. Prasanna Publisher: th IEEE International Conference on Application-specific Systems, Architectures and Processors.
CS203 – Advanced Computer Architecture Performance Evaluation.
System on a Programmable Chip (System on a Reprogrammable Chip)
Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.
CS203 – Advanced Computer Architecture
Improved Resource Sharing for FPGA DSP Blocks
Hiba Tariq School of Engineering
7.1 What is a Sorting Network?
Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux
Instructor: Dr. Phillip Jones
James D. Z. Ma Department of Electrical and Computer Engineering
Timothy J. Purcell Stanford / NVIDIA
Final Project presentation
Design principles for packet parsers
Presentation transcript:

On Implementing Sorting Network Machines with FPGAs Rui Marcelino (UALG/EST) Horácio Neto (IST/INESC-ID) João M. P. Cardoso (IST/INESC-ID) Jornadas REC 2007 IST -8, 9 FEV

Motivation With a new kind of devices likes PDA’s, Mobile Phones, new needs are invoked Databases access from embedded device are a reality and tends to grow Sorting is an integral component of most database systems The performance of queries in these systems is often dominated by the cost of the sorting algorithm Search and sorting are becoming important features for embedded applications

Idea Development of sorting machine Coupled to a microprocessor Boost the global performance of general embedded database application Embedded Microprocessor Sorting Machine

the outputs satisfy: y0  y1  y2  … yn-1. Sorting Network x0 y0 Sorting Network x1 y1 . . xn-1 yn-1 the outputs satisfy: y0  y1  y2  … yn-1.

Sorting Network Algorithms Odd-Even K. Batcher, (1968) Bitonic-sort Merge-Sort Odd-Even New sort algorithms have been proposed without significantly improve on Batcher results Sort Network n 2n n

Graphical Representation (transposition odd-even) x0 xn-1 . x1 x2 x4 x3 x5 y0 yn-1 y1 y2 y4 y3 y5 xn-2 yn-2 1 2 3 stage n Comp_Swap, Comparator Swap Stage, number of disjoint Comp_SWap Depth, number of parallel steps Length, or size, which is the number of total comparison-swap

ODD-EVEN MERGE a’0 a’1 a'2 a'3 a'4 a'5 a'6 a'7 x0 x11 x2 x31 a0 a11 a2 stage x0 x11 x2 x31 a0 a11 a2 a3 1 2 3 x4 x51 x6 x71 b0 b11 b2 b3 b'0 b'1 b'2 b'3 b'4 b'5 b'6 b'7 x8 x91 x10 x111 x12 x13 x14 x15 8 7 9 10 y0 y1 y2 y3 y4 y5 y6 y7 y8 y9 Y10 y11 y12 y13 y14 y15

Proposal implementation of sorting network on FPGA devices Different implementations can be done: Pipelined  more hardware resources, high data throughput Odd-Even Transposition Bitonic-Sort, Merge Odd-Even Sequential  less hardware, Network split on sequential stages

Odd-Even Comparators: n·(n-1)/2 Steps: n Advantages: simplicity, x0 xn-1 . x1 x2 x4 x3 x5 y0 yn-1 y1 y2 y4 y3 y5 xn-2 yn-2 1 2 3 stage n Comparators: n·(n-1)/2 Steps: n Advantages: simplicity, locality scalability

Sequential - II … … … … x x x x x y y y y y n Comp 2 - Swap clk n Comp 1 2 n - 2 n - 1 … n Comp 2 - Swap … clk n Comp -1 2 - Swap … y y y y y 1 2 n - 2 n - 1

Comparator Swap MUX A CHANGE B REG COMP L A> B CHG REG MUX H CLOCK

Sequential - II (animation) 2 1 3 6 5 Cycle=1 Change=1 x x x x 2 1 3 6 5 Cycle=2 Change=1 1 3 6 2 3 1 6 5 Cycle=3 Change=1 1 3 6 3 3 1 6 Cycle=4 Change=1 2 6 1 5

Sequential - II (animation) Cycle=5 Change=1 3 6 2 5 1 3 6 1 6 6 3 1 Cycle=6 Change=1 3 5 2 1 6 5 3 2 1 Cycle=7 Change=0 6 3 1 6 6 3 1 Cycle=8 Change=0 5 3 2 1

Sequential - I … … … … … x x x x y y y y Switch network n 2 Comp/Swap. 1 n - 2 n - 1 n 2 … … Comp/Swap. Switch network … clk n Regs … y y y y 1 n - 2 n - 1

Comparator Swap MUX n A CHANGE n B COMP n L A>B CHG MUX n H

Switch Network

Sequential – I (animation) 2 1 3 6 5 Cycle=1 Change=1 Cycle=2 Change=1 1 3 6 2 3 1 6 5 Cycle=3 Change=1 2 6 1 5 Cycle=4 Change=1 3 6 2 5 1 Cycle=5 Change=1 3 5 2 1 Cycle=6 Change=1 Latency minimum= 2 Cycles Latency maximum= N Cycles

Latency

Experimental Results* and Conclusions * Xilinx ISE 8.2i

Thanks!