A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.

Slides:

Advertisements

Similar presentations

Computer Architecture

Advertisements

System Integration and Performance

Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.

An Integrated Reduction Technique for a Double Precision Accumulator Krishna Nagar, Yan Zhang, Jason Bakos Dept. of Computer Science and Engineering University.

Distributed Arithmetic

Parallell Processing Systems1 Chapter 4 Vector Processors.

Octavian Cret, Kalman Pusztai Cristian Vancea, Balint Szente Technical University of Cluj-Napoca, Romania CREC: A Novel Reconfigurable Computing Design.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.

Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.

1 A Tree Based Router Search Engine Architecture With Single Port Memories Author: Baboescu, F.Baboescu, F. Tullsen, D.M. Rosu, G. Singh, S. Tullsen, D.M.Rosu,

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.

1 FPGA Lab School of Electrical Engineering and Computer Science Ohio University, Athens, OH 45701, U.S.A. An Entropy-based Learning Hardware Organization.

An FPGA Based Adaptive Viterbi Decoder Sriram Swaminathan Russell Tessier Department of ECE University of Massachusetts Amherst.

Presenter : Cheng-Ta Wu Antti Rasmus, Ari Kulmala, Erno Salminen, and Timo D. Hämäläinen Tampere University of Technology, Institute of Digital and Computer.

Sorting Chapter 10. Chapter 10: Sorting2 Chapter Objectives To learn how to use the standard sorting methods in the Java API To learn how to implement.

Deep Packet Inspection with Regular Expression Matching Min Chen, Danny Guo {michen, CSE Dept, UC Riverside 03/14/2007.

Pipelining By Toan Nguyen.

DETERMINATION OF THE TOPOLOGY OF HIGH SURVIVAL HF RADIO COMMUNICATION NETWORK Andrea Abrardo.

GPGPU platforms GP - General Purpose computation using GPU

Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.

Digital signature using MD5 algorithm Hardware Acceleration

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning.

Efficient FPGA Implementation of QR

(TPDS) A Scalable and Modular Architecture for High-Performance Packet Classification Authors: Thilan Ganegedara, Weirong Jiang, and Viktor K. Prasanna.

High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Sorting Chapter 10. Chapter Objectives  To learn how to use the standard sorting methods in the Java API  To learn how to implement the following sorting.

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.

August 1, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 9: I/O Devices and Communication Buses * Jeremy R. Johnson Wednesday,

StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:

4/19/20021 TCPSplitter: A Reconfigurable Hardware Based TCP Flow Monitor David V. Schuehler.

Floating-Point Divide and Square Root for Efficient FPGA Implementation of Image and Signal Processing Algorithms Xiaojun Wang, Miriam Leeser

Chapter 4 MARIE: An Introduction to a Simple Computer.

Company LOGO Final presentation Spring 2008/9 Performed by: Alexander PavlovDavid Domb Supervisor: Mony Orbach GPS/INS Computing System.

CPS 4150 Computer Organization Fall 2006 Ching-Song Don Wei.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Chapter One Introduction to Pipelined Processors

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.

Liang, Introduction to Java Programming, Sixth Edition, (c) 2007 Pearson Education, Inc. All rights reserved Chapter 23 Algorithm Efficiency.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.

Fast Lookup for Dynamic Packet Filtering in FPGA REPORTER: HSUAN-JU LI 2014/09/18 Design and Diagnostics of Electronic Circuits & Systems, 17th International.

SCORES: A Scalable and Parametric Streams-Based Communication Architecture for Modular Reconfigurable Systems Abelardo Jara-Berrocal, Ann Gordon-Ross NSF.

DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:

Parallel tree search: An algorithmic approach for multi- field packet classification Authors: Derek Pao and Cutson Liu. Publisher: Computer communications.

Virtual-Channel Flow Control William J. Dally

Introduction to Computer Organization Pipelining.

Lecture Overview Shift Register Buffering Direct Memory Access.

CS Spring 2009 CS 414 – Multimedia Systems Design Lecture 27 – Media Server (Part 2) Klara Nahrstedt Spring 2009.

Introduction to Intrusion Detection Systems. All incoming packets are filtered for specific characteristics or content Databases have thousands of patterns.

Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

PipeliningPipelining Computer Architecture (Fall 2006)

Buffering Techniques Greg Stitt ECE Department University of Florida.

Buffering Techniques Greg Stitt ECE Department University of Florida.

Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.

Presenter: Darshika G. Perera Assistant Professor

Author: Yun R. Qu, Shijie Zhou, and Viktor K. Prasanna Publisher:

Backprojection Project Update January 2002

Hiba Tariq School of Engineering

School of Engineering University of Guelph

Buffer Management and Arbiter in a Switch

SLP1 design Christos Gentsos 9/4/2014.

Ming Liu, Wolfgang Kuehn, Zhonghai Lu, Axel Jantsch

Cache Memory Presentation I

CDA 3101 Spring 2016 Introduction to Computer Organization

A SRAM-based Architecture for Trie-based IP Lookup Using FPGA

Presentation transcript:

A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS David Andrews Computer Science and Computer Engineering The University of Arkansas 504 J.B. Hunt Building, Fayetteville, AR

Introduction

Introduction Sorting an important system function Popular sorting algorithms not efficient or fast in hardware implementations Linear sorters ideal for hardware, but sort at a rate of 1 value per cycle Sorting networks better at throughput, but with high area and latency cost Need a better solution for high throughput, low latency sorting

Contributions Expanding the linear sorter implementation and making it versatile, reconfigurable and better suited for streaming input and output Parallelizing the linear sorter for increased throughput Implementing the high-throughput linear sorter, and outmatching the performance of current linear sorter approaches

Background

Background Software quicksort, mergesort and heapsort use divide-and-conquer techniques to achieve efficiency Hardware sorting plagued with overhead from data movements, synchronization, bookkeeping and memory accesses Need better use of concurrent data comparisons and swaps, rather than the extended execution of multiple assembly instructions like its software counterpart

Sorting Networks Swap comparators sort pairs of values Sink lowest value, then operate on remaining S n-1 items Receive parallel data at inputs High #PE and latency, resort with each new insertion Bubble Sort

Sorting Networks Swap comparators sort pairs of values Sink lowest value, then operate on remaining S n-1 items Receive parallel data at inputs High #PE and latency, resort with each new insertion Bubble Sort

Sorting Networks Swap comparators sort pairs of values Sink lowest value, then operate on remaining S n-1 items Receive parallel data at inputs High #PE and latency, resort with each new insertion Bubble Sort

Sorting Networks Swap comparators sort pairs of values Sink lowest value, then operate on remaining S n-1 items Receive parallel data at inputs High #PE and latency, resort with each new insertion Bubble Sort

Sorting Networks Swap comparators sort pairs of values Sink lowest value, then operate on remaining S n-1 items Receive parallel data at inputs High #PE and latency, resort with each new insertion Bubble Sort

Sorting Networks Swap comparators sort pairs of values Sink lowest value, then operate on remaining S n-1 items Receive parallel data at inputs High #PE and latency, resort with each new insertion Bubble Sort

Sorting Networks Swap comparators sort pairs of values Sink lowest value, then operate on remaining S n-1 items Receive parallel data at inputs High #PE and latency, resort with each new insertion Bubble Sort

Linear Sorters Sorted insertions Forwards incoming value to all nodes Each node shifts autonomously depending on neighbors’ values Single clock latency, small logic & regular structure Streaming input & output Serial input, need higher throughput Input: Output:

Linear Sorters Sorted insertions Forwards incoming value to all nodes Each node shifts autonomously depending on neighbors’ values Single clock latency, small logic & regular structure Streaming input & output Serial input, need higher throughput Input: Output: 3 3

Linear Sorters Sorted insertions Forwards incoming value to all nodes Each node shifts autonomously depending on neighbors’ values Single clock latency, small logic & regular structure Streaming input & output Serial input, need higher throughput 3 3 Input: Output: 2 2

Linear Sorters Sorted insertions Forwards incoming value to all nodes Each node shifts autonomously depending on neighbors’ values Single clock latency, small logic & regular structure Streaming input & output Serial input, need higher throughput Input: Output: 5 5

Linear Sorters Sorted insertions Forwards incoming value to all nodes Each node shifts autonomously depending on neighbors’ values Single clock latency, small logic & regular structure Streaming input & output Serial input, need higher throughput Input: Output: 4 4

Linear Sorters Sorted insertions Forwards incoming value to all nodes Each node shifts autonomously depending on neighbors’ values Single clock latency, small logic & regular structure Streaming input & output Serial input, need higher throughput Input: Output: 1 1

Linear Sorters Sorted insertions Forwards incoming value to all nodes Each node shifts autonomously depending on neighbors’ values Single clock latency, small logic & regular structure Streaming input & output Serial input, need higher throughput Input: Output:

Configurable Linear Sorter

Increase versatility for linear sorters Configurable: ◦ Linear sorter depth ◦ Sorting direction ◦ Sort on tags (for example, timestamps) rather than data ◦ User-defined data and tag size

Configurable Linear Sorter Increase functionality for linear sorters 1. Detect full conditions 2. Buffer input while full 3. Retrieve output serially for streaming 4. Delete top value, freeing nodes 5. Augment with left shift functionality 6. Test tags before deleting them

Extended Linear Sorter System Clock Cycles Inserted Tags Sorted Output Node 1 Node 2 Node 3 Node 4 Node 5 Node 6

Extended Linear Sorter System 5 5 Clock Cycles Inserted Tags Sorted Output Node 1 Node 2 Node 3 Node 4 Node 5 Node

Extended Linear Sorter System Clock Cycles Inserted Tags Sorted Output Node 1 Node 2 Node 3 Node 4 Node 5 Node

Extended Linear Sorter System Clock Cycles Inserted Tags Sorted Output Node 1 Node 2 Node 3 Node 4 Node 5 Node

Extended Linear Sorter System Clock Cycles Inserted Tags Sorted Output Node 1 Node 2 Node 3 Node 4 Node 5 Node

Extended Linear Sorter System Clock Cycles Inserted Tags Sorted Output Node 1 Node 2 Node 3 Node 4 Node 5 Node

Extended Linear Sorter System Clock Cycles Inserted Tags Sorted Output Node 1 Node 2 Node 3 Node 4 Node 5 Node

Extended Linear Sorter System Clock Cycles Inserted Tags Sorted Output Node 1 Node 2 Node 3 Node 4 Node 5 Node

Extended Linear Sorter System Clock Cycles Inserted Tags Sorted Output Node 1 Node 2 Node 3 Node 4 Node 5 Node

Interleaved Linear Sorter System Clock Cycles Inserted Tags Sorted Output Node 1 Node 2 Node 3 Node 4 Node 5 Node

Interleaved Linear Sorter System Clock Cycles Inserted Tags Sorted Output Node 1 Node 2 Node 3 Node 4 Node 5 Node

Interleaved Linear Sorter System Clock Cycles Inserted Tags Sorted Output Node 1 Node 2 Node 3 Node 4 Node 5 Node

Extended Linear Sorter System Clock Cycles Inserted Tags Sorted Output Node 1 Node 2 Node 3 Node 4 Node 5 Node

Extended Linear Sorter System Clock Cycles Inserted Tags Sorted Output Node 1 Node 2 Node 3 Node 4 Node 5 Node

Extended Linear Sorter System Clock Cycles Inserted Tags Sorted Output Node 1 Node 2 Node 3 Node 4 Node 5 Node

Extended Linear Sorter System Clock Cycles Inserted Tags Sorted Output Node 1 Node 2 Node 3 Node 4 Node 5 Node

Sorter Node Architecture

Interleaved Linear Sorter

Increase throughput by using multiple linear sorters with parallel inputs Interleave parallel inputs into linear sorters through modulo arithmetic Distribute data evenly among linear sorters to avoid full conditions Service each linear sorter in round-robin fashion to resort their outputs

Interleaved Linear Sorter ILS width = 4 Four parallel inputs ◦ {13, 04, 03, 10} After interleaving mod 4 ◦ {04, 13, 10, 03} ◦ [00, 01, 02, 03] But, what happens for inputs ◦ {07, 05, 01, 06} ?

Interleaved Linear Sorter Clock Cycles LS 0 LS 1 LS 2 LS Inserted Tags

Interleaved Linear Sorter Clock Cycles LS 0 LS 1 LS 2 LS Inserted Tags Inserted

Interleaved Linear Sorter Clock Cycles LS 0 LS 1 LS 2 LS Inserted Tags Inserted Preempted

Interleaved Linear Sorter Clock Cycles LS 0 LS 1 LS 2 LS Inserted Tags Inserted Preempted

Interleaved Linear Sorter Clock Cycles LS 0 LS 1 LS 2 LS Inserted Tags Inserted Preempted

Interleaved Linear Sorter Clock Cycles LS 0 LS 1 LS 2 LS Inserted Tags Inserted Preempted

Interleaved Linear Sorter 0123 Clock Cycles LS 0 LS 1 LS 2 LS Inserted Tags Inserted Preempted

Interleaved Linear Sorter 0123 Clock Cycles LS 0 LS 1 LS 2 LS Inserted Tags Inserted Preempted

Input Distribution and Latency Assume uniformly distributed data With W = 2, 50% chance of interleaving delays, adding an extra clock cycle With W > 2, more probabilities of stalling, adding 1+ clock cycles ILS Width WClock Cycles Average latency for Interleaved Linear Sorters of length W Larger ILS widths allow parallel sorting and increase throughput. However, they have complex routing and additional delays

Output Logic Accumulate values before output becomes relevant Increase linear sorter depth to accumulate more data Re-sort top values from each linear sorter to ensure continuity (particularly for contiguous values) Service in round-robin fashion: Test the top tag of each linear sorter before deleting

Streaming output 0123 LS 0 LS 1 LS 2 LS OK Output {8,9}; Wait for 10

Hardware Implementation

FPGA Area Linear Sorters, W Total SlicesSlices/NodeArea Overhead Interleaving Area (slices) % % % % %1081 Xilinx Virtex-5 FPGA 8-bit tags, 8-bit data 17 slices per sorter node Linear Sorter depth of 16 Interleaved Linear Sorter FPGA Area

FPGA Throughput f clk x ILS width W Includes logic and routing delay for interleaving data for W linear sorters Averaged for sorter depth from 1 up to 256 nodes Interleaved Linear Sorter FPGA Frequency & Throughput Linear Sorters, W Frequency (MHz) Throughput (millions/sec)

FPGA Throughput W=16, large logic & routing delays at inputs First three cases need single 6-input LUT for routing. Two and four LUTs needed for W=8 and W=16, respectively. Interleaved Linear Sorter FPGA Frequency & Throughput Linear Sorters, W Frequency (MHz) Throughput (millions/sec)

Maximum ILS Throughput Average speedups of 1.0, 1.8, 3.7 and 3.5 against single linear sorter

Throughput considerations Interleaving contention results in an average latency which increases with ILS width W ILS Width WClock Cycles Average latency for Interleaved Linear Sorters of length W

Normalized ILS Throughput Average speedups of 1.0, 1.3, 1.8 and 1.4 against single linear sorter

Virtex II-Pro implementation 100 MHz bus frequency Data resided on BlockRAMs Tested Interleaved Linear Sorter W=4 Compared against MicroBlaze running quicksort in C Timing includes bus arbitration, read and writes over the OPB Final result is saved in BlockRAM

Three test scenarios 1. MicroBlaze BRAM write & read-back ◦ MB writes BRAM unsorted data ◦ MB sends start signal to ILS ◦ MB reads back sorted values over OPB 2. MicroBlaze BRAM write ◦ Same as scenario 1 ◦ No need for read back into MB 3. No MicroBlaze ◦ Hardware-only streaming output approach ◦ No need for OPB requests ◦ Output consumed by other hardware components immediately

ILS speedup over MicroBlaze SystemClock cyclesSpeedup MicroBlaze quicksort49,9821 ILS 1 – MB write & read ILS 2 – MB write73268 ILS 3 – Hardware-only bit data 16-bit tags 64 values ILS width of 4

ILS speedup against Sorting Network SystemExecution time (ns)Speedup Batcher odd-even951.0 ILS – Hardware only Sorting network requires static data set Re-sorts the full set of data upon a single new insertion 16-bit data-tags 32 values ILS width of 4

Conclusions

Conclusions Linear sorters appropriate to overcome sorting network disadvantages, but limited throughput Interleaved Linear Sorters allow high throughput, configuration of width, depth, tag and data size to match system requirements ILS speedup of 1.8 over regular linear sorter (3.7 for best case) ILS speedup of 68 against embedded software counterpart (1666 for best case)

Questions