J. Harkins1 of 51MAPLD2005/C178 Sorting on the SRC 6 Reconfigurable Computer John Harkins, Tarek El-Ghazawi, Esam El-Araby, Miaoqing Huang The George Washington.

Slides:



Advertisements
Similar presentations
Introduction to Algorithms Quicksort
Advertisements

Garfield AP Computer Science
Computer Organization CS224
Analysis of Algorithms
Practice Quiz Question
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
Chapter 7 Sorting Part I. 7.1 Motivation list: a collection of records. keys: the fields used to distinguish among the records. One way to search for.
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.
CHAPTER 11 Sorting.
Design of parallel algorithms Sorting J. Porras. Problem Rearrange numbers (x 1,...,x n ) into ascending order ? What is your intuitive approach –Take.
MAPLD 2005 A High-Performance Radix-2 FFT in ANSI C for RTL Generation John Ardini.
TTIT33 Algorithms and Optimization – Dalg Lecture 2 HT TTIT33 Algorithms and optimization Lecture 2 Algorithms Sorting [GT] 3.1.2, 11 [LD] ,
Lecture 3 Aug 31, 2011 Goals: Chapter 2 (algorithm analysis) Examples: Selection sorting rules for algorithm analysis discussion of lab – permutation generation.
CS 584. Sorting n One of the most common operations n Definition: –Arrange an unordered collection of elements into a monotonically increasing or decreasing.
Concept of Basic Time Complexity Problem size (Input size) Time complexity analysis.
Sorting II/ Slide 1 Lecture 24 May 15, 2011 l merge-sorting l quick-sorting.
Sorting in Linear Time Lower bound for comparison-based sorting
1 Data Structures and Algorithms Sorting. 2  Sorting is the process of arranging a list of items into a particular order  There must be some value on.
Performance and Overhead in a Hybrid Reconfigurable Computer O. D. Fidanci 1, D. Poznanovic 2, K. Gaj 3, T. El-Ghazawi 1, N. Alexandridis 1 1 George Washington.
Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
1 Time Analysis Analyzing an algorithm = estimating the resources it requires. Time How long will it take to execute? Impossible to find exact value Depends.
Adaptive Parallel Sorting Algorithms in STAPL Olga Tkachyshyn, Gabriel Tanase, Nancy M. Amato
External Sorting Sort n records/elements that reside on a disk. Space needed by the n records is very large.  n is very large, and each record may be.
Outline  introduction  Sorting Networks  Bubble Sort and its Variants 2.
CSC 41/513: Intro to Algorithms Linear-Time Sorting Algorithms.
MapReduce How to painlessly process terabytes of data.
C++ Programming: From Problem Analysis to Program Design, Second Edition Chapter 19: Searching and Sorting.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Analysis of Algorithms CS 477/677
Gaj1P230/MAPLD 2004 Elliptic Curve Cryptography over GF(2 m ) on a Reconfigurable Computer: Polynomial Basis vs. Optimal Normal Basis Representation Comparative.
1 Joe Meehean.  Problem arrange comparable items in list into sorted order  Most sorting algorithms involve comparing item values  We assume items.
Survey of Sorting Ananda Gunawardena. Naïve sorting algorithms Bubble sort: scan for flips, until all are fixed Etc...
CS 361 – Chapters 8-9 Sorting algorithms –Selection, insertion, bubble, “swap” –Merge, quick, stooge –Counting, bucket, radix How to select the n-th largest/smallest.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
Java Methods Big-O Analysis of Algorithms Object-Oriented Programming
Copyright  2005 SRC Computers, Inc. ALL RIGHTS RESERVED Overview.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Sorting: Implementation Fundamental Data Structures and Algorithms Klaus Sutner February 24, 2004.
Review 1 Selection Sort Selection Sort Algorithm Time Complexity Best case Average case Worst case Examples.
1 Radix Sort. 2 Classification of Sorting algorithms Sorting algorithms are often classified using different metrics:  Computational complexity: classification.
A Comparison of Parallel Sorting Algorithms on Different Architectures Nancy M. Amato, Ravishankar Iyer, Sharad Sundaresan and Yan Wu Texas A&M University.
Week 13 - Friday.  What did we talk about last time?  Sorting  Insertion sort  Merge sort  Started quicksort.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
SORTING ALGORITHMS King Saud University College of Applied studies and Community Service CSC 1101 By: Nada Alhirabi 1.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
Week 13 - Wednesday.  What did we talk about last time?  NP-completeness.
SORTING ALGORITHMS Christian Jonsson Jonathan Fagerström And implementation.
Sorting: Implementation Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2004.
Sorting & Lower Bounds Jeff Edmonds York University COSC 3101 Lecture 5.
S ORTING ON P ARALLEL C OMPUTERS Dr. Sherenaz Al-Haj Baddar KASIT University of Jordan
Sorting.
External Sorting Sort n records/elements that reside on a disk.
FPGAs in AWS and First Use Cases, Kees Vissers
Chapter 7 Sorting Spring 14
CSCE 212 Chapter 4: Assessing and Understanding Performance
Parallel Sorting Algorithms
Algorithm Design and Analysis (ADA)
Elliptic Curve Cryptography over GF(2m) on a Reconfigurable Computer:
Implementation of IDEA on a Reconfigurable Computer
CSCI1600: Embedded and Real Time Software
RECONFIGURABLE PROCESSING AND AVIONICS SYSTEMS
Parallel Sorting Algorithms
Sub-Quadratic Sorting Algorithms
Parallel sorting.
CSCI1600: Embedded and Real Time Software
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Sorting Popular algorithms:
Presentation transcript:

J. Harkins1 of 51MAPLD2005/C178 Sorting on the SRC 6 Reconfigurable Computer John Harkins, Tarek El-Ghazawi, Esam El-Araby, Miaoqing Huang The George Washington University Washington, DC

J. Harkins2 of 51MAPLD2005/C178 Algorithms Quick Sort Heap Sort Radix Sort Bitonic Sort Odd/Even Merge

J. Harkins3 of 51MAPLD2005/C178 SRC System Architecture 16 Port Crossbar Switch 1.6 GB/s Peak Port BW Processor Node FPGA Node Memory Node Up to 16 Nodes per Switch \ 64 ………

J. Harkins4 of 51MAPLD2005/C178 Example - Quick Sort 0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] med: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9]

J. Harkins5 of 51MAPLD2005/C178 Example - Quick Sort 0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] med: [ 0][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][13]

J. Harkins6 of 51MAPLD2005/C178 Example - Quick Sort 0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] med: [ 0][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][13] QS1: [ 0][ 3][ 5][ 7][ 4][ 2][ 6][ 1][ 8][ 9][12][15][14][11][10][13]

J. Harkins7 of 51MAPLD2005/C178 Example - Quick Sort 0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] med: [ 0][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][13] QS1: [ 0][ 3][ 5][ 7][ 4][ 2][ 6][ 1][ 8][ 9][12][15][14][11][10][13]

J. Harkins8 of 51MAPLD2005/C178 Example - Quick Sort 0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] med: [ 0][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][13] QS1: [ 0][ 3][ 5][ 7][ 4][ 2][ 6][ 1][ 8][ 9][12][15][14][11][10][13] mL: [ 0][ 3][ 5][ 7][ 4][ 2][ 6][ 1][ 8]

J. Harkins9 of 51MAPLD2005/C178 Example - Quick Sort 0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] med: [ 0][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][13] QS1: [ 0][ 3][ 5][ 7][ 4][ 2][ 6][ 1][ 8][ 9][12][15][14][11][10][13] mL: [ 0][ 3][ 5][ 7][ 4][ 2][ 6][ 1][ 8] PS: [ 0][ 1][ 2][ 3][ 4][ 5][ 6][ 7][ 8]

J. Harkins10 of 51MAPLD2005/C178 Quick Sort - MIMD Architecture Bank A Bank B Bank C Bank D Bank E Bank F FPGA 1 QS 1 QS 2 QS 3 90% FPGA 2 QS 4 QS 5 QS 6 84% 6 Instances Median of 3 to select pivot Pipeline Sort for partitions ≤ 10 vs. Insertion Sort ≤ 20

J. Harkins11 of 51MAPLD2005/C Example - Heap Sort 0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9]

J. Harkins12 of 51MAPLD2005/C Example - Heap Sort : [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9]

J. Harkins13 of 51MAPLD2005/C Example - Heap Sort : [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9]

J. Harkins14 of 51MAPLD2005/C Example - Heap Sort : [13][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][ 0]

J. Harkins15 of 51MAPLD2005/C Example - Heap Sort : [13][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][ 0]

J. Harkins16 of 51MAPLD2005/C Example - Heap Sort : [13][ 3][14][15][10][ 2][11][ 9][ 8][ 4][12][ 7][ 5][ 6][ 1][ 0]

J. Harkins17 of 51MAPLD2005/C Example - Heap Sort max: [15][13][14][ 9][12][ 7][11][ 3][ 8][ 4][10][ 2][ 5][ 6][ 1][ 0]

J. Harkins18 of 51MAPLD2005/C178 Heap Sort - MIMD Architecture Bank A Bank B Bank C Bank D Bank E Bank F FPGA 1 HS 1 HS 2 HS 3 55% FPGA 2 HS 4 HS 5 HS 6 5% 6 Instances Almost identical to processor code

J. Harkins19 of 51MAPLD2005/C178 Example - Radix Sort 1: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] Pass1: 0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:  index 0 = 0  index 1 = 4  index 2 = 8  index 3 = 12 count 1 = 4 count 2 = 4 count 3 = 4 count 4 = 4 index n = ∑ count i n > 0 i=1 n index 0 = 0

J. Harkins20 of 51MAPLD2005/C178 Example - Radix Sort 2: [ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ] Pass2: 0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:  index 0 = 0  index 1 = 4  index 2 = 8  index 3 = 12 count 0 = 0 count 1 = 0 count 2 = 0 count 3 = 0

J. Harkins21 of 51MAPLD2005/C Example - Radix Sort 2: [ ][ ][ ][ ][13][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ] Pass2: 0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:  index 0 = 0  index 1 = 5  index 2 = 8  index 3 = 12 count 0 = 0 count 1 = 0 count 2 = 0 count 3 = 1

J. Harkins22 of 51MAPLD2005/C Example - Radix Sort 2: [ ][ ][ ][ ][13][ ][ ][ ][ ][ ][ ][ ][ 3][ ][ ][ ] : 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: count 0 = 1 count 1 = 0 count 2 = 0 count 3 = 1 Pass2:  index 0 = 0  index 1 = 5  index 2 = 8  index 3 = 13

J. Harkins23 of 51MAPLD2005/C Example - Radix Sort 2: [ ][ ][ ][ ][13][ ][ ][ ][14][ ][ ][ ][ 3][ ][ ][ ] : 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: count 0 = 1 count 1 = 0 count 2 = 0 count 3 = 2 Pass2:  index 0 = 0  index 1 = 5  index 2 = 9  index 3 =

J. Harkins24 of 51MAPLD2005/C Example - Radix Sort 3: [ 0][ 1][ 2][ 3][ 4][ 5][ 6][ 7][ 8][ 9][10][11][12][13][14][15] : 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:  index 0 = 4  index 1 = 8  index 2 = 12  index 3 = Pass3:

J. Harkins25 of 51MAPLD2005/C178 Radix Sort - MIMD Architecture Bank A Bank B Bank C Bank D Bank E Bank F FPGA 1 Radix Sort 1 33% FPGA 2 5% 3 Instances Uses enumeration sort Radix 13 bits vs. 8 bits Radix Sort 2 Radix Sort 3

J. Harkins26 of 51MAPLD2005/C178 MIMD Code Structure main.c int main( ) { int n = *6; int64 *buf; buf = cacheAlign(n); mapSort(buf, n); free(buf); exit(0); } mapSort.mc void mapSort(int64 *buf, n) { OBM_BANK_A (bufA, int64, n/6) OBM_BANK_B (bufB, int64, n/6) OBM_BANK_F (bufF, int64, n/6) DMA_CPU(dir, bufA, stripes, buf, n); #pragma src parallel sections { #pragma src section {Xsort(bufA, n/6);} #pragma src section {Xsort(bufB, n/6);} #pragma src section {Xsort(bufF, n/6);} } DMA_CPU(dir, bufA, stripes, buf, n); return; } … …

J. Harkins27 of 51MAPLD2005/C178 Example - Bitonic Sort 0: 1: 2: 3: LHLH HLHL LHLH HLHL LHLH LHLH LHLH LHLH LHLH LHLH LHLH LHLH LHLH LHLH HLHL HLHL LHLH LHLH LHLH HLHL HLHL LHLH LHLH LHLH [13][ 3][14][15] [10][ 2][ 6][ 0] [ 8][ 4][12][ 7] [ 5][11][ 1][ 9] Input Keys: Schedule: (0,1) (3,2) (0,2) (1,3) (0,1) (2,3)

J. Harkins28 of 51MAPLD2005/C178 Example - Bitonic Sort 0: 1: 2: 3: LHLH HLHL LHLH HLHL LHLH LHLH LHLH LHLH LHLH LHLH LHLH LHLH LHLH LHLH HLHL HLHL LHLH LHLH LHLH HLHL HLHL LHLH LHLH LHLH [ ][ ][ ][ ] [10][ 2][ 6][ 0] [ 8][ 4][12][ 7] [ 5][11][ 1][ 9] Input Keys: Schedule: (0,1) (3,2) (0,2) (1,3) (0,1) (2,3)

J. Harkins29 of 51MAPLD2005/C178 Example - Bitonic Sort 0: 1: 2: 3: LHLH HLHL LHLH HLHL LHLH LHLH LHLH LHLH LHLH LHLH LHLH LHLH LHLH LHLH HLHL HLHL LHLH LHLH LHLH HLHL HLHL LHLH LHLH LHLH [ ][ ][ ][ ] [ 8][ 4][12][ 7] [ 5][11][ 1][ 9] Input Keys: Schedule: (0,1) (3,2) (0,2) (1,3) (0,1) (2,3)

J. Harkins30 of 51MAPLD2005/C178 Example - Bitonic Sort 0: 1: 2: 3: LHLH HLHL LHLH HLHL LHLH LHLH LHLH LHLH LHLH LHLH LHLH LHLH LHLH LHLH HLHL HLHL LHLH LHLH LHLH HLHL HLHL LHLH LHLH LHLH [ ][ ][ ][ ] [ 8][ 4][12][ 7] [ ][ ][ ][ ] Input Keys: Schedule: (0,1) (3,2) (0,2) (1,3) (0,1) (2,3)

J. Harkins31 of 51MAPLD2005/C178 Example - Bitonic Sort 0: 1: 2: 3: LHLH HLHL LHLH HLHL LHLH LHLH LHLH LHLH LHLH LHLH LHLH LHLH LHLH LHLH HLHL HLHL LHLH LHLH LHLH HLHL HLHL LHLH LHLH LHLH [ 0][ 2][ 3][ 6] [ ][ ][ ][ ] Input Keys: Schedule: (0,1) (3,2) (0,2) (1,3) (0,1) (2,3)

J. Harkins32 of 51MAPLD2005/C178 Example - Bitonic Sort 0: 1: 2: 3: LHLH HLHL LHLH HLHL LHLH LHLH LHLH LHLH LHLH LHLH LHLH LHLH LHLH LHLH HLHL HLHL LHLH LHLH LHLH HLHL HLHL LHLH LHLH LHLH [ 0][ 2][ 3][ 6] [10][13][14][15] [ ][ ][ ][ ] Input Keys: Schedule: (0,1) (3,2) (0,2) (1,3) (0,1) (2,3)

J. Harkins33 of 51MAPLD2005/C178 Example - Bitonic Sort 0: 1: 2: 3: LHLH HLHL LHLH HLHL LHLH LHLH LHLH LHLH LHLH LHLH LHLH LHLH LHLH LHLH HLHL HLHL LHLH LHLH LHLH HLHL HLHL LHLH LHLH LHLH [ 0][ 2][ 3][ 6] [10][13][14][15] [ ][ ][ ][ ] [ 1][ 4][ 5][ 7] Input Keys: Schedule: (0,1) (3,2) (0,2) (1,3) (0,1) (2,3)

J. Harkins34 of 51MAPLD2005/C178 Example - Bitonic Sort 0: 1: 2: 3: LHLH HLHL LHLH HLHL LHLH LHLH LHLH LHLH LHLH LHLH LHLH LHLH LHLH LHLH HLHL HLHL LHLH LHLH LHLH HLHL HLHL LHLH LHLH LHLH [ 0][ 2][ 3][ 6] [10][13][14][15] [ 8][ 9][11][12] [ 1][ 4][ 5][ 7] Input Keys: Schedule: (0,1) (3,2) (0,2) (1,3) (0,1) (2,3)

J. Harkins35 of 51MAPLD2005/C178 Bitonic Sort - SIMD Architecture Bank A Bank B Bank C Bank D Bank E Bank F FPGA 1 8 Input Bitonic Sorting Network 1 27% FPGA 2 5% 2 Instances Parallel sorting network 4 Input Bitonic Sort 2 SIMD Controller

J. Harkins36 of 51MAPLD2005/C178 Example - Odd/Even Merge LHLH LHLH MUX Z -1 LHLH A: [ 0][ 1][ 2][ 4][ 7][11][12][14] B: [ 3][ 5][ 6][ 8][ 9][10][13][15] Input Keys: Z -2 C: [ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ] Merged Keys:

J. Harkins37 of 51MAPLD2005/C178 Example - Odd/Even Merge LHLH LHLH Z -1 LHLH A: [ 0][ 1][ 2][ 4][ 7][11][12][14] B: [ 3][ 5][ 6][ 8][ 9][10][13][15] Input Keys: Z -2 C: [ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ] Merged Keys:

J. Harkins38 of 51MAPLD2005/C178 Example - Odd/Even Merge LHLH LHLH Z -1 LHLH A: [ ][ ][ 2][ 4][ 7][11][12][14] B: [ ][ ][ 6][ 8][ 9][10][13][15] Input Keys: Z C: [ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ] Merged Keys:

J. Harkins39 of 51MAPLD2005/C178 Example - Odd/Even Merge LHLH LHLH Z -1 LHLH A: [ ][ ][ ][ ][ 7][11][12][14] B: [ ][ ][ 6][ 8][ 9][10][13][15] Input Keys: Z C: [ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ] Merged Keys:

J. Harkins40 of 51MAPLD2005/C178 Example - Odd/Even Merge LHLH LHLH Z -1 LHLH A: [ ][ ][ ][ ][ ][ ][12][14] B: [ ][ ][ 6][ 8][ 9][10][13][15] Input Keys: Z C: [ 0][ 1][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ] Merged Keys:

J. Harkins41 of 51MAPLD2005/C178 Example - Odd/Even Merge LHLH LHLH Z -1 LHLH A: [ ][ ][ ][ ][ ][ ][12][14] B: [ ][ ][ ][ ][ 9][10][13][15] Input Keys: Z C: [ 0][ 1][ 2][ 3][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ] Merged Keys:

J. Harkins42 of 51MAPLD2005/C178 Odd/Even Merge - SIMD Architecture Bank A Bank B Bank C Bank D Bank E Bank F FPGA 1 Odd Merge Two 40% FPGA 2 5% 1 Instance Parallel sorting network A/B = odd ; C/D = even Even Merge Two Merge Out

J. Harkins43 of 51MAPLD2005/C178 SIMD Code Structure main.c int main( ) { int n = *6; int64 *buf; buf = cacheAlign(n); mapSort(buf, n); free(buf); exit(0); } mapSort.mc void mapSort(int64 *buf, n) { OBM_BANK_A (AA, int64, n/6) OBM_BANK_B (BB, int64, n/6) OBM_BANK_F (FF, int64, n/6) DMA_CPU(dir, AA, stripes, buf, n); for (i=0; i<rounds; i++) { schedule( &r1, &r2); bitonicSort8(AA[r1],BB[r1],CC[r1],DD[r1], AA[r2],BB[r2],CC[r2].DD[r2], &AA[r1],&BB[r1],&CC[r1],&DD[r1], &AA[r2],&BB[r2],&CC[r2],&DD[r2]); bitonicSort4(EE[r1],FF[r1],EE[r2],FF[r2], … ); } DMA_CPU(dir, bufA, stripes, buf, n); return; } …

J. Harkins44 of 51MAPLD2005/C178 Implementation Comparisons Algorithm Processor Complexity Language Compiler Lines Of Code Recursion FPGA Util. % Slices MIMD SIMD Refactoring Upper Bound x10 6 keys/s Quick Sort X86N lgNC81 FPGAN lgNMC97/96n/a90, Heap Sort X86N lgNC55- FPGAN lgNMC56/54n/a55, Radix Sort X86NC70- FPGANMC81/64n/a33, Bitonic Sort X86Nlg 2 NC78 FPGAlg 2 NVHDL53/478/365n/a27,06.32 O/E Merge X86NC52- FPGANMC71/120n/a40, = icc v8.0 -fast = mcc v1.8 = mcc v1.9 X86= Dual Xeon 2.8GHz FPGA= 100MHz MC= MAP C = entirely = major changes = some = very little = almost none

J. Harkins45 of 51MAPLD2005/C178 Lesson Learned #1 Compiler Quick Sort Heap Sort Radix Sort Bitonic Sort O/E Merge 2.8 GHz Xeon x10 6 keys/s gcc icc -fast FPGA upper bound estimate x10 6 keys/s Upper bound on speedup vs gcc vs icc Know your tools Develop accurate assessments early

J. Harkins46 of 51MAPLD2005/C178 Test Conditions 64 bit unsigned integer keys Uniformly distributed Randomly permuted Scores average of 10 runs FPGA configuration time ~65ms DMA time ~18ms Typical key quantity 3.14M Processor comparison: Xeon 2.8GHz, 1GB mem

J. Harkins47 of 51MAPLD2005/C178 Experimental Results - 64 bit keys x 10 6 keys/s Sorting Algorithms

J. Harkins48 of 51MAPLD2005/C178 mcc Compiler Attempts to pipeline inner loops –Maintains sequential behavior of C –Reports dependencies/penalties Quick Sort:1penalty* Heap Sort:12penalties Radix Sort:2penalties Bitonic Sort:5penalties Odd/Even Merge:1penalty Easy to build embarrassingly parallel code Resource usage ~2x HDL

J. Harkins49 of 51MAPLD2005/C178 Conclusion FPGAs not best choice for sorting Sorting is memory bound –Tight loops, low computation suited to processor –More parallel memory accesses –Faster clock rates Refactoring for better performance –FPGAs underutilized –Understand compiler limitations –Eliminate dependencies

J. Harkins50 of 51MAPLD2005/C178 Tight Loop Example Merge a[N]=b[N]=infinity; j=k=0; Loop i = 0 to 2N-1 { if (a[j] > b[k]) merged[i] = b[k++]; else merged[i] = a[j++]; }

J. Harkins51 of 51MAPLD2005/C178 Future Work More refactoring –Greater use of block rams –HW prediction to reduce penalties FPGA performance gain = ƒ(computation density/memory access)