Download presentation
Presentation is loading. Please wait.
Published byElisabeth Robinson Modified over 9 years ago
1
Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning of the DARPA HPCS Integer Sort Benchmark within the SRC-6e Reconfigurable Computer
2
Page 2 MAPLD 2005/253MichalskiOutline Reconfigurable Computing – Introduction SRC-6e architecture, programming model Sorting Algorithms Design guidelines Testing Procedures, Results Conclusions, Future Work Lessons learned
3
Page 3 MAPLD 2005/253Michalski What is a Reconfigurable Computer? Combination of: Microprocessor workstation for frontend processing FPGA backend for specialized coprocessing Typical PC bus for communications
4
Page 4 MAPLD 2005/253Michalski What is a Reconfigurable Computer? PC Characteristics High clock speed Superscalar, pipelined Out of order issue Speculative execution High-Level Language FPGA Characteristics Low clock speed Large number of configurable elements LUTs, Block RAMs, CPAs Multipliers HDL Language
5
Page 5 MAPLD 2005/253Michalski What is the SRC-6e? SRC = Seymour R. Cray RC with high-throughput memory interface 1,415 MB/s for SNAP writes, 1,280 MB/s for SNAP reads PCI-X (1.0) = 1.064 GB/s
6
Page 6 MAPLD 2005/253Michalski SRC-6e Development Programming does not require knowledge of HW design C code can compile to hardware
7
Page 7 MAPLD 2005/253Michalski FPGA Considerations Superscalar design Parallel, pipelined execution SRC Considerations High overall data throughput Streaming versus non-streaming data transfer? Reduction of FPGA data processing stalls due to data dependencies, data read/write delays FPGA Block RAM versus SRC OnBoard Memory? Evaluate software/hardware partitioning Algorithm partitioning Data size partitioning SRC Design Objectives
8
Page 8 MAPLD 2005/253Michalski Sorting Algorithms Traditional Algorithms Comparison Sorts: Θ(n lg n) best case Insertion sort Merge sort Heapsort Quicksort Counting Sorts Radix sort: Θ(d(n+k)) HPCS FORTRAN code baseline Radix sort in combination with heapsort This research focuses on 128-bit operands SRC simplified data transfer, management
9
Page 9 MAPLD 2005/253Michalski Memory Constraints SRC onboard memory 6 banks x 4 MB Pipelined read or write access 5 clock latency FPGA BRAM memory 144 blocks, 18 Kbit each 1 clock read and write latency Initial Choices Parallel Insertion Sort (BubbleSort) Produces sorted blocks Use of onboard memory pipelined processing – Minimize data access stalls Parallel Heapsort Random access merge of sorted lists Use of BRAM for low latency access – Good for random data access Sorting – SRC FPGA Implementation
10
Page 10 MAPLD 2005/253Michalski Parallel Insertion Sort (BubbleSort) Systolic array of cells Pipelined SRC processing from OnBoard Memory Keeps highest value, passes other values Latency 2x number of cells
11
Page 11 MAPLD 2005/253Michalski Parallel Insertion Sort (BubbleSort) Systolic array of cells Results passed out in reverse order of comparison N = # comparator cells Sorts a list completely in Θ(L 2 ) Limit sort size to some number a < L (list size) Create multiple sorted lists Each list sorted in Θ(a)
12
Page 12 MAPLD 2005/253Michalski Parallel Insertion Sort (BubbleSort) #include void parsort_test(int arraysize, int sortsize, int transfer, uint64_t datahigh_in[], uint64_t datalow_in[], uint64_t datahigh_out[], uint64_t datalow_out[], int64_t *start_transferin, int64_t *start_loop, int64_t *start_transferout, int64_t *end_transfer, int mapno) { OBM_BANK_A (a, uint64_t, MAX_OBM_SIZE) OBM_BANK_B (b, uint64_t, MAX_OBM_SIZE) OBM_BANK_C (c, uint64_t, MAX_OBM_SIZE) OBM_BANK_D (d, uint64_t, MAX_OBM_SIZE) DMA_CPU(CM2OBM, a, MAP_OBM_stripe(1, "A"), datahigh_in, 1, arraysize*8, 0); wait_DMA(0); …. while (arrayindex < arraysize) { endarrayindex = arrayindex + sortsize - 1; if (endarrayindex > arraysize - 1) endarrayindex = arraysize - 1; while (arrayindex < endarrayindex) { for (i=arrayindex; i<=endarrayindex; i++) { data_high_in = a[i];data_low_in = b[i]; parsort(i==endarrayindex, data_high_in, data_low_in, &data_high_out, &data_low_out); c[i] = data_high_out;d[i] = data_low_out;
13
Page 13 MAPLD 2005/253Michalski Parallel Heapsort Tree structure of cells Asynchronous operation Acknowledged data transfer Merges sorted lists in Θ(n lg n) Designed for Independent BRAM block accesses
14
Page 14 MAPLD 2005/253Michalski Parallel Heapsort BRAM Limitations 144 Block RAMs @ 512 32 bit values = not a whole lot of 128-bit values OnBoard Memory SRC constraint – Up to 64 reads and 8 writes in one MAP C file Cascading clock delays as number of reads increase Explore the use of MUXd access: search and update only 6 of 48 leaf nodes at a time in round-robin fashion
15
Page 15 MAPLD 2005/253Michalski FPGA Initial Results Baseline: One V26000 PAR options: -ol high –t 1 Bubblesort Results – 100 Cells 29,354 Slices(86%) 37,131 LUTs(54%) 13.608 ns = 73 MHz (verified operational at 100MHz) Heapsort Results – 95 Cells (48 Leafs) 21,011 Slices(62%) 24,467 LUTs(36%) 11.770 ns = 85 MHz (verified operational at 100MHz)
16
Page 16 MAPLD 2005/253Michalski Testing Procedures All tests utilize one chip for baseline results Evaluate fastest software radix of operation Hardware/Software Partitioning Five cases - Case 5 utilizes FPGA reconfiguration Data size partitioning – 100, 500, 1000, 5000, 10000 10 runs for each test case/data partitioning combination List size 500000 values
17
Page 17 MAPLD 2005/253MichalskiResults Fastest Software Operations (Baseline) Comparison of Radixsort and Heapsort Combinations Radix 4, 8 and 16 evaluated Minimum Time: Radix-8 Radixsort + Heapsort (Size = 5000 or 10000) Radix-16 has too many buckets for sort size partitions evaluated Heapsort comparisons faster than radixsort index updates
18
Page 18 MAPLD 2005/253MichalskiResults Fastest SW-only Time = 3.41 sec. Fastest time including HW = 3.89 sec. Bubblesort (HW), Heapsort (SW) Partition Listsize of 1000 Heapsort times… Dominated by data access Significantly slower than software
19
Page 19 MAPLD 2005/253Michalski Results – Bubblesort vs. Radixsort Some cases where HW faster than SW List sizes < 5000 SRC data pipelined access Fastest SW case was for list size = 10000 MAP data transfer time less significant than data processing time For size = 1000: Input (11.3%), Analyze (76.9%), Output (11.5%)
20
Page 20 MAPLD 2005/253Michalski Results - Limitations Heapsort is limited by overhead of input servicing Random accesses of OBM not ideal Overhead of loop search, sequentially dependent processing Bubblesort limited by number of cells Can increase by approximately 13 cells Two-chip streaming Reconfiguration time assumed to be one-time setup factor Reconfiguration case exception – Solve by having a core per V26000
21
Page 21 MAPLD 2005/253MichalskiConclusions Pipelined, systolic designs are needed to overcome speed advantage of microprocessor Bubblesort works well on small data sets Heapsort’s random data access cannot exploit SRC benefits SRC high-throughput data transfer and high- level data abstraction provides good framework to implement systolic designs
22
Page 22 MAPLD 2005/253Michalski Future Work Heapsort’s random data access cannot exploit SRC benefits Look for possible speedups using BRAM? Unroll leaf memory access Exploit SRC “periodic macro” paradigm Currently evaluating radix sort in hardware This works better than bubblesort for larger sort sizes Compare MAP-C to VHDL when baseline VHDL is faster than SW
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.