On Implementing Sorting Network Machines with FPGAs Rui Marcelino (UALG/EST) Horácio Neto (IST/INESC-ID) João M. P. Cardoso (IST/INESC-ID) Jornadas REC 2007 IST -8, 9 FEV
Motivation With a new kind of devices likes PDA’s, Mobile Phones, new needs are invoked Databases access from embedded device are a reality and tends to grow Sorting is an integral component of most database systems The performance of queries in these systems is often dominated by the cost of the sorting algorithm Search and sorting are becoming important features for embedded applications
Idea Development of sorting machine Coupled to a microprocessor Boost the global performance of general embedded database application Embedded Microprocessor Sorting Machine
the outputs satisfy: y0 y1 y2 … yn-1. Sorting Network x0 y0 Sorting Network x1 y1 . . xn-1 yn-1 the outputs satisfy: y0 y1 y2 … yn-1.
Sorting Network Algorithms Odd-Even K. Batcher, (1968) Bitonic-sort Merge-Sort Odd-Even New sort algorithms have been proposed without significantly improve on Batcher results Sort Network n 2n n
Graphical Representation (transposition odd-even) x0 xn-1 . x1 x2 x4 x3 x5 y0 yn-1 y1 y2 y4 y3 y5 xn-2 yn-2 1 2 3 stage n Comp_Swap, Comparator Swap Stage, number of disjoint Comp_SWap Depth, number of parallel steps Length, or size, which is the number of total comparison-swap
ODD-EVEN MERGE a’0 a’1 a'2 a'3 a'4 a'5 a'6 a'7 x0 x11 x2 x31 a0 a11 a2 stage x0 x11 x2 x31 a0 a11 a2 a3 1 2 3 x4 x51 x6 x71 b0 b11 b2 b3 b'0 b'1 b'2 b'3 b'4 b'5 b'6 b'7 x8 x91 x10 x111 x12 x13 x14 x15 8 7 9 10 y0 y1 y2 y3 y4 y5 y6 y7 y8 y9 Y10 y11 y12 y13 y14 y15
Proposal implementation of sorting network on FPGA devices Different implementations can be done: Pipelined more hardware resources, high data throughput Odd-Even Transposition Bitonic-Sort, Merge Odd-Even Sequential less hardware, Network split on sequential stages
Odd-Even Comparators: n·(n-1)/2 Steps: n Advantages: simplicity, x0 xn-1 . x1 x2 x4 x3 x5 y0 yn-1 y1 y2 y4 y3 y5 xn-2 yn-2 1 2 3 stage n Comparators: n·(n-1)/2 Steps: n Advantages: simplicity, locality scalability
Sequential - II … … … … x x x x x y y y y y n Comp 2 - Swap clk n Comp 1 2 n - 2 n - 1 … n Comp 2 - Swap … clk n Comp -1 2 - Swap … y y y y y 1 2 n - 2 n - 1
Comparator Swap MUX A CHANGE B REG COMP L A> B CHG REG MUX H CLOCK
Sequential - II (animation) 2 1 3 6 5 Cycle=1 Change=1 x x x x 2 1 3 6 5 Cycle=2 Change=1 1 3 6 2 3 1 6 5 Cycle=3 Change=1 1 3 6 3 3 1 6 Cycle=4 Change=1 2 6 1 5
Sequential - II (animation) Cycle=5 Change=1 3 6 2 5 1 3 6 1 6 6 3 1 Cycle=6 Change=1 3 5 2 1 6 5 3 2 1 Cycle=7 Change=0 6 3 1 6 6 3 1 Cycle=8 Change=0 5 3 2 1
Sequential - I … … … … … x x x x y y y y Switch network n 2 Comp/Swap. 1 n - 2 n - 1 n 2 … … Comp/Swap. Switch network … clk n Regs … y y y y 1 n - 2 n - 1
Comparator Swap MUX n A CHANGE n B COMP n L A>B CHG MUX n H
Switch Network
Sequential – I (animation) 2 1 3 6 5 Cycle=1 Change=1 Cycle=2 Change=1 1 3 6 2 3 1 6 5 Cycle=3 Change=1 2 6 1 5 Cycle=4 Change=1 3 6 2 5 1 Cycle=5 Change=1 3 5 2 1 Cycle=6 Change=1 Latency minimum= 2 Cycles Latency maximum= N Cycles
Latency
Experimental Results* and Conclusions * Xilinx ISE 8.2i
Thanks!