Sorting Networks

Sorting networks are an interesting and unsolved mathematical puzzle. They are quite different from the usual sorting algorithms one encounters in computer science like bubble sort, quick sort, merge sort, heap sort and so on. While sequential algorithms are generic, in the sense that they work on an input set of items of any size, their execution time grows with N, either as  O(N·log2N) for the fast algorithms or  O(N2) for the more trivial ones and the execution time is data dependent. Sorting networks on the other hand are parallel algorithms where multiple operations happen concurrently and they are a very good match for hardware implementations. If properly pipelined they can do an entire sorting operation every clock so we might say that their execution time is 1. But they are not universal algorithms, a different sorting network must be used for every input size N.

More detailed information on sorting networks can be found at the Wikipedia page https://en.wikipedia.org/wiki/Sorting_network . The illustrations I used in this post are also taken from that source.

Sorting networks are defined in terms of a Compare and Exchange basic building block:

This block has two inputs and two outputs and the block compares the two inputs and swaps them if they are not in order, all in a single operation, then the results are passed to the outputs. Data moves from left to right and each pair of dots connected with a vertical bar represents a compare-exchange block. We can say that a compare-exchange block is a sorting network of size 2. In hardware terms it consists of a comparator and two 2:1 muxes, where the select input of the two muxes is driven by the output of the comparator. Larger sorting networks can be built out of multiple compare-exchange blocks connected properly. Here is an example of a sorting network for N=4, it has size 5 and depth 3:

While sorting networks for an arbitrary size N can be easily described, for example like this pyramidal sorting network that corresponds to sequential bubble sorting and insertion sorting algorithms, like their sequential counterparts they are not efficient, having a size of N·(N1)2   or O(N2) and a latency of 2·N3 or O(N) :

An optimal sorting network can be defined in terms of minimal depth, which would translate to latency for a hardware implementation, or number of compare-exchange blocks, which is proportional to design area. In general we will consider size as the determining factor for an optimal sorting network and if there is a tie on size than the one with the smallest depth is the best. The pyramidal sorting network above is far from being optimal and it is quite inefficient for large N. For N=6 the size is 15 and depth is 9, while the best sorting network of 6 elements has a size of 12 and a depth of 5.

The challenge of designing an optimal sorting network for a given input size N is a very difficult one - no general method for doing that exists and it is almost certain that designing an optimal sorting network is an NP-complete problem, meaning that no polynomial solution actually exists - the complexity of the design problem grows at least exponentially with the size N. We only know the optimal sorting networks for sizes up to 10. For sizes between 11 and 17 we know the minimum depth and bounds for the size, for example we know that for N=16 a solution with 60 compare-exchange exists and that no solution with less than 53 blocks is possible but nobody knows what the actual best solution is:

N1234567891011121314151617
Depth013355557788999910
Size (upper bound)01359121619252935394551566071
Size (lower bound)01359121619252933374145495358

Faced with the task of implementing a general sorting network in hardware we will have to make some compromises. First of all it will have to be a sorting network of N inputs, with N an arbitrary but fixed number. While we could sort less than N inputs by padding the input data with the largest possible input value and then dropping the same number of outputs, the design will not be able to sort more than N inputs, that will have to be a different design. The second compromise will be in selecting a particular sorting algorithm called Batcher odd-even mergesort algorithm. It is not as efficient as an optimal sorting network, it's size being O(N·(log2N)2) but at least for relatively small N values that make sense for a hardware implementation it is quite close to the optimal solution and has the advantage that it can generate a sorting network of any size recursively so we can describe a sorting network of any size with a single piece of reusable code, rather than providing a different sorting network for every possible value of N, which we do not have anyway for N larger than 17.

Next week we will look in more detail at this sorting algorithm and then we will try to implement it in FPGA hardware.

Back to the top: The Art of FPGA Design