Parallel Sorting in Theory and in Practice II
As promised, this is the second entry on parallel sorting. In this entry, we’ll implement merge_sort, and then give two different ways to make it run in parallel. The first one will be a bit simpler than the second one.
Without further ado, here’s our implementation of merge_sort in C++
merge_sort()
|
|
This is a great introduction to recursion. It starts by copying the original array to array_copy, then recursively sends the left half of it to merge_sort as well as the right half. So when those two functions finish, the left half of array – everything from 0 to h-1 is sorted among themselves, and likewise, everything from h to length-1 is sorted among themselves. For example, if our input was array = {2,0,1,4,3,4,3,2,1,0}, then after the two recursive calls to quick_sort we would have array_copy = {0,1,2,3,4,0,1,2,3,4}, because the two halves have been sorted.
The function is finished when serial_merge has finished. Here’s our implementation of that function.
serial_merge()
|
|
You’ll notice three while loops. In the first one we step through l_array_copy with index i and r_array_copy with index j. The first while loop exits when either i or j gets to the end of its respective array. What’s going on is that we’re comparing what’s at the \(i^{th}\) position of l_array with what’s at the \(j^{th}\) position of r_array_copy and whichever is smaller – with ties going to what’s in l_array_copy – getting copied over to array. That’s the heart of the entire algorithm. The next two while loop just take care of copying whatever is left over in l_array_copy if j got to the end of r_array_copy before i got to the end of l_array_copy. Similarly, the final while loop takes care of copying over the remainder of r_array_copy if i got to the end of l_array_copy first.
Notice that at every step of every while loop, either i or j and definitely k is being increased. Thus all three while loops take at most l_length + r_length to complete. We see from merge_sort that l_length = h and r_length = length - h, so their total is exactly equal to length.
Therefore, if \(T(n)\) represents the number of steps it takes merge_sort to finish on an input array of length \(n\), then we see that
\[ T(n) = cn + 2T(\left\lceil \frac{n}{2} \right\rceil) \tag{1}\label{1} \]
By our “Almost the Master” Theorem (from More on Sorting and Analyzing Run Time Complexity), we see that \(T(n) \in O(n\log_2(n)) \).
merge_sort is an excellent algorithm. You should use it whenever you have the available memory. That’s the main drawback with using it: it requires us to make a copy of our original input. If your memory budget is tight, this might be a deal breaker. It has been for me at times.
Now we’re going to dive into the much deeper waters of trying to make this thing run efficiently in parallel – something which at first probably seems impossible. It certainly did to us. Before we get going, though, we’ll need some useful functions. The first two are least_upper_bound and min_greater_than. Here’s our implementation of them in C++
least_upper_bound() and min_greater_than()
|
|
The comments in the code explain exactly what these two functions do. They’re very similar, the only difference is that the first involves \( \leq\), while the second involves \( < \). If \(T(n) \) represents the number of steps it takes each one to finish with an input array of length \(n\), then it’s easily seen that \(T(n) = cn^{0} + T(\left\lceil \frac{n}{2}\right\rceil) \), which means that \(T(n) \in O( \log_2(n)) \) (again, using the “Almost the Master” Theorem).
In order to allow us to merge two arrays in parallel, we’re going to handle the indices i, j, and k from serial_merge differently. We’re going to step through the i index and the j index in completely separate for loops. We’ll use the value of one to compute the value of the other. What we mean is that we will have a for (i = 0; i < h; i++) loop and in that loop we’ll calculate j and k as a function of i. Then we’ll have a completely separate for (j = h; j < length; j++) loop and in that loop we’ll calculate i and k as functions of j. Calculating i as a function of j will take \(O(\log_2(\text{length} - \text{h})) \) steps, and calculating j as a function of i will take \(O(\log_2(\text{h}))\) steps.
Here is our implementation.
parallel_merge()
|
|
Since \(\text{length} - \text{h} \) and \(\text{h}\) are both \( \leq \left\lceil \frac{\text{length}}{2}\right\rceil\), we can see that \(T(n)\), the number of steps it takes parallel_merge to finish on an input of size \(n = \text{length}\) is in \(O(n \log_2(n))\). We have to use the two different functions, least_upper_bound and min_greater_than because we have to let ties in our merging function consistently go to one half of array_copy. We have arbitrarily chosen the left half of the array to win on ties. Therefore, for a given i the j index should point to the earliest element of the right half of array_copy which is greater than or equal to array_copy[i]. Whereas, when we’re calculating i for a given j, we have to point to the earliest element which is strictly greater than array_copy[j].
Now we’re ready to implement parallel_merge_sort. Here’s our implementation
parallel_merge_sort()
|
|
Because the parallel_merge function we’ve provided above takes \(O(n\log(n))\) steps to complete, \(T(n)\) for the entire function is equal to \( cn\log_2(n) + 2 T( \left\lceil \frac{n}{2}\right\rceil) \), which means that by case # 4 of the “Almost the Master” Theorem – provided at the very end of Parallel Sorting in Theory and in Practice (I) – \(T(n) \in O(n \log_2(n)^{2}) \). This means that asymptotically this will perform worse than even parallel_quick_sort_plus, even though it might require a very large input before it’s actually slower. For example, on an input array of 200 million elements, it runs in 6.5 seconds with 8 cores and 16 threads. This is the same Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz processor, and in fact, the same test input, so it’s fair to compare the times. That’s 13.5 times faster than parallel_quick_sort_plus!
Unfortunately, we won’t get to the even faster version of parallel_merge_sort in this entry. It will have to wait for the next one.
That’s all for this entry!