Parallel Sorting in Theory and in Practice II

Mike

2021-07-27 (Last Modified: 2021-11-06)

Parallel Sorting in Theory and in Practice II

Page content

As promised, this is the second entry on parallel sorting. In this entry, we’ll implement merge_sort, and then give two different ways to make it run in parallel. The first one will be a bit simpler than the second one.

Without further ado, here’s our implementation of merge_sort in C++

merge_sort()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31


// Polymorphic wrapper //
void merge_sort(int64_t* array, int64_t length)
{

  int64_t* array_copy = new int64_t[length];

  merge_sort(array, array_copy, length);

  delete array_copy;
}

// The real function //
void merge_sort(int64_t* array, int64_t* array_copy, int64_t length)
{
  if (length < 2)
  {
    return;
  }
  int64_t h = length / 2;
  int64_t i;

  for(i = 0; i < length; i++)
  {
    array_copy[i] = array[i];
  }

  merge_sort(array_copy, array, h);
  merge_sort(&array_copy[h], &array[h], length - h);

  serial_merge(array_copy, &array_copy[h], array, h, length - h);
}

This is a great introduction to recursion. It starts by copying the original array to array_copy, then recursively sends the left half of it to merge_sort as well as the right half. So when those two functions finish, the left half of array – everything from 0 to h-1 is sorted among themselves, and likewise, everything from h to length-1 is sorted among themselves. For example, if our input was array = {2,0,1,4,3,4,3,2,1,0}, then after the two recursive calls to quick_sort we would have array_copy = {0,1,2,3,4,0,1,2,3,4}, because the two halves have been sorted.

The function is finished when serial_merge has finished. Here’s our implementation of that function.

serial_merge()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35


void serial_merge(int64_t* l_array_copy, int64_t* r_array_copy,
  int64_t* array, int64_t l_length, int64_t r_length)
{
  int64_t i, j, k;
  i = 0;
  j = 0; 
  k = 0;
  while((i < l_length) && (j < r_length))
  {
    if(l_array_copy[i] <= r_array_copy[j])
    {
      array[k] = l_array_copy[i];
      i++;
      k++;
    }
    else
    {
      array[k] = r_array_copy[j];
      j++;
      k++;
    }
  }
  while(i < l_length)
  {
    array[k] = l_array_copy[i];
    i++;
    k++;
  }
  while(j < r_length)
  {
    array[k] = r_array_copy[j];
    j++;
    k++;
  }
}

You’ll notice three while loops. In the first one we step through l_array_copy with index i and r_array_copy with index j. The first while loop exits when either i or j gets to the end of its respective array. What’s going on is that we’re comparing what’s at the \(i^{th}\) position of l_array with what’s at the \(j^{th}\) position of r_array_copy and whichever is smaller – with ties going to what’s in l_array_copy – getting copied over to array. That’s the heart of the entire algorithm. The next two while loop just take care of copying whatever is left over in l_array_copy if j got to the end of r_array_copy before i got to the end of l_array_copy. Similarly, the final while loop takes care of copying over the remainder of r_array_copy if i got to the end of l_array_copy first.

Notice that at every step of every while loop, either i or j and definitely k is being increased. Thus all three while loops take at most l_length + r_length to complete. We see from merge_sort that l_length = h and r_length = length - h, so their total is exactly equal to length.

Therefore, if \(T(n)\) represents the number of steps it takes merge_sort to finish on an input array of length \(n\), then we see that

\[ T(n) = cn + 2T(\left\lceil \frac{n}{2} \right\rceil) \tag{1}\label{1} \]

By our “Almost the Master” Theorem (from More on Sorting and Analyzing Run Time Complexity), we see that \(T(n) \in O(n\log_2(n)) \).

merge_sort is an excellent algorithm. You should use it whenever you have the available memory. That’s the main drawback with using it: it requires us to make a copy of our original input. If your memory budget is tight, this might be a deal breaker. It has been for me at times.

Now we’re going to dive into the much deeper waters of trying to make this thing run efficiently in parallel – something which at first probably seems impossible. It certainly did to us. Before we get going, though, we’ll need some useful functions. The first two are least_upper_bound and min_greater_than. Here’s our implementation of them in C++

least_upper_bound() and min_greater_than()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


void parallel_merge(int64_t* array, int64_t* array_copy, int64_t length, int16_t n_threads)
{
  int64_t h = length / 2;

  if (n_threads < 2 || length <= n_threads*1024)
  {
    serial_merge(array_copy, &array_copy[h], array, h, length - h);
    return;
  }

  int64_t i,j,k;

  omp_set_nested(1);
  #pragma omp parallel for private(i, j, k) shared(h, length, array, array_copy)
  for (i = 0; i < h; i++)
  {
    j = least_upper_bound(&array_copy[h], array_copy[i], length - h);
    k = i + j;
    array[k] = array_copy[i];
  }

  omp_set_nested(1);
  #pragma omp parallel for private(i, j, k) shared(h, length, array, array_copy)
  for (j = h; j < length; j++)
  {
    i = min_greater_than(array_copy, array_copy[j], h);
    k = (i + j - h);
    array[k] = array_copy[j];
  }
}

The comments in the code explain exactly what these two functions do. They’re very similar, the only difference is that the first involves \( \leq\), while the second involves \( < \). If \(T(n) \) represents the number of steps it takes each one to finish with an input array of length \(n\), then it’s easily seen that \(T(n) = cn^{0} + T(\left\lceil \frac{n}{2}\right\rceil) \), which means that \(T(n) \in O( \log_2(n)) \) (again, using the “Almost the Master” Theorem).

In order to allow us to merge two arrays in parallel, we’re going to handle the indices i, j, and k from serial_merge differently. We’re going to step through the i index and the j index in completely separate for loops. We’ll use the value of one to compute the value of the other. What we mean is that we will have a for (i = 0; i < h; i++) loop and in that loop we’ll calculate j and k as a function of i. Then we’ll have a completely separate for (j = h; j < length; j++) loop and in that loop we’ll calculate i and k as functions of j. Calculating i as a function of j will take \(O(\log_2(\text{length} - \text{h})) \) steps, and calculating j as a function of i will take \(O(\log_2(\text{h}))\) steps.

Here is our implementation.

parallel_merge()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


void parallel_merge(int64_t* array, int64_t* array_copy, int64_t length, int16_t n_threads)
{
  int64_t h = length / 2;

  if (n_threads < 2 || length <= n_threads*1024)
  {
    serial_merge(array_copy, &array_copy[h], array, h, length - h);
    return;
  }

  int64_t i,j,k;

  omp_set_nested(1);
  #pragma omp parallel for private(i, j, k) shared(h, length, array, array_copy)
  for (i = 0; i < h; i++)
  {
    j = least_upper_bound(&array_copy[h], array_copy[i], length - h);
    k = i + j;
    array[k] = array_copy[i];
  }

  omp_set_nested(1);
  #pragma omp parallel for private(i, j, k) shared(h, length, array, array_copy)
  for (j = h; j < length; j++)
  {
    i = min_greater_than(array_copy, array_copy[j], h);
    k = (i + j - h);
    array[k] = array_copy[j];
  }
}

Since \(\text{length} - \text{h} \) and \(\text{h}\) are both \( \leq \left\lceil \frac{\text{length}}{2}\right\rceil\), we can see that \(T(n)\), the number of steps it takes parallel_merge to finish on an input of size \(n = \text{length}\) is in \(O(n \log_2(n))\). We have to use the two different functions, least_upper_bound and min_greater_than because we have to let ties in our merging function consistently go to one half of array_copy. We have arbitrarily chosen the left half of the array to win on ties. Therefore, for a given i the j index should point to the earliest element of the right half of array_copy which is greater than or equal to array_copy[i]. Whereas, when we’re calculating i for a given j, we have to point to the earliest element which is strictly greater than array_copy[j].

Now we’re ready to implement parallel_merge_sort. Here’s our implementation

parallel_merge_sort()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62


// Polymorphic wrapper //
void parallel_merge_sort(int64_t* array, int64_t length, int16_t n_threads)
{
  if (!array) return;
  if (length < 2) return;
  int16_t max_threads = omp_get_max_threads();
  int16_t threads = (n_threads < max_threads) ? n_threads : max_threads;

  int64_t* array_copy = new int64_t[length];

  parallel_merge_sort(array, array_copy, length, threads);

  // clean up!!
  delete array_copy;
}

// The real function //
void parallel_merge_sort(int64_t* array, int64_t* array_copy,
  int64_t length, int16_t n_threads)
{
  if (length < 2)
  {
    return;
  }

  int64_t h = length / 2;
  int64_t i;

  omp_set_num_threads(n_threads);

  #pragma omp parallel for private(i) shared(array_copy, array)
  for (i = 0; i < length; i++)
  {
    array_copy[i] = array[i];
  }

  if (n_threads > 1)
  {
    omp_set_nested(1);
    #pragma omp parallel for private(i)
      shared(array, array_copy, h, n_threads)
    for (i = 0; i < 2; i++)
    {
      if (i ==0)
      {
        parallel_merge_sort(array_copy, array, h, n_threads / 2);
      }
      else
      {
        parallel_merge_sort(&array_copy[h], &array[h],
          length - h, n_threads - (n_threads / 2));
      }
    }
  }
  else
  {
    merge_sort(array_copy, array, h);
    merge_sort(&array_copy[h], &array[h], length - h);
  }

  parallel_merge(array, array_copy, length, n_threads);
}

Because the parallel_merge function we’ve provided above takes \(O(n\log(n))\) steps to complete, \(T(n)\) for the entire function is equal to \( cn\log_2(n) + 2 T( \left\lceil \frac{n}{2}\right\rceil) \), which means that by case # 4 of the “Almost the Master” Theorem – provided at the very end of Parallel Sorting in Theory and in Practice (I) – \(T(n) \in O(n \log_2(n)^{2}) \). This means that asymptotically this will perform worse than even parallel_quick_sort_plus, even though it might require a very large input before it’s actually slower. For example, on an input array of 200 million elements, it runs in 6.5 seconds with 8 cores and 16 threads. This is the same Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz processor, and in fact, the same test input, so it’s fair to compare the times. That’s 13.5 times faster than parallel_quick_sort_plus!

Unfortunately, we won’t get to the even faster version of parallel_merge_sort in this entry. It will have to wait for the next one.

That’s all for this entry!