Convert a C++ program
Note: This page will be updated soon to include cilk_scope
, introduced with OpenCilk 2.0.
Overview
Here is the sequence of steps to create a parallel program using OpenCilk.
-
Typically, you will start with a serial C or C++ program that implements the basic functions or algorithms that you want to parallelize. You will likely be most successful if the serial program is correct to begin with! Any bugs in the serial program will occur in the parallel program, but they will be more difficult to identify and fix.
-
Next, identify the program regions that will benefit from parallel operation. Operations that are relatively long-running and which can be performed independently are prime candidates.
-
Use the three OpenCilk keywords to identify tasks that can execute in parallel: *
cilk_spawn
indicates a call to a function (a "child") that can proceed in parallel with the caller (the "parent").*cilk_sync
indicates that all spawned children must complete before proceeding. *cilk_for
identifies a loop for which all iterations can execute in parallel. -
Build the program:
- Linux OS:* Use the
clang
orclang++
compiler command.
- Linux OS:* Use the
-
Run the program. If there are no race conditions, the parallel program will produce the same result as the serial program.
-
Even if the parallel and serial program results are the same, there may still be race conditions. Run the program under the cilksan race detector to identify possible race conditions introduced by parallel operations.
-
Correct any race conditions with reducers, locks, or recode to resolve conflicts.
-
Note that a traditional debugger can debug the serialization of a parallel program, which you can create easily with OpenCilk.
We will walk through this process in detail using a sort program as an example.
Start with a serial program
We'll demonstrate how to use write an OpenCilk program by parallelizing a simple implementation of Quicksort (http://en.wikipedia.org/wiki/Quicksort).
Note that the function name sample_qsort
avoids confusion with the
Standard C Library qsort
function.
#include <algorithm>
#include <iostream>
#include <iterator>
#include <functional>
// Sort the range between begin and end.
// "end" is one past the final element in the range.
// This is pure C++ code before Cilk++ conversion.
void sample_qsort(int * begin, int * end)
{
if (begin != end) {
--end; // Exclude last element (pivot)
int * middle = std::partition(begin, end,
std::bind2nd(std::less<int(),*end));
std::swap(*end, *middle); // pivot to middle
sample_qsort(begin, middle);
sample_qsort(++middle, ++end); // Exclude pivot
}
}
// A simple test harness
int qmain(int n)
{
int *a = new int[n];
for (int i = 0; i < n; ++i)
a[i] = i;
std::random_shuffle(a, a + n);
std::cout << "Sorting " << n << " integers"
<< std::endl;
sample_qsort(a, a + n);
// Confirm that a is sorted and that each element
// contains the index.
for (int i = 0; i < n-1; ++i) {
if ( a[i] = a[i+1] || a[i] != i ) {
std::cout << "Sort failed at location i=" << i << " a[i] = "
<< a[i] << " a[i+1] = " << a[i+1] << std::endl;
delete[] a;
return 1;
}
}
std::cout << "Sort succeeded." << std::endl;
delete[] a;
return 0;
}
int main(int argc, char* argv[])
{
int n = 10*1000*1000;
if (argc 1)
n = std::atoi(argv[1]);
return qmain(n);
}
Convert to an OpenCilk program
Converting the C++ code to OpenCilk C++ code is very simple.
- Add a "
#include <cilk.h>
" statement to the source.cilk.h
declares all the entry points to the OpenCilk runtime.
The result is an OpenCilk program that has no parallelism yet.
Compile the program to ensure that the OpenCilk SDK development environment is setup correctly.
Typically, OpenCilk programs are built with optimized code for best performance.
Linux* OS
> clang++ qsort.cpp -o qsort –O3 -fopencilk
Add parallelism using cilk_spawn
We are now ready to introduce parallelism into our qsort
program.
The cilk_spawn
keyword indicates that a function (the child) may be
executed in parallel with the code that follows the cilk_spawn
statement (the parent). Note that the keyword allows but does not
require parallel operation. The OpenCilk scheduler will dynamically
determine what actually gets executed in parallel when multiple
processors are available. The cilk_sync
statement indicates that the
function may not continue until all cilk_spawn
requests in the same
function have completed. cilk_sync
does not affect parallel strands
spawned in other functions.
void sample_qsort(int * begin, int * end)
{
if (begin != end) {
--end; // Exclude last element (pivot)
int * middle = std::partition(begin, end,
std::bind2nd(std::less<int>(),*end));
std::swap(*end, *middle); // pivot to middle
cilk_spawn sample_qsort(begin, middle);
sample_qsort(++middle, ++end); // Exclude pivot
cilk_sync;
}
}
In line 8, we spawn a recursive invocation of sample_qsort
that can
execute asynchronously. Thus, when we call sample_qsort
again in line 9, the call at line 8 might not have completed. The cilk_sync
statement at line 10 indicates that this function will not continue
until all cilk_spawn
requests in the same function have completed.
There is an implicit cilk_sync
at the end of every function that waits
until all tasks spawned in the function have returned, so the cilk_sync
here is redundant, but written explicitly for clarity.
The above change implements a typical divide-and-conquer strategy for parallelizing recursive algorithms. At each level of recursion, we have two-way parallelism; the parent strand (line 9) continues executing the current function, while a child strand executes the other recursive call. This recursion can expose quite a lot of parallelism.
Build, execute, and test
With these changes, you can now build and execute the OpenCilk version of the qsort program. Build and run the program exactly as we did with the previous example:
Linux* OS:
> clang++ qsort.cpp -o qsort –O3 -fopencilk
Run qsort from the command line
> qsort
Sorting 10000000 integers
5.641 seconds
Sort succeeded.
By default, an OpenCilk program will query the operating system and use all available cores. You can control the number of workers by setting the CILK_NWORKERS environment variable:
CILK_NWORKERS=8 ./qsort
Observe speedup on a multicore system
Run qsort using one and then two cores:
> CILK_NWORKERS=1 qsort
Sorting 10000000 integers
2.909 seconds Sort succeeded.
> CILK_NWORKERS=2 qsort
Sorting 10000000 integers
1.468 seconds Sort succeeded.
Alternately, run cilkscale to get a more detailed performance graph.