You are here: Documentation Quickstart for C

MulticoreBSP for C: a quick-start guide

For the purpose of this guide, we assume the following:

The above links to versions 2.0.4 and 1.02 of MulticoreBSP for C and BSPedupack, respectively. These might not be the latest versions at the time you read this; check the respective websites for the latest software updates (MulticoreBSP, BSPedupack).

Compiling the library

First extract all files with

  • tar xvfJ MulticoreBSP-for-C.tar.xz

and the library will appear in the new `./MulticoreBSP-for-C' directory. We proceed with compilation:

  • cd MulticoreBSP-for-C;
  • make

This will create three new subdirectories: `include', `lib', and `tools'.

Warning for OS X and Windows users

Some popular OSes do not support the POSIX standards we use here; these include Mac OS X (no POSIX realtime) and Microsoft Windows (no POSIX threads and no POSIX realtime). On OS X, after extraction but before issueing `make’, please execute the following steps:

  • make include.mk
  • Uncomment the SHARED_LINKER_FLAGS and C_STANDARD_FLAGS appropriate for OS X systems in include.mk

Also note that modern OS X systems masquerade the LLVM clang compiler as a gcc compiler; therefore it is advisable to also enable the LLVM/clang compiler by uncommenting the respective lines in include.mk. After these, this guide also applies to OS X.

On Microsoft Windows, make use of the PThreads-win32 project and see this forum post for details. Please note only basic support for Windows is maintained through cross-compilation; see include.mk.

Optional: testing the library

MulticoreBSP for C comes with a basic testing suite. To check if your version of the library compiled correctly, you may want to run these tests with

  • make tests

This should output one SUCCESS message for each test. Some tests may require manual checking of output. Each test verifies functions in a separate internal file from the libraries, or checks the behaviour of MulticoreBSP versus its specification. If a failure occurs, it usually specifies the origin on a function-level; please report it should this happen.

Writing, compiling, and running your own BSP application

Not relying on existing source, we give a very short `Hello world!' example. We can compile a MulticoreBSP program by the use of the bspcc and bsprun scripts that are available in the ./tools directory. To simplify their use, we add them to our path:

  • export BSPPATH=`pwd`/tools
  • export PATH=${BSPPATH}:${PATH}

To make this setting persist through multiple sessions, add the latter line to your startup script; for the bourne again shell (bash), this is ~/.bashrc. Be sure to use the expanded version of BSPPATH:

  • echo ${BSPPATH}

A MulticoreBSP for C application starts as any other C program. Let us write example.c:

#include <stdlib.h>

int main( int argc, char ** argv ) {
    return EXIT_SUCCESS;
}

To use our library to spawn three threads which each print `Hello world!', we include the MulticoreBSP for C header file, bsp.h, write parallel code in a Single Program Multiple Data (SPMD) function, and have the main function call it:

#include <bsp.h>
#include <stdio.h>
#include <stdlib.h>

void spmd() {
    bsp_begin( 3 );
    printf( "Hello world!\n" );
    bsp_end();
}

int main( int argc, char ** argv ) {
    bsp_init( &spmd, argc, argv );
    spmd();
    return EXIT_SUCCESS;
}

We can now compile the example and indeed see it work:

  • bspcc -o example example.c
  • ./example

The output should read

Hello world!
Hello world!
Hello world!

This shows the basics of using the MulticoreBSP library, and uses only 3 out of the 22 BSP primitives available. Adding the use of bsp_nprocs and bsp_pid primitives enables us to start a variable amount of threads, as well as enabling us to differentiate between threads within an SPMD program. We modify example.c to illustrate:

#include <bsp.h>
#include <stdlib.h>
#include <stdio.h>

static unsigned int P;

void spmd() {
    bsp_begin( P );
    printf( "Hello world from thread %u out of %u!\n", bsp_pid(), bsp_nprocs() );
    bsp_end();
}

int main( int argc, char ** argv ) {
    printf( "How many threads do you want started? There are %u cores available.\n", bsp_nprocs() );
    fflush( stdout );
    scanf( "%u", &P );
    if( P == 0 || P > bsp_nprocs() ) {
        fprintf( stderr, "Cannot start %u threads.\n", P );
        return EXIT_FAILURE;
    }
    bsp_init( &spmd, argc, argv );
    spmd();
    return EXIT_SUCCESS;
}

Note that BSPlib requires us to communicate between the sequential context and the SPMD function using a global static variable (static unsigned int P, in this example). We again compile and run it:

  • bspcc -o example example.c
  • ./example

Output may, for example, look like the following:

How many threads do you want started? There are 3 cores available.
3
Hello world from thread 2 out of 3!
Hello world from thread 0 out of 3!
Hello world from thread 1 out of 3!

This example writes output in arbitrary order. For meaningful parallel programming, communication between threads is necessary. Sixteen of the remaining seventeen MulticoreBSP primitives provide you with various means on how to do this; please refer to the documentation for a full description. The introductory text on the BSP model may help new BSP programmers as well.

Compiling in debug or profile mode

Sometimes your parallel program will not behave as expected. To help diagnose any issues due to bugs, one can compile in debug mode:

  • bspcc --debug -o example example.c

The compilation will emit a warning you are compiling in debug mode-- the resulting program still runs the same but extra checks will be performed on every call to a BSP primitive, which slows down overall execution. Try to run the example program again and note extra information is printed out. Always make sure to compile in the default performance mode for production or benchmarking codes!

Another mode of interest is the profile mode:

  • bspcc --profile -o example example.c

Again, a warning will be emitted since profiling brings with it a performance overhead. Running a program in profiling mode will show, for each superstep, statistics such as the time spent in computation phases, time spent buffering communication, and time spent communicating. It reports the number of bytes sent and the h-relation achieved, the number of calls to BSP primtives made, and an overall BSP signature-- the ratio of useful (computational) work versus the total run-time. Try running the example program to see an example.

Manual compilation, compiling C++ code, and system-wide installation

Compiling C++ code can be done using the bspcxx tool instead of bspcc. To manually compile your codes, use the --show flag to inspect all arguments bspcc and bspcxx pass through to the regular compiler.

In summary, when in C mode, bspcc falls back to ANSI C99, passes ./include to the -I flag, statically links against the library in ./lib, and links to POSIX Threads and POSIX Realtime. When compiling using -c, the static and dynamic linkage flags are omitted. When compiling in C++ mode using bspcxx, nothing changes except for the use of a C++ compiler and the use of the ANSI C++98 standard (if no others were manually defined). On OS X, neither tool will link against POSIX Realtime.

Both bspcc and bspcxx store the full paths to the include and lib directories; i.e., the path where MulticoreBSP was built is also its install directory. If you want to separate this, please move the public headers and compiled libraries to your preferred installation path manually, and edit the paths in bspcc and bspcxx accordingly before installing them in your preferred path. For users without super-user priviliges, adding the local install directory to your path as described above suffices and is recommended.

Compiling older BSPlib programs

Suppose we have a BSP program written in C using the original BSPlib standard (Hill et al., 1998), and we want to use MulticoreBSP to run that same code. We ensure full compatibility by a so-called compatibility mode, which we shall demonstrate using an earlier version of the BSPedupack software:

  • export PATH=${PATH}:`pwd`/tools/
  • cd ..
  • tar xvf BSPedupack1.02.tar
  • cd BSPedupack1.02

We adapt the BSPedupack makefile to this end:

  • change line 1 from `Makefile' from CC= bspcc to CC= bspcc --compat
  • change line 2 from `Makefile’ to remove everything after -O3

The software will now compile using a simple `make’. Note that newer BSPedupack versions (from 2.0 and up) are written using the 2014 BSPlib version by Yzelman et al.; no compatibility mode need be set.

Recall the tools directory of MulticoreBSP installation was added to the PATH; otherwise, the above changes will not work.

Running BSP programs

If static linking is used (as in all the above examples), running BSP applications requires no additional effort regardless of which compilation modes were used, and regardless of how compilation modes may have been mixed. The five BSPedupack applications compiled in the previous section, for example, can be run as you would any application:

  1. ./bench, a BSP benchmarking utility
  2. ./ip, a parallel inner-product calculation
  3. ./lu, a parallel dense LU factorisation (note that the actual number of threads used is M times N)
  4. ./fft, a parallel Fast Fourier Transform for complex-valued vectors (note that both the vector length as well as the number of threads requested must be powers of 2)
  5. ./matvec, a parallel sparse matrix–vector multiplication benchmark (note that BSPedupack includes a test matrix for 2 threads. Use `test.mtx-P2' as matrix, `test.mtx-v2' as the v-vector distribution, and `test.mtx-u2' as the u-vector distribution. Use the Mondriaan software to try other matrices.)

The only restriction is that if in a final application there was one compilation unit compiled using --profile, then the compilation unit that calls bsp_end should also be compiled using --profile.

To hyperthread or not to hyperthread (and how to pin threads)

Modern Intel processors include hyperthreading. Programs that do not incur high memory latencies typically do not benefit from hyperthreads, while the use of hyperthreads typically also leads to higher performance variabilities. A user may thus wish to control, on an application-by-application basis, whether hyperthreads are to be used.

MulticoreBSP allows for controlling this by creating a file called machine.info in the working directory of each application. To disable hyperthreads, the contents of such a file should read

threads_per_core 2
thread_numbering wrapped
unused_threads_per_core 1

The machine.info file controls affinity and thread pinning. To interactively create one appropriate for your specific use case, users can issue `make machine.info' from the MulticoreBSP-for-C root directory (you may need to remove the existing machine.info first). This tool allows to optimise for bandwidth-bound versus compute-bound applications, and can also create pinning strategies suitable for Intel Xeon Phi.

One can specify common pinning strategies using machine.info, using affinity compact, affinity scatter, or affinity manual. However, if you have Intel hyperthreads active, MulticoreBSP needs to be made aware of the fact multiple hardware threads share a single core.

For example, a scatter strategy on a hyperthread-enabled processor is enabled via:

threads_per_core 2
thread_numbering wrapped
affinity compact

A manual pinning can be hardcoded as follows, assuming here for example a processor with four hardware threads:

affinity manual
pinning 0 1 0 1

Manual thread pinning is unaffected by the presence of hyperthreads, though the user should of course be aware which threads map to the same core. Manual pinning overrides internal checks and will not result in warnings if, for example, multiple BSP processes are pinned to the same hardware thread (as in the above).

More examples and benchmarking your current machine

For more examples of BSP programs, please see the MulticoreBSP-for-C/examples directory. A simple `make' in that directory builds several examples applications, ranging from a simple `hello world' in both C and C++ to a numerical integration in the parallel_loop example.

To gauge the speed of your computer, to, for example, verify MulticoreBSP performs as expected, please navigate to the MulticoreBSP-for-C/benchmarks directory. When your system has an MPI implementation available, simply issue `make' to build a set of benchmarks programs using both MulticoreBSP and your MPI installation. These are then automatically run, the result of which can be inspected using MATLAB (or Octave) and running the plot_results.m script.

If you do not have MPI installed, one can issue `make nompi' in the benchmarks directory, which will build the non-MPI benchmarks only. These can be run manually; no automated runs nor plotting tools are available in this mode. The tools correspond to several common collective communication patterns, implemented using one of the BSP communication primitives. These benchmark programs are named accordingly: put, get, hpput, and hpget. Each of these run on various sizes of input data so to measure behaviour on various levels of cache.

The barrier program times the bsp_sync primitive, while the stream-memcpy, stream-sw-triad, and stream-hw-triad each measure local memory speeds. The memcpy uses the system standard memory copy routine available on your system, while the other two benchmarks measure bandwidth using variants of the STREAM benchmark. The latter program omits a barrier after each collective which the other two tools do perform; the use of this benchmark relies on an appropriate and successful thread pinning (and should thus not be used on OS X).