Nothing technical about it: Writing parallel programs with C and gcc

I just found a cool gcc feature. gcc implements the OpenMP specification. OpenMP specification details how to write shared memory parallel programs in various languages. This got me interested and I took the best parallelisable example : The matrix multiplication. Matrix multiplication is an O(n^3) algorithm. The loop structure of a matrix multiplication program look like this:



for (i = 0; i < MAX_NUM; i++) {

     for (j = 0; j < MAX_NUM; j++) {

          for (k = 0; k < MAX_NUM; k++) 

               Y[i][j] += A[i][k] * B[k][j];

      }

}

The i and j loops are parallel. This means that given a unique (i, j) combination, the computations in the loop body do not depend on any previously computed values for multiplication i.e. each matrix element can be computed independently.

Lets get out hand dirty and make the i loop parallel. I have chosen size of matrix to be 1000X1000 so that we can see the speed-up clearly.

Now, all we have to do is, add one #pragma line before the i loop. Here is what the parallel code will look like



#pragma omp parallel for default(shared) private(i, j, k)

for (i = 0; i < MAX_NUM; i++) {

     for (j = 0; j < MAX_NUM; j++) {

          for (k = 0; k < MAX_NUM; k++) 

               Y[i][j] += A[i][k] * B[k][j];

      }

}

Don't worry about the #pragma statement. These are actually quite simple to figure out and are a 5 minute job to learn from here: http://www.openmp.org/mp-documents/cspec20.pdf

Note that we haven't specified how many parallel tasks to fork. This is specified via an environment variable OMP_NUM_THREADS. This allows great flexibility in specifying how many threads to use without recompiling the program.

To enable OpenMP spec, use the "-fopenmp" switch with gcc:



$ gcc -o parallel -fopenmp parallel.c

Here are some test-runs which clearly show the speed-up:
Only one thread (serial program)



$ export OMP_NUM_THREADS=1

$ time ./parallel



real    0m13.633s

user    0m13.558s

sys    0m0.028s

Two threads



$ export OMP_NUM_THREADS=2

$ time ./parallel



real    0m8.144s

user    0m13.963s

sys    0m0.035s

Four threads



$ export OMP_NUM_THREADS=4

$ time ./parallel



real    0m7.960s

user    0m13.815s

sys    0m0.037s

Notice the speed-up?
With one thread, we have a execution time of 13.6s. With two threads it becomes 8.144s which is almost half. And with four threads it is 7.9s. Now what happened in the case of 4 threads? Didn't we expect 1/4th time? Well, mine is only a dual core PC, so the max speed-up for a CPU intensive task is when number of parallel tasks = number of cores/threads = 2.

You can see the running parallel threads with



$ ps -Haux | grep parallel

jitesh    7240 26.0  0.0  44536  1164 pts/0    Rl+  20:10   0:00 ./parallel

jitesh    7240 25.5  0.0  44536  1164 pts/0    Rl+  20:10   0:00 ./parallel

jitesh    7240 27.5  0.0  44536  1164 pts/0    Rl+  20:10   0:00 ./parallel

jitesh    7240 27.5  0.0  44536  1164 pts/0    Rl+  20:10   0:00 ./parallel

The "-H" switch displays the threads.

Cool, isn't it?
Thus, a parallel program with minimum effort and LOTS of time savings. We reduced the execution time of matrix multiplication to half with proper parallelism. On a 16-core machine with native support for threads, imagine how parallel programs will perform!

gcc and OpenMP FTW!

Note: If you are interested in learning OpenMP, see the spec links I have posted in the body of the blog.

Nothing technical about it

Wednesday, September 15, 2010

Writing parallel programs with C and gcc

2 comments:

google analytics

Blog Archive

My kin

Contributors