Wednesday, September 15, 2010

Writing parallel programs with C and gcc

I just found a cool gcc feature. gcc implements the OpenMP specification. OpenMP specification details how to write shared memory parallel programs in various languages. This got me interested and I took the best parallelisable example : The matrix multiplication. Matrix multiplication is an O(n^3) algorithm. The loop structure of a matrix multiplication program look like this:


for (i = 0; i < MAX_NUM; i++) {
     for (j = 0; j < MAX_NUM; j++) {
          for (k = 0; k < MAX_NUM; k++)
               Y[i][j] += A[i][k] * B[k][j];
      }
}


The i and j loops are parallel. This means that given a unique (i, j) combination, the computations in the loop body do not depend on any previously computed values for multiplication i.e. each matrix element can be computed independently. 

Lets get out hand dirty and make the i loop parallel. I have chosen size of matrix to be 1000X1000 so that we can see the speed-up clearly.

Now, all we have to do is, add one #pragma line before the i loop. Here is what the parallel code will look like


#pragma omp parallel for default(shared) private(i, j, k)
for (i = 0; i < MAX_NUM; i++) {
     for (j = 0; j < MAX_NUM; j++) {
          for (k = 0; k < MAX_NUM; k++)
               Y[i][j] += A[i][k] * B[k][j];
      }
}


Don't worry about the #pragma statement. These are actually quite simple to figure out and are a 5 minute job to learn from here: http://www.openmp.org/mp-documents/cspec20.pdf

Note that we haven't specified how many parallel tasks to fork. This is specified via an environment variable OMP_NUM_THREADS. This allows great flexibility in specifying how many threads to use without recompiling the program.

To enable OpenMP spec, use the "-fopenmp" switch with gcc:

$ gcc -o parallel -fopenmp parallel.c


Here are some test-runs which clearly show the speed-up:
Only one thread (serial program)

$ export OMP_NUM_THREADS=1
$ time ./parallel

real    0m13.633s
user    0m13.558s
sys    0m0.028s


Two threads

$ export OMP_NUM_THREADS=2
$ time ./parallel

real    0m8.144s
user    0m13.963s
sys    0m0.035s


Four threads

$ export OMP_NUM_THREADS=4
$ time ./parallel

real    0m7.960s
user    0m13.815s
sys    0m0.037s


Notice the speed-up?
With one thread, we have a execution time of 13.6s. With two threads it becomes 8.144s which is almost half. And with four threads it is  7.9s. Now what happened in the case of 4 threads? Didn't we expect 1/4th time? Well, mine is only a dual core PC, so the max speed-up for a CPU intensive task is when number of parallel tasks = number of cores/threads = 2.

You can see the running parallel threads with

$ ps -Haux | grep parallel
jitesh    7240 26.0  0.0  44536  1164 pts/0    Rl+  20:10   0:00 ./parallel
jitesh    7240 25.5  0.0  44536  1164 pts/0    Rl+  20:10   0:00 ./parallel
jitesh    7240 27.5  0.0  44536  1164 pts/0    Rl+  20:10   0:00 ./parallel
jitesh    7240 27.5  0.0  44536  1164 pts/0    Rl+  20:10   0:00 ./parallel


The "-H" switch displays the threads.


Cool, isn't it?
Thus, a parallel program with minimum effort and LOTS of time savings. We reduced the execution time of matrix multiplication to half with proper parallelism. On a 16-core machine with native support for threads, imagine how parallel programs will perform!

gcc and OpenMP FTW!

Note: If you are interested in learning OpenMP, see the spec links I have posted in the body of the blog.

2 comments:

Vedang said...

nice! extremely interesting :)

Lloyd said...

Nice, I gotta try this one.


Thanks!

technical programs