for (i = 0; i < MAX_NUM; i++) {
for (j = 0; j < MAX_NUM; j++) {
for (k = 0; k < MAX_NUM; k++)
Y[i][j] += A[i][k] * B[k][j];
}
}
The i and j loops are parallel. This means that given a unique (i, j) combination, the computations in the loop body do not depend on any previously computed values for multiplication i.e. each matrix element can be computed independently.
Lets get out hand dirty and make the i loop parallel. I have chosen size of matrix to be 1000X1000 so that we can see the speed-up clearly.
Now, all we have to do is, add one #pragma line before the i loop. Here is what the parallel code will look like
#pragma omp parallel for default(shared) private(i, j, k)
for (i = 0; i < MAX_NUM; i++) {
for (j = 0; j < MAX_NUM; j++) {
for (k = 0; k < MAX_NUM; k++)
Y[i][j] += A[i][k] * B[k][j];
}
}
Don't worry about the #pragma statement. These are actually quite simple to figure out and are a 5 minute job to learn from here: http://www.openmp.org/mp-documents/cspec20.pdf
Note that we haven't specified how many parallel tasks to fork. This is specified via an environment variable OMP_NUM_THREADS. This allows great flexibility in specifying how many threads to use without recompiling the program.
To enable OpenMP spec, use the "-fopenmp" switch with gcc:
$ gcc -o parallel -fopenmp parallel.c
Here are some test-runs which clearly show the speed-up:
Only one thread (serial program)
$ export OMP_NUM_THREADS=1
$ time ./parallel
real 0m13.633s
user 0m13.558s
sys 0m0.028s
Two threads
$ export OMP_NUM_THREADS=2
$ time ./parallel
real 0m8.144s
user 0m13.963s
sys 0m0.035s
Four threads
$ export OMP_NUM_THREADS=4
$ time ./parallel
real 0m7.960s
user 0m13.815s
sys 0m0.037s
Notice the speed-up?
With one thread, we have a execution time of 13.6s. With two threads it becomes 8.144s which is almost half. And with four threads it is 7.9s. Now what happened in the case of 4 threads? Didn't we expect 1/4th time? Well, mine is only a dual core PC, so the max speed-up for a CPU intensive task is when number of parallel tasks = number of cores/threads = 2.
You can see the running parallel threads with
$ ps -Haux | grep parallel
jitesh 7240 26.0 0.0 44536 1164 pts/0 Rl+ 20:10 0:00 ./parallel
jitesh 7240 25.5 0.0 44536 1164 pts/0 Rl+ 20:10 0:00 ./parallel
jitesh 7240 27.5 0.0 44536 1164 pts/0 Rl+ 20:10 0:00 ./parallel
jitesh 7240 27.5 0.0 44536 1164 pts/0 Rl+ 20:10 0:00 ./parallel
The "-H" switch displays the threads.
Cool, isn't it?
Thus, a parallel program with minimum effort and LOTS of time savings. We reduced the execution time of matrix multiplication to half with proper parallelism. On a 16-core machine with native support for threads, imagine how parallel programs will perform!
gcc and OpenMP FTW!
Note: If you are interested in learning OpenMP, see the spec links I have posted in the body of the blog.
2 comments:
nice! extremely interesting :)
Nice, I gotta try this one.
Thanks!
technical programs
Post a Comment