Tuesday, October 26, 2010


The newer gdb (gdb 7 and above) is supporting memory debugging.
Quoting from http://fedoraproject.org/wiki/Features/MemoryDebuggingTools

"The new "gdb-heap" package adds a new "heap" command to /usr/bin/gdb.

The command allows you to get a breakdown of how that process is using
dynamic memory.

It allows for unplanned memory usage debugging: if a process
unexpectedly starts using large amounts of memory you can attach to it
with gdb, and use the heap command to figure out where the memory is
going. You should also be able to use it on core dumps.

We believe this approach is entirely new, and is unique to Fedora 14. "

Sounds promising!

Saturday, October 2, 2010

Writing cache efficient programs

Just recently I discovered how much effect cache misses can have on the running time of your program! Its truly amazing how much savings you can make, yet I had ignored it for more than 4 years now.

Consider the matrix multiplication program again

for (i = 0; i < MAX_NUM; i++) {
     for (j = 0; j < MAX_NUM; j++) {
          for (k = 0; k < MAX_NUM; k++)
               Y[i][j] += A[i][k] * B[k][j];

Note that when you have a cache miss (depending on your cache properties), you'll fetch in 32 bytes of consecutive data from main memory. Now see the B[][] array in the program. We are accessing it column-wise. Now, once we access B[0][0], we fetch in B[0][0] tp B[0][7]. But, we only end-up using B[0][0]. Only 12.5% cache usage. REALLY bad!

When MAX_NUM = 1000, this code takes ~6.778ms to run.

Now, how can we optimise this for good cache usage. One method is to transpose B[][] matrix and use the transpose in the multiplication. (Transpose = Interchange rows and columns). This way, we'll access B[][] by its rows and improve cache usage.

New code will look something like this (For the moment, ignore the time taken to transpose the matrix. We'll come to it later):

for (i = 0; i < MAX_NUM; i++) {
     for (j = 0; j < MAX_NUM; j++) {
          for (k = 0; k < MAX_NUM; k++)
                Y[i][j] += A[i][k] * B[j][k];

Notice how we access B[][] row-wise now. Our cache-efficient version takes 5.629ms for a 1000 dimension array.
So how much did we save? We saved 16.9% of execution time!! Isn't that cool?
17% savings is too much to ignore!!

Coming back to transpose of the matrix. There are multiple solutions to this problem:
1) Store it as the transpose right from the start. Thus, you don't have to do that operation before a multiplication. Neat solution.
2) There are cache-efficient ways to transpose a matrix which take less than 0.15ms for a 1000-element array. You still end-up with ~12% savings! I'll try to write a post on that someday too.

Ofcourse this is a very crude solution to use cache efficiently and can be optimised further to get even lower execution times. There is one called "Blocking factor" which I am still learning. I never thought so much about cache hits/misses when writing a program. For 17% savings, I certainly should!

Combined with my earlier post on parallelizing a program, you could effectively have double the cache size (assuming dual core) and hence more the savings!