Saturday, October 2, 2010

Writing cache efficient programs

Just recently I discovered how much effect cache misses can have on the running time of your program! Its truly amazing how much savings you can make, yet I had ignored it for more than 4 years now.

Consider the matrix multiplication program again


for (i = 0; i < MAX_NUM; i++) {
     for (j = 0; j < MAX_NUM; j++) {
          for (k = 0; k < MAX_NUM; k++)
               Y[i][j] += A[i][k] * B[k][j];
      }
}


Note that when you have a cache miss (depending on your cache properties), you'll fetch in 32 bytes of consecutive data from main memory. Now see the B[][] array in the program. We are accessing it column-wise. Now, once we access B[0][0], we fetch in B[0][0] tp B[0][7]. But, we only end-up using B[0][0]. Only 12.5% cache usage. REALLY bad!

When MAX_NUM = 1000, this code takes ~6.778ms to run.

Now, how can we optimise this for good cache usage. One method is to transpose B[][] matrix and use the transpose in the multiplication. (Transpose = Interchange rows and columns). This way, we'll access B[][] by its rows and improve cache usage.

New code will look something like this (For the moment, ignore the time taken to transpose the matrix. We'll come to it later):


for (i = 0; i < MAX_NUM; i++) {
     for (j = 0; j < MAX_NUM; j++) {
          for (k = 0; k < MAX_NUM; k++)
                Y[i][j] += A[i][k] * B[j][k];
      }
}


Notice how we access B[][] row-wise now. Our cache-efficient version takes 5.629ms for a 1000 dimension array.
So how much did we save? We saved 16.9% of execution time!! Isn't that cool?
17% savings is too much to ignore!!

Coming back to transpose of the matrix. There are multiple solutions to this problem:
1) Store it as the transpose right from the start. Thus, you don't have to do that operation before a multiplication. Neat solution.
2) There are cache-efficient ways to transpose a matrix which take less than 0.15ms for a 1000-element array. You still end-up with ~12% savings! I'll try to write a post on that someday too.

Ofcourse this is a very crude solution to use cache efficiently and can be optimised further to get even lower execution times. There is one called "Blocking factor" which I am still learning. I never thought so much about cache hits/misses when writing a program. For 17% savings, I certainly should!

Combined with my earlier post on parallelizing a program, you could effectively have double the cache size (assuming dual core) and hence more the savings!

1 comment:

Ani Rudh said...

kewl man! I find your blog full of interesting stuff