As usual, I’m knee deep in CUDA optimising a fair few algorithms from various papers. Recently, I’ve been implementing the algorithms from this paper with the aim of improving them later/creating my own based from their concepts. The algorithm is an All Pairs Shortest Path algorithm with a nested loop in the kernel. Each time the second loop executes, two values from shared memory are added together and the resulted is evaluated against another variable stored in a register on the appropriate core. For some reason the code was running a lot slower than the results posted in the paper.
My friend here at Durham who is also working with CUDA suggested taking the addition out of the loop and storing the result in a register before the conditional. Much to my surprise, this worked a treat and instantly gave me comparable results with the paper.
Here is the original code before the change:
-
{
-
__shared__ int row[blockWidth][blockHeight];
-
__shared__ int column[blockWidth][blockHeight];
-
-
//Code here fills row and column
-
-
__syncthreads();
-
-
for(int k = 0; k < blockWidth; k ++)
-
{
-
if(row[threadIdx.y][k] + column[k][threadIdx.x] < value)
-
{
-
value = row[threadIdx.y][k] + column[k][threadIdx.x];
-
}
-
}
-
}
Here, we can see the change needed to drastically improve the running time of the algorithm:
-
for(unsigned int k = 0; k < blockWidth; k ++)
-
{
-
sum = row[threadIdx.y][k] + column[k][threadIdx.x];
-
if(sum < value)
-
{
-
value = sum;
-
}
-
}
Given that shared memory is so quick on CUDA, similar to an L1 cache on CPU, I wouldn’t have thought that it would have made any difference at all. Obviously, I was wrong! So watch out for things like this when using CUDA or any parallel computing platform.
