<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Jeremy Kemp &#187; CUDA</title>
	<atom:link href="http://www.jeremykemp.co.uk/tag/cuda/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.jeremykemp.co.uk</link>
	<description>//TODO</description>
	<lastBuildDate>Sun, 15 Jan 2012 15:32:26 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Shared Memory Tip</title>
		<link>http://www.jeremykemp.co.uk/07/02/2011/shared-memory-tip/</link>
		<comments>http://www.jeremykemp.co.uk/07/02/2011/shared-memory-tip/#comments</comments>
		<pubDate>Mon, 07 Feb 2011 15:39:37 +0000</pubDate>
		<dc:creator>Jeremy</dc:creator>
				<category><![CDATA[CUDA]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Uni]]></category>

		<guid isPermaLink="false">http://www.jeremykemp.co.uk/?p=195</guid>
		<description><![CDATA[As usual, I&#8217;m knee deep in CUDA optimising a fair few algorithms from various papers. Recently, I&#8217;ve been implementing the algorithms from this paper with the aim of improving them later/creating my own based from their concepts. The algorithm is an All Pairs Shortest Path algorithm with a nested loop in the kernel. Each time [...]]]></description>
			<content:encoded><![CDATA[<p>As usual, I&#8217;m knee deep in CUDA optimising a fair few algorithms from various papers. Recently, I&#8217;ve been implementing the algorithms from <a href="http://www.computer.org/portal/web/csdl/doi/10.1109/ITNG.2010.230" target="_blank">this</a> paper with the aim of improving them later/creating my own based from their concepts. The algorithm is an All Pairs Shortest Path algorithm with a nested loop in the kernel. Each time the second loop executes, two values from shared memory are added together and the resulted is evaluated against another variable stored in a register on the appropriate core. For some reason the code was running a lot slower than the results posted in the paper.</p>
<p>My <a href="http://laurencedawson.com/" target="_blank">friend</a> here at Durham who is also working with CUDA suggested taking the addition out of the loop and storing the result in a register before the conditional. Much to my surprise, this worked a treat and instantly gave me comparable results with the paper.</p>
<p>Here is the original code before the change:</p>
<div class="geshi no cpp">
<div class="head">for (int i = 0; i &lt; gridDim.x; i ++)</div>
<ol>
<li class="li1">
<div class="de1"><span class="br0">&#123;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; __shared__ <span class="kw4">int</span> row<span class="br0">&#91;</span>blockWidth<span class="br0">&#93;</span><span class="br0">&#91;</span>blockHeight<span class="br0">&#93;</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; __shared__ <span class="kw4">int</span> column<span class="br0">&#91;</span>blockWidth<span class="br0">&#93;</span><span class="br0">&#91;</span>blockHeight<span class="br0">&#93;</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; <span class="co1">//Code here fills row and column</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; __syncthreads<span class="br0">&#40;</span><span class="br0">&#41;</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; <span class="kw1">for</span><span class="br0">&#40;</span><span class="kw4">int</span> k <span class="sy1">=</span> <span class="nu0">0</span>; k <span class="sy3">&amp;</span>lt; blockWidth; k <span class="sy2">++</span><span class="br0">&#41;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; <span class="br0">&#123;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp;<span class="kw1">if</span><span class="br0">&#40;</span>row<span class="br0">&#91;</span>threadIdx.<span class="me1">y</span><span class="br0">&#93;</span><span class="br0">&#91;</span>k<span class="br0">&#93;</span> <span class="sy2">+</span> column<span class="br0">&#91;</span>k<span class="br0">&#93;</span><span class="br0">&#91;</span>threadIdx.<span class="me1">x</span><span class="br0">&#93;</span> <span class="sy3">&amp;</span>lt; value<span class="br0">&#41;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp;<span class="br0">&#123;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp; value <span class="sy1">=</span> row<span class="br0">&#91;</span>threadIdx.<span class="me1">y</span><span class="br0">&#93;</span><span class="br0">&#91;</span>k<span class="br0">&#93;</span> <span class="sy2">+</span> column<span class="br0">&#91;</span>k<span class="br0">&#93;</span><span class="br0">&#91;</span>threadIdx.<span class="me1">x</span><span class="br0">&#93;</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp;<span class="br0">&#125;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; <span class="br0">&#125;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;<span class="br0">&#125;</span></div>
</li>
</ol>
</div>
<p>Here, we can see the change needed to drastically improve the running time of the algorithm:</p>
<div class="geshi no cpp">
<div class="head">unsigned int sum;</div>
<ol>
<li class="li1">
<div class="de1"><span class="kw1">for</span><span class="br0">&#40;</span><span class="kw4">unsigned</span> <span class="kw4">int</span> k <span class="sy1">=</span> <span class="nu0">0</span>; k <span class="sy3">&amp;</span>lt; blockWidth; k <span class="sy2">++</span><span class="br0">&#41;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;<span class="br0">&#123;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; sum <span class="sy1">=</span> row<span class="br0">&#91;</span>threadIdx.<span class="me1">y</span><span class="br0">&#93;</span><span class="br0">&#91;</span>k<span class="br0">&#93;</span> <span class="sy2">+</span> column<span class="br0">&#91;</span>k<span class="br0">&#93;</span><span class="br0">&#91;</span>threadIdx.<span class="me1">x</span><span class="br0">&#93;</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; <span class="kw1">if</span><span class="br0">&#40;</span>sum <span class="sy3">&amp;</span>lt; value<span class="br0">&#41;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; <span class="br0">&#123;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp; &nbsp;value <span class="sy1">=</span> sum;</div>
</li>
<li class="li1">
<div class="de1">&nbsp; <span class="br0">&#125;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;<span class="br0">&#125;</span></div>
</li>
</ol>
</div>
<p>Given that shared memory is so quick on CUDA, similar to an L1 cache on CPU, I wouldn&#8217;t have thought that it would have made any difference at all. Obviously, I was wrong! So watch out for things like this when using CUDA or any parallel computing platform.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.jeremykemp.co.uk/07/02/2011/shared-memory-tip/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>CUDA cuPrintf</title>
		<link>http://www.jeremykemp.co.uk/08/02/2010/cuda-cuprintf/</link>
		<comments>http://www.jeremykemp.co.uk/08/02/2010/cuda-cuprintf/#comments</comments>
		<pubDate>Mon, 08 Feb 2010 12:06:08 +0000</pubDate>
		<dc:creator>Jeremy</dc:creator>
				<category><![CDATA[CUDA]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[-deviceemu]]></category>
		<category><![CDATA[cuPrintf]]></category>

		<guid isPermaLink="false">http://www.jeremykemp.co.uk/?p=124</guid>
		<description><![CDATA[I finally got an Nvidia developer account a few days ago which gave me access to a very useful library to use with CUDA. cuPrintf allows printf equivalent statements to be placed inside CUDA kernels without the need for -deviceemu. The following example demonstrates a simple use for cuPrintf and displays the current thread ID. [...]]]></description>
			<content:encoded><![CDATA[<p>I finally got an Nvidia developer account a few days ago which gave me access to a very useful library to use with CUDA.</p>
<p>cuPrintf allows printf equivalent statements to be placed inside CUDA kernels without the need for -deviceemu.</p>
<p>The following example demonstrates a simple use for cuPrintf and displays the current thread ID.</p>
<div class="geshi no cpp">
<ol>
<li class="li1">
<div class="de1"><span class="co2">#include &lt;cuda.h&gt;</span></div>
</li>
<li class="li1">
<div class="de1"><span class="co2">#include &quot;cuPrintf.cu&quot;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1">__global__ <span class="kw4">void</span> cuPrintfExample<span class="br0">&#40;</span><span class="br0">&#41;</span></div>
</li>
<li class="li1">
<div class="de1"><span class="br0">&#123;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;<span class="kw4">int</span> tid;</div>
</li>
<li class="li1">
<div class="de1">&nbsp;tid <span class="sy1">=</span> blockIdx.<span class="me1">x</span> <span class="sy2">*</span> blockDim.<span class="me1">x</span> <span class="sy2">+</span> threadIdx.<span class="me1">x</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp;cuPrintf<span class="br0">&#40;</span><span class="st0">&quot;%d<span class="es0">\n</span>&quot;</span>, tid<span class="br0">&#41;</span>;</div>
</li>
<li class="li1">
<div class="de1"><span class="br0">&#125;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;</div>
</li>
<li class="li1">
<div class="de1"><span class="kw4">int</span> main<span class="br0">&#40;</span><span class="br0">&#41;</span></div>
</li>
<li class="li1">
<div class="de1"><span class="br0">&#123;</span></div>
</li>
<li class="li1">
<div class="de1">&nbsp;cudaPrintfInit<span class="br0">&#40;</span><span class="br0">&#41;</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp;cuPrintfExample <span class="sy1">&lt;&lt;&lt;</span> <span class="nu0">5</span>, <span class="nu0">2</span> <span class="sy1">&gt;&gt;&gt;</span> <span class="br0">&#40;</span><span class="br0">&#41;</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp;cudaPrintfDisplay<span class="br0">&#40;</span><span class="kw2">stdout</span>, <span class="kw2">true</span><span class="br0">&#41;</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp;cudaPrintfEnd<span class="br0">&#40;</span><span class="br0">&#41;</span>;</div>
</li>
<li class="li1">
<div class="de1">&nbsp;<span class="kw1">return</span> <span class="nu0">0</span>;</div>
</li>
<li class="li1">
<div class="de1"><span class="br0">&#125;</span></div>
</li>
</ol>
</div>
<p>cudaPrintfInit and cudaPrintfEnd only need be called once throughout your entire project.</p>
<p>Output is not automatically displayed on the screen, but stored in a buffer which is cleared and displayed when cudaPrintfDisplay is called. The size of the buffer can be specified with the optional argument cudaPrintfInit(size_t  bufferLen).</p>
<p>cudaPrintfEnd simply frees the memory allocated by cudaPrintfInit.</p>
<p>When cudaPrintfDisplay is called, output stored in the buffer is displayed to the console. The second argument in this call either displays the current thread (true) or doesn&#8217;t (false). The first arguemnt, specified by stdout in this example, simply defines the descriptor where the cuPrintf log is sent.</p>
<p>On another note, I&#8217;ve found that using cuPrintf impacts on the performance of my kernels, presumably due to the data transfer performed every time cuPrintfDisplay() is called.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.jeremykemp.co.uk/08/02/2010/cuda-cuprintf/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
		</item>
	</channel>
</rss>

