06 Jun, 2025
In this blogpost I will step by step show you how to implement a highly efficient transpose kernel for the Hopper architecture using Mojo. The best kernel archives a bandwidth of 2775.49 GB/s, i.e. 84.1056%. The optimisations are the same that I applied to archive a bandwidth of 2771.35 GB/s using pure CUDA on the same H100 that I use here. That shows that Mojo can archive CUDA like performance on exactly the same task. You may compare the kernels with the previous kernels I wrote and read my other blogpost as well where I explain the concepts in detail. Here I will only briefly review them and instead focus on the implementation details. For readers without knowledge how to use TMA in Mojo I refer you to my previous blogpost on this topic.
Naive approach
Before calling the kernel we need to initialise two TMA descriptors, this concept is similar to cuTensorMapEncodeTiled we can use in CUDA.
We have two descriptors. Both in row major format, the one the transpose of the other. The corresponding smems in relation of transpose as well. As a quick reminder he is the algorithm we are going to implement. We take a tile, perform transpose inside the tile and put it at the opposite position in the matrix, i.e. at the transposed position
![[Screenshot 2025-06-06 at 19.07.36.png]]
Below is the code that archives that.
We annotate the descriptors with nvvm.grid_constant similar to what we would do in CUDA. After allocating the shared memories we define the upper left coordinate of the tile using x and y and get row and column the current thread is responsible fore. We'll than copy over the tile to the shared memory array. This kernel archives a bandwidth of 1056.08 GB/s which is faster than the 875.46 GB/s we archived using CUDA. I believe that to be the reason because we use the PTX api for TMA transfers in Mojo. You can read about the difference between these in the CUDA api in this excellent blogpost.
We compute the transpose using our two arrays. We'll than create a fence to let the TMA know we are finished with computation.
Store to gmem
We store the transposed result to the GMEM using the transposed TMA descriptor.
Swizzling
For a more detailed explanation of what swizzling is and how it works please in my previous blogpost on matrix transpose the concept is the same for Mojo. In the repo I link to at the end there is also one program which you can use to understand swizzling yourself. Only two things need to be adjusted to make swizzling work:
- The descriptors need to be provided with the appropriate swizzling mode
- Inside the kernel we need to use swizzled indices
This can be implemented as follows
We can compute swizzled indices like this:
and than use the swizzled indices inside our kernel like so:
Everything else is exactly the same.
This kernel archives 1437.55 GB/s compared to the 1251.76 GB/s we get in CUDA.
Processes a batch of columns per thread
An important and common optimisation one can apply in memory bound kernels is thread coarsening which is essentially putting more work on each thread. We can modify the previous kernel as follows to do that:
Note that we launch less threads with this approach (we divide by a factor of batch_size) to account for the fact we are processing multiple columns per thread now.
This kernel archives a bandwidth of 2775.49 GB/s compared to the 2771.35 GB/s we archived in the equivalent CUDA kernel.
Conclusion
I hope this blogpost showed you how to archive high performance on a common task in GPU computing using Mojo. Feel free to contact me on Linkedin to chat about GPU programming or other topics related to MLSys.
The full code for the blogpost can be find on my Github.