Solving Triton Puzzles

TL;DR: I solved all the puzzles in the Triton-Puzzles repository. You can find my solutions here.

Why Triton? Why Now?

I learned CUDA a while back, but recently I noticed most people are using Triton instead, which apparently is a much simpler and effective way of writing GPU kernels. This made me decide to give a try.
The Triton-Puzzles repository looked like a good place to start. It's basically a collection of coding challenges for GPU kernels.

What I Learned

No Fine-Grained Thread Control

Coming from CUDA, the biggest difference was that Triton doesn't let you control individual threads. You can't use threadIdx.x or similar constructs. Instead, you write vector operations with tl.arange and the compiler handles the thread mapping for you. One of the unique things about it is that you can only use sizes in power of 2, which is presumably for internal optimization thing. Bound check was usually handled while performing load and store, where you can plug in the mask to specify the boundaries. Therefore, you don't need to write plenty of if-statements to check if you are stepping into unauthorized territory.
Also, instead of CUDA blocks, Triton uses a virtualization called programs, which is pretty much the same concept with different name. You access them with tl.program_id(0) and similar functions. It took a while for me to get used to it, but it was definitely easier once I got used to it. All I had to do was taking care of what individual programs within the grid will do, while not worrying about individual thread's behavior.

Debugging

Indexing errors in Triton felt similar to debugging CUDA. It's still annoying, but there's something nostalgic about it. It reminded me of my early CUDA days when I would spend hours figuring out memory access bugs. Not exactly fun, but oddly satisfying when you finally fix them. However, one big relief was that I could use triton-viz, which was really useful for identifying silly indexing mistakes!
One important thing to note is that operations like tl.dot cast automatically to TF32, which accelerates the calculation but sacrifices the precision. This happens when you are using modern GPUs equipped with tensor cores, unless you explicitly specify allow_tf32=False.

Interesting Puzzles

Most puzzles were standard stuff like matrix multiplication, but some were genuinely interesting:
- FlashAttention: I got to implement a scalar version of FlashAttention. It was cool to see how the tiled computation actually works in practice. I could also gain a lot of insight from the previous one, where I had to implement an efficient softmax kernel. The both puzzles gave me an insight of how I can implement more complex functions in Triton.
- Quantization: This was especially relevant since I'm working on quantization projects. I was able to write a custom kernel for handling forward pass for quantized layer.

Final Thoughts

Triton won't replace CUDA entirely, but it makes certain (actually most) types of kernel development much easier. The higher-level abstractions let you focus on the algorithm instead of low-level optimization details. Also, it can much more easily be integrated into your deep learning projects, since it is a python-based framework, not necessitating steps like building and linking.
I'd recommend the puzzle series if you want to learn Triton. It's a good way to get started without jumping into a big project right away.