During this semester's Scalable High-Performance Computing course, I worked on optimizing matrix multiplication using various libraries and APIs including pthread, OpenMP, OpenCL, CUDA. Also, in the final project, I was able to gain valuable experience accelerating deep learning model inference using a cluster of GPU nodes, contemplating various ideas to accelerate different kernels and applying optimization techniques. Even in the final project, optimizing the matrix multiplication (matmul) kernel was the crucial part where I put most of my time and endeavors.
In this article, I aim to summarize optimization techniques applied to CUDA implementation step-by-step, starting from a naive approach. By incrementally applying each optimization, I will document how much each technique improves performance. Note that the performance measurements were all conducted with $A\in \mathbb{R}^{M\times K}, B \in \mathbb{R}^{K \times N}, C \in \mathbb{R}^{M\times N}, M=N=K=4096$ on an NVIDIA GeForce RTX 3060. All codes can be found at https://github.com/vantaa89/cuda-matmul/tree/master.
Kernel 0: Naive Matrix Multiplication
__global__ void matmul_kernel_naive(float *A, float *B, float *C, int M, int N, int K) {
int row = threadIdx.x + blockDim.x * blockIdx.x;
int col = threadIdx.y + blockDim.y * blockIdx.y;
float acc = 0;
for(int k = 0; k < K; ++k){
acc += A[row * K + k] * B[k * N + col];
}
C[row * N + col] = acc;
}
// Host code for launch
dim3 blockDim(32, 32);
dim3 gridDim((M+31)/32, (N+31)/32);
matmul_kernel_naive<<<gridDim, blockDim>>>(a_d, b_d, c_d, M, N, K);

This is the most basic matrix multiplication kernel. It uses a grid with M rows and N columns, where the $(i, j)$-th thread calculates $C_{ij}$. However, it doesn't use shared memory, repeatedly fetching matrices A and B directly from global memory. The memory access pattern has not been optimized yet. A throughput of 103 GFLOPS was measured.
Kernel 1: Block Tiling
template <int BLOCK_SIZE>
__global__ void matmul_kernel_block_tiling(float *A, float *B, float *C, int M, int N, int K) {
int row = threadIdx.x;
int col = threadIdx.y;
int global_row = BLOCK_SIZE * blockIdx.x + threadIdx.x;
int global_col = BLOCK_SIZE * blockIdx.y + threadIdx.y;
__shared__ float A_block[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float B_block[BLOCK_SIZE][BLOCK_SIZE];
float acc = 0.0f;
const int num_tiles = (K + BLOCK_SIZE - 1) / BLOCK_SIZE;
for(int t = 0; t < num_tiles; ++t){
const int tiled_row = BLOCK_SIZE * t + row;
const int tiled_col = BLOCK_SIZE * t + col;
A_block[row][col] = A[global_row * K + tiled_col];
B_block[row][col] = B[tiled_row * N + global_col];
__syncthreads();
for(int k = 0; k < BLOCK_SIZE; ++k){
acc += A_block[row][k] * B_block[k][col];
}
__syncthreads();
}
C[global_row * N + global_col] = acc;
}
// Host code for launch
const int BLOCK_SIZE = 16;
blockDim = dim3(BLOCK_SIZE, BLOCK_SIZE);
gridDim = dim3((M+BLOCK_SIZE-1)/BLOCK_SIZE, (N+BLOCK_SIZE-1)/BLOCK_SIZE);
matmul_kernel_block_tiling<BLOCK_SIZE><<<gridDim, blockDim>>>(a_d, b_d, c_d, M, N, K);

This is the kernel where a basic optimization called tiling is applied. In particular, the number of global memory accesses was significantly reduced by employing shared memory. Since each of the thread blocks processes a single tile of the resultant matrix $C$, this can be dubbed as block tiling. The figure above represents a thread block that computes the $3\times 3$ sub-matrix of $C$, colored in blue.
As demonstrated in the code, a thread block comprises BLOCK_SIZE * BLOCK_SIZE
($16\times 16$) threads. In the figure, this is reduced to $3\times 3$ for convenience, meaning there are $3\times 3$ different threads in a thread block.
First, the thread block reads the elements of $A$ and $B$ belonging to โ from global memory, and copies them to A<mathsubscript>block
and B<mathsubscript>block
, which are stored in shared memory. Since there are $3\times 3$ elements to copy from each $A$ and $B$, and the thread block also comprises $3\times 3$ threads, each thread needs to read one element from $A$ and one from $B$.
After the elements are copied to shared memory, the computation is performed for these elements and the results are accumulated to the buffer acc
of each thread. The same process is repeated for each โก, โข, and so on, by the for-loop over t
. Here, __syncthreads()
is employed to guarantee that the computation is held after the loading to shared memory is done. The throughput was measured to be 411 GFLOPS.
Kernel 2: Thread Tiling
template <int BLOCK_SIZE, int THREAD_TILE_SIZE>
__global__ void matmul_kernel_thread_tiling(float *A, float *B, float *C, int M, int N, int K) {
int row = threadIdx.x * THREAD_TILE_SIZE;
int col = threadIdx.y;
int global_row = BLOCK_SIZE * blockIdx.x + row;
int global_col = BLOCK_SIZE * blockIdx.y + col;
__shared__ float A_block[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float B_block[BLOCK_SIZE][BLOCK_SIZE];
float acc[THREAD_TILE_SIZE];
for (int wm=0; wm<THREAD_TILE_SIZE; wm++) {
acc[wm] = 0.0f;
}
const int num_tiles = (K + BLOCK_SIZE - 1) / BLOCK_SIZE;
for(int t = 0; t < num_tiles; ++t){
const int tiled_row = BLOCK_SIZE * t + row;
const int tiled_col = BLOCK_SIZE * t + col;
for(int w1 = 0; w1 < THREAD_TILE_SIZE; w1++){
A_block[row+w1][col] = A[(global_row+w1) * K + tiled_col];
B_block[row+w1][col] = B[(tiled_row+w1) * N + global_col];
}
__syncthreads();
for(int k = 0; k < BLOCK_SIZE; ++k){
for(int w1 = 0; w1 < THREAD_TILE_SIZE; w1++){
acc[w1] += A_block[row+w1][k] * B_block[k][col];
}
}
__syncthreads();
}
for(int w1 = 0; w1 < THREAD_TILE_SIZE; w1++){
C[(global_row+w1) * N + global_col] = acc[w1];
}
}
// Host code for launch
const int BLOCK_SIZE = 32, THREAD_TILE_SIZE = 8;
blockDim = dim3(BLOCK_SIZE/THREAD_TILE_SIZE, BLOCK_SIZE);
gridDim = dim3((M+BLOCK_SIZE-1)/BLOCK_SIZE, (N+BLOCK_SIZE-1)/BLOCK_SIZE);
matmul_kernel_thread_tiling<BLOCK_SIZE, THREAD_TILE_SIZE><<<gridDim, blockDim>>>(a_d, b_d, c_d, M, N, K);

This is very similar to kernel 1, where block tiling is applied. However, while a thread in kernel 1 calculates a single resultant element of $C$, each thread in kernel 2 computes THREAD_TILE_SIZE
adjacent elements in the row direction of C
. This increases the workload per thread, effectively reducing the overhead which are irrelevant to actual computation.
The figure above shows an example for BLOCK_SIZE=4
, THREAD_TILE_SIZE=2
. The currently executed thread block is computing the $4\times 4$ sub-matrix of $C$, colored blue. However, since two elements are computed per thread, the thread block size becomes (4, 2)
, not (4, 4)
.
One can notice that acc
has changed into an array, as opposed to a scalar in the original code. This is because each thread needs multiple accumulators corresponding to each elements of $C$ it is in charge of. Also, since the number of threads within a thread block no longer matches the number of elements in $A$ and $B$ it accesses, we now need a for-loop to load input elements into shared memory (A_block
and B_block
).
As a result, throughput increases to 745 GFLOPS.
Kernel 3: 2D Thread Tiling
template <int BLOCK_SIZE, int THREAD_TILE_SIZE>
__global__ void matmul_kernel_2d_thread_tiling(float *A, float *B, float *C, int M, int N, int K) {
int row = threadIdx.x * THREAD_TILE_SIZE;
int col = threadIdx.y * THREAD_TILE_SIZE;
int global_row = BLOCK_SIZE * blockIdx.x + row;
int global_col = BLOCK_SIZE * blockIdx.y + col;
__shared__ float A_block[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float B_block[BLOCK_SIZE][BLOCK_SIZE];
float acc[THREAD_TILE_SIZE][THREAD_TILE_SIZE];
for (int wm=0; wm<THREAD_TILE_SIZE; wm++) {
for (int wn=0; wn<THREAD_TILE_SIZE; wn++) {
acc[wm][wn] = 0.0f;
}
}
const int num_tiles = (K + BLOCK_SIZE - 1) / BLOCK_SIZE;
for(int t = 0; t < num_tiles; ++t){
const int tiled_row = BLOCK_SIZE * t + row;
const int tiled_col = BLOCK_SIZE * t + col;
for(int w1 = 0; w1 < THREAD_TILE_SIZE; w1++){
for(int w2 = 0; w2 < THREAD_TILE_SIZE; w2++){
A_block[row+w1][col+w2] = A[(global_row+w1) * K + tiled_col + w2];
B_block[row+w1][col+w2] = B[(tiled_row+w1) * N + global_col + w2];
}
}
__syncthreads();
for(int k = 0; k < BLOCK_SIZE; ++k){
for(int w2 = 0; w2 < THREAD_TILE_SIZE; w2++){
for(int w1 = 0; w1 < THREAD_TILE_SIZE; w1++){
acc[w1][w2] += A_block[row+w1][k] * B_block[k][col+w2];
}
}
}
__syncthreads();
}
for(int w1 = 0; w1 < THREAD_TILE_SIZE; w1++){
for(int w2 = 0; w2 < THREAD_TILE_SIZE; w2++){
if(global_col+w2 < N && global_row + w1 < M)
C[(global_row+w1) * N + global_col+w2] = acc[w1][w2];
}
}
}
// Host code for launch
const int BLOCK_SIZE = 64, THREAD_TILE_SIZE = 8;
blockDim = dim3(BLOCK_SIZE/THREAD_TILE_SIZE, BLOCK_SIZE/THREAD_TILE_SIZE);
gridDim = dim3((M+BLOCK_SIZE-1)/BLOCK_SIZE, (N+BLOCK_SIZE-1)/BLOCK_SIZE);
matmul_kernel_2d_thread_tiling<BLOCK_SIZE, THREAD_TILE_SIZE><<<gridDim, blockDim>>>(a_d, b_d, c_d, M, N, K);

In the previous kernel 2, we improved the performance by increasing the number of elements in $C$ each thread computes. To further optimize the kernel, we can apply such thread tiling in a two-dimensional way. As depicted in the figure above, each thread block computes a sub-matrix of $C$, the resultant matrix, not a single row as it did in kernel 2. This further improved the throughput to 2150 GFLOPS!
Kernel 4: Tensor Core
#define WMMA_SIZE 16
__global__ void matmul_tc(half *A, half *B, float *C, int M, int N, int K){
int global_row = blockIdx.x * WMMA_SIZE;
int global_col = blockIdx.y * WMMA_SIZE;
wmma::fragment<wmma::matrix_a, WMMA_SIZE, WMMA_SIZE, WMMA_SIZE, half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, WMMA_SIZE, WMMA_SIZE, WMMA_SIZE, half, wmma::row_major> b_frag;
wmma::fragment<wmma::accumulator, WMMA_SIZE, WMMA_SIZE, WMMA_SIZE, float> c_frag;
wmma::fill_fragment(c_frag, 0.0f);
const int num_tiles = (K + WMMA_SIZE - 1) / WMMA_SIZE;
for(int t = 0; t < num_tiles; ++t){
int tiled_col = t * WMMA_SIZE, tiled_row = t * WMMA_SIZE;
wmma::load_matrix_sync(a_frag, &A[global_row * K + tiled_col], K);
wmma::load_matrix_sync(b_frag, &B[tiled_row * N + global_col], N);
wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
}
wmma::store_matrix_sync(&C[global_row * N + global_col], c_frag, N, wmma::mem_row_major);
}
// Host code for launch
blockDim = dim3(32, 1);
gridDim = dim3((N+WMMA_SIZE-1)/WMMA_SIZE, (M+WMMA_SIZE-1)/WMMA_SIZE);
matmul_tc<<<gridDim, blockDim>>>(a_d, b_d, c_d, M, N, K);
Now tensor cores come into play. Kernel 4 and 5 uses tensor cores to perform matrix multiplication for half-precision (FP16) inputs. Tensor cores are specialized computational units that exist separately from CUDA cores in each streaming multiprocessor of the GPU, allowing threads within a warp to cooperatively perform fast multiplication between two $16\times 16$ matrices. However, tensor cores only accept inputs with reduced precisions, such as FP16, BF16, or even sub-8 bit data types. The output (matrix C) can be chosen as full-precision, but some loss of precision must be considered.
I used a library called WMMA(Warp Matrix Multiply-Accumulate) to utilize tensor core, partly because CUTLASS was not allowed for the project. The basic principles are similar to block tiling (kernel 1), but blockDim
must be set to (32, 1)
so that a thread block fits exactly in a single warp. Thus, a thread block(=warp) processes a $16\times 16$ submatrix of the resultant matrix $C$.
WMMA operates by performing multiplication between wmma::matrix_a
(left matrix) and wmma::matrix_b
(right matrix) and accumulating the result in wmma::accumulator
. Therefore, wmma::fragment
was employed to define the input and output matrices recognized by the tensor core. The buffers defined so are actually distributed and stored in each thread's registers. Then, similar to kernel 1, tiles are slid while accumulating the resultant submatrix. wmma::load_matrix_sync
loads 16\times 16
data from global memory into wmma::fragment
, wmma::mma_sync
performs the actual matrix multiply-accumulate on the loaded values, and wmma::store_matrix_sync
writes the data in c_frag
to the global memory location. Using tensor cores, the throughput increased to 3066 GFLOPS.
Kernel 5: Tensor Core + Warp Tiling
template<int WARP_TILE_SIZE1, int WARP_TILE_SIZE2>
__global__ void matmul_tc_2d_warp_tiling(half *A, half *B, float *C, int M, int N, int K) {
int global_row = blockIdx.x * WMMA_SIZE * WARP_TILE_SIZE1; // 0 ~ M
int global_col = blockIdx.y * WMMA_SIZE * WARP_TILE_SIZE2; // 0 ~ N
wmma::fragment<wmma::matrix_a, WMMA_SIZE, WMMA_SIZE, WMMA_SIZE, half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, WMMA_SIZE, WMMA_SIZE, WMMA_SIZE, half, wmma::row_major> b_frag[WARP_TILE_SIZE2];
wmma::fragment<wmma::accumulator, WMMA_SIZE, WMMA_SIZE, WMMA_SIZE, float> c_frag[WARP_TILE_SIZE1][WARP_TILE_SIZE2];
for(int i = 0; i < WARP_TILE_SIZE1; i++){
for(int j = 0; j < WARP_TILE_SIZE2; j++)
wmma::fill_fragment(c_frag[i][j], 0.0f);
}
const int num_tiles = (K + WMMA_SIZE - 1) / WMMA_SIZE;
for(int t = 0; t < num_tiles; ++t){
int tiled_col = t * WMMA_SIZE, tiled_row = t * WMMA_SIZE;
for(int j = 0; j < WARP_TILE_SIZE2; j++)
wmma::load_matrix_sync(b_frag[j], &B[tiled_row * N + global_col + j * WMMA_SIZE], N);
for(int i = 0; i < WARP_TILE_SIZE1; i++){
wmma::load_matrix_sync(a_frag, &A[(global_row + i * WMMA_SIZE) * K + tiled_col], K);
for(int j = 0; j < WARP_TILE_SIZE2; j++){
wmma::mma_sync(c_frag[i][j], a_frag, b_frag[j], c_frag[i][j]);
}
}
}
for(int i = 0; i < WARP_TILE_SIZE1; i++){
for(int j = 0; j < WARP_TILE_SIZE2; j++){
wmma::store_matrix_sync(&C[(global_row + i * WMMA_SIZE) * N + global_col + j * WMMA_SIZE], c_frag[i][j], N, wmma::mem_row_major);
}
}
}
// Host code for launch
const int WARP_TILE_SIZE1 = 4, WARP_TILE_SIZE2 = 4;
blockDim = dim3(32, 1);
gridDim = dim3((N+WMMA_SIZE-1)/WMMA_SIZE/WARP_TILE_SIZE1, (M+WMMA_SIZE-1)/WMMA_SIZE/WARP_TILE_SIZE2);
matmul_tc_2d_warp_tiling<WARP_TILE_SIZE1, WARP_TILE_SIZE2><<<gridDim
Considering the powerful performance of tensor cores, the throughput increase in kernel 4 does not seem to be significant. Particularly, considering that we have sacrificed precision, a 1.5x improvement is not very satisfactory. This can be attributed to significant overhead caused in areas other than actual WMMA computation.
Therefore, applying 2D tiling similar to what was used in kernel 3 can increase the workload per warp, thereby effectively reducing the non-computational overhead. With this applied, each thread block computes a larger sub-matrix than the previous $16\times 16$. In the code, this was achieved by defining an array of wmma::fragment
's and performing multiple computations per single load. Applying this optimization resulted in a final throughput increased to 6688 GFLOPS.
Conclusion

While the naive kernel 0 achieved only 103 GFLOPS, kernel 5 where tensor cores and warp tiling were applied achieved 6688 GFLOPS, showing a 65-fold speedup. Even kernel 3, which didn't sacrifice precision at all, achieved a 21-fold speedup compared to kernel 0, demonstrating how important kernel optimizations are. I also note that I have considered many factors including but not limited to number of global memory I/O accesses, coalesced memory access, and shared memory bank conflicts, while implementing each kernel.
The codes above don't implement boundary checks, so depending on the kernel, they generally produce correct results only when matrix sizes are multiples of 32. Boundary checks can be implemented by writing 0 for out-of-bounds cases when loading values to shared memory, or by using zero padding methods.
References
- Otomo Hiroyuki, "Tensorใณใขใไฝฟใฃใฆใฟใ," Fixstars Tech Blog
- "OpenCL matrix-multiplication SGEMM tutorial"
- Lei Mao, "CUDA Matrix Multiplication Optimization"
- Jeremy Appleyard and Scott Yokim, "Programming Tensor Cores in CUDA 9," NVIDIA Technical Blog
- Andrew Kerr, Duane Merrill, Julien Demouth and John Tran, "CUTLASS: Fast Linear Algebra in CUDA C++," NVIDIA Technical Blog
ํ์๋ ์ด๋ฒ ํ๊ธฐ ํ์ฅํ ๊ณ ์ฑ๋ฅ ์ปดํจํ
์์
์ ๋ค์ผ๋ฉด์, pthread, OpenMP, OpenCL, CUDA ๋ฑ ๋ค์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ์ API๋ฅผ ํ์ฉํด ํ๋ ฌ ๊ณฑ์
์ ์ต์ ํํ๋ ๊ณผ์ ๋ฅผ ์ํํ๋ค. ํนํ ํ๊ธฐ๋ง ํ๋ก์ ํธ์์๋ GPU๋ฅผ ํ์ฉํด ๋ฅ๋ฌ๋ ๋ชจ๋ธ์ ์ฐ์ฐ์ ๊ฐ์ํํ๋ฉด์ ์์
์์ ๋ค๋ฃจ์ง ์์๋ tensor core๋ํ ์ฌ์ฉํด๋ณผ ์ ์์๊ณ , ํ๋ ฌ ๊ณฑ์
kernel์ ์ต์ ํํ๊ธฐ ์ํ ๋ค์ํ ์์ด๋์ด๋ฅผ ๊ณ ๋ฏผํ๊ณ ์ต์ ํ ๊ธฐ๋ฒ๋ค์ ์ ์ฉํด๋ณด๋ ๊ฐ์ง ๊ฒฝํ์ ์ป์ ์ ์์๋ค.
์ด ๊ธ์์๋ CUDA๋ฅผ ์ด์ฉํด naiveํ ๊ตฌํ์์๋ถํฐ ๋จ๊ณ๋ณ๋ก ์ต์ ํ๋ฅผ ์ ์ฉํด๋ณด๋ฉด์, ๊ฐ๊ฐ์ ์ต์ ํ๊ฐ ์ฑ๋ฅ์ ์ผ๋ง๋ ํฅ์์ํค๋์ง๋ฅผ ์ ๋ฆฌํด๋ณด๋ ค๊ณ ํ๋ค. ์ฑ๋ฅ ์ธก์ ์ ๋ชจ๋ $A\in \mathbb{R}^{M\times K}, B \in \mathbb{R}^{K \times N}, C \in \mathbb{R}^{M\times N}, M=N=K=4096$๋ก ์ค์ ํ ํ NVIDIA GeForce RTX 3060์์ ์ธก์ ํ์๋ค. ๋ชจ๋ ์ฝ๋๋ https://github.com/vantaa89/cuda-matmul/tree/master์์ ํ์ธํ ์ ์๋ค.
Kernel 0: Naive Matrix Multiplication
__global__ void matmul_kernel_naive(float *A, float *B, float *C, int M, int N, int K) {
int row = threadIdx.x + blockDim.x * blockIdx.x;
int col = threadIdx.y + blockDim.y * blockIdx.y;
float acc = 0;
for(int k = 0; k < K; ++k){
acc += A[row * K + k] * B[k * N + col];
}
C[row * N + col] = acc;
}
// Host code for launch
dim3 blockDim(32, 32);
dim3 gridDim((M+31)/32, (N+31)/32);
matmul_kernel_naive<<<gridDim, blockDim>>>(a_d, b_d, c_d, M, N, K);

๊ฐ์ฅ ๊ธฐ๋ณธ์ ์ธ ํ๋ ฌ ๊ณฑ์
์ปค๋์ด๋ค. M๊ฐ์ row, N๊ฐ์ column์ผ๋ก ์ด๋ฃจ์ด์ง grid๋ฅผ ์ฌ์ฉํ์ฌ, $(i, j)$๋ฒ thread๊ฐ $C_{ij}$๋ฅผ ๊ณ์ฐํ๋ ์์ผ๋ก ๋ณ๋ ฌํ๊ฐ ์ด๋ฃจ์ด์ ธ ์๋ค. ๋ค๋ง Shared memory๋ฅผ ์ฌ์ฉํ์ง ์์ A์ B ํ๋ ฌ์ ๋งค๋ฒ global memory์์ ์ง์ ๊ฐ์ ธ์ค๊ณ ์์ผ๋ฉฐ, ๋ฉ๋ชจ๋ฆฌ ์ ๊ทผ ํจํด ๋ํ ์ต์ ํ๋์ด ์์ง ์์ ๋ชจ์ต์ด๋ค. 103 GFLOPS์ throughput์ด ์ธก์ ๋์๋ค.
Kernel 1: Block Tiling
template <int BLOCK_SIZE>
__global__ void matmul_kernel_block_tiling(float *A, float *B, float *C, int M, int N, int K) {
int row = threadIdx.x;
int col = threadIdx.y;
int global_row = BLOCK_SIZE * blockIdx.x + threadIdx.x;
int global_col = BLOCK_SIZE * blockIdx.y + threadIdx.y;
__shared__ float A_block[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float B_block[BLOCK_SIZE][BLOCK_SIZE];
float acc = 0.0f;
const int num_tiles = (K + BLOCK_SIZE - 1) / BLOCK_SIZE;
for(int t = 0; t < num_tiles; ++t){
const int tiled_row = BLOCK_SIZE * t + row;
const int tiled_col = BLOCK_SIZE * t + col;
A_block[row][col] = A[global_row * K + tiled_col];
B_block[row][col] = B[tiled_row * N + global_col];
__syncthreads();
for(int k = 0; k < BLOCK_SIZE; ++k){
acc += A_block[row][k] * B_block[k][col];
}
__syncthreads();
}
C[global_row * N + global_col] = acc;
}
// Host code for launch
const int BLOCK_SIZE = 16;
blockDim = dim3(BLOCK_SIZE, BLOCK_SIZE);
gridDim = dim3((M+BLOCK_SIZE-1)/BLOCK_SIZE, (N+BLOCK_SIZE-1)/BLOCK_SIZE);
matmul_kernel_block_tiling<BLOCK_SIZE><<<gridDim, blockDim>>>(a_d, b_d, c_d, M, N, K);

Tiling์ด๋ผ ๋ถ๋ฆฌ๋ ๊ธฐ๋ณธ์ ์ธ ์ต์ ํ๋ฅผ ์ ์ฉํ kernel์ด๋ค. ํนํ shared memory๋ฅผ ์ฌ์ฉํด global memory๋ก์ ์ ๊ทผ ํ์๋ฅผ ํฌ๊ฒ ์ค์๋ค. ๊ฐ๊ฐ์ block์ด ๊ฒฐ๊ณผํ๋ ฌ C์ ํ๋์ tile์ ์ฒ๋ฆฌํ๋ ๋ฐฉ์์ด๋ฏ๋ก block tiling์ด๋ผ ์ด๋ฆ๋ถ์ผ ์ ์๋ค. ์ ๊ทธ๋ฆผ์ ํธ๋ฅธ ์์ผ๋ก ์์น ๋ C์ $3\times 3$ submatrix๋ฅผ ๊ณ์ฐํ๋ thread block์ ํํํ ๊ฒ์ด๋ค.
์ฝ๋๋ฅผ ๋ณด๋ฉด ํ๋์ thread block์ BLOCK_SIZE * BLOCK_SIZE
($16\times 16$) ๊ฐ์ thread๋ก ๊ตฌ์ฑ๋์ด ์๋ค. ๊ทธ๋ฆผ์์๋ ํธ์์ ์ด๋ฅผ $3\times 3$๋ก ์ค์ฌ ํํํ์๋ค. ์ด ๊ฒฝ์ฐ ํ๋์ thread block์๋ $3\times 3$๊ฐ์ thread๊ฐ ๋ค์ด์๋ ์์ด ๋๋ค.
๋จผ์ thread block์ด โ ์ ์ํ๋ A์ B์ element๋ค์ global memory์์ ์ฝ์ด shared memory์ธ A_block
๊ณผ B_block
์ผ๋ก ๋ณต์ฌํ๋ค. ์ด๋ ๋ณต์ฌ์ ๋์์ด ๋๋ element์ ๊ฐ์๊ฐ (A, B์์ ๊ฐ๊ฐ) $3\times 3$๊ฐ์ด๊ณ , thread block๋ $3\times 3$์ด๋ฏ๋ก ํ thread๊ฐ A, B์์ ๊ฐ๊ฐ ํ๋์ฉ์ ์์๋ฅผ ์ฝ์ด์ค๋ฉด ๋๋ค. Shared memory๋ก์ ๋ณต์ฌ๊ฐ ๋๋ ํ์๋ ํ์ฌ shared memory์ ์ฌ๋ผ์ ์๋ element๋ค์ ์ฌ์ฉํด์ ๊ณ์ฐ์ ์ํํ๊ณ , ๊ฐ thread์ acc
์ ๋์ฐ(accumulate)ํ๋ค. ๊ฐ์ ๊ณผ์ ์ โก, โข, ...์ ๋ํด์ ๋ฐ๋ณตํ๋ฉด ๋๋ค(t
์ ๋ํ for๋ฌธ).
์ด๋, shared memory๋ก์ loading์ด ์์ ํ ์๋ฃ๋ ํ์ ๊ณ์ฐ์ด ์ํ๋จ์ ๋ณด์ฅํ๊ธฐ ์ํด __syncthreads()
๋ฅผ ์ฌ์ฉํ๋ค. Throughput์ 411 GFLOPS๋ก ์ธก์ ๋์๋ค.
Kernel 2: Thread Tiling
template <int BLOCK_SIZE, int THREAD_TILE_SIZE>
__global__ void matmul_kernel_thread_tiling(float *A, float *B, float *C, int M, int N, int K) {
int row = threadIdx.x * THREAD_TILE_SIZE;
int col = threadIdx.y;
int global_row = BLOCK_SIZE * blockIdx.x + row;
int global_col = BLOCK_SIZE * blockIdx.y + col;
__shared__ float A_block[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float B_block[BLOCK_SIZE][BLOCK_SIZE];
float acc[THREAD_TILE_SIZE];
for (int wm=0; wm<THREAD_TILE_SIZE; wm++) {
acc[wm] = 0.0f;
}
const int num_tiles = (K + BLOCK_SIZE - 1) / BLOCK_SIZE;
for(int t = 0; t < num_tiles; ++t){
const int tiled_row = BLOCK_SIZE * t + row;
const int tiled_col = BLOCK_SIZE * t + col;
for(int w1 = 0; w1 < THREAD_TILE_SIZE; w1++){
A_block[row+w1][col] = A[(global_row+w1) * K + tiled_col];
B_block[row+w1][col] = B[(tiled_row+w1) * N + global_col];
}
__syncthreads();
for(int k = 0; k < BLOCK_SIZE; ++k){
for(int w1 = 0; w1 < THREAD_TILE_SIZE; w1++){
acc[w1] += A_block[row+w1][k] * B_block[k][col];
}
}
__syncthreads();
}
for(int w1 = 0; w1 < THREAD_TILE_SIZE; w1++){
C[(global_row+w1) * N + global_col] = acc[w1];
}
}
// Host code for launch
const int BLOCK_SIZE = 32, THREAD_TILE_SIZE = 8;
blockDim = dim3(BLOCK_SIZE/THREAD_TILE_SIZE, BLOCK_SIZE);
gridDim = dim3((M+BLOCK_SIZE-1)/BLOCK_SIZE, (N+BLOCK_SIZE-1)/BLOCK_SIZE);
matmul_kernel_thread_tiling<BLOCK_SIZE, THREAD_TILE_SIZE><<<gridDim, blockDim>>>(a_d, b_d, c_d, M, N, K);

Block tiling์ด ์ ์ฉ๋ kernel 1๊ณผ ๊ฑฐ์ ๋น์ทํ ๋ฐฉ์์ด๋ค. ๋ค๋ง kernel 1์์๋ ํ thread๊ฐ ๊ฒฐ๊ณผ ํ๋ ฌ C์ ํ ์์๋ฅผ ๊ณ์ฐํ๋ ๋ฐ ๋นํด kernel 2๋ **ํ thread๊ฐ C์ row๋ฐฉํฅ์ผ๋ก ์ธ์ ํ THREAD_TILE_SIZE
๊ฐ ์์๋ฅผ ๊ณ์ฐํ๋ค. ์ด๋ thread ๋น ์์
์ ์(workload per thread)๋ฅผ ๋์ฌ, ์ค์ ๊ณ์ฐ๊ณผ ๋ฌด๊ดํ overhead์ ๋น์ค์ ๊ฐ์์ํค๋ ํจ๊ณผ๋ฅผ ์ค๋ค.
์ ๊ทธ๋ฆผ์ BLOCK_SIZE=4
, THREAD_TILE_SIZE=2
์ธ ์์์ด๋ค. ํ์ฌ ์คํ๋๋ thread block์ ๊ฒฐ๊ณผ ํ๋ ฌ C์์ ์์น ๋์ด ์๋ $4\times 4$ ๋ถ๋ถํ๋ ฌ์ ๊ณ์ฐํ๋ ค ํ๊ณ ์๋ค. ๋ค๋ง ํ thread๊ฐ C์์ 2๊ฐ์ฉ์ ์์(ํ๋์)๋ฅผ ๊ณ์ฐํ๋ฏ๋ก thread block์ size๋ (4, 4)
๊ฐ ์๋ (4, 2)
๊ฐ ๋๋ค.
์ฝ๋๋ฅผ ์ดํด๋ณด๋ฉด ๋จผ์ acc
๊ฐ ๋ฐฐ์ด๋ก ๋ฐ๋ ๊ฒ์ ํ์ธํ ์ ์๋ค. ๊ธฐ์กด์ ๊ฐ thread๊ฐ C์ ์์ ํ ๊ฐ ์ฉ์ ๊ณ์ฐํ๋ ๋ฐ ๋นํด, kernel 2๋ ์ฌ๋ฌ ์์๋ฅผ ๊ณ์ฐํ๋ฏ๋ก accumulator๊ฐ ์ฌ๋ฌ ๊ฐ ํ์ํ ๊ฒ์ด๋ค. ๋ํ thread block ๋ด thread์ ๊ฐ์์ ํด๋น block์ด ์ ๊ทผํ๋ A, B ํ๋ ฌ์ ์์ ๊ฐ์๊ฐ ์ผ์นํ์ง ์๊ฒ ๋๋ฉด์ shared memory(A_block
์ B_block
)๋ก ๋ฐ์ดํฐ๋ฅผ ๋ก๋ํด ์ค๋ ์ฝ๋์ ๋ฐ๋ณต๋ฌธ์ด ํ์ํด์ง๋ ๊ฒ์ ํ์ธํ ์ ์๋ค.
์ด์ ๊ฐ์ด thread tiling์ ์ ์ฉํ ๊ฒฐ๊ณผ throughput์ 745 GFLOPS๋ก ์ฆ๊ฐํ์๋ค.
Kernel 3: 2D Thread Tiling
template <int BLOCK_SIZE, int THREAD_TILE_SIZE>
__global__ void matmul_kernel_2d_thread_tiling(float *A, float *B, float *C, int M, int N, int K) {
int row = threadIdx.x * THREAD_TILE_SIZE;
int col = threadIdx.y * THREAD_TILE_SIZE;
int global_row = BLOCK_SIZE * blockIdx.x + row;
int global_col = BLOCK_SIZE * blockIdx.y + col;
__shared__ float A_block[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float B_block[BLOCK_SIZE][BLOCK_SIZE];
float acc[THREAD_TILE_SIZE][THREAD_TILE_SIZE];
for (int wm=0; wm<THREAD_TILE_SIZE; wm++) {
for (int wn=0; wn<THREAD_TILE_SIZE; wn++) {
acc[wm][wn] = 0.0f;
}
}
const int num_tiles = (K + BLOCK_SIZE - 1) / BLOCK_SIZE;
for(int t = 0; t < num_tiles; ++t){
const int tiled_row = BLOCK_SIZE * t + row;
const int tiled_col = BLOCK_SIZE * t + col;
for(int w1 = 0; w1 < THREAD_TILE_SIZE; w1++){
for(int w2 = 0; w2 < THREAD_TILE_SIZE; w2++){
A_block[row+w1][col+w2] = A[(global_row+w1) * K + tiled_col + w2];
B_block[row+w1][col+w2] = B[(tiled_row+w1) * N + global_col + w2];
}
}
__syncthreads();
for(int k = 0; k < BLOCK_SIZE; ++k){
for(int w2 = 0; w2 < THREAD_TILE_SIZE; w2++){
for(int w1 = 0; w1 < THREAD_TILE_SIZE; w1++){
acc[w1][w2] += A_block[row+w1][k] * B_block[k][col+w2];
}
}
}
__syncthreads();
}
for(int w1 = 0; w1 < THREAD_TILE_SIZE; w1++){
for(int w2 = 0; w2 < THREAD_TILE_SIZE; w2++){
if(global_col+w2 < N && global_row + w1 < M)
C[(global_row+w1) * N + global_col+w2] = acc[w1][w2];
}
}
}
// Host code for launch
const int BLOCK_SIZE = 64, THREAD_TILE_SIZE = 8;
blockDim = dim3(BLOCK_SIZE/THREAD_TILE_SIZE, BLOCK_SIZE/THREAD_TILE_SIZE);
gridDim = dim3((M+BLOCK_SIZE-1)/BLOCK_SIZE, (N+BLOCK_SIZE-1)/BLOCK_SIZE);
matmul_kernel_2d_thread_tiling<BLOCK_SIZE, THREAD_TILE_SIZE><<<gridDim, blockDim>>>(a_d, b_d, c_d, M, N, K);

๋ฐ๋ก ์ ์ kernel 2์์๋ thread๋น ๊ณ์ฐํ๋ C์ ์์ ์๋ฅผ ๋๋ฆผ์ผ๋ก์จ ์ฑ๋ฅ์ ์ฆ๊ฐ์์ผฐ๋ค. ์ด๋, ํ๋์ thread๊ฐ ์ธ์ ํ row์ ์์๋ค์ ๋ช ๊ฐ์ฉ ๋ฌถ์ด ๊ฐ์ด ๊ณ์ฐํ๋ ๋ฐฉ์์ ์ฌ์ฉํ์๋ค. ์ด๋ฅผ ๋์ฑ ๊ฐ์ ํ๋ฉด ์ ๊ทธ๋ฆผ๊ณผ ๊ฐ์ด thread tiling์ 2D๋ก ์ ์ฉํ ์๋ ์๋ค. ์ ๊ทธ๋ฆผ์ ๊ฒฝ์ฐ, ํ thread block์ C์ $4\times 4$ submatrix๋ฅผ ๋งก์ ๊ณ์ฐํ๋๋ก ๋์ด ์๋ค. ํ์ง๋ง thread block ๋ด ๊ฐ thread๊ฐ $2\times2$๊ฐ์ฉ์ ์์๋ฅผ ๊ณ์ฐํ๋ฏ๋ก(ํ๋์) blockDim์ $4\times 4$๊ฐ ์๋ $2\times 2$๊ฐ ๋๋ค. ์ด๋ฅผ ์ ์ฉํ์์ ๋ throughput์ 2150 GFLOPS๋ก ์ฆ๊ฐํ์๋ค.
Kernel 4: Tensor Core
#define WMMA_SIZE 16
__global__ void matmul_tc(half *A, half *B, float *C, int M, int N, int K){
int global_row = blockIdx.x * WMMA_SIZE;
int global_col = blockIdx.y * WMMA_SIZE;
wmma::fragment<wmma::matrix_a, WMMA_SIZE, WMMA_SIZE, WMMA_SIZE, half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, WMMA_SIZE, WMMA_SIZE, WMMA_SIZE, half, wmma::row_major> b_frag;
wmma::fragment<wmma::accumulator, WMMA_SIZE, WMMA_SIZE, WMMA_SIZE, float> c_frag;
wmma::fill_fragment(c_frag, 0.0f);
const int num_tiles = (K + WMMA_SIZE - 1) / WMMA_SIZE;
for(int t = 0; t < num_tiles; ++t){
int tiled_col = t * WMMA_SIZE, tiled_row = t * WMMA_SIZE;
wmma::load_matrix_sync(a_frag, &A[global_row * K + tiled_col], K);
wmma::load_matrix_sync(b_frag, &B[tiled_row * N + global_col], N);
wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
}
wmma::store_matrix_sync(&C[global_row * N + global_col], c_frag, N, wmma::mem_row_major);
}
// Host code for launch
blockDim = dim3(32, 1);
gridDim = dim3((N+WMMA_SIZE-1)/WMMA_SIZE, (M+WMMA_SIZE-1)/WMMA_SIZE);
matmul_tc<<<gridDim, blockDim>>>(a_d, b_d, c_d, M, N, K);
Kernel 4์ 5์ ๊ฒฝ์ฐ tensor core๋ฅผ ์ฌ์ฉํ์ฌ ์
๋ ฅ ํ๋ ฌ A์ B ํ๋ ฌ์ด half-precision(16 bit)์ธ ๊ฒฝ์ฐ์ ๋ํด ๊ณฑ์
์ ์ํํ๋ค. Tensor core๋ GPU ๋ด ๊ฐ streaming multiprocessor์ cuda core ์ธ์ ๋ณ๋๋ก ์กด์ฌํ๋ ๊ณ์ฐ ์ฅ์น๋ก, warp ๋ด์ thread๋ค์ด ํ๋ ฅํ์ฌ ๋ $16\times 16$ ํ๋ ฌ๊ฐ์ ๊ณฑ์
์ ๋น ๋ฅด๊ฒ ์ํํ ์ ์๋ค. ๋ค๋ง Tensor core๋ full-precision(float
)์ด ์๋, half-precision(half
)๋ง์ ์
๋ ฅ์ผ๋ก ๋ฐ๋๋ค. ์ถ๋ ฅ(C ํ๋ ฌ)์ full-precision์ผ๋ก ์ ํํ ์ ์์ผ๋, ์ด๋ ์ ๋ ์ ๋ฐ๋๊ฐ ๋จ์ด์ง๋ ๊ฒ์ ๊ฐ์ํด์ผ ํ๋ค.
CUDA์์ tensor core๋ฅผ ์ฌ์ฉํ๊ธฐ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ WMMA(Warp Matrix Multiply-Accumulate)์ด๋ค. ์ ์ฝ๋์์๋ WMMA๋ฅผ ์ด์ฉํด ํ๋ ฌ ๊ณฑ์
์ ์ํํ๋๋ก ํ์๋ค. ๊ธฐ๋ณธ์ ์ธ ์๋ฆฌ๋ block tiling(kernel 1)๊ณผ ์ ์ฌํ์ง๋ง, ํ thread block์ด ๋ฐ๋์ ํ warp๊ฐ ๋๋๋ก blockDim์ (32, 1)
๋ก ์ค์ ํด์ฃผ์ด์ผ ํ๋ค. ์ฆ ํ๋์ thread block=warp๊ฐ ๊ฒฐ๊ณผ ํ๋ ฌ C์ $16\times 16$ submatrix๋ฅผ ๋งก์ ๊ณ์ฐํ๊ฒ ๋๋ค.
WMMA๋ wmma::matrix_a
๋ฅผ ์ผ์ชฝ ํ๋ ฌ, wmma::matrix_b
๋ฅผ ์ค๋ฅธ์ชฝ ํ๋ ฌ๋ก ํ์ฌ ๋ ์ฌ์ด์ ๊ณฑ์
์ ์ํํด wmma::accumulator
์ ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ๋์ฐ(accumulate)์ํค๋ ์์ผ๋ก ์๋ํ๋ค. ๋ฐ๋ผ์ ๋จผ์ wmma::fragment
๋ฅผ ์ฌ์ฉํด tensor core์ ๋ฃ์ด์ค ์
๋ ฅํ๋ ฌ๊ณผ ์ถ๋ ฅํ๋ ฌ์ ์ ์ํด์ฃผ์๋ค. ์ด๋ wmma::fragment
๋ก ์ ์๋ ๋ฒํผ๋ ์ค์ ๋ก๋ ๊ฐ thread์ ๋ ์ง์คํฐ์ ๋๋์ด ์ ์ฅ๋๋ค. ๊ทธ ๋ค์์ผ๋ก๋ kernel 1์์์ ๋น์ทํ ๋ฐฉ์์ผ๋ก tile์ ์์ง์ด๋ฉฐ ๊ฒฐ๊ณผ๋ฅผ ๋์ฐํ๋ค. wmma::load_matrix_sync
๋ ํด๋น global memory์ ์๋ ๋ฐ์ดํฐ๋ฅผ $16\times 16$๊ฐ์ฉ wmma::fragment
๋ก ๋ถ๋ฌ์ค๋ ์ญํ ์ ํ๋ฉฐ, wmma::mma_sync
๋ ๋ถ๋ฌ์จ ๊ฐ์ ํ ๋๋ก ์ค์ matrix multiply-accumulate๋ฅผ ์ํํ๋ ๋ถ๋ถ์ด๋ค. wmma::store_matrix_sync
๋ฅผ ์ํ์ํค๋ฉด c_frag
์ ์๋ ๋ฐ์ดํฐ๋ฅผ global memory์ ์จ์ฃผ๊ฒ ๋๋ค. Tensor core๋ฅผ ์ฌ์ฉํ ๊ฒฐ๊ณผ throughput์ 3066 GFLOPS๊น์ง ์ฆ๊ฐํ์๋ค.
Kernel 5: Tensor Core + Warp Tiling
template<int WARP_TILE_SIZE1, int WARP_TILE_SIZE2>
__global__ void matmul_tc_2d_warp_tiling(half *A, half *B, float *C, int M, int N, int K) {
int global_row = blockIdx.x * WMMA_SIZE * WARP_TILE_SIZE1; // 0 ~ M
int global_col = blockIdx.y * WMMA_SIZE * WARP_TILE_SIZE2; // 0 ~ N
wmma::fragment<wmma::matrix_a, WMMA_SIZE, WMMA_SIZE, WMMA_SIZE, half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, WMMA_SIZE, WMMA_SIZE, WMMA_SIZE, half, wmma::row_major> b_frag[WARP_TILE_SIZE2];
wmma::fragment<wmma::accumulator, WMMA_SIZE, WMMA_SIZE, WMMA_SIZE, float> c_frag[WARP_TILE_SIZE1][WARP_TILE_SIZE2];
for(int i = 0; i < WARP_TILE_SIZE1; i++){
for(int j = 0; j < WARP_TILE_SIZE2; j++)
wmma::fill_fragment(c_frag[i][j], 0.0f);
}
const int num_tiles = (K + WMMA_SIZE - 1) / WMMA_SIZE;
for(int t = 0; t < num_tiles; ++t){
int tiled_col = t * WMMA_SIZE, tiled_row = t * WMMA_SIZE;
for(int j = 0; j < WARP_TILE_SIZE2; j++)
wmma::load_matrix_sync(b_frag[j], &B[tiled_row * N + global_col + j * WMMA_SIZE], N);
for(int i = 0; i < WARP_TILE_SIZE1; i++){
wmma::load_matrix_sync(a_frag, &A[(global_row + i * WMMA_SIZE) * K + tiled_col], K);
for(int j = 0; j < WARP_TILE_SIZE2; j++){
wmma::mma_sync(c_frag[i][j], a_frag, b_frag[j], c_frag[i][j]);
}
}
}
for(int i = 0; i < WARP_TILE_SIZE1; i++){
for(int j = 0; j < WARP_TILE_SIZE2; j++){
wmma::store_matrix_sync(&C[(global_row + i * WMMA_SIZE) * N + global_col + j * WMMA_SIZE], c_frag[i][j], N, wmma::mem_row_major);
}
}
}
// Host code for launch
const int WARP_TILE_SIZE1 = 4, WARP_TILE_SIZE2 = 4;
blockDim = dim3(32, 1);
gridDim = dim3((N+WMMA_SIZE-1)/WMMA_SIZE/WARP_TILE_SIZE1, (M+WMMA_SIZE-1)/WMMA_SIZE/WARP_TILE_SIZE2);
matmul_tc_2d_warp_tiling<WARP_TILE_SIZE1, WARP_TILE_SIZE2><<<gridDim
์์ kernel 4๋ tensor core์ ๋ง๊ฐํ ์ฑ๋ฅ์ ๊ณ ๋ คํ์ ๋ throughput์ ํฌ๊ฒ ์ฆ๊ฐ์ํค์ง๋ ๋ชปํ๊ณ ์๋ค. ํนํ precision์ ํฌ์ํ ๊ฒ์ ์๊ฐํ๋ฉด, ์ฝ 1.5๋ฐฐ์ ์ฑ๋ฅ ํฅ์์ ๋งค์ฐ ๋ง์กฑ์ค๋ฝ์ง ๋ชปํ ๊ฒฐ๊ณผ์ด๋ค. ์ด๋ ์ค์ WMMA ๊ณ์ฐ ์ธ ๋ค๋ฅธ ๊ณณ์์ overhead๊ฐ ํฌ๊ฒ ์์ฉํ ๊ฒ์ด ์์ธ์ด๋ผ๊ณ ์๊ฐํ ์ ์๋ค.
๋ฐ๋ผ์ ์ฌ๊ธฐ์ kernel 3์์ ์ฌ์ฉํ ๊ฒ๊ณผ ๊ฐ์ 2D tiling์ ์ ์ฉํ๋ฉด ๊ฐ warp ๋น workload๋ฅผ ์ฆ๊ฐ์ํด์ผ๋ก์จ, ๊ณ์ฐ ์ด์ธ์ overhead ๋น์ค์ ์ค์ผ ์ ์๋ค. ์ด๋ฅผ ์ ์ฉํ๋ฉด ๊ฐ๊ฐ์ thread block์ ๊ฒฐ๊ณผํ๋ ฌ์์ ๊ธฐ์กด์ $16\times16$๋ณด๋ค ํฐ submatrix๋ฅผ ๊ณ์ฐํ๊ฒ ๋๋ค. ์ฝ๋์์๋ ์ด๋ฅผ ์ํด์ wmma::fragment
๋ค์ ๋ฐฐ์ด๋ก ๋ง๋ค์ด์ฃผ๊ณ , ํ ๋ฒ์ load๋ก ๊ณ์ฐ์ ์ฌ๋ฌ ๋ฒ ์ํํ๋ ๊ฒ์ ํ์ธํ ์ ์๋ค. ์ด์ ๊ฐ์ ์ต์ ํ๋ฅผ ์ํํ ๊ฒฐ๊ณผ throughput์ ์ต์ข
์ ์ผ๋ก 6688 GFLOPS๊น์ง ์ฆ๊ฐํ์๋ค.
๊ฒฐ๋ก

Naiveํ๊ฒ ๊ตฌํํ kernel 0์ ์ฑ๋ฅ์ด 103 GFLOPS์ ๊ทธ์ณค๋ ๊ฒ์ ๋นํด tensor core์ warp tiling๊น์ง ์ ์ฉํ kernel 5๋ 6688 GFLOPS์ ์ฑ๋ฅ์ ๋ด, 65๋ฐฐ์ ๋ฌํ๋ speedup์ ๋ณด์ฌ์ฃผ์๋ค. ์ฌ์ง์ด precision์ ํฌ์ํ์ง ์์ kernel 3๋ kernel 0์ ๋นํด์๋ 21๋ฐฐ์ speedup์ ๋ฌ์ฑํ์ฌ, global memory I/O ํ์์ coalesced memory access, shared memory bank conflict ๋ฑ์ ๊ณ ๋ คํ ์ต์ ํ๊ฐ ์ผ๋ง๋ ํฐ ์ํฅ์ ๋ผ์น๋์ง ์ ์ ์์๋ค.
์์ ์ฝ๋๋ค์๋ bound check๊ฐ ๊ตฌํ๋์ด ์์ง ์์, kernel์ ๋ฐ๋ผ ๋ค๋ฅด์ง๋ง ๋์ฒด๋ก ํ๋ ฌ์ ํฌ๊ธฐ๊ฐ 32์ ๋ฐฐ์์ผ ๋์๋ง ์ฌ๋ฐ๋ฅธ ๊ฒฐ๊ณผ๋ฅผ ๋ธ๋ค. Bound check์ ๊ฒฝ์ฐ shared memory๋ก ๊ฐ์ ๋ถ๋ฌ์ฌ ๋ out of bounds์ธ ๊ฒฝ์ฐ 0์ ์จ์ฃผ๊ฑฐ๋, zero padding์ ํ๋ ๋ฐฉ์ ๋ฑ์ผ๋ก ๊ตฌํ์ด ๊ฐ๋ฅํ๋ค.
์ฐธ๊ณ ๋ฌธํ
- Otomo Hiroyuki, "Tensorใณใขใไฝฟใฃใฆใฟใ," Fixstars Tech Blog
- "OpenCL matrix-multiplication SGEMM tutorial"
- Lei Mao, "CUDA Matrix Multiplication Optimization"
- Jeremy Appleyard and Scott Yokim, "Programming Tensor Cores in CUDA 9," NVIDIA Technical Blog
- Andrew Kerr, Duane Merrill, Julien Demouth and John Tran, "CUTLASS: Fast Linear Algebra in CUDA C++," NVIDIA Technical Blog