LUT Tensor Core

Aug. 5, 2025, 9:34 p.m. · 7 min read · 🌐︎ en/ko

architecture paper review MLSys quantization

Z. Mo et al., “LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference”, ISCA ‘25

Recently, I was conducting some research on model compression including quantization. This also made me interested in architectural support for this area, and led me to read this paper. The paper "LUT Tensor Core" implements tensor core-like hardware using lookup tables for extreme quantization scenarios such as binary or ternary (1.58-bit) quantization. It was fascinating to see how the HW-SW co-design incorporating lookup tables could be used to implement mixed precision GEMM (mpGEMM) in a highly efficient way.

Introduction

Mixed-precision GEMM (mpGEMM) occurs during quantized inference when activation and weight precisions differ. For example, input activations may be in FP16/INT8 while weights are INT4/2/1. However, in actual computation, hardware dequantizes the values in prior to computation, leading into an inefficiency.

Look-up table(LUT)-based mpGEMM, which is possible in a low-bit regime, has gained a lot of attention recently. That is, all possible combinations of dot product results are pre-computed for a weight row. This is especially effective in extreme quantization scenarios like INT1 (e.g., BitNet).

However, the authors point out that LUT-based mpGEMM does not perform well in reality:
- SW issue: Since LUT kernel is not well supported by instructions, its memory access pattern is unoptimized, and thus more inefficient than dequantization-based one.
- HW issue: conventional LUT design lacks optimization since it was not designed with a lot of consideration about mpGEMM. Moreover, storing pre-computation results incurs a lot of overhead.

LUT Tensor Core copes with such problems by conducting a HW-SW co-design: precomputation and storage management, which is a tricky part for hardware, is handled by software. The implementation incorporates three following parts:

  1. SW Optimization:

    • Precomputation is split by independent operators and then fused to other ones, thereby reducing the number of memory accesses.
    • {0, 1} is reinterpreted as {-1, +1} to reduce the table size by half, mitigating storage overhead.
  2. HW Customization

    • Bit-serial-like circuit to support various combinations of mixed precision.
    • Design space exploration (DSE) to determine the shape of LUT-based tensor core, finding that elongated tiles enable efficient table reuse.
    • Extends existing matrix multiply-accumulate (MMA) instruction set to implement LUT-based MMA (LMMA) instruction set.

The authors validated the effectiveness of LUT Tensor core by testing it with Accel-Sim1, a GPU simulator.

Background and Motivation

LLM Inference and Low-Bit Quantization
The authors first discusses the importance of quantization, including PTQ and QAT, specifically mentioning BitNet and ParetoQ. Such low bit-width regime is where the authors' method is optimized for. The authors also mention the difficulties of activation quantization, which arises from their nature, generated on-the-fly with high variance due to outliers.

LUT-Based mpGEMM for Low-Bit LLM
LUT-based mpGEMM breaks down matrix multiplications into smaller tiles and then precomputes all possible outcomes for specific activation values within each tile. This dramatically reduces computational cost by preventing the calculation and storage of values that would be impossible given the tile’s activation values. For comparison, precomputing all possible results of FP16 x INT4 would require $2^{16} \times 2^4$ entries, which is far more expensive.

Gaps in Current LUT-based Solutions
The existing HW/SW implementation is not sufficient to support LUT-based mpGEMM:

LUT Tensor Core Design

SW-Based Table optimization

Naively $(2^\text{W\_BIT})^K$ entries are required if the weight bit width is W\_BIT. LUT Tensor Core addresses such overhead by incorporating an optimization technique similar to bit-serial. Bit-serial decomposes an $W$-bit integer into $W$ INT1's, performing the multiplication on them via bit shifts. This reduces the table size into $2^K$, but causes a significant HW overhead. LUT Tensor Core improves upon this with four optimizations: (1) dataflow graph transformation, (2) operator fusion, (3) weight reinterpretation, and (4) table quantization.

Precomputing lookup table with DFG transformation and operator fusion

Reinterpreting weight for table symmetrization

Original quantized weight $q_w$ is unsigned int, expressing actual values as:
$$r_w = s_w(q_w - z_w)$$
By transforming to $$q'_w = 2q_w - (2^K - 1), \quad s'_w = s_w/2, \quad z'_w = 2z_w + 1 - 2^K$$,
$r_w =s_w^{\prime}(q_w^{\prime}-z_w^{\prime})$ remains unchanged while $q_w^\prime$ now becomes symmetric around 0 (symmetrization).

With this representation, the dot product with activation becomes: $$DP = \sum Act_i s_w(q_{wi} - z_w) = \sum Act_i s'_w (q'_{wi} - z'_w)$$. Through symmetrization, it holds that

$$\text{LUT}[W_3 W_2 W_1 W_0] = \begin{cases} -\text{LUT}[\sim (W_2 W_1 W_0)], & \text{if } W_3 = 1 \ \text{LUT}[W_2 W_1 W_0], & \text{if } W_3 = 0 \end{cases}$$, reducing the number of table entries by half ($2^K \rightarrow 2^{K-1}$).

Table quantization

LUT Tensor Core quantizes the precomputed table itself in a lower precision (e.g., INT8), in case of high activation precision such as FP16/32. By keeping the group size small, such as 4, high precision is maintained while simplifying HW.

LUT-based Tensor Core Microarchitecture

Simplified LUT unit design with bit-serial

As discussed above, a bit-serial circuit architecture is introduced to support various combinations of precisions. That is, weight bit width is mapped to W_BIT cycles so that a bit width greater than 1 is processed serially.

Elongated LUT tiling

Akin to previous tensor cores processing matmul in a tiled manner, LUT Tensor core needs to select tile dimensions.

Consider an $MNK$ tile with $M$ tables, $N$ weight sets, and $K$ weights per set.

This makes elongated tile shapes to be advantageous, since:
- Large $K$ causes exponential increase in table entries, which is undesirable
- Large $N$ allows MUX units to reuse table entries extensively, which is beneficial

Instruction and Compilation

LUT-based MMA instructions

The existing GPU architecture is extended to integrate LUT Tensor Core. Specifically, the following instruction set is created:

lmma.{M}{N}{K}.{$A_\text{dtype}$}{$W_\text{dtype}$}{$Accum_\text{dtype}$}{$O_\text{dtype}$}

Compilation support and optimizations

End-to-end LLM compilation is implemented using TVM, Roller, and Welder to automatically generate kernels that utilize LUT-based Tensor Core.

The compilation process can be summarized as follows:

  1. DFG Transformation: Given a model represented by DFG, an mpGEMM operator is decomposed to precompute operator + LUT-mpGEMM operator. The transformed graph is passed to Welder's graph optimization.
  2. Operation Fusion: Welder is utilized to fuse the precompute operator with the preceding element-wise operator.
  3. LUT-mpGEMM Scheduling: a tiling strategy is determined, considering the memory hierarchy. Since LUT-mpGEMM has inputs with data type different to each other, tiling is registered with data sizes rather than shapes. This is registered to the rTile interface of Roller, which finds the optimal configuration.
  4. Code Generation: Once the final scheduling plan is determined, TVM is used for code generation.. LMMA instructions are registered as TVM intrinsics, allowing TVM to generate kernel codes incorporating LMMA instructions according to the scheduling plan.

Evaluation


  1. https://accel-sim.github.io, a simulator providing A100 simulation and other capabilities. 

Ctrl+K
Start typing to search...