Triton简介: Open-Source GPU Programming for Neural Networks
我们发布了 Triton 1.0, 一个类似于python的开源编程语言,使研究者即使没有CUDA经验也能编写高效的GPU代码,大部分时候能取得专家级的产出。Triton使达到硬件性能的峰值成为可能,仅需要付出较小的努力。例如,可以编写 FP16 matrix multiplication kernels能够达到cuBLAS的执行效率—是很多 GPU programmers无法做到的—只需要 25 行代码。我们的研究者已使用其构建出相对Torch2倍以上效率的内核。我们非常高兴与社区一起工作使GPU编程能够为每个人所用。
在深度学习领域新奇的设计思想通常使用combination of native framework operators来实现。通常,该方法要求创建(且/或移动)大量的临时张量,会影响到神经网络在伸缩时的性能。该问题可以通过编写特殊的GPU内核来减缓,但是因为GPU编程的复杂性而引起惊人的困难。 1 2 3 并且,尽管不同的系统出现使其变得更为容易 4 5 ,我们发现要么冗长、要么缺乏灵活性或者创建的代码明显要慢于手工调优的基线。这使我们扩展和改进 Triton 6, 最近的语言和编译器的最初创建者现在已经在OpenAI开展工作。GPU 编程的挑战
现代 GPUs的架构可以大概分为三个部分—DRAM, SRAM 和 ALUs—当优化 CUDA 代码都必须要进行考虑:
-
来自DRAM的内存传输必须经过合并,从而利用现代内存接口的总线带宽。
-
Memory transfers from DRAM must be coalesced into large transactions to leverage the large bus width of modern memory interfaces.
-
-
数据在被重新使用之前必须被手动存储到SRAM中,从而在检索时减少共享内存库的冲突。
-
Data must be manually stashed to SRAM prior to being re-used, and managed so as to minimize shared memory bank conflicts upon retrieval.
-
-
计算必须在流式多处理器(SM)之间和内部仔细分区和调度,从而完成指令或线程级的并行处理,以及对专用ALU的利用。
- Computations must be partitioned and scheduled carefully, both across and within Streaming Multiprocessors (SMs), so as to promote instruction/thread-level parallelism and leverage special-purpose ALUs (e.g., tensor cores).
推论这些因子是一个挑战,即便是有多年CUDA经验的编程者。Triton的目的是完全自动化这个优化过程,因此开发者能够聚焦于并行代码的高级别的逻辑。Triton 帮助能够广泛可用从而跨SMs的调度不能自动化的部分 -- 留下一些重要的算法考虑 (e.g. tiling, inter-SM synchronization) 给开发者进行决策。如下图所示:
CUDA | Triton | |
---|---|---|
Memory Coalescing | Manual | Automatic |
Shared Memory Management | Manual | Automatic |
Scheduling (Within SMs) | Manual | Automatic |
Scheduling (Across SMs) | Manual | Manual |
编程模型-Programming Model
Domain Specific Languages 和 JIT-compilers 可用, Triton 也许最像Numba: kernels 被定义为装饰的Python functions, 通过不同的 program_id
’s 并行启动在一个 grid of so-called instances。但是,如下的代码片段, 相似性在这里停止: Triton exposes intra-instance parallelism via operations on blocks—small arrays whose dimensions are powers of two—rather than a Single Instruction, Multiple Thread (SIMT)7 execution model。如此,Triton 有效滴提取出来并行相关的问题( within CUDA thread blocks )(如memory coalescing, shared memory synchronization/conflicts, tensor core scheduling)。
BLOCK = 512
# This is a GPU kernel in Numba.
# Different instances of this
# function may run in parallel.
@jit
def add(X, Y, Z, N):
# In Numba/CUDA, each kernel
# instance itself uses an SIMT execution
# model, where instructions are executed in
# parallel for different values of threadIdx
tid = threadIdx.x
bid = blockIdx.x
# scalar index
idx = bid * BLOCK + tid
if id < N:
# There is no pointer in Numba.
# Z,X,Y are dense tensors
Z[idx] = X[idx] + Y[idx]
...
grid = (ceil_div(N, BLOCK),)
block = (BLOCK,)
add[grid, block](x, y, z, x.shape[0])
BLOCK = 512
# This is a GPU kernel in Triton.
# Different instances of this
# function may run in parallel.
@jit
def add(X, Y, Z, N):
# In Triton, each kernel instance
# executes block operations on a
# single thread: there is no construct
# analogous to threadIdx
pid = program_id(0)
# block of indices
idx = pid * BLOCK + arange(BLOCK)
mask = idx < N
# Triton uses pointer arithmetics
# rather than indexing operators
x = load(X + idx, mask=mask)
y = load(Y + idx, mask=mask)
store(Z + idx, x + y, mask=mask)
...
grid = (ceil_div(N, BLOCK),)
# no thread-block
add[grid](x, y, z, x.shape[0])
这也许不是特别地有帮助,对于那些令人尴尬的并行计算,但这将极大地简化复杂GPU程序的开发。
考虑一个栗子,fused softmax kernel (如下) 。每一个示例 normalizes a different row of the given input tensor X∈RM×NX \in \mathbb{R}^{M \times N}X∈RM×N. 标准的 CUDA 对于该并行策略的实现 can be challenging to write, requiring explicit synchronization between threads as they concurrently reduce the same row of XXX。复杂性的大部分在Triton中不复存在,每一个 kernel instance使用NumPy类似的元语载入兴趣列并正则化。
import triton
import triton.language as tl
@triton.jit
def softmax(Y, stride_ym, stride_yn, X, stride_xm, stride_xn, M, N):
# row index
m = tl.program_id(0)
# col indices
# this specific kernel only works for matrices that
# have less than BLOCK_SIZE columns
BLOCK_SIZE = 1024
n = tl.arange(0, BLOCK_SIZE)
# the memory address of all the elements
# that we want to load can be computed as follows
X = X + m * stride_xm + n * stride_xn
# load input data; pad out-of-bounds elements with 0
x = tl.load(X, mask=n < N, other=-float('inf'))
# compute numerically-stable softmax
z = x - tl.max(x, axis=0)
num = tl.exp(z)
denom = tl.sum(num, axis=0)
y = num / denom
# write back to Y
Y = Y + m * stride_ym + n * stride_yn
tl.store(Y, y, mask=n < N)
import torch
# Allocate input/output tensors
X = torch.normal(0, 1, size=(583, 931), device='cuda')
Y = torch.empty_like(X)
# SPMD launch grid
grid = (X.shape[0], )
# enqueue GPU kernel
softmax[grid](Y, Y.stride(0), Y.stride(1),
X, X.stride(0), X.stride(1),
X.shape[0] , X.shape[1])
注意 Triton JIT 对待 X 和 Y 为 pointers 而不是 tensors; 我们觉得保留内存访问的低端控制是重要的,从而可以去寻求更复杂的数据结构(e.g., block-sparse tensors)。
重要的是,该 softmax的特殊实现保留 rows of XXX 在 SRAM 包括整个正则化处理过程,最大化了数据的重用(当可用时,~<32K columns). 这不同于from PyTorch’s内部CUDA 代码,使用临时内存使其保持通用但是明显慢了不少(如下). 这儿下面一行显示Triton并不是更好,但是这简化了特殊内属内核的开发,从而能够比通目的库更快。
5,00010,000N05001000 GB/s
The lower performance of the Torch (v1.9) JIT highlights the difficulty of automatic CUDA code generation from sequences of high-level tensor operations.
@torch.jit.script
def softmax(x):
x_max = x.max(dim=1)[0]
z = x - x_max[:, None]
numerator = torch.exp(x)
denominator = numerator.sum(dim=1)
return numerator / denominator[:, None]
Matrix Multiplication
Being able to write fused kernels for element-wise operations and reductions is important, but not sufficient given the prominence of matrix multiplication tasks in neural networks. Triton 工作的很好,只需要大概 ~25 Python 代码就可以达到峰值。换句话说,实现类似的CUDA代码将需要 a lot more effort 而且可能性能更低。
@triton.jit
def matmul(A, B, C, M, N, K, stride_am, stride_ak,
stride_bk, stride_bn, stride_cm, stride_cn,
**META):
# extract metaparameters
BLOCK_M, GROUP_M = META['BLOCK_M'], META['GROUP_M']
BLOCK_N = META['BLOCK_N']
BLOCK_K = META['BLOCK_K']
# programs are grouped together to improve L2 hit rate
_pid_m = tl.program_id(0)
_pid_n = tl.program_id(1)
pid_m = _pid_m // GROUP_M
pid_n = (_pid_n * GROUP_M) + (_pid_m % GROUP_M)
# rm (resp. rn) denotes a range of indices
# for rows (resp. col) of C
rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
# rk denotes a range of indices for columns
# (resp. rows) of A (resp. B)
rk = tl.arange(0, BLOCK_K)
# the memory addresses of elements in the first block of
# A and B can be computed using numpy-style broadcasting
A = A + (rm[:, None] * stride_am + rk[None, :] * stride_ak)
B = B + (rk [:, None] * stride_bk + rn[None, :] * stride_bn)
# initialize and iteratively update accumulator
acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
for k in range(K, 0, -BLOCK_K):
a = tl.load(A)
b = tl.load(B)
# block level matrix multiplication
acc += tl.dot(a, b)
# increment pointers so that the next blocks of A and B
# are loaded during the next iteration
A += BLOCK_K * stride_ak
B += BLOCK_K * stride_bk
# fuse leaky ReLU if desired
# acc = tl.where(acc >= 0, acc, alpha * acc)
# write back result
C = C + (rm[:, None] * stride_cm + rn[None, :] * stride_cn)
mask = (rm[:, None] < M) & (rn[None, :] < N)
tl.store(C, acc, mask=mask)
编写矩阵乘法内核的一个重要优势在于可以定制化去适应输入(e.g., slicing) 输出 (e.g., Leaky ReLU)的转换。如果没有 Triton这类系统, 矩阵乘法的不平凡的修改是没有GPU丰富经验的开发者难以达到的。
1,0002,0003,0004,000M = N = K020406080100 TFLOPscuBLASTritonTriton + LeakyReLUcuBLAS +torch.nn.LeakyReLU
High-Level System Architecture
Triton的良好性能来自于围绕Triton-IR的模块化系统结构,是基于LLVM的即时表达,在其中多维值块是一等公民。如下所示:
@jit
def add(X, Y, Z, N):
pid = program_id(0)
idx= pid * 512 + arange(512)
mask = idx < N
x = load(X + idx, mask=mask)
y = load(Y + idx, mask=mask)
store(Z + idx, x + y, mask=mask)
def void add(i32* X .aligned(16) , i32* Y .aligned(16) , i32* Z .aligned(16) , i32 N .multipleof(2) )
{
entry:
%0 = get_program_id[0] i32;
%1 = mul i32 %0, 512;
%3 = make_range[0 : 512] i32<512>;
%4 = splat i32<512> %1;
%6 = add i32<512> %4, %3;
%9 = splat i32<512> N;
%11 = icmp_slt i1<512> %6, %9;
%14 = splat i32*<512> X;
%16 = getelementptr i32*<512> %14, %6;
%19 = broadcast i1<512> %11;
%21 = splat i32<512> undef;
%22 = masked_load i32<512> %16, %19, %21;
%26 = splat i32*<512> Y;
%28 = getelementptr i32*<512> %26, %6;
%31 = broadcast i1<512> %11;
%33 = splat i32<512> undef;
%34 = masked_load i32<512> %28, %31, %33;
%38 = splat i32*<512> Z;
%40 = getelementptr i32*<512> %38, %6;
%43 = add i32<512> %22, %34;
%46 = broadcast i32<512> %43;
%48 = broadcast i1<512> %11;
masked_store void %40, %46, %48;
ret void;
}
.visible .entry add(
.param .u64 add_param_0, .param .u64 add_param_1,
.param .u64 add_param_2, .param .u32 add_param_3
)
.maxntid 128, 1, 1
{
.reg .pred %p<4>;
.reg .b32 %r<18>;
.reg .b64 %rd<8>;
ld.param.u64 %rd4, [add_param_0];
ld.param.u64 %rd5, [add_param_1];
mov.u32 %r13, %tid.x;
ld.param.u32 %r14, [add_param_3];
shl.b32 %r15, %r13, 2;
mov.u32 %r16, %ctaid.x;
mad.lo.s32 %r17, %r16, 512, %r15;
setp.ge.s32 %p3, %r17, %r14;
setp.lt.s32 %p1, %r17, %r14;
mul.wide.s32 %rd7, %r17, 4;
add.s64 %rd2, %rd4, %rd7;
@%p1 ld.global.cg.v4.b32 {%r5,%r6,%r7,%r8}, [ %rd2 + 0];
add.s64 %rd3, %rd5, %rd7;
@%p1 ld.global.cg.v4.b32 {%r9,%r10,%r11,%r12}, [ %rd3 + 0];
@%p3 bra LBB0_2;
ld.param.u64 %rd6, [add_param_2];
add.s64 %rd1, %rd6, %rd7;
add.s32 %r1, %r5, %r9;
add.s32 %r2, %r6, %r10;
add.s32 %r3, %r7, %r11;
add.s32 %r4, %r8, %r12;
st.global.v4.u32 [%rd1], {%r1, %r2, %r3, %r4};
LBB0_2:
ret;
}
The @triton.jit
decorator works by walking the Abstract Syntax Tree (AST) of the provided Python function so as to generate Triton-IR on-the-fly using a common SSA construction algorithm.8 最后 IR code 在编译器后台被简化、优化和自动并行化,再转换为高质量的 LLVM-IR—最终是 PTX—在最近版本的 NVIDIA GPUs中执行。CPUs 和 AMD GPUs 目前换不支持,但是我们欢迎社区贡献者一起做出努力。
Compiler Backend
我们发现通过Triton-IR的块化程序的使用允许我们的编译器自动执行广泛的重要程序的优化。例如,data can be automatically stashed to shared memory by looking at the operands of computationally intensive block-level operations (e.g., tl.dot
)—and allocated/synchronized using standard liveness analysis techniques.
S1 float A[4,4] = ...
S2 float B[4,4] = ...
S3 float C[4,4] = A + B
S1 half A[4,2] = ...
S2 half B[2,2] = ...
S3 float C[4,2] = dot(A,B)
- Definition of a Triton program P composed of three statements
S1
,S2
,S3
- Iteration space of
S3
- Mapping of
S3
onto a Stream Multiprocessor (SM)
- Mapping of P onto the GPU
贡献
我们期待Triton 成为community-driven project. 可以自由地fork our repository on GitHub!
如果感兴趣,欢迎加入我们的队伍为Triton & GPU kernels工作, we’re hiring!
-
Gray, S. (2017). SGEMM Walkthrough. URL https://github.com/NervanaSystems/maxas/wiki/SGEMM.
-
Kerr, A. (2020). Developing CUDA kernels to push Tensor Cores to the Absolute Limit on NVIDIA A100. URL https://developer.nvidia.com/gtc/2020/video/s21745-vid.
-
Yan, D., Wang, W., & Chu, X. (2020, May). Demystifying tensor cores to optimize half-precision matrix multiply. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE.
-
Tillet, P., Kung, H. T., & Cox, D. (2019, June). Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (pp. 10-19).
-
Lin, Y. & Grover, V. (2018). Using CUDA Warp-Level Primitives. URL https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/.
-
Braun, M., Buchwald, S., Hack, S., Leißa, R., Mallon, C., & Zwinkau, A. (2013, March). Simple and efficient construction of static single assignment form. In International Conference on Compiler Construction (pp. 102-122). Springer, Berlin, Heidelberg.