Breaking the NVIDIA Tyranny: Transpilation Enhanced by Automatic Parallelization, Susan Tan, GS (77BBEB41)

views

Since 2015, NVIDIA's stock has increase 40-fold, primarily due to its dominance in the GPU market, crucial for AI advancements. With its Tensor Cores, like the A100 and H100, customized for AI and high-performance computing, NVIDIA has a significant edge. These AI-focused Tensor Cores are substantially more expensive than NVIDIA’s gaming GPUs, contributing to a vendor lock-in where AI giants rely heavily on NVIDIA.

NVIDIA's CUDA platform further solidifies this lock-in, being optimized only for NVIDIA GPUs. However, our research introduces Tulip, a transpilation framework that converts CUDA code to run on other GPUs like AMD's. Unlike simple rewrites, Tulip uses a compile-then-decompile approach to translate CUDA's parallelism into other models, such as OpenMP and OpenACC. This process allows us to harness additional parallelism and potentially outperform native CUDA applications.

Our results show that Tulip-generated code on an AMD GPU outperforms CUDA on NVIDIA by 46% and Hip on AMD by 54%. This is largely because OpenMP, unlike CUDA, doesn't require explicit scheduling, allowing more efficient compiler optimizations.

Tags

Breaking the NVIDIA Tyranny: Transpilation Enhanced by Automatic Parallelization, Susan Tan, GS (77BBEB41)

Related Media