tl:dr – the world is as expected. AMD’s software needs lots of hand holding and support, Nvidia’s runs very well right out of the box. Nvidia maintains performance advantage in nearly every test run. https://semianalysis.com/2024/12/22/mi300x-vs-h100-vs-h200-benchmark-part-1-training/
Key Findings
– Comparing on paper FLOP/s and HBM Bandwidth/Capacity is akin to comparing cameras by merely examining megapixel count. The only way to tell the actual performance is to run benchmarking.
– Nvidia’s Out of the Box Performance & Experience is amazing, and we did not run into any Nvidia specific bugs during our benchmarks. Nvidia tasked a single engineer to us for technical support, but we didn’t run into any Nvidia software bugs as such we didn’t need much support.
– AMD’s Out of the Box Experience is very difficult to work with and can require considerable patience and elbow grease to move towards a usable state. On most of our benchmarks, Public AMD stable releases of AMD PyTorch is still broken and we needed workarounds.
– If we weren’t supported by multiple teams of AMD engineers triaging and fixing bugs in AMD software that we ran into, AMD’s results would have been much lower than Nvidia’s.
– We ran unofficial MLPerf Training GPT-3 175B on 256 H100 in collaboration with Sustainable Metal Cloud to test the effects of different VBoost setting
– For AMD, Real World Performance on public stable released software is nowhere close to its on paper marketed TFLOP/s. Nvidia’s real world performance also undershoots its marketing TFLOP/s, but not by nearly as much.
– The MI300X has a lower total cost of ownership (TCO) compared to the H100/H200, but training performance per TCO is higher on the MI300X on public stable releases of AMD software. This changes if one uses custom development builds of AMD software.
– Training performance is weaker, as demonstrated by the MI300X ‘s matrix multiplication micro-benchmarks, and AMD public release software on single-node training throughput still lags that of Nvidia’s H100 and H200.
– MI300X performance is held back by AMD software. AMD MI300X software on BF16 development branches have better performance but has not yet merged into the main branch of AMD’s internal repos. By the time it gets merged into the main branch and into the PyTorch stable release, Nvidia Blackwell will have already been available to everyone.
– AMD’s training performance is also held back as the MI300X does not deliver strong scale out performance. This is due to its weaker ROCm Compute Communication Library (RCCL) and AMD’s lower degree of vertical integration with networking and switching hardware compared to Nvidia’s strong integration of its Nvidia Collective Communications Library (NCCL), InfiniBand/Spectrum-X network fabric and switches.
– Many of AMD AI Libraries are forks of NVIDIA AI Libraries, leading to suboptimal outcomes and compatibility issues.
– AMD customers tend to use hand crafted kernels only for inference, which means their performance outside of very narrow well defined use cases is poor, and their flexibility to rapidly shifting workloads is non-existent.