Intel Ponte Vecchio Seemingly Offers 2.5x Higher Performance Than Nvidia’s A100

Intel has detailed the company’s Ponte Vecchio Xe-HPC GPU at Hot Chips 34. In the provided benchmarks, the chipmaker claims that Ponte Vecchio delivers up to 2.5x more performance than the Nvidia A100. But, as customary, take vendor-provided benchmarks with a pinch of salt.

Ponte Vecchio outperformed the A100 by significant margins in several Intel-selected benchmarks. Intel’s powerhouse also flaunted a 2x lead in miniBUDE and 1.5x in ExaSMR. It’s an interesting comparison considering that Ponte Vecchio isn’t even out yet, and A100 (Ampere) has been on the market since 2020. And let’s not forget that AMD’s Instinct MI250X (Aldebaran) is reportedly three times faster than the A100. So Intel should worry about AMD and Nvidia’s next-generation HPC products.

If Intel’s numbers are accurate, Ponte Vecchio could be a potential competitor against Nvidia’s next-generation H100 (Hopper). Based on the specifications we have so far, H100 should be at least twice as fast as the A100, what’s even more menacing in AMD’s Instinct MI300, fusing both Zen 4 CPU and CDNA 3 GPU chiplets into a single product. Dubbed as the world’s first data center APU, AMD claims that the Instinct MI300 represents an 8x uplift in AI training performance compared to the Instinct MI250X.

Ponte Vecchio will come in three flavors: OAM, x4 subsystem with Xe links, and x4 subsystem with Xe links on a dual-socket Sapphire Rapids platform. Unfortunately, Sapphire Rapids has suffered so many delays that it’s not funny anymore. Barring further setbacks, some Sapphire Rapids products could finally debut in October. Nonetheless, the high-volume chips may not arrive until February 2023.

In its OAM form factor, Ponte Vecchio boasts support for both four GPU and eight GPU platforms. A two-stack Ponte Vecchio configuration pumps out 52 TFLOPs of FP32 and FP64 performance. For comparison, a single H100 SXM5 module peaks at 60 TFLOPs of FP32 and 30 TFLOPs of FP64 performance.

Ponte Vecchio features a 64MB register file, outputting up to 419 TBps of bandwidth. The L1 and L2 caches are 64MB and 408MB, respectively. The large L2 cache on Ponte Vecchio benefits specific workloads, such as 2D-FFT Case and DNN Case. In the presentation, Intel’s results reveal substantial performance improvement from 80MB to 408MB in both scenarios.