Nvidia’s Grace server CPU trades blows with AMD and Intel in detailed review — outperforms Bergamo, Genoa, and Emerald Rapids in over half of the benchmarks

Nvidia’s Grace server CPU appears to be very competitive, according to Phoronix’s review of the GH100, which includes a single Grace chip. Although Nvidia’s 72-core Arm CPU lagged behind AMD’s and Intel’s flagships in overall performance, it won in more benchmarks than the top-end Epyc 9754 or Xeon Platinum 8592+. With more optimization for the Arm architecture, Grace could prove to be a very potent datacenter processor.

GH100 includes a Hopper GPU and a 72-core Grace CPU with 480GB of LPDDR5X RAM. Since Nvidia doesn’t sell single Grace chips on their own, GH100 (and GH200) are really the only devices that can be tested to ascertain the performance of just one Grace CPU. Phoronix obtained access to a GH100 via GPTshop.ai, but only remotely. No power statistics were exposed to the remote computer, and since the publication couldn’t see power draw from the wall, no power figures are quoted in the review.

The benchmarks were conducted in Linux, the most common server operating system. The review includes comparisons to many different CPUs, including dual-socket setups. In the table below, we’ve taken the results comparing Grace to AMD’s flagship Bergamo-based Epyc 9754 and Intel’s top-end Emerald Rapids Xeon Platinum 8592+

Swipe to scroll horizontally
GH200 CPU Benchmarks
Row 0 – Cell 0 Grace-Hopper GH200 Epyc 9754 Xeon Platinum 8592+
High Performance Conjugate Gradient 41.69 25.89 35.42
Algebraic Multi-Grid Benchmark 1.2 1,997,929,111 2,291,049,667 1,839,912,667
LULESH 2.0.3 23,185.18 22,356.75 39,468.91
Xmrig 6.18.1 17,253 29,356.1 40,381.2
John The Ripper 2023.03.14 68,817 204,828 178,108
ACES DGEMM 1.0 17.94 43.68 29.14
GraphicsMagick 1.3.38 Sharpen 1,363 924 749
GraphicsMagick 1.3.38 Enhance 1,761 1,451 1,192
Graph500 3.0 Median 1,239,790,000 1,147,090,000 1,238,670,000
Graph500 3.0 Max 1,315,650,000 1,184,510,000 1,304,200,000
Stress-NG 0.16.04 Matrix 512,759.08 552,067.04 301,894.53
Stress-NG 0.16.04 Matrix 3D 17,483.02 8,009.21 13,854.38

These tests were all measured in different values, ranging from GFLOPs to calculations per second to points. Most of Grace’s losses are contained in this spread of benchmarks, which is why the CPU might not look that impressive at first glance. Still, there are workloads where Grace has big leads, like High Performance Conjugate Gradient and GraphicsMagick.

Swipe to scroll horizontally
GH200 CPU Benchmarks (Lower is better)
Row 0 – Cell 0 Grace-Hopper GH200 Epyc 9754 Xeon Platinum 8592+
Rodinia 3.1 (Lower is better) 30.31 25.15 39.89
NWChem 7.0.2 (Lower is better) 1,403.5 1,700.8 1,850.8
Xompact3d Incompact3d (Lower is better) 254.49 493.5 323.53
Xompact3d Incompact3d (Lower is better) 9.81 9.03 10.18
Godot Compilation 4.0 (Lower is better) 139.1 118.25 111.96
Primesieve 8.0 (Lower is better) 35.49 21.76 49.06
Helsing 1.0-beta (Lower is better) 67.61 48.95 84.95
DuckDB 0.9.1 IMDB (Lower is better) 92.08 147.6 96.87
DuckDB 0.9.1 TPC-H Parquet (Lower is better) 148.76 177.13 134.73
RawTherapee (Lower is better) 46.72 66.13 45.53
Timed Gem 5 Compilation 23.0.1 (Lower is better) 180.62 208.58 174.18
Overall Average Performance 2,175.03 2,459.11 2,242.9

Grace picks up more steam in this second set of tests scored on completion time, where lower is better. In the end, the single Grace chip racks up 15 wins against Emerald Rapids and 13 wins against both Bergamo and Genoa (which isn’t included in the table, but the results are very similar). There were even some cases where Nvidia’s server CPU beat AMD’s or Intel’s in dual-socket systems. Grace was also very fast compared to Ampere’s aging Altra Max M128-30, which also uses Arm.

However, because many of Grace’s losses were pretty big, on average, it’s 3% behind the Emerald Rapids-powered Xeon Platinum 8592+ and about 13% slower than the Bergamo-based Epyc 9754 and the Genoa-based Epyc 9654. According to Phoronix, “there still are some workloads not too well optimized for AArch64 [Arm],” which is a key reason why when Grace lost, it often lost by a large margin.

It’s hard to evaluate how good Grace will be as a server CPU based solely on performance, though, as efficiency is also a key metric. However, we know that the Grace superchip combining two Grace CPUs has a TDP of 500 watts, implying that a single Grace likely isn’t using anything more than 350 watts. Early benchmarks for the superchip certainly suggest it’s very efficient, which will probably also be true for single-chip configurations.