When Tachyum unveiled the concept of its Prodigy Universal Processor at Hot Chips 18, it made quite a splash with a chip designed to run any code using a dynamic binary translator. It demonstrated high performance when executing both native and translated code. It took the company a while to design the actual hardware, taking pre-orders on evaluation kits (opens in new tab); the company also discloses the exact specifications of its Prodigy. They certainly look impressive, but they are also scary with a 950W thermal design power per chip.
Formidable Performance at Formidable Power
Each Tachyum Prodigy processor has up to 128 proprietary cores mated with 16 DDR5 memory channels (for a 1,024-bit interface) supporting up to 7200 MT/s data transfer rate (and therefore providing up to 921.6 GBps of bandwidth) as well as 64 PCIe 5.0 lanes. In addition, the chip supports up to 8TB of DDR5 memory in total, which is in line with what we will see with upcoming server CPUs from other makers. As for clock rates, Tachyum’s Prodigy is designed to run up to 5.7 GHz and is a product of TSMC’s performance-optimized N5P process technology.
When it comes to performance, Tachyum expects its flagship Prodigy T16128-AIX processor (opens in new tab) to offer up to 90 FP64 TFLOPS for HPC as well as up to 12 ‘AI PetaFLOPS’ for inference and training, presumably when running native code and consuming up to 950W (and using liquid cooling), according to specifications published (opens in new tab) by the company and at Golem.de (opens in new tab). Meanwhile, Tachyum’s Prodigy processors can work in 2-way and 4-way configurations. To put the numbers into context, AMD’s Instinct MI250X has a peak throughput of 96 FP64 TFLOPS for HPC at about 560W. In contrast, Nvidia’s H100 SXM5 can provide up to 20 INT8/FP8 PetaOPS/PetaFLOPS for AI (up to 40 PetaOPS/PetaFLOPS with sparsity) at 700W. Yet, neither compute GPUs function for general-purpose workloads. And this is exactly when it gets interesting.
A New CPU Is Born
Tachyum’s Prodigy is a universal homogeneous processor packing up to 128 proprietary 64-bit VLIW cores that feature two 1024-bit vector units per core and one 4096-bit matrix unit per core. In addition, each core features a 64KB instruction cache, a 64KB data cache, 1MB L2 cache, and can utilize unused L2 caches of other cores as a victim L3 cache.
Tachyum’s VLIW cores are in-order cores, but when compiler makers proper optimizations, they can support 4-way out-of-order issues, according to Radoslav Danilak, chief executive and co-founder of Tachuym, who spoke with Golem.de (opens in new tab). He also re-emphasized that the Prodigy instruction set architecture can achieve a very high instruction level parallelism with software using so-called poison bits.
These cores run native code written and explicitly optimized for Prodigy (where VLIW architecture promises to shine) as well as x86, Arm, and RISC-V binaries using software emulation and without performance degradation, according to the company. Historically, all attempts to make VLIW processors execute x86 code have failed (e.g., Transmeta’s Crusoe, Intel’s Itanium) mainly because of particular CPU architectures and emulation inefficiencies. The head of Tachyum admits that Qemu binary translation degrades performance by 30% to 40% (without disclosing any baselines) but hopes that real-world performance will still be high enough to be competitive. Meanwhile, some programs are already supported natively.
“We support GCC and Linux natively, and FreeBSD now also runs [on Prodigy],” said Danilak. “Apache, MongoDB or Python already run natively, Pytorch and Tensorflow frameworks are also available.”
Tachyum stresses that Prodigy is not an accelerator but an actual CPU that will compete against AMD, Intel, and others. To ensure that the processor can deliver competitive performance across general purpose and AI workloads, the company has made numerous alterations to its design implementation since its first introduction in 2018.
“We are a CPU replacement and not an AI accelerator company, we are targeting cloud/hyperscalers and telcos,” said Danilak. “Over time we plan to win some supercomputer customers, so we doubled the width of the vector/MAC units from 512 bits to 1,024 bits [which also brings in necessary data paths for the 4,096-bit matrix operations for artificial intelligence].”
Indeed, one particular advantage that Tachyum’s Prodigy promises is its ability to execute a different kind of code. Assuming that it can provide decent performance at decent power while executing general-purpose workloads (instances), it may give some additional flexibility to AWS, Microsoft Azure, and the likes since they will be able to use the same machines for AI, HPC, and general-purpose instances if needed. It will, of course, require some actual software work from different parties, but this might work, at least in theory.
Still Not Here
It should be noted that Tachyum still does not have any Prodigy silicon. As a result, all performance projections are a product of simulations, and the only thing the company has now is an FPGA prototype of its processor.
Meanwhile, the company recently began to take pre-orders on Tachyum’s Prodigy Evaluation Platform, which will use on some Prodigy silicon. Companies must place orders before July 31, 2022, and delivery of actual hardware is around ‘six to nine months after receipt of order.’
Tachyum expects to tape out the first Prodigy silicon (which could be smaller than 500 mm^2) in mid-August if everything goes as planned. After that, the company expects to get the first samples of its chip around December, and if the chip works appropriately, the company plans to start sampling (i.e., send out evaluation kits). Typically, silicon bring-up takes about a year after the initial chip returns from the fab. Still, Tachyum hopes its first processor will work as planned, and it will be able to kick off actual mass production in the first half of 2023.
In the future, Danilak envisions a Prodigy 2 processor made using one of TSMC’s N3 nodes that will deliver twice higher performance at the same power along with PCIe Gen6 support.