• Technical Conference:  30 March – 03 April 2025
  • Exhibition: 01 – 03 April 2025
  • Moscone Center, San Francisco, California, USA

Short and Sweet: How Do We Cost-Optimize a 10 Meter Link for Scaling Up Machine Learning Clusters?

Moscone Center

In response to exploding large language model size, machine learning clusters are rapidly scaling up in node injection bandwidth and number of nodes. GPU injection bandwidths are currently ~10 Tbps per GPU and roughly doubling every two years. Cluster sizes are trending upwards as well, spanning many racks in some cases. Link reliability is ever more important as clusters grow. Thus the industry needs cost effective, reliable, and extremely dense intra-cluster scale-up links. Our speakers will take their best shots at the following aspirational goals in the 2030 timeframe:

  • 10 meter reach;
  • > 1 Tbps/mm (per direction) of host ASIC die-edge bandwidth density;
  • Link energy well under 5 pJ/bit, host ASIC to host ASIC, including any DSP and laser sources;
  • Reliability better than 10 FIT per link (normalized to a 400 Gbps link)  [note copper today is roughly 1 FIT];
  • Cost < 0.10 $/Gbps host to host, including fiber and connectors.

Organizers

Trey Greer (Lead), NVIDIA, United States

Connie Chang-Hasnain, HC Meta Ple. Ltd., United States