• Technical Conference:  30 March – 03 April 2025
  • Exhibition: 01 – 03 April 2025
  • Moscone Center, San Francisco, California, USA

Short and Sweet: How Do We Cost-Optimize a 10 Meter Link for Scaling Up Machine Learning Clusters?

Sunday, 30 March, 16:00 – 18:30

Rooms 211-212 (Level 2)

In response to exploding large language model size, machine learning clusters are rapidly scaling up in node injection bandwidth and number of nodes. GPU injection bandwidths are currently ~10 Tbps per GPU and roughly doubling every two years. Cluster sizes are trending upwards as well, spanning many racks in some cases. Link reliability is ever more important as clusters grow. Thus the industry needs cost effective, reliable, and extremely dense intra-cluster scale-up links. Our speakers will take their best shots at the following aspirational goals in the 2030 timeframe:

  • 10 meter reach;
  • > 1 Tbps/mm (per direction) of host ASIC die-edge bandwidth density;
  • Link energy well under 5 pJ/bit, host ASIC to host ASIC, including any DSP and laser sources;
  • Reliability better than 10 FIT per link (normalized to a 400 Gbps link)  [note copper today is roughly 1 FIT];
  • Cost < 0.10 $/Gbps host to host, including fiber and connectors.

Organizers

Trey Greer (Lead), NVIDIA, United States

Connie Chang-Hasnain, Berxel Photonics, United States

Speakers

Karl Bois, NVIDIA, United States

Tzu-Hao Chow, Broadcom, United States

Julie Eng, Coherent, United States

Ben Foo, Microsoft, United States

Thomas Liljeberg, Intel Corp., United States

Nhat Nguyen, Ayar Labs, United States

Matt Sysak, Lumentum, United States