–
Moscone Center
In response to exploding large language model size, machine learning clusters are rapidly scaling up in node injection bandwidth and number of nodes. GPU injection bandwidths are currently ~10 Tbps per GPU and roughly doubling every two years. Cluster sizes are trending upwards as well, spanning many racks in some cases. Link reliability is ever more important as clusters grow. Thus the industry needs cost effective, reliable, and extremely dense intra-cluster scale-up links. Our speakers will take their best shots at the following aspirational goals in the 2030 timeframe:
- 10 meter reach;
- > 1 Tbps/mm (per direction) of host ASIC die-edge bandwidth density;
- Link energy well under 5 pJ/bit, host ASIC to host ASIC, including any DSP and laser sources;
- Reliability better than 10 FIT per link (normalized to a 400 Gbps link) [note copper today is roughly 1 FIT];
- Cost < 0.10 $/Gbps host to host, including fiber and connectors.
Organizers
Trey Greer (Lead), NVIDIA, United States
Connie Chang-Hasnain, HC Meta Ple. Ltd., United States