Sunday, 30 March,
16:00
–
18:30
Rooms 211-212 (Level 2)
In response to exploding large language model size, machine learning clusters are rapidly scaling up in node injection bandwidth and number of nodes. GPU injection bandwidths are currently ~10 Tbps per GPU and roughly doubling every two years. Cluster sizes are trending upwards as well, spanning many racks in some cases. Link reliability is ever more important as clusters grow. Thus the industry needs cost effective, reliable, and extremely dense intra-cluster scale-up links. Our speakers will take their best shots at the following aspirational goals in the 2030 timeframe:
- 10 meter reach;
- > 1 Tbps/mm (per direction) of host ASIC die-edge bandwidth density;
- Link energy well under 5 pJ/bit, host ASIC to host ASIC, including any DSP and laser sources;
- Reliability better than 10 FIT per link (normalized to a 400 Gbps link) [note copper today is roughly 1 FIT];
- Cost < 0.10 $/Gbps host to host, including fiber and connectors.
Organizers
Trey Greer (Lead), NVIDIA, United States
Connie Chang-Hasnain, Berxel Photonics, United States
Speakers
Karl Bois, NVIDIA, United States
Tzu-Hao Chow, Broadcom, United States
Julie Eng, Coherent, United States
Ben Foo, Microsoft, United States
Thomas Liljeberg, Intel Corp., United States
Nhat Nguyen, Ayar Labs, United States
Matt Sysak, Lumentum, United States