Sunday, 05 March,
Machine learning models increase x10 every 18 months. Custom processors (TPUs, NPUs, xPUs) used for ML tasks support significantly higher I/O bandwidth compared to CPUs. However, the scalability and training time significantly depends on network performance. Current electronic multi-layer networks, due to over-subscription and large network diameter, lead to significant overheads and could limit the scale and efficiency of the system and ML applications. This workshop will explore methods used to scale models (data/model/hybrid parallelism) across hundreds and thousands of processing units, discuss existing network solutions and explore the potential and challenges of optical networks. Both current and future technologies will be presented and explored. Some of the questions that we aim to answer include but are not limited to include:
- Can electronic packet switching and traditional pluggable transceivers sustain the performance and power consumption demanded by the rapid growth of ML models?
- What are the requirements for broad adoption of fast all-optical switching and networking?
- Will optical networks change how we design distributed deep learning training systems and processes?
Part 1. Session on methods and current systems to support large-scale ML models.
Part 2. Potential and challenges of optical networks for ML systems.
Hitesh Ballani, Microsoft, UnitedKingdom
Manya Ghobhadi, Massachusetts Institute of Technology, USA
Georgios Zervas, University College London, UnitedKingdom
Larry Denison, NVIDIA, USA
Michael Geiselmann, LIGENTEC, Switzerland
Alessandro Ottino, University College London, UnitedKingdom
Sergey Shumarayev, Intel Corporation, USA
Vladimir Stojanovic , University of California, Berkeley, USA
Cen Wang, KDDI, Japan