Georgios Zervas, University College London, UK
Hitesh Ballani, Microsoft, UK
Manya Ghobhadi, MIT, USA
Machine learning models increase x10 every 18 months. Custom processors (TPUs, NPUs, xPUs) used for ML tasks support significantly higher I/O bandwidth compared to CPUs. However, the scalability and training time significantly depends on network performance. Current electronic multi-layer networks, due to over-subscription and large network diameter, lead to significant overheads and could limit the scale and efficiency of the system and ML applications. This workshop will explore methods used to scale models (data/model/hybrid parallelism) across hundreds and thousands of processing units, discuss existing network solutions and explore the potential and challenges of optical networks. Both current and future technologies will be presented and explored. Some of the questions that we aim to answer include but not limited to include:
• Can electronic packet switching, and traditional pluggable transceivers sustain the performance and power consumption demanded by the rapid growth of ML models?
• What are the requirements for broad adoption of fast all-optical switching and networking?
• Will optical networks change the way we design distributed deep learning training system and process?
Part 1. Session on methods and current systems to support large-scale ML models.
Part 2. Potential and challenges of optical networks for ML systems.