Module 1: Fundamental of ML, e.g., scalability, batch size, etc.
Lecture Session: Sep. 28, 15:00-17:00 [slides] [video]
Discussion Session: Oct. 5, 15:00-17:00 [slides] [video]
Required Reading
- Measuring the Effects of Data Parallelism on Neural Network Training [pdf]
- Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour [pdf]
- CROSSBOW: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers [pdf]
- Don’t Use Large Mini-Batches, Use Local SGD [pdf]
Optional Reading
- Scaling SGD Batch Size to 32K for ImageNet Training [pdf]
Module 2: Distributed learning: data-parallelization
Lecture Session: Oct. 12, 15:00-17:00 [slides] [video]
Discussion Session 2: Oct. 19, 15:00-17:00 [slides] [video]
Required Reading
- Communication-Efficient Distributed Deep Learning: A Comprehensive Survey [pdf]
- TicTac: Accelerating Distributed Deep Learning with Communication Scheduling [pdf]
- Caramel: Accelerating Decentralized Distributed Deep Learning with Computation Scheduling [pdf]
- CodedReduce: A Fast and Robust Framework for Gradient Aggregation in Distributed Learning [pdf]
Optional Reading
- More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server [pdf]
- Scaling Distributed Machine Learning with the Parameter Server [pdf]
- MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms [pdf]
- Gradient Coding: Avoiding Stragglers in Distributed Learning [pdf]
- Asynchronous Decentralized Parallel Stochastic Gradient Descent [pdf]
- GossipGraD: Scalable Deep Learning using Gossip Communication based Asychronous Gradient Descent [pdf]
Module 3: Distributed learning: model-parallelization
Lecture Session: Oct. 26, 15:00-17:00 [slides] [video]
Discussion Session: Nov. 2, 15:00-17:00 [slides]
Required Reading
- The TensorFlow Partitioning and Scheduling Problem [pdf]
- Device Placement Optimization with Reinforcement Learning [pdf]
- A Hierarchical Model for Device Placement [pdf]
- Placeto: Learning Generalizable Device Placement Algorithms for Distributed Machine Learning [pdf]
- A Single-Shot Generalized Device Placement for Large Dataflow Graphs [pdf]
- Spotlight: Optimizing Device Placement for Training Deep Neural Networks [pdf]
Optional Reading
- GDP: Generalized Device Placement for Dataflow Graphs [pdf]
- Post: Device Placement with Cross-Entropy Minimization and Proximal Policy Optimization [pdf]
- Graph Representation Matters in Device Placement [pdf]
- Inductive Representation Learning on Large Graphs [pdf]
- Position-aware Graph Neural Networks [pdf]
- Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context [pdf]
Module 4: Robust learning, e.g., byzantine-resilient learning
Lecture Session: Nov. 9, 15:00-17:00 [slides] [video]
Discussion Session 4: Nov. 16, 15:00-17:00 [slides] [video]
Required Reading
- Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent [pdf]
- The Hidden Vulnerability of Distributed Learning in Byzantium [pdf]
- AGGREGATHOR: Byzantine Machine Learning via Robust Gradient Aggregation [pdf]
- SGD: Decentralized Byzantine Resilience [pdf]
- Fast Machine Learning with Byzantine Workers and Servers [pdf]
- DRACO: Byzantine-resilient Distributed Training via Redundant Gradients [pdf]
Optional Reading
- Generalized Byzantine-tolerant SGD [pdf]
- SoK: Security and Privacy in Machine Learning [pdf]
Lecture Session: Nov. 23, 15:00-17:00 [slides] [video]
Discussion Session: Nov. 30, 15:00-17:00 [slides] [video]
Required Reading
- Automated Machine Learning: State-of-The-Art and Open Challenges [pdf]
- BOHB: Robust and Efficient Hyperparameter Optimization at Scale [pdf]
- A System for Massively Parallel Hyperparameter Tuning [pdf]
- DARTS: Differentable Architecture Search [pdf]
- ASAP: Architecture Search, Anneal and Prune [pdf]
Optional Reading
- AutoML Book - Hyperparameter Optimization [pdf]
- AutoML Book - Meta-Learning [pdf]
- AutoML Book - Neural Architecutre Search [pdf]
- Non-stochastic Best Arm Identification and Hyperparameter Optimization [pdf]
- Hyperband: A Novel Bandit-based Approach to Hyperparameter Optimization [pdf]
- Random Search and Reproducibility for Neural Architecture Search [pdf]
- Maggy: Scalable Asynchronous Parallel Hyperparameter Search [pdf]
Lecture Session: Dec. 7, 15:00-17:00 [slides] [video]
Discussion Session: Dec. 14, 15:00-17:00 [slides]
Required Reading
- BigDL: A Distributed Deep Learning Framework for Big Data [pdf]
- PyTorch Distributed: Experiences on Accelerating Data Parallel Training [pdf]
- ZeRO: Memory Optimizations Toward Training Trillion Parameter Models [pdf]
- Beyonf Data and Model Parallelism for Deep Neural Networks [pdf]
Optional Reading
- Caffe: Convolutional Architecture for Fast Feature Embedding [pdf]
- MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems [pdf]
- TensorFlow: A system for large-scale machine learning [pdf]
- Horovod: fast and easy distributed deep learning in TensorFlow [pdf]
- Mesh-TensorFlow: Deep Learning for Supercomputers [pdf]
- MXNET-MPI: Embedding MPI parallelism in Parameter Server Task Model for Scaling Deep Learning [pdf]
- GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism [pdf]
- PyTorch: An Imperative Style, High-Performance [pdf]
- HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training using TensorFlow [pdf]
- Towards a Scalable and Distributed Infrastructure for Deep Learning Applications [pdf]