Module 1: Fundamental of ML, e.g., scalability, batch size, etc.

Lecture Session: Sep. 28, 15:00-17:00 [slides] [video]

Discussion Session: Oct. 5, 15:00-17:00 [slides] [video]

Required Reading

  • Measuring the Effects of Data Parallelism on Neural Network Training [pdf]
  • Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour [pdf]
  • CROSSBOW: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers [pdf]
  • Don’t Use Large Mini-Batches, Use Local SGD [pdf]

Optional Reading

  • Scaling SGD Batch Size to 32K for ImageNet Training [pdf]




Module 2: Distributed learning: data-parallelization

Lecture Session: Oct. 12, 15:00-17:00 [slides] [video]

Discussion Session 2: Oct. 19, 15:00-17:00 [slides] [video]

Required Reading

  • Communication-Efficient Distributed Deep Learning: A Comprehensive Survey [pdf]
  • TicTac: Accelerating Distributed Deep Learning with Communication Scheduling [pdf]
  • Caramel: Accelerating Decentralized Distributed Deep Learning with Computation Scheduling [pdf]
  • CodedReduce: A Fast and Robust Framework for Gradient Aggregation in Distributed Learning [pdf]

Optional Reading

  • More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server [pdf]
  • Scaling Distributed Machine Learning with the Parameter Server [pdf]
  • MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms [pdf]
  • Gradient Coding: Avoiding Stragglers in Distributed Learning [pdf]
  • Asynchronous Decentralized Parallel Stochastic Gradient Descent [pdf]
  • GossipGraD: Scalable Deep Learning using Gossip Communication based Asychronous Gradient Descent [pdf]




Module 3: Distributed learning: model-parallelization

Lecture Session: Oct. 26, 15:00-17:00 [slides] [video]

Discussion Session: Nov. 2, 15:00-17:00 [slides]

Required Reading

  • The TensorFlow Partitioning and Scheduling Problem [pdf]
  • Device Placement Optimization with Reinforcement Learning [pdf]
  • A Hierarchical Model for Device Placement [pdf]
  • Placeto: Learning Generalizable Device Placement Algorithms for Distributed Machine Learning [pdf]
  • A Single-Shot Generalized Device Placement for Large Dataflow Graphs [pdf]
  • Spotlight: Optimizing Device Placement for Training Deep Neural Networks [pdf]

Optional Reading

  • GDP: Generalized Device Placement for Dataflow Graphs [pdf]
  • Post: Device Placement with Cross-Entropy Minimization and Proximal Policy Optimization [pdf]
  • Graph Representation Matters in Device Placement [pdf]
  • Inductive Representation Learning on Large Graphs [pdf]
  • Position-aware Graph Neural Networks [pdf]
  • Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context [pdf]




Module 4: Robust learning, e.g., byzantine-resilient learning

Lecture Session: Nov. 9, 15:00-17:00 [slides] [video]

Discussion Session 4: Nov. 16, 15:00-17:00 [slides] [video]

Required Reading

  • Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent [pdf]
  • The Hidden Vulnerability of Distributed Learning in Byzantium [pdf]
  • AGGREGATHOR: Byzantine Machine Learning via Robust Gradient Aggregation [pdf]
  • SGD: Decentralized Byzantine Resilience [pdf]
  • Fast Machine Learning with Byzantine Workers and Servers [pdf]
  • DRACO: Byzantine-resilient Distributed Training via Redundant Gradients [pdf]

Optional Reading

  • Generalized Byzantine-tolerant SGD [pdf]
  • SoK: Security and Privacy in Machine Learning [pdf]




Module 5: AutoML, e.g., hyperparameter optimization, meta learning, and neural architecture search

Lecture Session: Nov. 23, 15:00-17:00 [slides] [video]

Discussion Session: Nov. 30, 15:00-17:00 [slides] [video]

Required Reading

  • Automated Machine Learning: State-of-The-Art and Open Challenges [pdf]
  • BOHB: Robust and Efficient Hyperparameter Optimization at Scale [pdf]
  • A System for Massively Parallel Hyperparameter Tuning [pdf]
  • DARTS: Differentable Architecture Search [pdf]
  • ASAP: Architecture Search, Anneal and Prune [pdf]

Optional Reading

  • AutoML Book - Hyperparameter Optimization [pdf]
  • AutoML Book - Meta-Learning [pdf]
  • AutoML Book - Neural Architecutre Search [pdf]
  • Non-stochastic Best Arm Identification and Hyperparameter Optimization [pdf]
  • Hyperband: A Novel Bandit-based Approach to Hyperparameter Optimization [pdf]
  • Random Search and Reproducibility for Neural Architecture Search [pdf]
  • Maggy: Scalable Asynchronous Parallel Hyperparameter Search [pdf]




Module 6: ML platforms, e.g., BigDL, PyTorch Distributed, ZeRO

Lecture Session: Dec. 7, 15:00-17:00 [slides] [video]

Discussion Session: Dec. 14, 15:00-17:00 [slides]

Required Reading

  • BigDL: A Distributed Deep Learning Framework for Big Data [pdf]
  • PyTorch Distributed: Experiences on Accelerating Data Parallel Training [pdf]
  • ZeRO: Memory Optimizations Toward Training Trillion Parameter Models [pdf]
  • Beyonf Data and Model Parallelism for Deep Neural Networks [pdf]

Optional Reading

  • Caffe: Convolutional Architecture for Fast Feature Embedding [pdf]
  • MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems [pdf]
  • TensorFlow: A system for large-scale machine learning [pdf]
  • Horovod: fast and easy distributed deep learning in TensorFlow [pdf]
  • Mesh-TensorFlow: Deep Learning for Supercomputers [pdf]
  • MXNET-MPI: Embedding MPI parallelism in Parameter Server Task Model for Scaling Deep Learning [pdf]
  • GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism [pdf]
  • PyTorch: An Imperative Style, High-Performance [pdf]
  • HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training using TensorFlow [pdf]
  • Towards a Scalable and Distributed Infrastructure for Deep Learning Applications [pdf]