Module 1: Fundamental of ML, e.g., scalability, batch size, etc.

Lecture Session: Sep. 28, 15:00-17:00 [slides] [video]

Measuring the Effects of Data Parallelism on Neural Network Training [pdf]
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour [pdf]
CROSSBOW: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers [pdf]
Don’t Use Large Mini-Batches, Use Local SGD [pdf]

Communication-Efficient Distributed Deep Learning: A Comprehensive Survey [pdf]
TicTac: Accelerating Distributed Deep Learning with Communication Scheduling [pdf]
Caramel: Accelerating Decentralized Distributed Deep Learning with Computation Scheduling [pdf]
CodedReduce: A Fast and Robust Framework for Gradient Aggregation in Distributed Learning [pdf]

More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server [pdf]
Scaling Distributed Machine Learning with the Parameter Server [pdf]
MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms [pdf]
Gradient Coding: Avoiding Stragglers in Distributed Learning [pdf]
Asynchronous Decentralized Parallel Stochastic Gradient Descent [pdf]
GossipGraD: Scalable Deep Learning using Gossip Communication based Asychronous Gradient Descent [pdf]

The TensorFlow Partitioning and Scheduling Problem [pdf]
Device Placement Optimization with Reinforcement Learning [pdf]
A Hierarchical Model for Device Placement [pdf]
Placeto: Learning Generalizable Device Placement Algorithms for Distributed Machine Learning [pdf]
A Single-Shot Generalized Device Placement for Large Dataflow Graphs [pdf]
Spotlight: Optimizing Device Placement for Training Deep Neural Networks [pdf]

GDP: Generalized Device Placement for Dataflow Graphs [pdf]
Post: Device Placement with Cross-Entropy Minimization and Proximal Policy Optimization [pdf]
Graph Representation Matters in Device Placement [pdf]
Inductive Representation Learning on Large Graphs [pdf]
Position-aware Graph Neural Networks [pdf]
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context [pdf]

Caffe: Convolutional Architecture for Fast Feature Embedding [pdf]
MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems [pdf]
TensorFlow: A system for large-scale machine learning [pdf]
Horovod: fast and easy distributed deep learning in TensorFlow [pdf]
Mesh-TensorFlow: Deep Learning for Supercomputers [pdf]
MXNET-MPI: Embedding MPI parallelism in Parameter Server Task Model for Scaling Deep Learning [pdf]
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism [pdf]
PyTorch: An Imperative Style, High-Performance [pdf]
HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training using TensorFlow [pdf]
Towards a Scalable and Distributed Infrastructure for Deep Learning Applications [pdf]