2024 AAALGO AI Bootcamp

Foundations of Deep Learning

Wei Dong

wdong@aaalgo.com

Outline

  1. Basic Concepts: Models, Layers, and Activation Functions
  2. Gradient Computation & Backpropagation
  3. Learning Process: Loss Functions, Metrics and An Minimal Example
  4. Optimization Methods: SGD & Momentum-Based Methods
  5. Data Preparation: Normalization, Augmentation, and Batching
  6. Regularization & Initialization Techniques
  7. DeepSpeed and Distributed Training

1. Basic Concepts: Models, Layers, and Activation Functions

Objectives:

  • Understand what a model is in deep learning
  • Introduce nn.Module in PyTorch
  • Explore basic layer types and activation functions
  • Examine a simple MLP example

What is a Model?

What is a Model?

What is a Model?

  • A model is a function (both in math and in code) that:
    • maps inputs to outputs
    • through learnable parameters
    • likely constructed from smaller models (called layers, or modules)
  • Parameters are trained to minimize a loss function.
  • Inputs, outputs and parameters are all tensors.
  • In PyTorch, models are usually built as classes extending nn.Module.

PyTorch nn.Module

  • Base class for all neural network components
    • layers, loss functions, entire models
  • Implement forward and PyTorch automatically computes gradients.
import torch.nn as nn

class MyLinearModel (nn.Module):

    def __init__(self, dim_in, dim_out):
        super(MyLinearModel, self).__init__()
        self.weights = nn.Parameter(torch.randn(dim_in, dim_out))
        self.bias = nn.Parameter(torch.randn(dim_out))

    def forward(self, x):
      # x's shape is [batch_size, dim_in]
      return x @ self.weights + self.bias

Import Methods of nn.Module

  • Documentation
  • Train/Inference: .forward(), .backward(), .zero_grad()
  • Setting phase: .train(), .eval()
  • Parameters: .parameters(), .named_parameters()
  • Load and save: .load_state_dict(), .state_dict()
  • Transfer to device: .to(device), .cpu(), .cuda()
  • Data type conversion: .float(), .double(), .half(), .int(), .long()

Parameters

  • Documentation
  • Parameters are tensors that are adjusted to reduce the loss.
      self.weights = nn.Parameter(torch.randn(dim_in, dim_out), requires_grad=True)
      self.bias = nn.Parameter(torch.randn(dim_out), requires_grad=True)
    
  • Example:
    model = MyLinearModel(100, 2)
    for name, a in model.named_parameters():
      print(f"{name}: {a.shape}")
    
    weights: torch.Size([100, 2])
    bias: torch.Size([2])
    

Lab: Play with the Llama 3.2 1G model.

Example: Initiate Model from Existing Model


# load pretrained standard model
pretrained = AutoModelForCausalLM.from_pretrained('Llama-3.2-3B')
# get parameters from pretrained model
state_dict = pretrained.model.state_dict()

# load config from standard model
config = AutoConfig.from_pretrained('Llama-3.2-3B')
model = MyCustomModelBasedOnLlama(config)
model.load_state_dict(state_dict, strict=False)
model.init_weights()  # initialize non-standard parameters

Example: Partial Finetune

from safetensors.torch import load_file, save_file

def is_trainable (name):
  return '.self_attn.' in name

params_to_save = {name: param for name, param in model.named_parameters()
                  if is_trainable(name)}
metadata = {"format": "pt"}
save_file(params_to_save, "delta.safetensors", metadata=metadata)

# load delta
state_dict = load_file("delta.safetensors")
model.load_state_dict(state_dict, strict=False)

Tensors: Multi-Dimensional Arrays

  • Documentation
  • .dtype: torch.float32, torch.float16, torch.int32, torch.int64, ...
  • .shape: [batch_size, dim_in], [batch_size, dim_in, dim_out], ...
  • .device(), .cuda(), .cpu(), .to(device)
  • .detach().cpu().numpy(): convert to numpy array

Tensor Operations

  • Creation: torch.randn(), torch.randint(), torch.zeros(), torch.ones()
  • Reduction: torch.mean(), torch.sum(), torch.max(), torch.min()
  • Shaping: torch.reshape(), torch.view(), torch.transpose(), torch.repeat(), torch.expand()
  • Shaping: torch.squeeze(), torch.unsqueeze()

Common Layer Types

  • torch.nn
  • Linear (Fully-Connected): nn.Linear(in_features, out_features)
  • Convolutional: nn.Conv2d(in_channels, out_channels, kernel_size, ...) for image tasks
  • Pooling: nn.MaxPool2d, nn.AvgPool2d, ...
  • Recurrent: nn.RNN, nn.LSTM, nn.GRU for sequence data
  • Activation Functions: nn.ReLU, nn.Sigmoid, nn.Tanh
  • ResNet

Example: Multi-Layer Perceptron (MLP)

# https://pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html
class NeuralNetwork(nn.Module):

    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10),
        )

    def forward(self, x):
        # x can be a batch of anything that has 28x28 values.
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

Convolutional Layer


conv = nn.Conv2d(
    in_channels=3,
    out_channels=16,
    kernel_size=3,
    stride=1,
    padding=1
)

x = torch.randn(32, 3, 224, 224)
y = conv(x)
print(y.shape)


torch.Size([32, 16, 224, 224])

Exercise: Convolution Parameters Calculation

  • Input image size: 224x224x3

  • Kernel size: 3x3

  • Stride: 2

  • Padding: 1

  • Output channels: 32

  • ??? Calculate dimension & number of parameters

  • ??? Calculate output tensor shape

Pooling Layer

  • Reduce the spatial size of the representation
  • Similar to convolution, with kernel_size, stride, padding
    • MaxPool2d, AvgPool2d
  • Global pooling, regardless of input size, output a fixed size
    • nn.AdaptiveMaxPool2d(output_size)

Activation Functions: Introducing Non-Linearity

Dropout: Regularization to Prevent Overfitting

  • Introduced in AlexNet (2012) by Hinton et al.
  • Randomly sets values to 0 during training.
  • In PyTorch, use nn.Dropout(p=0.4) to randomly drop 40% of the units.
  • The next layer is trained with (1-p) = 60% of input signal
  • At inference time, instead of dropping units, scale up the input by 1/(1-p)
  • PyTorch needs to know whether it is training or inference.
model.train()   # training mode
model.eval()    # inference mode

ResNet: When you are not certain that a layer will help

  # transformers/models/llama/modeling_llama.py
  residual = hidden_states

  hidden_states = self.input_layernorm(hidden_states)

  # Self Attention
  hidden_states, self_attn_weights, present_key_value = self.self_attn(
      hidden_states=hidden_states,
      attention_mask=attention_mask,
      position_ids=position_ids,
      ...
  )
  hidden_states = residual + hidden_states

  # Fully Connected
  residual = hidden_states
  hidden_states = self.post_attention_layernorm(hidden_states)
  hidden_states = self.mlp(hidden_states)
  hidden_states = residual + hidden_states

Convolutional Neural Network

class VGG16(nn.Module):
    def __init__(self, num_classes=1000):
        super().__init__()
        self.model = nn.Sequential(
            *self._conv_block(2, 3, 64),
            *self._conv_block(2, 64, 128),
            *self._conv_block(3, 128, 256),
            *self._conv_block(3, 256, 512),
            *self._conv_block(3, 512, 512),
            nn.Flatten(),
            nn.Linear(512 * 7 * 7, 4096),
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(4096, num_classes),
        )

    def _conv_block(self, num_convs, in_channels, out_channels):
        layers = []
        for _ in range(num_convs):
            layers.append(nn.Conv2d(in_channels, out_channels, 3, padding=1))
            layers.append(nn.ReLU(True))
            in_channels = out_channels
        layers.append(nn.MaxPool2d(2, 2))
        return layers

    def forward(self, x):
        return self.model(x)

Arbitrary Input Size with Adaptive Pooling

class VGG16(nn.Module):
    def __init__(self, num_classes=1000):
        super().__init__()
        self.model = nn.Sequential(
            *self._conv_block(2, 3, 64),    nn.MaxPool2d(2, 2),
            *self._conv_block(2, 64, 128),  nn.MaxPool2d(2, 2),
            *self._conv_block(3, 128, 256), nn.MaxPool2d(2, 2),
            *self._conv_block(3, 256, 512), nn.MaxPool2d(2, 2),
            *self._conv_block(3, 512, 512), nn.AdaptiveMaxPool2d(7),
            nn.Flatten(),
            nn.Linear(512 * 7 * 7, 4096),
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(4096, num_classes),
        )

    def _conv_block(self, num_convs, in_channels, out_channels):
        layers = []
        for _ in range(num_convs):
            layers.append(nn.Conv2d(in_channels, out_channels, 3, padding=1))
            layers.append(nn.ReLU(True))
            in_channels = out_channels
        # layers.append(nn.MaxPool2d(2, 2))
        return layers

Weight Initialization Strategies

  • Xavier (Glorot) Initialization: Scales weights based on fan-in and fan-out.
  • He Initialization: Good for ReLU-based networks.
    def init_weights(m):
        if type(m) == nn.Linear:
            nn.init.xavier_uniform_(m.weight)
    model.apply(init_weights)
    

2. Gradient Computation & Backpropagation

Objectives:

  • Understand gradients and the chain rule
  • Learn how backpropagation works
  • See how PyTorch’s autograd mechanism simplifies gradient calculation
  • Recognize exploding/vanishing gradients and how to mitigate them

The Math of Derivatives

Properties:

  • Derivative is a mapping:
  • Derivative is linear:
  • Leibniz rule:

(If a linear mapping satisfies the Leibniz rule, it must be a derivative.)

In deep learning we always talk about the derivative at a particular point .

The Math of Gradient

  • Vector of partial derivatives:
  • It tells how much each parameter affects the loss.
  • Minimizing function involves moving parameters opposite the gradient direction.

  • Basic update rule:

Chain Rule and Automatic Differentiation

  • Chain rule:

  • PyTorch tensor optionally saves gradients in addition to values.

  • And uses autograd to automatically compute gradients:

    x = torch.tensor(1.0, requires_grad=True)
    y = x * 2
    z = y**2
    z.backward()        # automatically computes gradients
    print(x.grad)       # 4.0
    

Gradient Calculation in Deep Learning

  • Gradient is always w.r.t.

  • Each tensor has both a value and a gradient

  • Forward pass: fill in values

  • Backward pass: fill in gradients

Exercise 1: Linear Layer

  • Suppose forward pass is done and backward pass is done up till .
  • Derive the following:

  • Why do we want to calculate ?

Exercise 1b: Linear Layer

  • Let's say .
  • Calculate and .

Exercise 2: ReLU

  • Both and are vectors.
  • Calculate

Exercise 2b: ReLU

  • Let's say .
  • Calculate .

Exercise 3: Pooling

  • Calculate

Exercise 3b: Pooling

Let's say .

Calculate

Exercise 4: Convolution

Exercise 4: Convolution

Suppose .

Calculate and .

3. Learning Process: Loss Functions, Metrics and A Minimal Examples

Objectives:

  • Loss functions
    • Distinguish loss functions from evaluation metrics
    • Learn common loss functions (MSE, Cross-Entropy)
    • Choose appropriate metrics (accuracy, F1, etc.)
  • From PyTorch to HuggingFace

Review of a Minibatch Training Loop

  • Set the input to a new batch.
  • Forward pass: Compute predictions and loss.
  • Backward pass: Compute gradients using autograd loss.backward().
    • At this point, each tensor has a gradient.
  • After backward pass, call optimizer.step() to update parameters.

  • Optimizer: How to update with gradients?

A Minimal Training Loop in PyTorch

outputs = model(inputs)
loss = criterion(outputs, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(loss.item())

Training VGG16 in Pure PyTorch


model = VGG16().cuda()  # Move model to GPU
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

for epoch in range(num_epochs):
    for batch_idx, (images, labels) in enumerate(train_loader):
        # Move data to GPU
        images = images.cuda()  # shape: [batch_size, 3, 224, 224]
        labels = labels.cuda()  # shape: [batch_size]
        
        outputs = model(images) # shape: [batch_size, 1000]
        loss = criterion(outputs, labels)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if batch_idx % 100 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], '
                  f'Loss: {loss.item():.4f}')

From PyTorch to HuggingFace

  • HuggingFace provides a high-level API for training models.
  • De-facto implementation of Llama3.x.
  • Has its ecosystem: datasets, deepspeed, trl, etc.
  • The APIs usually expose a huge number of parameters.
  • Some paths of integration or modes of operation are broken.
    • Sometimes very difficult to debug.
    • Success experiences are invaluable.
  • Alternative for CNN and small models (but not LLMs): Lightning

The HuggingFace Trainer

    training_args = TrainingArguments(
        output_dir='./output',
        logging_dir='./logs',
        per_device_train_batch_size=32,
        per_device_eval_batch_size=32,
        # num_train_epochs=200,
        max_steps=20000,
        warmup_steps=400,
        weight_decay=0.01,
        logging_steps=100,
        bf16=True,
        eval_strategy="steps",
        eval_steps=100,
        save_strategy="steps",
        save_steps=100,
        metric_for_best_model="eval_F1",
        greater_is_better=False,
    )
    trainer = Trainer(model=model, args=training_args, ...)
    trainer.train()

Training Process in WanDB

  • Integrated with HuggingFace
import wandb
wandb.init(project=f"project", name='curve')
... Training Code ...

Loss Functions vs. Metrics

  • Loss functions:
    • Must be differentiable.
    • Smaller is better.
    • The model’s parameters are updated to minimize loss.
  • Metrics:
    • Not required to be differentiable.
    • Do not affect parameter updates.
    • Track performance during training and evaluation.

Common Loss Functions

  • MSELoss: For regression tasks.

  • CrossEntropyLoss: For classification tasks.

  • Dice Loss: For segmentation tasks.

Exercise: Calculate the Gradient of CrossEntropyLoss

Evaluation Metrics and Tools

  • Accuracy: Proportion of correctly classified examples.
  • MAE, RMSE: For regression performance checks.
  • Confusion Matrix: For classification performance checks.
  • Binary Classification:
    • True/False Positive/Negative:
    • AUC Curve and Precision-Recall Curve:
    • F1-score: Overall retrieval performance.
  • scikit-learn metrics

Confusion Matrix

Predicted\Actual Class 1 Class 2 Class 3 Class 4 Class 5
Class 1 85 7 3 4 1
Class 2 5 90 6 2 2
Class 3 2 4 88 5 1
Class 4 3 2 4 82 4
Class 5 1 3 2 3 91
sklearn.metrics.confusion_matrix(y_true, y_pred, ...)

Confusion Matrix of Binary Classification

Predicted\Actual False True
False TN FN
True FP TP

ROC Curve and Precision-Recall Curve

  • ROC Curve: for classification tasks.
    • AUROC: Area Under ROC Curve
    • All curves are monotonic.
  • Precision-Recall Curve: for information retrieval tasks.
    • Not necessarily monotonic.

4. Optimization Methods: SGD & Momentum-Based Methods

Objectives:

  • Understand basic SGD and its update rules
  • Introduce momentum and weight decay
  • Learn about learning rate scheduling
  • Understand advantages and limitations of basic SGD variants

Stochastic Gradient Descent

  • Update rule:

  • Learning rate controls step size.
  • Uses mini-batches of data for gradient estimates.
  • Advantages:
    • Simplicity, strong theoretical foundations.
    • Often good generalization properties with proper tuning.
  • Limitations:
    • Can get stuck in sharp minima or plateaus.
    • Requires careful tuning of LR and momentum.

Momentum

  • Accelerates gradients in the right directions and dampens oscillations.
  • Update rule with momentum (conceptually):

  • Common choice:

Weight Decay

  • Equivalent to L2 regularization to prevent overfitting.

  • Update rule with weight decay

  • Common choice:
class torch.optim.SGD(lr=0.001, momentum=0, weight_decay=0, ...)

Practical Considerations: Learning Rate Schedules

  • Constant LR may not be optimal.
  • Schedules:
    • Linear decay: reduce LR by a factor every few epochs.
    • Exponential decay: .
    • Cosine annealing: LR oscillates, helping escape minima.
  • Warmup: Gradually increase LR from 0 to the initial LR.
    lr_scheduler = get_scheduler(
        "linear",
        optimizer=optimizer,
        num_warmup_steps=400,
        num_training_steps=20000,
    )

Cosine Annealing


left image / right image

Example Code

# PyTorch
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
for epoch in range(num_epochs):
    for inputs, targets in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

# HuggingFace
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=400,
    num_training_steps=len(train_dataset) * training_args.num_train_epochs
)
trainer = Trainer(.....optimizers=(optimizer, lr_scheduler))
trainer.train()

Adaptive Optimizers: Adam, RMSProp, and Others

  • Different parameters may require different step sizes.

  • Adaptive optimizers adjust the effective LR for each parameter based on historical gradients.

  • Often converge faster or require less manual tuning than plain SGD.

Root Mean Square Propagation (RMSProp)

  • Maintains a moving average of squared gradients:

  • Update:

  • Good for non-stationary environments

  • Widely used in RNNs

Adam Optimizer

  • Combines RMSProp and Momentum ideas.
  • Maintains moving averages of gradients () and squared gradients ():

  • Bias-corrected updates:

  • Parameter update:

Adam Default Parameters

  • Common defaults, rarely need to change.
    • , ,
  • Learning rate is usually smaller than SGD
    • lr = 1e-3 for training CNN models from scratch
    • lr = 1e-5 for finetuning LLM
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

Adam vs AdamW

  • Adam: weight decay is coupled with the gradient

  • AdamW: weight decay decoupled and directly applied to the weights

Other Variants: Adagrad, Adadelta

  • Adagrad: Accumulates historical gradients, good for sparse data but LR decreases over time.

  • Adadelta: Tries to fix Adagrad’s diminishing LR issue.

  • Generally, AdamW and Adam is more popular due to good defaults.

Advantages of Adaptive Methods

  • Often faster initial convergence.
  • Less sensitive to initial LR.
  • Good when dealing with sparse gradients or varying feature scales.
  • May lead to slightly worse generalization in some cases.
  • Experimentation is key.
  • Start with AdamW or Adam

5. Scaling Up Computation

Objectives:

  • Single GPU Techniques
    • Low-precision data type
    • CPU offloading: offload optimizer states to CPU
    • Gradient checkpointing
    • LoRA
  • Generic paradigms of parallelization
  • DeepSpeed ZeRO

Computer Science is All About Tradeoffs

Sacrifice \Gain Space Time Accuracy
Space -
Time -
Accuracy -

Generalized CAP Theorem: out of three, you can only have two.

(The original CAP theorem is about Consistency, Availability, and Partition tolerance of distributed systems.)

Reducing Memory Usage: Single GPU Techniques

  • Low-precision data type
  • CPU offloading: offload optimizer states to CPU
  • Gradient checkpointing
  • LoRA

Gradient Checkpointing

from torch.utils.checkpoint import checkpoint

class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.layer1 = nn.Linear(1024, 1024)
        self.layer2 = nn.Linear(1024, 1024)
        self.layer3 = nn.Linear(1024, 1024)

    def forward(self, x):
        # Use checkpointing on layers to save memory
        x = checkpoint(self.layer1, x)  # orignally: x = self.layer1(x)
        x = checkpoint(self.layer2, x)  # orignally: x = self.layer2(x)
        x = self.layer3(x)              # No checkpointing for layer3
        return x

Gradient Checkpointing: Implementation

class CheckpointFunction(torch.autograd.Function):
    @staticmethod
    def forward(ctx, run_function, *args):
        ctx.run_function = run_function
        ctx.save_for_backward(*args)  # save *args for backward
        with torch.no_grad():
            outputs = run_function(*args)
        return outputs

    @staticmethod
    def backward(ctx, *grad_outputs):
        inputs = ctx.saved_tensors
        with torch.enable_grad():
            inputs = [x.detach().requires_grad_(True) for x in inputs]
            outputs = ctx.run_function(*inputs)
        grads = torch.autograd.grad(outputs, inputs, grad_outputs)
        return (None, *grads)

def checkpoint(run_function, *args):
    return CheckpointFunction.apply(run_function, *args)

LoRA: Low-Rank Adaptation of Linear Layers

LoRA: Mathematics

  • Problem: is too large.
  • Solution: using rule of matrix multiplication:

Exercise: Calculate LoRA Cost

  • Original:
  • LoRA:
  • Cost of matrix multiplication

  • (1). Computation cost of:
  • (2). Computation cost of:
  • (3). Computation cost of:
  • (4). What's the cost ratio of LoRA to the original?
  • (5). What's the memory cost ratio of LoRA to the original?

Exercise: Calculate LoRA Cost

  • Assuming and

  • Computation cost of:

  • Computation cost of:

  • Computation cost of:

  • What's the cost ratio of LoRA to the original?

  • What's the memory cost ratio of LoRA to the original?

LoRA: Implementation


class LoRAWrapper(nn.Module):
    def __init__(self, base_layer, r=32):
        super(LoRAWrapper, self).__init__()
        self.base_layer = base_layer
        self.r = r  # Rank of the low-rank approximation
        # Create low-rank layers
        self.lora_A = nn.Linear(base_layer.in_features, r, bias=False)
        self.lora_B = nn.Linear(r, base_layer.out_features, bias=False)

        nn.init.kaiming_uniform_(self.lora_A.weight)
        nn.init.kaiming_uniform_(self.lora_B.weight)

    def forward(self, x):
        return self.base_layer(x) + self.lora_B(self.lora_A(x))
  

# Define a basic model
class BasicModel(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(BasicModel, self).__init__()
        self.linear = nn.Linear(input_dim, output_dim)

    def forward(self, x):
        return self.linear(x)

# Instantiate the model
basic_model = BasicModel(4096, 4096)

basic_model.linear = LoRAWrapper(basic_model.linear, r=32, alpha=1)

# Forward pass with the modified model
output_with_lora = basic_model(x)

import torch.nn as nn

def patch_model_with_lora(model, r=32):
    for name, module in model.named_children():
        if isinstance(module, nn.Linear):
            setattr(model, name, LoRAWrapper(module, r=r))
        else:
            patch_model_with_lora(module, r, alpha)  # Recursively apply to child modules
    return model

model = SimpleModel()
patched_model = patch_model_with_lora(model, r=32)

(We should be able to apply this to Llama 3.x)

Applying LoRA to Self-Attention

General Multi-GPU Parallelization Paradigms

  • DP: Data Parallelization
  • TP: Tensor Parallelization
  • PP: Pipeline Parallelization

DP: Data Parallelization

TP: Tensor Parallelization

TP: MLP

TP: Self-Attention

PP: Pipeline Parallelization

DeepSpeed

  • Microsoft Research
  • Reducing memory footprint for training large models
  • Compatible with HuggingFace's accelerate
  • ZeRO: Zero Redundancy Optimizer
    • Stage 1 -> Stage 2 -> Stage 3

What are in the optimizer states?

In addition to parameters and gradients, maintain and for Adam.

  • Bias-corrected updates:

  • Parameter update:

DeepSpeed ZeRO Stages

  • Stage 1: partitioning optimizer states

    • Each process updates only its partition of the optimizer
    • Shared forward & backward passes
  • Stage 2: Gradients

    • Each process only retains its partition of the gradients
    • Forward pass shared, backward pass partitioned
  • Stage 3: Model Parameters

    • Model parameters are partitioned
    • Forward pass also partitioned
  • Time cost is increasingly larger due to more communication.

DeepSpeed ZeRO Savings

Adapting Training Script

from accelerate import Accelerator

accelerator = Accelerator()

training_args = TrainingArguments(
  ...
  bf16=True,
  deepspeed = "deepspeed.json",
  ...
)

trainer = Trainer(model=model, args=training_args)
trainer.train()

unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(output_dir,
  is_main_process=accelerator.is_main_process,
  save_function=accelerator.save)        

deepspeed.json

{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "none"
    },
    "allgather_partitions": true,
    "reduce_scatter": true
  },
  "train_batch_size": 40,
  "gradient_accumulation_steps": 1,
  "train_micro_batch_size_per_gpu": 8,
  "stage3_gather_16bit_weights_on_model_save": false,
  "bf16": {
    "enabled": true
  }
}

deepspeed.yaml

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_config_file: deepspeed.json
  zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 5
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

The Commandline

export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
accelerate launch --config_file=deepspeed.yaml ./train.py ......

**Goal:** Provide a strong foundation in how deep learning models are built, how they learn from data via gradients and backpropagation, how to choose and measure progress with loss functions and metrics, how to prepare data efficiently, and how regularization and initialization techniques influence training.

emphasize tensors for efficient parallel processing

A stack of linear layers with nonlinear activations.