Backpropagation¶

Backpropagation is the algorithm that computes how errors (loss) change with respect to each weight using the chain rule, and then updates the weights via gradient descent to minimize the error. It consists of four steps.

Forward Pass: Compute the output of the network given the input.
Loss Calculation: Measure the error using a loss function (e.g., mean squared error, cross-entropy).
Backward Pass: Calculate the gradients of the loss with respect to the weights and biases using the chain rule.
Weight Update: Adjust the weights and biases using gradient descent.

Given a loss function $L$, the gradient of weights $w$ can be computed by chain rule: $$ \frac{\partial L(\hat{y}, y)}{\partial w} = \frac{\partial L(\hat{y}, y)}{\partial } \frac{\partial }{\partial } ... \frac{\partial }{\partial w} $$ and the weight update is ($\alpha$ is the learning rate): $$w = w - \alpha \frac{\partial L(\hat{y}, y)}{\partial w}$$

Homework: Backpropagation Derivation for A Three-Layer Feedforward Neural Network¶

We consider a three-layer feedforward neural network with the following setup:

Architecture¶

Input layer:

$$ \mathbf{x} \in \mathbb{R}^4 $$
Hidden layer 1: 3 neurons, ReLU activation
Hidden layer 2: 2 neurons, ReLU activation
Output layer: 1 neuron
- Identity activation → regression
- Sigmoid $\sigma$ activation → classification

Parameters¶

First hidden layer:

$$ W_1 \in \mathbb{R}^{3 \times 4}, \quad \mathbf{b}_1 \in \mathbb{R}^3 $$
Second hidden layer:

$$ W_2 \in \mathbb{R}^{2 \times 3}, \quad \mathbf{b}_2 \in \mathbb{R}^2 $$
Output layer:

$$ W_3 \in \mathbb{R}^{1 \times 2}, \quad b_3 \in \mathbb{R} $$
Activation function (ReLU):

$$ \text{ReLU}(z) = \max(0, z) $$

Forward Pass¶

Step 1 – Hidden layer 1

$$ \mathbf{z}_{(1)} = W_1 \mathbf{x} + \mathbf{b}_1 \quad \in \mathbb{R}^3 $$

$$ \mathbf{a}_{(1)} = \text{ReLU}(\mathbf{z}_{(1)}) \quad \in \mathbb{R}^3 $$

Step 2 – Hidden layer 2

$$ \mathbf{z}_{(2)} = W_2 \mathbf{a}_{(1)} + \mathbf{b}_2 \quad \in \mathbb{R}^2 $$

$$ \mathbf{a}_{(2)} = \text{ReLU}(\mathbf{z}_{(2)}) \quad \in \mathbb{R}^2 $$

Step 3 – Output layer

$$ z_{(3)} = W_3 \mathbf{a}_{(2)} + b_3 \quad \in \mathbb{R} $$

Predicted output:

Regression: $\hat{y} = z_{(3)}$
Classification: $\hat{y} = \sigma(z_{(3)}) =\frac{1}{1+e^{-z_{(3)}}}$

Tasks¶

Forward Pass: Write down the expression of loss function $L$ for both regression and binary class classification. Use MSE loss for regression, and Cross-Entropy loss for binary classification.
Backward Pass (Gradients): Compute the following gradient for both cases. $$ \frac{\partial L}{\partial W_3},\ \frac{\partial L}{\partial b_3},\quad \frac{\partial L}{\partial W_2},\ \frac{\partial L}{\partial \mathbf{b}_2},\quad \frac{\partial L}{\partial W_1},\ \frac{\partial L}{\partial \mathbf{b}_1} $$
Batch Version of Forward Pass and Backward Pass: Extend the single-sample forward and backward derivations to the batch case, where the network processes a batch of $m$ input samples in parallel. Let the parameters be as defined earlier.

Inputs (Each row is a sample): $$ X = \big[ \mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \dots, \mathbf{x}^{(m)} \big] \in \mathbb{R}^{m \times 4} $$
Targets: $$ \mathbf{y} = \big[ y^{(1)}, y^{(2)}, \dots, y^{(m)} \big] \in \mathbb{R}^{m} $$