Homework: Backpropagation Derivation for A Three-Layer Feedforward Neural Network (Multi-Class)¶
We consider a three-layer feedforward neural network with the following setup (the same architecture as in the binary backpropagation homework, except there are 5 neurons in the output layer because we do a 5-class classification):
Architecture¶
Input layer:
$$ \mathbf{x} \in \mathbb{R}^4 $$
Hidden layer 1: 3 neurons, ReLU activation
Hidden layer 2: 2 neurons, ReLU activation
Output layer: 5 neurons, softmax activation
Parameters¶
First hidden layer:
$$ W_1 \in \mathbb{R}^{3 \times 4}, \quad \mathbf{b}_1 \in \mathbb{R}^3 $$
Second hidden layer:
$$ W_2 \in \mathbb{R}^{2 \times 3}, \quad \mathbf{b}_2 \in \mathbb{R}^2 $$
Output layer:
$$ W_3 \in \mathbb{R}^{5 \times 2}, \quad \mathbf{b}_3 \in \mathbb{R}^5 $$
Activation function (ReLU):
$$ \text{ReLU}(z) = \max(0, z) $$
Forward Pass¶
Step 1 – Hidden layer 1
$$ \mathbf{z}_{(1)} = W_1 \mathbf{x} + \mathbf{b}_1 \quad \in \mathbb{R}^3 $$
$$ \mathbf{a}_{(1)} = \text{ReLU}(\mathbf{z}_{(1)}) \quad \in \mathbb{R}^3 $$
Step 2 – Hidden layer 2
$$ \mathbf{z}_{(2)} = W_2 \mathbf{a}_{(1)} + \mathbf{b}_2 \quad \in \mathbb{R}^2 $$
$$ \mathbf{a}_{(2)} = \text{ReLU}(\mathbf{z}_{(2)}) \quad \in \mathbb{R}^2 $$
Step 3 – Output layer
$$ \mathbf{z}_{(3)} = W_3 \mathbf{a}_{(2)} + \mathbf{b}_3 \quad \in \mathbb{R}^5 $$
Predicted output: $\hat{\mathbf{y}} = \text{softmax}(\mathbf{z}_{(3)})$ where $\hat{y}_i = \frac{e^{z_i}}{\sum_{k=1}^5 e^{z_k}}, \quad i = 1,\dots,5$
Tasks¶
Forward Pass: Write down the expression of loss function $L$.
Backward Pass (Gradients): Compute the following gradient for both cases. $$ \frac{\partial L}{\partial W_3},\ \frac{\partial L}{\partial b_3},\quad \frac{\partial L}{\partial W_2},\ \frac{\partial L}{\partial \mathbf{b}_2},\quad \frac{\partial L}{\partial W_1},\ \frac{\partial L}{\partial \mathbf{b}_1} $$
Batch Version of Forward Pass and Backward Pass: Extend the single-sample forward and backward derivations to the batch case, where the network processes a batch of $m$ input samples in parallel. Let the parameters be as defined earlier.
Inputs (Each row is a sample): $$ X = \big[ \mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \dots, \mathbf{x}^{(m)} \big] \in \mathbb{R}^{m \times 4} $$
Targets: $$ \mathbf{y} = \big[ y^{(1)}, y^{(2)}, \dots, y^{(m)} \big] \in \mathbb{R}^{m} $$
---- Answer¶
1. Loss Functions¶
Let the target be one-hot $\mathbf{y}\in\mathbb{R}^5$ (i.e., $y_c=1$ for the correct class $c$, else $0$). With $\hat{\mathbf{y}}=\mathrm{softmax}(\mathbf{z}_{(3)})$, $$ L = -\sum_{i=1}^{5} y_i \log \hat y_i $$
2. Backward Pass (Gradients)¶
Let $\mathbf{1}_{(\cdot)}$ denote the indicator function. The derivative of ReLU can be expressed as
$$ \text{ReLU}'(z)=\mathbf{1}_{z>0}. $$
The gradient of the loss with respect to the raw score $\mathbf{z}_{(3)}$ can be deducted to (same as binary classification & regression) $$ \boxed{ \boldsymbol{\delta}_{(3)}=\frac{\partial L}{\partial \mathbf{z}_{(3)}}=\hat{\mathbf{y}}-\mathbf{y} } $$ First compute $$ \frac{\partial L}{\partial \hat y_i} = -\frac{y_i}{\hat y_i}, \qquad \frac{\partial \hat y_i}{\partial z_j}=\hat y_i(\mathbf{1}_{i=j}-\hat y_j) \text{(directly by quotient rule)}. $$ Thus $$ \frac{\partial L}{\partial z_j} =\sum_{i=1}^{5}\frac{\partial L}{\partial \hat y_i}\frac{\partial \hat y_i}{\partial z_j} =\sum_{i=1}^{5}\left(-\frac{y_i}{\hat y_i}\right)\hat y_i(\mathbf{1}_{i=j}-\hat y_j) =\sum_{i=1}^{5}-y_i(\mathbf{1}_{i=j}-\hat y_j) =\hat y_j-y_j, $$
Output layer ($W_3\in\mathbb{R}^{1\times 2},\, b_3\in\mathbb{R}$)¶
$$ \frac{\partial L}{\partial W_3}=\boldsymbol{\delta}_{(3)}\,\mathbf{a}_{(2)}^{\!\top},\qquad \frac{\partial L}{\partial \mathbf{b}_3}=\boldsymbol{\delta}_{(3)}. $$
Hidden layer 2 (ReLU)¶
$$ \boldsymbol{\delta}_{(2)} = \frac{\partial L}{\partial \mathbf{z}_{(2)}} = \frac{\partial L}{\partial \mathbf{z}_{(3)}} \cdot \frac{\partial \mathbf{z}_{(3)}}{\partial \mathbf{a}_{(2)}} \cdot \frac{\partial \mathbf{a}_{(2)}}{\partial \mathbf{z}_{(2)}} = \big(W_3^{\!\top}\boldsymbol{\delta}_{(3)}\big)\;\odot\;\mathbf{1}_{\mathbf{z}_{(2)}>0}, $$ $$ \frac{\partial L}{\partial W_2}=\boldsymbol{\delta}_{(2)}\,\mathbf{a}_{(1)}^{\!\top},\qquad \frac{\partial L}{\partial \mathbf{b}_2}=\boldsymbol{\delta}_{(2)}. $$
Hidden layer 1 (ReLU)¶
$$ \boldsymbol{\delta}_{(1)} = \frac{\partial L}{\partial \mathbf{z}_{(1)}} = \frac{\partial L}{\partial \mathbf{z}_{(2)}} \cdot \frac{\partial \mathbf{z}_{(2)}}{\partial \mathbf{a}_{(1)}} \cdot \frac{\partial \mathbf{a}_{(1)}}{\partial \mathbf{z}_{(1)}} = \big(W_2^{\!\top}\boldsymbol{\delta}_{(2)}\big)\;\odot\;\mathbf{1}_{\mathbf{z}_{(1)}>0}, $$ $$ \frac{\partial L}{\partial W_1}=\boldsymbol{\delta}_{(1)}\,\mathbf{x}^{\!\top},\qquad \frac{\partial L}{\partial \mathbf{b}_1}=\boldsymbol{\delta}_{(1)}. $$
Note: $\odot$ denotes elementwise multiplication (Hadamard product).
3. Batch Version of Forward Pass and Backward Pass¶
Compute the forward pass for the entire batch, where $\mathbf{1}\in\mathbb{R}^{m\times 1}$ represents an all-ones column vector, used for broadcasting the bias terms.
Layer 1 (Hidden 1): $$ \mathbf{z}^{(i)}_{(1)} = W_1 \mathbf{x}^{(i)} + \mathbf{b}_1 \quad \in \mathbb{R}^3 $$ $$ \mathbf{a}^{(i)}_{(1)} = \text{ReLU}(\mathbf{z}^{(i)}_{(1)}) \quad \in \mathbb{R}^3 $$ $$ Z_{(1)} = X W_1^\top + \mathbf{1}\mathbf{b}_1^\top \in \mathbb{R}^{m \times 3} $$ $$ A_{(1)} = \text{ReLU}(Z_{(1)}) \in \mathbb{R}^{m \times 3} $$
Layer 2 (Hidden 2): $$ \mathbf{z}^{(i)}_{(2)} = W_2 \mathbf{a}^{(i)}_{(1)} + \mathbf{b}_2 \quad \in \mathbb{R}^2 $$ $$ \mathbf{a}^{(i)}_{(2)} = \text{ReLU}(\mathbf{z}^{(i)}_{(2)}) \quad \in \mathbb{R}^2 $$ $$ Z_{(2)} = A_{(1)} W_2^\top + \mathbf{1}\mathbf{b}_2^\top \in \mathbb{R}^{m \times 2} $$ $$ A_{(2)} = \text{ReLU}(Z_{(2)}) \in \mathbb{R}^{m \times 2} $$
Output Layer: $$ \mathbf{z}^{(i)}_{(3)} = W_3 \mathbf{a}^{(i)}_{(2)} + \mathbf{b}_3 \quad \in \mathbb{R}^5 $$ $$ Z_{(3)} = A_{(2)}W_3^{\top} + \mathbf{1}\mathbf{b}_3^{\top}\ \in \mathbb{R}^{m\times 5} $$ Predicted outputs: $$ \hat Y = \mathrm{softmax}(Z_{(3)})\ \in \mathbb{R}^{m\times 5} \ \text{(apply softmax for every row)} $$ that is, for the position $(i, k)$ in $\hat Y$: $$ \hat Y_{i, k}=\frac{\exp(Z_{(3),i,k})}{\sum_{c=1}^{5}\exp(Z_{(3),i,c})}. $$
Loss function $$ L = \frac{1}{m} \sum_{i=1}^{m} L^{(i)} = -\frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{5} Y_{i,k} \log \hat Y_{i,k} $$
Backward Pass¶
Output Layer: The gradient of the loss with respect to the raw score $Z_{(3)}$ can be deducted to (same as binary classification & regression) $$\boxed{\boldsymbol{\delta}_{(3)} = \frac{\partial L}{\partial Z_{(3)}} = \frac{1}{m}(\hat Y - Y)}$$
The same as the single sample formula, for each sample: $$ \boldsymbol{\delta}^{(i)}_{(3)} = \frac{\partial L}{\partial \mathbf{z}^{(i)}_{(3)}} = \frac{1}{m}\frac{\partial L^{(i)}}{\partial \mathbf{z}^{(i)}_{(3)}} = \frac{1}{m} (\hat Y_{i,} - Y_{i,}) $$ Then, for the entire batch: $$ \frac{\partial L}{\partial Z_{(3)}} = \begin{bmatrix} \frac{\partial L}{\partial \mathbf{z}^{(1)}_{(3)}} \\ \frac{\partial L}{\partial \mathbf{z}^{(2)}_{(3)}} \\ \vdots \\ \frac{\partial L}{\partial \mathbf{z}^{(m)}_{(3)}} \end{bmatrix} = \begin{bmatrix} \boldsymbol{\delta}^{(1)}_{(3)} \\ \boldsymbol{\delta}^{(2)}_{(3)} \\ \vdots \\ \boldsymbol{\delta}^{(m)}_{(3)} \end{bmatrix} = \frac{1}{m} (\hat{Y} - Y). $$
- Gradients for $W_3$, $\mathbf{b}_3$ $$\frac{\partial L}{\partial W_3} = \sum_{i=1}^{m} \boldsymbol{\delta}^{(i)}_{(3)} \cdot \mathbf{a}^{(i)}_{(2)}{}^\top,\quad \frac{\partial L}{\partial b_3} = \sum_{i=1}^{m} \boldsymbol{\delta}^{(i)}_{(3)}$$ $$\frac{\partial L}{\partial W_3} = \boldsymbol{\delta}_{(3)}^\top A_{(2)},\quad \frac{\partial L}{\partial b_3} = \mathbf{1}^\top \boldsymbol{\delta}_{(3)}$$
Hidden Layer 2 $$\boldsymbol{\delta}^{(i)}_{(2)} = (\boldsymbol{\delta}^{(i)}_{(3)} W_3) \odot \mathbf{1}_{\mathbf{z}^{(i)}_{(2)} > 0}$$ $$\boldsymbol{\delta}_{(2)} = (\boldsymbol{\delta}_{(3)} W_3) \odot \mathbf{1}_{Z_{(2)} > 0} \in \mathbb{R}^{m \times 2}$$
- Gradients for $W_2$, $\mathbf{b}_2$ $$\frac{\partial L}{\partial W_2} = \sum_{i=1}^{m} \boldsymbol{\delta}^{(i)}_{(2)} \cdot \mathbf{a}^{(i)}_{(1)}{}^\top,\quad \frac{\partial L}{\partial \mathbf{b}_2} = \sum_{i=1}^{m} \boldsymbol{\delta}^{(i)}_{(2)}$$ $$\frac{\partial L}{\partial W_2} = \boldsymbol{\delta}_{(2)}^\top A_{(1)},\quad \frac{\partial L}{\partial \mathbf{b}_2} = \mathbf{1}^\top \boldsymbol{\delta}_{(2)}$$
Hidden Layer 1 $$\boldsymbol{\delta}^{(i)}_{(1)} = (\boldsymbol{\delta}^{(i)}_{(2)} W_2) \odot \mathbf{1}_{\mathbf{z}^{(i)}_{(1)} > 0}$$ $$\boldsymbol{\delta}_{(1)} = (\boldsymbol{\delta}_{(2)} W_2) \odot \mathbf{1}_{Z_{(1)} > 0} \in \mathbb{R}^{m \times 3}$$
- Gradients for $W_1$, $\mathbf{b}_1$ $$\frac{\partial L}{\partial W_1} = \sum_{i=1}^{m} \boldsymbol{\delta}^{(i)}_{(1)} \cdot \mathbf{x}^{(i)}{}^\top,\quad \frac{\partial L}{\partial \mathbf{b}_1} = \sum_{i=1}^{m} \boldsymbol{\delta}^{(i)}_{(1)}$$ $$\frac{\partial L}{\partial W_1} = \boldsymbol{\delta}_{(1)}^\top X,\quad \frac{\partial L}{\partial \mathbf{b}_1} = \mathbf{1}^\top \boldsymbol{\delta}_{(1)}$$