Backpropagation Derivation Answer¶
1. Loss Functions¶
Part A – Regression (MSE)
$$ L = \frac{1}{2}(y - \hat{y})^2 $$
Part B – Classification (Binary Cross-Entropy)
$$ L = -\Big[y \log(\hat{y}) + (1-y)\log(1-\hat{y})\Big] $$
2. Backward Pass (Gradients)¶
Let $\mathbf{1}_{(\cdot)}$ denote the indicator function. The derivative of ReLU can be expressed as
$$ \text{ReLU}'(z)=\mathbf{1}_{z>0}. $$
Whether we use MSE + identity mapping (regression) or Sigmoid + BCE (classification), the output-layer error term can be unified as $$ \delta_{(3)} \;=\; \frac{\partial L}{\partial z_{(3)}} \;=\; \hat y - y. $$
$$ \boxed{ \begin{aligned} &\textbf{For regression:}\\ &\quad \hat y=z_{(3)},\quad L=\tfrac12(\hat y-y)^2=\tfrac12(z_{(3)}-y)^2\\ &\quad \text{Differentiate w.r.t. }z_{(3)}\text{:}\\ &\quad \frac{\partial L}{\partial z_{(3)}}=\frac{\partial}{\partial z_{(3)}}\tfrac12(z_{(3)}-y)^2 =(z_{(3)}-y)=\hat y-y.\\[1em] &\textbf{For binary classification:}\\ &\quad \frac{\partial L}{\partial \hat y} = -\left(\frac{y}{\hat y}-\frac{1-y}{1-\hat y}\right) = \frac{\hat y - y}{\hat y(1-\hat y)}.\\ &\quad \text{Directly by quotient rule }\frac{\partial \hat y}{\partial z_{(3)}}=\sigma'(z_{(3)})=\hat y(1-\hat y),\\ &\quad \text{multiplying gives}\\ &\quad \frac{\partial L}{\partial z_{(3)}} =\frac{\partial L}{\partial \hat y}\cdot\frac{\partial \hat y}{\partial z_{(3)}} =\frac{\hat y - y}{\hat y(1-\hat y)}\cdot \hat y(1-\hat y) =\hat y-y. \end{aligned} } $$
Output layer ($W_3\in\mathbb{R}^{1\times 2},\, b_3\in\mathbb{R}$)¶
$$ \frac{\partial L}{\partial W_3}=\delta_{(3)}\,\mathbf{a}_{(2)}^{\!\top},\qquad \frac{\partial L}{\partial b_3}=\delta_{(3)}. $$
Hidden layer 2 (ReLU)¶
$$ \boldsymbol{\delta}_{(2)} = \frac{\partial L}{\partial \mathbf{z}_{(2)}} = \frac{\partial L}{\partial z_{(3)}} \cdot \frac{\partial z_{(3)}}{\partial \mathbf{a}_{(2)}} \cdot \frac{\partial \mathbf{a}_{(2)}}{\partial \mathbf{z}_{(2)}} = \big(W_3^{\!\top}\delta_{(3)}\big)\;\odot\;\mathbf{1}_{\mathbf{z}_{(2)}>0}, $$ $$ \frac{\partial L}{\partial W_2}=\boldsymbol{\delta}_{(2)}\,\mathbf{a}_{(1)}^{\!\top},\qquad \frac{\partial L}{\partial \mathbf{b}_2}=\boldsymbol{\delta}_{(2)}. $$
Hidden layer 1 (ReLU)¶
$$ \boldsymbol{\delta}_{(1)} = \frac{\partial L}{\partial \mathbf{z}_{(1)}} = \frac{\partial L}{\partial \mathbf{z}_{(2)}} \cdot \frac{\partial \mathbf{z}_{(2)}}{\partial \mathbf{a}_{(1)}} \cdot \frac{\partial \mathbf{a}_{(1)}}{\partial \mathbf{z}_{(1)}} = \big(W_2^{\!\top}\boldsymbol{\delta}_{(2)}\big)\;\odot\;\mathbf{1}_{\mathbf{z}_{(1)}>0}, $$ $$ \frac{\partial L}{\partial W_1}=\boldsymbol{\delta}_{(1)}\,\mathbf{x}^{\!\top},\qquad \frac{\partial L}{\partial \mathbf{b}_1}=\boldsymbol{\delta}_{(1)}. $$
Note: $\odot$ denotes elementwise multiplication (Hadamard product).
3. Batch Version of Forward Pass and Backward Pass¶
Compute the forward pass for the entire batch, where $\mathbf{1}\in\mathbb{R}^{m\times 1}$ represents an all-ones column vector, used for broadcasting the bias terms.
Layer 1 (Hidden 1): $$ \mathbf{z}^{(i)}_{(1)} = W_1 \mathbf{x}^{(i)} + \mathbf{b}_1 \quad \in \mathbb{R}^3 $$ $$ \mathbf{a}^{(i)}_{(1)} = \text{ReLU}(\mathbf{z}^{(i)}_{(1)}) \quad \in \mathbb{R}^3 $$ $$ Z_{(1)} = X W_1^\top + \mathbf{1}\mathbf{b}_1^\top \in \mathbb{R}^{m \times 3} $$ $$ A_{(1)} = \text{ReLU}(Z_{(1)}) \in \mathbb{R}^{m \times 3} $$
Layer 2 (Hidden 2): $$ \mathbf{z}^{(i)}_{(2)} = W_2 \mathbf{a}^{(i)}_{(1)} + \mathbf{b}_2 \quad \in \mathbb{R}^2 $$ $$ \mathbf{a}^{(i)}_{(2)} = \text{ReLU}(\mathbf{z}^{(i)}_{(2)}) \quad \in \mathbb{R}^2 $$ $$ Z_{(2)} = A_{(1)} W_2^\top + \mathbf{1}\mathbf{b}_2^\top \in \mathbb{R}^{m \times 2} $$ $$ A_{(2)} = \text{ReLU}(Z_{(2)}) \in \mathbb{R}^{m \times 2} $$
Output Layer: $$ z^{(i)}_{(3)} = W_3 \mathbf{a}^{(i)}_{(2)} + b_3 \quad \in \mathbb{R} $$ $$ Z_{(3)} = A_{(2)} W_3^\top + b_3 \in \mathbb{R}^{m \times 1} $$
Predicted outputs:
- Regression: $$ \hat{y}_i = z^{(i)}_{(3)} $$ $$ \hat{\mathbf{y}} = Z_{(3)} \in \mathbb{R}^{m \times 1} $$
- Classification: $$ \hat{y}_i = \sigma(z^{(i)}_{(3)}) = \frac{1}{1 + e^{-z^{(i)}_{(3)}}} $$ $$ \hat{\mathbf{y}} = \sigma(Z_{(3)}) = \frac{1}{1 + e^{-Z_{(3)}}} \in \mathbb{R}^{m \times 1} $$
Compute the Loss function.
- Regression (MSE): $$ L = \frac{1}{m} \sum_{i=1}^{m} \left( \hat{y}_i - y_i \right)^2 = \frac{1}{m} \| \hat{\mathbf{y}} - \mathbf{y} \|^2 $$
- Binary Classification (Cross-Entropy): $$ L = -\frac{1}{m} \sum_{i=1}^{m} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right] $$
Compute backward pass.
Output Layer
- Regression $$\boldsymbol{\delta}^{(i)}_{(3)} = \frac{2}{m} (\hat{y}_i - y_i)$$ $$\boldsymbol{\delta}_{(3)} = \frac{2}{m} (\hat{\mathbf{y}} - \mathbf{y}) \in \mathbb{R}^{m \times 1}$$
- Classification $$\boldsymbol{\delta}^{(i)}_{(3)} = \frac{1}{m} (\hat{y}_i - y_i)$$ $$\boldsymbol{\delta}_{(3)} = \frac{1}{m} (\hat{\mathbf{y}} - \mathbf{y}) \in \mathbb{R}^{m \times 1}$$
- Gradients for $W_3$, $b_3$ $$\frac{\partial L}{\partial W_3} = \sum_{i=1}^{m} \boldsymbol{\delta}^{(i)}_{(3)} \cdot \mathbf{a}^{(i)}_{(2)}{}^\top,\quad \frac{\partial L}{\partial b_3} = \sum_{i=1}^{m} \boldsymbol{\delta}^{(i)}_{(3)}$$ $$\frac{\partial L}{\partial W_3} = \boldsymbol{\delta}_{(3)}^\top A_{(2)},\quad \frac{\partial L}{\partial b_3} = \mathbf{1}^\top \boldsymbol{\delta}_{(3)}$$
Hidden Layer 2 $$\boldsymbol{\delta}^{(i)}_{(2)} = (\boldsymbol{\delta}^{(i)}_{(3)} W_3) \odot \mathbf{1}_{\mathbf{z}^{(i)}_{(2)} > 0}$$ $$\boldsymbol{\delta}_{(2)} = (\boldsymbol{\delta}_{(3)} W_3) \odot \mathbf{1}_{Z_{(2)} > 0} \in \mathbb{R}^{m \times 2}$$
- Gradients for $W_2$, $\mathbf{b}_2$ $$\frac{\partial L}{\partial W_2} = \sum_{i=1}^{m} \boldsymbol{\delta}^{(i)}_{(2)} \cdot \mathbf{a}^{(i)}_{(1)}{}^\top,\quad \frac{\partial L}{\partial \mathbf{b}_2} = \sum_{i=1}^{m} \boldsymbol{\delta}^{(i)}_{(2)}$$ $$\frac{\partial L}{\partial W_2} = \boldsymbol{\delta}_{(2)}^\top A_{(1)},\quad \frac{\partial L}{\partial \mathbf{b}_2} = \mathbf{1}^\top \boldsymbol{\delta}_{(2)}$$
Hidden Layer 1 $$\boldsymbol{\delta}^{(i)}_{(1)} = (\boldsymbol{\delta}^{(i)}_{(2)} W_2) \odot \mathbf{1}_{\mathbf{z}^{(i)}_{(1)} > 0}$$ $$\boldsymbol{\delta}_{(1)} = (\boldsymbol{\delta}_{(2)} W_2) \odot \mathbf{1}_{Z_{(1)} > 0} \in \mathbb{R}^{m \times 3}$$
- Gradients for $W_1$, $\mathbf{b}_1$ $$\frac{\partial L}{\partial W_1} = \sum_{i=1}^{m} \boldsymbol{\delta}^{(i)}_{(1)} \cdot \mathbf{x}^{(i)}{}^\top,\quad \frac{\partial L}{\partial \mathbf{b}_1} = \sum_{i=1}^{m} \boldsymbol{\delta}^{(i)}_{(1)}$$ $$\frac{\partial L}{\partial W_1} = \boldsymbol{\delta}_{(1)}^\top X,\quad \frac{\partial L}{\partial \mathbf{b}_1} = \mathbf{1}^\top \boldsymbol{\delta}_{(1)}$$