Takeaways
- Understand what an MLP is (see Backpropagation and Implementing Three-Layer Feedforward Network)
- Learn how to implement an MLP (see Implementing Three-Layer Feedforward Network)
- From scratch
- Using PyTorch
- Using TensorFlow
- Understand what perceptrons are (see Perceptron)
- Review common activation functions (see Activation Functions)
Implementing a Three-Layer Feedforward Neural Network¶
This section presents a complete implementation of a three-layer feedforward neural network classifier using a batch of data (the same network as in the Backpropagation note). We cover three versions with the same architecture and training process:
- Manual (from scratch) - see the exact gradients derived by hand in the Backpropagation note coded step-by-step.
- PyTorch
- TensorFlow
Compare how the implementation differs across these approaches — same model, different tools.
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
# Reproducibility
torch.manual_seed(0)
<torch._C.Generator at 0x1275da110>
# --- Simulated dataset ---
num_samples = 1000
input_dim = 4
X_full = torch.randn(num_samples, input_dim)
true_W = torch.tensor([[1.0, -2.0, 0.5, 1.5]])
true_b = torch.tensor([0.2])
logits = X_full @ true_W.T + true_b
y_full = (torch.sigmoid(logits) > 0.5).float() # Binary labels
print("Label distribution:", torch.bincount(y_full.view(-1).long()))
Label distribution: tensor([472, 528])
1. Manual Implementation¶
We'll manually implement:
- Forward pass
- Loss computation (Binary Cross-Entropy)
- Manual backpropagation (no autograd)
# --- Network architecture ---
hidden1_dim = 3
hidden2_dim = 2
output_dim = 1
# --- Hyperparameters ---
lr = 0.0001
batch_size = 32
epochs = 20
# --- Initialize weights (manual, no autograd) ---
def init_weights(shape):
return torch.randn(shape) * 0.1
W1 = init_weights((hidden1_dim, input_dim))
b1 = torch.zeros(hidden1_dim)
W2 = init_weights((hidden2_dim, hidden1_dim))
b2 = torch.zeros(hidden2_dim)
W3 = init_weights((output_dim, hidden2_dim))
b3 = torch.zeros(1)
loss_history = []
# --- Training loop (mini-batch SGD) ---
for epoch in range(epochs):
perm = torch.randperm(num_samples)
total_loss = 0
for i in range(0, num_samples, batch_size):
idx = perm[i:i+batch_size]
X = X_full[idx]
y = y_full[idx]
m = X.shape[0] # actual batch size (may be < batch_size at the end)
# --- Forward ---
Z1 = X @ W1.T + b1
A1 = F.relu(Z1)
Z2 = A1 @ W2.T + b2
A2 = F.relu(Z2)
Z3 = A2 @ W3.T + b3
y_pred = torch.sigmoid(Z3)
# --- Loss ---
eps = 1e-8
loss = -(1/m) * torch.sum(y * torch.log(y_pred + eps) + (1 - y) * torch.log(1 - y_pred + eps))
total_loss += loss.item()
# --- Manual Backpropagation ---
delta3 = (1/m) * (y_pred - y) # (m, 1)
dW3 = delta3.T @ A2 # (1, 2)
db3 = torch.sum(delta3, dim=0) # (1,)
dA2 = delta3 @ W3 # (m, 2)
dZ2 = dA2 * (Z2 > 0).float() # (m, 2)
dW2 = dZ2.T @ A1 # (2, 3)
db2 = torch.sum(dZ2, dim=0) # (2,)
dA1 = dZ2 @ W2 # (m, 3)
dZ1 = dA1 * (Z1 > 0).float() # (m, 3)
dW1 = dZ1.T @ X # (3, 4)
db1 = torch.sum(dZ1, dim=0) # (3,)
# --- Gradient Descent ---
W3 -= lr * dW3
b3 -= lr * db3
W2 -= lr * dW2
b2 -= lr * db2
W1 -= lr * dW1
b1 -= lr * db1
# Record and print loss per epoch
avg_loss = total_loss / (num_samples / batch_size)
loss_history.append(avg_loss)
print(f"Epoch {epoch+1}/{epochs} | Avg Loss: {avg_loss:.4f}")
Epoch 1/20 | Avg Loss: 0.7098 Epoch 2/20 | Avg Loss: 0.7098 Epoch 3/20 | Avg Loss: 0.7098 Epoch 4/20 | Avg Loss: 0.7098 Epoch 5/20 | Avg Loss: 0.7098 Epoch 6/20 | Avg Loss: 0.7098 Epoch 7/20 | Avg Loss: 0.7098 Epoch 8/20 | Avg Loss: 0.7098 Epoch 9/20 | Avg Loss: 0.7098 Epoch 10/20 | Avg Loss: 0.7098 Epoch 11/20 | Avg Loss: 0.7098 Epoch 12/20 | Avg Loss: 0.7098 Epoch 13/20 | Avg Loss: 0.7097 Epoch 14/20 | Avg Loss: 0.7097 Epoch 15/20 | Avg Loss: 0.7098 Epoch 16/20 | Avg Loss: 0.7097 Epoch 17/20 | Avg Loss: 0.7097 Epoch 18/20 | Avg Loss: 0.7097 Epoch 19/20 | Avg Loss: 0.7097 Epoch 20/20 | Avg Loss: 0.7097
2. Pytorch Version¶
This is the normal pytroch version code for the manual implementation one above.
# --- Dataset and Dataloader ---
batch_size = 32
dataset = TensorDataset(X_full, y_full)
loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
# --- Model ---
class SimpleNet(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(4, 3)
self.fc2 = nn.Linear(3, 2)
self.out = nn.Linear(2, 1)
def forward(self, x):
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = torch.sigmoid(self.out(x))
return x
model = SimpleNet()
# --- Loss and Optimizer ---
criterion = nn.BCELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)
# --- Training Loop ---
epochs = 20
loss_history = []
for epoch in range(epochs):
total_loss = 0
for X_batch, y_batch in loader:
y_pred = model(X_batch)
loss = criterion(y_pred, y_batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
avg_loss = total_loss / len(loader)
loss_history.append(avg_loss)
print(f"Epoch {epoch+1}/{epochs} | Avg Loss: {avg_loss:.4f}")
Epoch 1/20 | Avg Loss: 0.7113 Epoch 2/20 | Avg Loss: 0.7044 Epoch 3/20 | Avg Loss: 0.6970 Epoch 4/20 | Avg Loss: 0.6895 Epoch 5/20 | Avg Loss: 0.6833 Epoch 6/20 | Avg Loss: 0.6776 Epoch 7/20 | Avg Loss: 0.6733 Epoch 8/20 | Avg Loss: 0.6680 Epoch 9/20 | Avg Loss: 0.6636 Epoch 10/20 | Avg Loss: 0.6596 Epoch 11/20 | Avg Loss: 0.6534 Epoch 12/20 | Avg Loss: 0.6478 Epoch 13/20 | Avg Loss: 0.6424 Epoch 14/20 | Avg Loss: 0.6362 Epoch 15/20 | Avg Loss: 0.6293 Epoch 16/20 | Avg Loss: 0.6217 Epoch 17/20 | Avg Loss: 0.6125 Epoch 18/20 | Avg Loss: 0.6034 Epoch 19/20 | Avg Loss: 0.5911 Epoch 20/20 | Avg Loss: 0.5806
3. TensorFlow Version¶
# --- Define model ---
model = tf.keras.Sequential([
tf.keras.layers.Dense(3, activation='relu', input_shape=(4,)),
tf.keras.layers.Dense(2, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(
optimizer=tf.keras.optimizers.SGD(learning_rate=0.01),
loss='binary_crossentropy',
metrics=['accuracy']
)
# --- Training ---
history = model.fit(
X_full, y_full,
batch_size=32,
epochs=20,
verbose=1
)
Epoch 1/20 32/32 ━━━━━━━━━━━━━━━━━━━━ 0s 697us/step - accuracy: 0.5020 - loss: 0.7126 Epoch 2/20
/Users/yufyi/miniconda3/envs/Cookbook/lib/python3.10/site-packages/keras/src/layers/core/dense.py:92: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(activity_regularizer=activity_regularizer, **kwargs)
32/32 ━━━━━━━━━━━━━━━━━━━━ 0s 657us/step - accuracy: 0.5280 - loss: 0.7089 Epoch 3/20 32/32 ━━━━━━━━━━━━━━━━━━━━ 0s 660us/step - accuracy: 0.5280 - loss: 0.7058 Epoch 4/20 32/32 ━━━━━━━━━━━━━━━━━━━━ 0s 571us/step - accuracy: 0.5280 - loss: 0.7029 Epoch 5/20 32/32 ━━━━━━━━━━━━━━━━━━━━ 0s 550us/step - accuracy: 0.5280 - loss: 0.7003 Epoch 6/20 32/32 ━━━━━━━━━━━━━━━━━━━━ 0s 567us/step - accuracy: 0.5280 - loss: 0.6979 Epoch 7/20 32/32 ━━━━━━━━━━━━━━━━━━━━ 0s 542us/step - accuracy: 0.5280 - loss: 0.6959 Epoch 8/20 32/32 ━━━━━━━━━━━━━━━━━━━━ 0s 562us/step - accuracy: 0.5280 - loss: 0.6941 Epoch 9/20 32/32 ━━━━━━━━━━━━━━━━━━━━ 0s 549us/step - accuracy: 0.5280 - loss: 0.6930 Epoch 10/20 32/32 ━━━━━━━━━━━━━━━━━━━━ 0s 553us/step - accuracy: 0.5280 - loss: 0.6925 Epoch 11/20 32/32 ━━━━━━━━━━━━━━━━━━━━ 0s 559us/step - accuracy: 0.5280 - loss: 0.6923 Epoch 12/20 32/32 ━━━━━━━━━━━━━━━━━━━━ 0s 555us/step - accuracy: 0.5280 - loss: 0.6922 Epoch 13/20 32/32 ━━━━━━━━━━━━━━━━━━━━ 0s 561us/step - accuracy: 0.5280 - loss: 0.6921 Epoch 14/20 32/32 ━━━━━━━━━━━━━━━━━━━━ 0s 549us/step - accuracy: 0.5280 - loss: 0.6920 Epoch 15/20 32/32 ━━━━━━━━━━━━━━━━━━━━ 0s 531us/step - accuracy: 0.5280 - loss: 0.6919 Epoch 16/20 32/32 ━━━━━━━━━━━━━━━━━━━━ 0s 552us/step - accuracy: 0.5280 - loss: 0.6919 Epoch 17/20 32/32 ━━━━━━━━━━━━━━━━━━━━ 0s 543us/step - accuracy: 0.5280 - loss: 0.6918 Epoch 18/20 32/32 ━━━━━━━━━━━━━━━━━━━━ 0s 534us/step - accuracy: 0.5280 - loss: 0.6918 Epoch 19/20 32/32 ━━━━━━━━━━━━━━━━━━━━ 0s 540us/step - accuracy: 0.5280 - loss: 0.6918 Epoch 20/20 32/32 ━━━━━━━━━━━━━━━━━━━━ 0s 564us/step - accuracy: 0.5280 - loss: 0.6918
Perceptron¶
A perceptron is a single unit (neuron) in a neural network. It takes the weighted sum of the inputs and add bias, and passes the result to an activation function to produce a one dimension output.
The difference between perceptron and neural network layer: A neural network layer is a collection of perceptrons that perform the same operations independently but with different weights.
For example, in the previous example, the first layer has 3 perceptrons. Each perceptron contributes to one dimension of the output.
Input (dim = 4) ───► Linear Layer ───► Output (dim = 3)

Activation¶
A function that maps a value to 0 or 1. It's usually nonlinear, so neural networks are most useful when representing and learning nonlinear functions. Examples:
Sigmoid (Logistic) Activation Function:
- Formula: $\sigma(x) = \frac{1}{1 + e^{-x}}$
- Use Case: Often used in binary classification problems and in the output layer of neural networks for binary outputs.
- Drawbacks:
- Vanishing gradient problem, which can slow down or halt the learning process.
- Outputs are not zero-centered, which can lead to inefficiencies during training.
Hyperbolic Tangent (Tanh) Activation Function:
- Formula: $\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$
- Use Case: Commonly used in hidden layers of neural networks.
- Drawbacks:
- Also suffers from the vanishing gradient problem, although less severe compared to the sigmoid function.
- Computationally more expensive than ReLU.
Rectified Linear Unit (ReLU) Activation Function:
- Formula: $\text{ReLU}(x) = \max(0, x)$
- Use Case: Widely used in hidden layers of neural networks, especially in convolutional neural networks (CNNs).
- Drawbacks:
- The dying ReLU problem, where neurons can become inactive and stop learning if they only output zero.
Leaky ReLU Activation Function:
- Formula: $\text{Leaky ReLU}(x) = \max(\alpha x, x)$ where $\alpha$ is typically 0.01
- Use Case: Used to address the dying ReLU problem, ensuring that neurons have a small gradient even when inactive.
- Drawbacks:
- The slope of the negative part (determined by $\alpha$) needs to be set manually and may not be optimal for all tasks.
Parametric ReLU (PReLU) Activation Function:
- Formula: $\text{PReLU}(x) = \max(\alpha x, x)$ where $\alpha$ is a learnable parameter
- Use Case: Similar to Leaky ReLU but with a learnable parameter that adapts during training.
- Drawbacks:
- Introduces additional parameters, increasing the model complexity and training time.
Exponential Linear Unit (ELU) Activation Function:
- Formula: $$ \text{ELU}(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha (e^x - 1) & \text{if } x \leq 0 \end{cases} $$ where $\alpha$ is a hyperparameter.
- Use Case: Used to improve learning speed and performance of deep neural networks.
- Drawbacks:
- Computationally more expensive than ReLU.
- The parameter $\alpha$ needs to be set carefully.
Scaled Exponential Linear Unit (SELU) Activation Function:
- Formula: $$ \text{SELU}(x) = \lambda \begin{cases} x & \text{if } x > 0 \\ \alpha (e^x - 1) & \text{if } x \leq 0 \end{cases} $$ where $\alpha$ and $\lambda$ are fixed parameters.
- Use Case: Used in self-normalizing neural networks (SNNs) to keep the mean and variance of the inputs to each layer close to zero and one respectively.
- Drawbacks:
- May not be suitable for all types of architectures.
- Computationally expensive.