Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass

What will I learn

You will learn network architecture -- layers, neurons, and connections represented as matrix operations;
weight matrices and bias vectors -- the parameters a network learns;
the forward pass -- transforming input through layers to produce output;
activation functions -- sigmoid, tanh, ReLU -- and why ReLU won the hidden layer battle;
implementing a complete multi-layer forward pass in pure NumPy;
the softmax function for multi-class classification;
weight initialization strategies -- He vs. Xavier and why getting this wrong silently breaks training;
the universal approximation theorem -- what it actually promises and what it conveniently leaves out.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.11+) distribution;
The ambition to learn AI and machine learning.

Difficulty

Beginner

Curriculum (of the `Learn AI Series`):

@scipio/learn-ai-series-1-what-machine-learning-actually-is" target="_blank" rel="noopener noreferrer">Learn AI Series (#1) - What Machine Learning Actually Is
@scipio/learn-ai-series-2-setting-up-your-ai-workbench-python-and-numpy" target="_blank" rel="noopener noreferrer">Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
@scipio/learn-ai-series-3-your-data-is-just-numbers-how-machines-see-the-world" target="_blank" rel="noopener noreferrer">Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
@scipio/learn-ai-series-4-your-first-prediction-no-math-just-intuition" target="_blank" rel="noopener noreferrer">Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
@scipio/learn-ai-series-5-patterns-in-data-what-learning-actually-looks-like" target="_blank" rel="noopener noreferrer">Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
@scipio/learn-ai-series-6-from-intuition-to-math-why-we-need-formulas" target="_blank" rel="noopener noreferrer">Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
@scipio/learn-ai-series-7-the-training-loop-see-it-work-step-by-step" target="_blank" rel="noopener noreferrer">Learn AI Series (#7) - The Training Loop - See It Work Step by Step
@scipio/learn-ai-series-8-the-math-you-actually-need-part-1-linear-algebra" target="_blank" rel="noopener noreferrer">Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
@scipio/learn-ai-series-9-the-math-you-actually-need-part-2-calculus-and-probability" target="_blank" rel="noopener noreferrer">Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
@scipio/learn-ai-series-10-your-first-ml-model-linear-regression-from-scratch" target="_blank" rel="noopener noreferrer">Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
@scipio/learn-ai-series-11-making-linear-regression-real" target="_blank" rel="noopener noreferrer">Learn AI Series (#11) - Making Linear Regression Real
@scipio/learn-ai-series-12-classification-logistic-regression-from-scratch" target="_blank" rel="noopener noreferrer">Learn AI Series (#12) - Classification - Logistic Regression From Scratch
@scipio/learn-ai-series-13-evaluation-how-to-know-if-your-model-actually-works" target="_blank" rel="noopener noreferrer">Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
@scipio/learn-ai-series-14-data-preparation-the-80-nobody-talks-about" target="_blank" rel="noopener noreferrer">Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
@scipio/learn-ai-series-15-feature-engineering-and-selection" target="_blank" rel="noopener noreferrer">Learn AI Series (#15) - Feature Engineering and Selection
@scipio/learn-ai-series-16-scikit-learn-the-standard-library-of-ml" target="_blank" rel="noopener noreferrer">Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
@scipio/learn-ai-series-17-decision-trees-how-machines-make-decisions" target="_blank" rel="noopener noreferrer">Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
@scipio/learn-ai-series-18-random-forests-wisdom-of-crowds" target="_blank" rel="noopener noreferrer">Learn AI Series (#18) - Random Forests - Wisdom of Crowds
@scipio/learn-ai-series-19-gradient-boosting-the-kaggle-champion" target="_blank" rel="noopener noreferrer">Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
@scipio/learn-ai-series-20-support-vector-machines-drawing-the-perfect-boundary" target="_blank" rel="noopener noreferrer">Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
@scipio/learn-ai-series-21-mini-project-predicting-crypto-market-regimes" target="_blank" rel="noopener noreferrer">Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
@scipio/learn-ai-series-22-k-means-clustering-finding-groups" target="_blank" rel="noopener noreferrer">Learn AI Series (#22) - K-Means Clustering - Finding Groups
@scipio/learn-ai-series-23-advanced-clustering-beyond-k-means" target="_blank" rel="noopener noreferrer">Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
@scipio/learn-ai-series-24-dimensionality-reduction-pca" target="_blank" rel="noopener noreferrer">Learn AI Series (#24) - Dimensionality Reduction - PCA
@scipio/learn-ai-series-25-advanced-dimensionality-reduction-t-sne-and-umap" target="_blank" rel="noopener noreferrer">Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
@scipio/learn-ai-series-26-anomaly-detection-finding-what-doesnt-belong" target="_blank" rel="noopener noreferrer">Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
@scipio/learn-ai-series-27-recommendation-systems-users-like-you-also-liked" target="_blank" rel="noopener noreferrer">Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
@scipio/learn-ai-series-28-time-series-fundamentals-when-order-matters" target="_blank" rel="noopener noreferrer">Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
@scipio/learn-ai-series-29-time-series-forecasting-predicting-what-comes-next" target="_blank" rel="noopener noreferrer">Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
@scipio/learn-ai-series-30-natural-language-processing-text-as-data" target="_blank" rel="noopener noreferrer">Learn AI Series (#30) - Natural Language Processing - Text as Data
@scipio/learn-ai-series-31-word-embeddings-meaning-in-numbers" target="_blank" rel="noopener noreferrer">Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
@scipio/learn-ai-series-32-bayesian-methods-thinking-in-probabilities" target="_blank" rel="noopener noreferrer">Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
@scipio/learn-ai-series-33-ensemble-methods-deep-dive-stacking-and-blending" target="_blank" rel="noopener noreferrer">Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
@scipio/learn-ai-series-34-ml-engineering-from-notebook-to-production" target="_blank" rel="noopener noreferrer">Learn AI Series (#34) - ML Engineering - From Notebook to Production
@scipio/learn-ai-series-35-data-ethics-and-bias-in-ml" target="_blank" rel="noopener noreferrer">Learn AI Series (#35) - Data Ethics and Bias in ML
@scipio/learn-ai-series-36-mini-project-complete-ml-pipeline" target="_blank" rel="noopener noreferrer">Learn AI Series (#36) - Mini Project - Complete ML Pipeline
@scipio/learn-ai-series-37-the-perceptron-where-it-all-started" target="_blank" rel="noopener noreferrer">Learn AI Series (#37) - The Perceptron - Where It All Started
@scipio/learn-ai-series-38-neural-networks-from-scratch-forward-pass" target="_blank" rel="noopener noreferrer">Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass (this post)

Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass

Last episode we built a single perceptron and slammed face-first into the XOR wall. A single neuron can learn AND, OR, NOT -- anything that's linearly separable -- but the moment you need a non-linear boundary, it's stuck. We solved XOR by stacking neurons into a multi-layer perceptron with hand-crafted weights, and that was the critical insight: the hidden layer re-represents the data in a space where the problem becomes trivially separable.

But hand-crafting weights for a network with millions of parameters? Obviously that's not going to happen. The network needs to learn its own weights. And before it can learn, it first needs to compute. That's what this episode is about: the forward pass -- taking input data and pushing it through the network, layer by layer, to produce an output. No learning yet. No gradient descent, no backpropagation, no training loop. Just pure computation.

We'll build the entire forward pass in NumPy from scratch, understand every matrix multiplication along the way, and see exactly how activation functions turn a boring sequence of linear transformations into something that can (in principle) approximate any function. This is the computational backbone of every neural network, from a toy 3-layer net to GPT-4's trillion-parameter behemoth.

Here we go!

Network architecture as matrices

A neural network is a sequence of layers. Each layer has three components:

A weight matrix W of shape (inputs, outputs) -- one weight for every connection between the previous layer and the current one
A bias vector b of shape (outputs,) -- one bias per neuron in this layer
An activation function that introduces nonlinearity after the linear transformation

The forward pass for a single layer is exactly one line of math: output = activation(input @ W + b). The @ is matrix multiplication -- the dot products from episode #8. Every neuron in the layer computes a weighted sum of all its inputs, adds its own bias term, and then applies the activation function. And because we're using matrices, all neurons in a layer compute simultaneously. No explicit loops needed.

If you remember logistic regression from episode #12, this should feel familiar. Logistic regression was one neuron: weighted sum, bias, sigmoid activation. A neural network is many of these neurons arranged in layers, where the output of one layer feeds into the next. Same building block, scaled up.

import numpy as np

def layer_forward(x, W, b, activation):
    """Forward pass through a single layer.
    Returns both the pre-activation (z) and the activated output (a).
    We need z later for backpropagation."""
    z = x @ W + b      # linear transformation
    a = activation(z)   # nonlinear activation
    return z, a

Why return both z and a? Because during training (which we'll implement in the next episode), the backpropagation algorithm needs access to the pre-activation values z to compute gradients. For now, just trust that storing these intermediate results is essential. The forward pass isn't just about getting the output -- it's about recording the path through the network so we can later figure out which weights were responsible for which errors.

A full network chains multiple layers together. The output of layer 1 becomes the input to layer 2, and so on. For a network with 4 input features, a hidden layer of 8 neurons, and 1 output neuron:

W1: shape (4, 8) -- 32 weights connecting 4 inputs to 8 hidden neurons
b1: shape (8,) -- 8 biases, one per hidden neuron
W2: shape (8, 1) -- 8 weights connecting 8 hidden neurons to 1 output
b2: shape (1,) -- 1 bias for the output neuron
Total parameters: 32 + 8 + 8 + 1 = 49

That's a tiny network. GPT-4 has roughly 1.8 trillion parameters. But the forward pass is the same operation at every scale: multiply, add, activate, repeat. The difference is only size -- the math is identical.

# Manual forward pass for a tiny network: 4 -> 8 -> 1
np.random.seed(42)

# Random weights (we'll discuss proper initialization later)
W1 = np.random.randn(4, 8) * 0.1
b1 = np.zeros(8)
W2 = np.random.randn(8, 1) * 0.1
b2 = np.zeros(1)

# One input sample with 4 features
x = np.array([1.5, -0.3, 2.1, 0.8])

# Layer 1: input -> hidden
z1 = x @ W1 + b1
print(f"Pre-activation z1: {z1.round(4)}")
print(f"Shape: {z1.shape}")  # (8,) -- one value per hidden neuron

# Apply ReLU activation (we'll explain this properly below)
a1 = np.maximum(0, z1)
print(f"After ReLU a1: {a1.round(4)}")

# Layer 2: hidden -> output
z2 = a1 @ W2 + b2
# Sigmoid for binary classification output
a2 = 1 / (1 + np.exp(-z2))
print(f"\nOutput: {a2.round(4)}")
print(f"Total parameters: {W1.size + b1.size + W2.size + b2.size}")

Walk through what happens to that single input of shape (4,):

Input -> Hidden: (4,) @ (4, 8) + (8,) = (8,). Four features become eight hidden values via 32 multiply-add operations.
ReLU activation: any negative value in those 8 values gets zeroed out. The 8 values are now all non-negative.
Hidden -> Output: (8,) @ (8, 1) + (1,) = (1,). Eight values collapse to one via 8 multiply-add operations.
Sigmoid activation: the single value gets squished to the range (0, 1) -- our prediction probability.

Each step is a simple matrix operation that NumPy handles in a few microseconds. The elegance of neural networks is that this same operation works identically whether you have 1 sample or 10,000 -- you just change the shape of x from (4,) to (10000, 4), and all the matrix multiplications broadcast correctly. Batched computation for free.

Activation functions: the source of all the power

Now we need to talk about why activation functions matter so much. Without them, a neural network is just a sequence of matrix multiplications. And here's the problem: matrix multiplications compose linearly. (x @ W1) @ W2 = x @ (W1 @ W2). That means you could mathematically collapse any deep network into a single layer by multiplying all the weight matrices together. A 100-layer network without activations is no more powerful than a single layer.

That's not a minor inconvenience -- it's a death sentence for the architecture. If deep networks are just fancy linear models, they can't learn curves, they can't learn edges in images, they can't learn anything that a plain linear regression from episode #10 couldn't already do. Activation functions break this linearity, giving the network the ability to model arbitrarily complex nonlinear relationships. They are, quite literally, what makes neural networks work.

Three activation functions have dominated the field historically, and understanding their tradeoffs explains a lot about why neural networks took decades to become practical:

Sigmoid

We already know sigmoid from episode #12 -- it squishes any real number into the range (0, 1). Smooth, differentiable everywhere, and has a nice probabilistic interpretation (output looks like a probability). It was the default activation for decades.

But sigmoid has a serious flaw: saturation. For very large or very small inputs, the output is essentially flat -- sigmoid(-10) is approximately 0.00005, and the gradient there is nearly zero. When the gradient is nearly zero, the weight update during training becomes nearly zero too, and the network stops learning. In deep networks this cascades: each layer shrinks the gradient a little more, until by the time you reach the early layers, the gradient is effectively zero. This is the infamous vanishing gradient problem, and it's the main reason deep sigmoid networks were practically untrainable for decades.

Tanh

Tanh squishes output to (-1, 1) in stead of (0, 1). It's zero-centered, which helps gradient flow compared to sigmoid (because the outputs can be both positive and negative, the weight updates aren't consistently biased in one direction). But tanh still saturates at the extremes -- the vanishing gradient problem persists, just slightly less severely than with sigmoid.

ReLU

ReLU (Rectified Linear Unit) is embarrassingly simple: max(0, x). If the input is positive, return it unchanged. If negative, return zero. That's it. No exponentials, no divisions, just a comparison and a max.

Why did this trivially simple function revolutionize deep learning? Two reasons. First, for positive inputs the gradient is always exactly 1 -- no saturation, no vanishing, no matter how large the input gets. Gradients flow cleanly through ReLU layers without shrinking. Second, it's computationally cheap -- just a conditional, no transcendental math functions. When you're doing billions of activations per second across GPU cores, the difference between max(0, x) and 1/(1+exp(-x)) adds up massively.

ReLU has one known issue: dying neurons. If a neuron's weights push all possible inputs into negative territory, the neuron always outputs zero, gets zero gradient, and never recovers. It's permanently dead. In practice this happens sometimes but rarely causes serious problems, especially with proper weight initialization (which we'll cover below). Variants like Leaky ReLU (max(0.01*x, x)) fix the dying problem by allowing a small gradient for negative inputs, but standard ReLU remains the default for hidden layers in most architectures.

def sigmoid(z):
    """Sigmoid: squishes to (0, 1). Good for output layers in
    binary classification. Bad for hidden layers (vanishing gradients)."""
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

def tanh_act(z):
    """Tanh: squishes to (-1, 1). Zero-centered, but still saturates."""
    return np.tanh(z)

def relu(z):
    """ReLU: max(0, x). The modern default for hidden layers.
    No saturation for positive inputs. Computationaly cheap."""
    return np.maximum(0, z)

def leaky_relu(z, alpha=0.01):
    """Leaky ReLU: small slope for negatives prevents dead neurons."""
    return np.where(z > 0, z, alpha * z)

# Compare behaviors across a range of inputs
x = np.linspace(-5, 5, 11)
print(f"{'x':>6s}  {'sigmoid':>8s}  {'tanh':>8s}  {'ReLU':>6s}  {'LeakyReLU':>10s}")
print("-" * 46)
for xi in x:
    s = sigmoid(np.array([xi]))[0]
    t = tanh_act(np.array([xi]))[0]
    r = relu(np.array([xi]))[0]
    lr = leaky_relu(np.array([xi]))[0]
    print(f"{xi:>6.1f}  {s:>8.4f}  {t:>8.4f}  {r:>6.1f}  {lr:>10.3f}")

Look at the sigmoid column for x = -5 and x = 5: values of 0.0067 and 0.9933. Nearly flat. The gradient at those points is approximately 0.007 -- basically nothing. Now look at ReLU: for x = 5, the output is 5 and the gradient is 1. For x = -5, the output is 0 and the gradient is 0 (the dead zone). The gradient doesn't gradually vanish -- it's either full (1) or completely off (0). This binary behavior turns out to be much better for training than the slow sigmoid decay.

Let me also show you the gradient comparison explicitly, because this is the core reason ReLU won:

def sigmoid_grad(z):
    s = sigmoid(z)
    return s * (1 - s)

def relu_grad(z):
    return (z > 0).astype(float)

z_vals = np.array([-10, -5, -2, -1, 0, 1, 2, 5, 10])
print(f"{'z':>5s}  {'sig_grad':>9s}  {'relu_grad':>10s}")
print("-" * 28)
for z in z_vals:
    sg = sigmoid_grad(np.array([z]))[0]
    rg = relu_grad(np.array([z]))[0]
    print(f"{z:>5.0f}  {sg:>9.6f}  {rg:>10.1f}")

print(f"\nMax sigmoid gradient: {sigmoid_grad(np.array([0]))[0]:.4f} (at z=0)")
print(f"ReLU gradient for any z>0: 1.0 (constant)")
print(f"Sigmoid gradient at z=10: {sigmoid_grad(np.array([10]))[0]:.8f}")
print(f"That's {1.0 / sigmoid_grad(np.array([10]))[0]:.0f}x weaker than ReLU")

At z = 10, sigmoid's gradient is roughly 0.00000005. ReLU's gradient is 1.0. That's a factor of 20 million. Now imagine stacking 50 layers deep -- each layer multiplies the gradient by its local gradient. With sigmoid, you're multiplying by ~0.25 at best (at z=0, the sigmoid gradient peaks at 0.25), so after 50 layers the gradient is roughly 0.25^50 which is about 10^-30. With ReLU, you're multiplying by 1.0 for every active neuron, so the gradient passes through unchanged. This is why deep networks became trainable once people switched to ReLU -- the gradient doesn't vanish.

Choosing the right activation for each layer

For hidden layers, ReLU is the default. No contest. Unless you have a specific architectural reason to use something else (and in 2024+ you rarely do), use ReLU.

For the output layer, the activation depends on the task:

Binary classification: sigmoid (outputs a probability between 0 and 1)
Multi-class classification: softmax (outputs a probability distribution across classes)
Regression: no activation at all -- just the raw linear output

This separation matters. The hidden layers need nonlinearity to learn complex representations. The output layer needs to produce values in a format that matches your loss function and your target variable. Mixing these up is a surprisingly common beginner mistake.

The softmax function

For classification problems with more than two classes, we need the output layer to produce a probability distribution -- a vector where all values are positive and sum exactly to 1. Softmax does this:

def softmax(z):
    """Convert raw scores (logits) to a probability distribution.
    The max subtraction is a numerical stability trick -- without it,
    exp(1000) overflows to infinity. Subtracting the max doesn't
    change the result because it cancels in the numerator/denominator."""
    exp_z = np.exp(z - z.max(axis=-1, keepdims=True))
    return exp_z / exp_z.sum(axis=-1, keepdims=True)

# Example: 3-class classification
logits = np.array([2.0, 1.0, 0.5])
probs = softmax(logits)
print(f"Raw logits:      {logits}")
print(f"Probabilities:   {probs.round(4)}")
print(f"Sum:             {probs.sum():.4f}")

# Softmax amplifies differences -- the largest logit gets
# a disproportionately large probability
logits2 = np.array([5.0, 1.0, 0.5])
probs2 = softmax(logits2)
print(f"\nLarger gap logits:  {logits2}")
print(f"Probabilities:      {probs2.round(4)}")
print(f"Class 0 probability jumped from {probs[0]:.2f} to {probs2[0]:.2f}")

The z.max() subtraction is worth understanding because it's a pattern you'll see everywhere in numerical computing. Without it, np.exp(1000) overflows to infinity and everything breaks. With it, the largest exponent is always exp(0) = 1, and all other exponents are smaller. The subtraction doesn't change the mathematical result because it cancels out: exp(z_i - c) / sum(exp(z_j - c)) = exp(z_i) * exp(-c) / (sum(exp(z_j)) * exp(-c)) = exp(z_i) / sum(exp(z_j)). Same answer, no overflow.

Softmax also has a nice property: it's monotonic -- if logit A is larger than logit B before softmax, probability A will be larger than probability B after softmax. The ranking is preserved, only the scale changes. This means the network can learn to rank classes correctly even if the raw magnitudes are off. The training process will adjust the magnitudes to produce calibrated probabilities, but the ranking is free.

# Softmax with batched data (multiple samples at once)
batch_logits = np.array([
    [2.0, 1.0, 0.5],    # sample 1
    [0.1, 3.0, 0.2],    # sample 2
    [0.5, 0.5, 2.0],    # sample 3
])
batch_probs = softmax(batch_logits)
print("Batched softmax:")
for i, (l, p) in enumerate(zip(batch_logits, batch_probs)):
    pred_class = np.argmax(p)
    print(f"  Sample {i}: logits={l} -> probs={p.round(3)} "
          f"-> class {pred_class}")

Building the complete forward pass

Now we put everything together: a complete neural network class that performs the forward pass through an arbitrary number of layers with proper activation functions, weight storage, and intermediate value caching (for backpropagation later).

class NeuralNetwork:
    def __init__(self, layer_sizes, task='binary'):
        """Initialize a neural network.

        Args:
            layer_sizes: list of ints, e.g. [4, 8, 4, 1]
                First element is input size, last is output size,
                everything in between is hidden layers.
            task: 'binary', 'multiclass', or 'regression'
                Determines the output activation.
        """
        self.layer_sizes = layer_sizes
        self.task = task
        self.weights = []
        self.biases = []

        for i in range(len(layer_sizes) - 1):
            # He initialization: scale by sqrt(2 / fan_in)
            # Designed for ReLU networks (explained below)
            scale = np.sqrt(2.0 / layer_sizes[i])
            W = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * scale
            b = np.zeros(layer_sizes[i+1])
            self.weights.append(W)
            self.biases.append(b)

    def forward(self, X):
        """Forward pass: push X through all layers.
        Stores intermediate values for backpropagation."""
        self.activations = [X]        # input is "activation 0"
        self.pre_activations = []     # z values before activation

        for i, (W, b) in enumerate(zip(self.weights, self.biases)):
            z = self.activations[-1] @ W + b
            self.pre_activations.append(z)

            if i < len(self.weights) - 1:
                # Hidden layers: ReLU
                a = relu(z)
            else:
                # Output layer: depends on task
                if self.task == 'binary':
                    a = sigmoid(z)
                elif self.task == 'multiclass':
                    a = softmax(z)
                else:  # regression
                    a = z  # no activation

            self.activations.append(a)

        return self.activations[-1]

    def count_parameters(self):
        total = sum(W.size + b.size
                    for W, b in zip(self.weights, self.biases))
        return total

    def summary(self):
        print(f"Network: {' -> '.join(str(s) for s in self.layer_sizes)}")
        print(f"Task: {self.task}")
        for i, (W, b) in enumerate(zip(self.weights, self.biases)):
            act = 'ReLU' if i < len(self.weights) - 1 else {
                'binary': 'sigmoid',
                'multiclass': 'softmax',
                'regression': 'linear'
            }[self.task]
            print(f"  Layer {i+1}: {W.shape[0]} -> {W.shape[1]}"
                  f" ({W.size} weights + {b.size} biases)"
                  f" [{act}]")
        print(f"Total parameters: {self.count_parameters()}")

Now let's use it:

# Binary classification: 4 inputs -> 8 hidden -> 4 hidden -> 1 output
np.random.seed(42)
nn = NeuralNetwork([4, 8, 4, 1], task='binary')
nn.summary()

X_demo = np.random.randn(5, 4)  # 5 samples, 4 features
output = nn.forward(X_demo)
print(f"\nInput shape:  {X_demo.shape}")
print(f"Output shape: {output.shape}")
print(f"Predictions:  {output.flatten().round(4)}")
print(f"\nStored activations: {len(nn.activations)} "
      f"(input + {len(nn.weights)} layers)")
print(f"Stored pre-activations: {len(nn.pre_activations)}")

Let me walk through what happens to one input sample of shape (4,) in detail:

Input -> Hidden 1: (4,) @ (4, 8) + (8,) = (8,). Four input features become eight hidden values. That's 32 multiply-add operations -- each of the 8 hidden neurons computes a weighted sum of all 4 inputs, then adds its bias.
ReLU: any negative value among those 8 becomes zero. Some neurons "fire" (positive output), others stay silent (zero). This creates a sparse representation -- not all neurons contribute to every input, and this sparsity is actually beneficial for learning.
Hidden 1 -> Hidden 2: (8,) @ (8, 4) + (4,) = (4,). Eight values become four. The second hidden layer combines the representations from the first layer into higher-level patterns.
ReLU again: zeroing negatives in the 4 values.
Hidden 2 -> Output: (4,) @ (4, 1) + (1,) = (1,). Four values collapse to one.
Sigmoid: the single value gets squished to (0, 1) -- our prediction probability.

The network stores all of this: self.activations contains [X, a1, a2, a3] (the input and each layer's output), and self.pre_activations contains [z1, z2, z3] (the linear values before activation). These seem redundant for just making predictions, but they're essential for backpropagation. We'll use them in the next episode.

# Multi-class example: 10 inputs -> 32 -> 16 -> 5 classes
nn_multi = NeuralNetwork([10, 32, 16, 5], task='multiclass')
nn_multi.summary()

X_multi = np.random.randn(3, 10)
probs = nn_multi.forward(X_multi)
print(f"\nOutput probabilities (3 samples, 5 classes):")
for i, p in enumerate(probs):
    print(f"  Sample {i}: {p.round(3)} "
          f"(sum={p.sum():.4f}, predicted class={np.argmax(p)})")

The multi-class output is a proper probability distribution for each sample. Each row sums to 1, all values are positive, and argmax gives you the predicted class. This is exactly what you'd use for image classification (10 classes of objects), sentiment analysis (5 sentiment levels), or any other multi-class problem.

Solving XOR with our network

Let's prove our network class can handle what the single perceptron couldn't. Remember the XOR problem from episode #37? Let me use our new NeuralNetwork class with hand-tuned weights to verify the forward pass produces the right results:

# Build a 2 -> 2 -> 1 network and set weights manually
# (training will learn these automatically -- next episode)
nn_xor = NeuralNetwork([2, 2, 1], task='binary')

# Hidden layer: neuron 1 = OR gate, neuron 2 = AND gate
nn_xor.weights[0] = np.array([[1.0, 1.0],
                                [1.0, 1.0]])
nn_xor.biases[0] = np.array([-0.5, -1.5])

# Output layer: OR AND (NOT AND) = XOR
nn_xor.weights[1] = np.array([[1.0],
                                [-2.0]])
nn_xor.biases[1] = np.array([-0.5])

# Test all XOR inputs
X_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_xor = np.array([0, 1, 1, 0])

predictions = nn_xor.forward(X_xor)
print("XOR with our NeuralNetwork class:")
print(f"{'Input':>10s}  {'Output':>8s}  {'Rounded':>8s}  {'Target':>7s}")
for x, pred, target in zip(X_xor, predictions.flatten(), y_xor):
    print(f"  {str(x):>8s}  {pred:>8.4f}  {round(pred):>8.0f}  {target:>7d}")

# Show the hidden layer representation
print(f"\nHidden layer values (the re-representation):")
hidden = nn_xor.activations[1]
for x, h in zip(X_xor, hidden):
    print(f"  {x} -> hidden: {h.round(3)}")

The hidden layer transforms the XOR inputs into a new space where the output neuron can separate them linearly -- exactly what we demonstrated by hand in episode #37, now running through our general-purpose forward pass code. The weights we set manually are the same OR/AND gate trick, but now the sigmoid output gives us soft probabilities in stead of hard 0/1. Values close to 0 or close to 1 mean the network is confident; values near 0.5 would mean uncertainty.

Weight initialization: the silent make-or-break

I glossed over the scale = np.sqrt(2.0 / layer_sizes[i]) line in the constructor earlier. This is He initialization (named after Kaiming He, first author of the 2015 paper), and getting initialization right is one of those things that makes the difference between a network that trains smoothly and one that silently fails to learn anything useful.

Why does it matter? Consider what happens when data flows through layers during the forward pass. If weights are too large, the output of each layer is larger than its input -- the activations explode exponentially through the layers until you get inf values and NaN gradients. If weights are too small, the output of each layer is smaller than its input -- the activations shrink until they're effectively zero and the network can't distinguish between different inputs.

The goal of smart initialization is to keep the variance of activations roughly constant across layers. He initialization sets each weight's variance to 2/fan_in (where fan_in is the number of inputs to that layer). The factor of 2 accounts for ReLU zeroing out roughly half the values -- if half the activations are zero, you need the surviving half to have twice the variance to maintain the overall signal strength.

Let me show you the concrete difference:

def test_initialization(n_layers, init_scale, n_features=100):
    """Forward pass through many layers to see activation behavior."""
    x = np.random.randn(1, n_features)
    for i in range(n_layers):
        W = np.random.randn(n_features, n_features) * init_scale
        b = np.zeros(n_features)
        x = relu(x @ W + b)
    return x

np.random.seed(42)
print("Effect of initialization scale on deep networks:\n")
for scale_name, scale in [("Too small (0.01)", 0.01),
                           ("Too large (1.0)", 1.0),
                           ("He init", np.sqrt(2.0 / 100))]:
    result = test_initialization(20, scale)
    mean_val = np.mean(np.abs(result))
    nonzero = np.count_nonzero(result)
    has_nan = np.any(np.isnan(result))
    has_inf = np.any(np.isinf(result))
    print(f"  {scale_name:>20s}:  "
          f"mean |activation| = {mean_val:.6f},  "
          f"nonzero = {nonzero}/100,  "
          f"NaN/Inf = {has_nan or has_inf}")

With scale 0.01, the activations after 20 layers are essentially zero. The network is dead -- it can't distinguish any input from any other input because all information has been squashed away. With scale 1.0, the activations explode (you'll see enormously large values or infinities). With He initialization (sqrt(2/100) = 0.1414), the activations stay in a reasonable range and maintain a healthy mix of zero and non-zero values.

For sigmoid or tanh hidden layers (which you'd rarely use today but should understand), the appropriate initialization is Xavier (or Glorot) initialization: variance = 1/fan_in rather than 2/fan_in. The factor of 2 is absent because sigmoid and tanh don't kill half the values like ReLU does.

# Xavier vs He initialization -- when to use which
print("Initialization strategy cheat sheet:")
print(f"  ReLU hidden layers:          He   -> scale = sqrt(2 / fan_in)")
print(f"  Sigmoid/tanh hidden layers:  Xavier -> scale = sqrt(1 / fan_in)")
print(f"  Output layer:                doesn't matter much (1 layer)")
print()

# Quick demo with our NeuralNetwork class
nn_deep = NeuralNetwork([784, 256, 128, 64, 10], task='multiclass')
print("Deep network for image classification (like MNIST digits):")
nn_deep.summary()
print(f"\nWeight scale per layer (should be sqrt(2/fan_in)):")
for i, W in enumerate(nn_deep.weights):
    actual_std = W.std()
    expected_std = np.sqrt(2.0 / W.shape[0])
    print(f"  Layer {i+1}: std={actual_std:.4f} "
          f"(expected ~{expected_std:.4f})")

Getting initialization wrong doesn't crash your program -- it silently makes training fail or take orders of magnitude longer. Your loss might plateau early and never improve, or it might oscillate wildly. These symptoms look like hyperparameter problems or architecture problems, but the real cause is bad initialization. This is one of those debugging nightmares that experienced practitioners learn to check first.

Batch processing: scaling to real datasets

So far we've been feeding individual samples or tiny batches through the network. In practice, you process data in mini-batches -- groups of samples that are computed together. This isn't just an optimization trick; it's fundamental to how modern neural networks train (as we'll see in upcoming episodes).

# Simulate a real workflow: batch processing
np.random.seed(42)
n_samples = 1000
n_features = 20

# Synthetic dataset
X_data = np.random.randn(n_samples, n_features)

# Network
nn_batch = NeuralNetwork([20, 64, 32, 1], task='binary')

# Process in batches of 128
batch_size = 128
all_predictions = []

for start in range(0, n_samples, batch_size):
    X_batch = X_data[start:start + batch_size]
    preds = nn_batch.forward(X_batch)
    all_predictions.append(preds)

all_predictions = np.vstack(all_predictions)
print(f"Processed {n_samples} samples in "
      f"{(n_samples + batch_size - 1) // batch_size} batches")
print(f"Prediction shape: {all_predictions.shape}")
print(f"Prediction range: [{all_predictions.min():.4f}, "
      f"{all_predictions.max():.4f}]")
print(f"Mean prediction: {all_predictions.mean():.4f}")

# Verify: single forward pass gives same results
preds_all = nn_batch.forward(X_data)
match = np.allclose(all_predictions, preds_all)
print(f"Batched matches single-pass: {match}")

The last check is important: processing in batches of 128 gives exactly the same results as processing all 1000 at once. The forward pass is deterministic -- same input, same weights, same output. (During training, randomness comes from weight updates and data shuffling, not from the forward pass itself.) This determinism is essential for reproducibility, and it's something that trips people up when they move to GPU computing where floating-point ordering can differ.

Putting it all together: a multi-task demonstration

Let's use our NeuralNetwork class for three different tasks to show that the same forward pass machinery handles completely different problems just by changing the output activation and architecture:

np.random.seed(42)

# Task 1: Binary classification (is it spam or not?)
nn_binary = NeuralNetwork([50, 32, 16, 1], task='binary')
X_spam = np.random.randn(10, 50)
probs_spam = nn_binary.forward(X_spam)
print("Task 1: Spam detection (binary)")
print(f"  Predictions: {probs_spam.flatten().round(3)}")
print(f"  All in (0,1): {(probs_spam > 0).all() and (probs_spam < 1).all()}")

# Task 2: Multi-class (which digit is it? 0-9)
nn_digits = NeuralNetwork([784, 128, 64, 10], task='multiclass')
X_digit = np.random.randn(5, 784)  # 28x28 flattened images
probs_digit = nn_digits.forward(X_digit)
print(f"\nTask 2: Digit classification (10 classes)")
print(f"  Output shape: {probs_digit.shape}")
for i, p in enumerate(probs_digit):
    print(f"  Sample {i}: predicted class {np.argmax(p)} "
          f"(confidence {p.max():.3f})")

# Task 3: Regression (predict house price)
nn_price = NeuralNetwork([8, 32, 16, 1], task='regression')
X_house = np.random.randn(5, 8)
prices = nn_price.forward(X_house)
print(f"\nTask 3: House price regression")
print(f"  Predictions: {prices.flatten().round(4)}")
print(f"  (Can be negative -- no sigmoid constraint)")

print(f"\nAll three use the same NeuralNetwork class.")
print(f"Same forward pass. Different output activation.")
print(f"Total params: binary={nn_binary.count_parameters()}, "
      f"digits={nn_digits.count_parameters()}, "
      f"price={nn_price.count_parameters()}")

Three fundamentally different prediction tasks, three different output formats (probability, distribution, raw value), all handled by the same forward pass code. The architecture and output activation change, but the core computation -- multiply, add, activate, repeat -- is identical. This universality is what makes neural networks so powerfull: one framework handles an enormous range of problems just by adjusting the structure.

The universal approximation theorem

With the forward pass implemented, we can state something remarkable: a network with a single hidden layer containing enough neurons can approximate any continuous function to arbitrary precision. This is the universal approximation theorem, and it's both reassuring and (as you'll quickly discover) somewhat misleading.

Reassuring: it tells us that neural networks are not fundamentally limited in what they can represent. There's no continuous function that a sufficiently wide single-hidden-layer network couldn't, in principle, model. You're not wasting your time building these things -- the model class is expressive enough for anything.

Misleading: it says absolutely nothing about finding the right weights. It's an existence theorem, not a construction theorem. Knowing that good weights exist and actually finding them via gradient descent are wildly different problems. It also says nothing about how many neurons you'd need -- for complex functions, the required width could be astronomically large, making the network impractical even if it's theoretically capable.

In practice, deep networks with moderate width learn much more efficiently than shallow networks with enormous width. We showed this in episode #37 with the depth vs. width comparison. The reason is compositionality: deep networks build complex functions by composing simple transformations, where each layer builds on the previous one's representation. First layer learns simple features, second layer combines them into patterns, third layer builds higher abstractions from those patterns. This hierarchical composition maps naturally onto the structured patterns in real-world data (edges -> textures -> objects in images, characters -> words -> grammar in text), and it's far more parameter-efficient than trying to learn everything in one giant flat layer.

The theorem justifies using neural networks as a modeling framework. The practical challange -- actually training them to find good weights -- is what the rest of Arc 3 will address, starting with backpropagation in the very next episode.

Before you close this tab

Here's what to take away:

A neural network layer computes output = activation(input @ W + b) -- that's the entire forward pass for one layer. Matrix multiply, add bias, apply nonlinearity. Chain multiple layers and you have a deep network;
Weight matrices connect layers and bias vectors add per-neuron offsets. Their shapes are determined by the neuron counts in consecutive layers -- W has shape (fan_in, fan_out) and b has shape (fan_out,);
Activation functions break linearity -- without them, any deep network collapses mathematically to a single layer. Sigmoid and tanh saturate (vanishing gradients), ReLU doesn't. ReLU is the modern default for hidden layers, period;
Softmax converts raw output scores to a probability distribution for multi-class classification. The max subtraction trick prevents numerical overflow;
He initialization (variance = 2/fan_in) keeps activations stable across layers in ReLU networks. Xavier initialization (variance = 1/fan_in) is for sigmoid/tanh. Getting this wrong doesn't crash -- it silently kills learning;
The forward pass stores intermediate activations and pre-activations. These values seem unnecessary for prediction, but they're essential for backpropagation -- the learning algorithm that makes all of this actually useful;
The universal approximation theorem guarantees that a wide enough network can represent any function. It does NOT guarantee you can find the right weights -- that's the training problem;
Everything we built today works identically for binary classification, multi-class classification, and regression. Same code, different output activation.

The forward pass is only half the story. A network that can compute but can't learn is just an expensive random number generator ;-) In the next episode we'll implement backpropagation -- the algorithm that computes how much each weight contributed to the error, so we can update them to make better predictions. That's where the real magic happens.

De groeten en tot de volgende!

@scipio

Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass

@scipio

Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass

What will I learn

Requirements

Difficulty

Curriculum (of the Learn AI Series):

Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass

Network architecture as matrices

Activation functions: the source of all the power

Sigmoid

Tanh

ReLU

Choosing the right activation for each layer

The softmax function

Building the complete forward pass

Solving XOR with our network

Weight initialization: the silent make-or-break

Batch processing: scaling to real datasets

Putting it all together: a multi-task demonstration

The universal approximation theorem

Before you close this tab

De groeten en tot de volgende!

Discussion

Curriculum (of the `Learn AI Series`):