
Learn AI Series (#37) - The Perceptron - Where It All Started
Learn AI Series (#37) - The Perceptron - Where It All Started

What will I learn
- You will learn the perceptron -- the simplest artificial neuron and the ancestor of all neural networks;
- the perceptron learning algorithm -- how a single neuron adjusts its weights from mistakes;
- implementing a perceptron from scratch that learns AND, OR, and NOT gates;
- the XOR problem -- what a single neuron fundamentally cannot learn, and why it matters;
- multi-layer perceptrons -- how adding a hidden layer solves the XOR problem by re-representing data;
- why depth matters -- the universal approximation theorem vs. practical efficiency of deeper networks;
- the AI winters and the boom-bust cycles that shaped the field we're entering now.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- @scipio/learn-ai-series-1-what-machine-learning-actually-is" target="_blank" rel="noopener noreferrer">Learn AI Series (#1) - What Machine Learning Actually Is
- @scipio/learn-ai-series-2-setting-up-your-ai-workbench-python-and-numpy" target="_blank" rel="noopener noreferrer">Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- @scipio/learn-ai-series-3-your-data-is-just-numbers-how-machines-see-the-world" target="_blank" rel="noopener noreferrer">Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- @scipio/learn-ai-series-4-your-first-prediction-no-math-just-intuition" target="_blank" rel="noopener noreferrer">Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- @scipio/learn-ai-series-5-patterns-in-data-what-learning-actually-looks-like" target="_blank" rel="noopener noreferrer">Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- @scipio/learn-ai-series-6-from-intuition-to-math-why-we-need-formulas" target="_blank" rel="noopener noreferrer">Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- @scipio/learn-ai-series-7-the-training-loop-see-it-work-step-by-step" target="_blank" rel="noopener noreferrer">Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- @scipio/learn-ai-series-8-the-math-you-actually-need-part-1-linear-algebra" target="_blank" rel="noopener noreferrer">Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- @scipio/learn-ai-series-9-the-math-you-actually-need-part-2-calculus-and-probability" target="_blank" rel="noopener noreferrer">Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- @scipio/learn-ai-series-10-your-first-ml-model-linear-regression-from-scratch" target="_blank" rel="noopener noreferrer">Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- @scipio/learn-ai-series-11-making-linear-regression-real" target="_blank" rel="noopener noreferrer">Learn AI Series (#11) - Making Linear Regression Real
- @scipio/learn-ai-series-12-classification-logistic-regression-from-scratch" target="_blank" rel="noopener noreferrer">Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- @scipio/learn-ai-series-13-evaluation-how-to-know-if-your-model-actually-works" target="_blank" rel="noopener noreferrer">Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- @scipio/learn-ai-series-14-data-preparation-the-80-nobody-talks-about" target="_blank" rel="noopener noreferrer">Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- @scipio/learn-ai-series-15-feature-engineering-and-selection" target="_blank" rel="noopener noreferrer">Learn AI Series (#15) - Feature Engineering and Selection
- @scipio/learn-ai-series-16-scikit-learn-the-standard-library-of-ml" target="_blank" rel="noopener noreferrer">Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- @scipio/learn-ai-series-17-decision-trees-how-machines-make-decisions" target="_blank" rel="noopener noreferrer">Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- @scipio/learn-ai-series-18-random-forests-wisdom-of-crowds" target="_blank" rel="noopener noreferrer">Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- @scipio/learn-ai-series-19-gradient-boosting-the-kaggle-champion" target="_blank" rel="noopener noreferrer">Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- @scipio/learn-ai-series-20-support-vector-machines-drawing-the-perfect-boundary" target="_blank" rel="noopener noreferrer">Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- @scipio/learn-ai-series-21-mini-project-predicting-crypto-market-regimes" target="_blank" rel="noopener noreferrer">Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- @scipio/learn-ai-series-22-k-means-clustering-finding-groups" target="_blank" rel="noopener noreferrer">Learn AI Series (#22) - K-Means Clustering - Finding Groups
- @scipio/learn-ai-series-23-advanced-clustering-beyond-k-means" target="_blank" rel="noopener noreferrer">Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- @scipio/learn-ai-series-24-dimensionality-reduction-pca" target="_blank" rel="noopener noreferrer">Learn AI Series (#24) - Dimensionality Reduction - PCA
- @scipio/learn-ai-series-25-advanced-dimensionality-reduction-t-sne-and-umap" target="_blank" rel="noopener noreferrer">Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- @scipio/learn-ai-series-26-anomaly-detection-finding-what-doesnt-belong" target="_blank" rel="noopener noreferrer">Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- @scipio/learn-ai-series-27-recommendation-systems-users-like-you-also-liked" target="_blank" rel="noopener noreferrer">Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- @scipio/learn-ai-series-28-time-series-fundamentals-when-order-matters" target="_blank" rel="noopener noreferrer">Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- @scipio/learn-ai-series-29-time-series-forecasting-predicting-what-comes-next" target="_blank" rel="noopener noreferrer">Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- @scipio/learn-ai-series-30-natural-language-processing-text-as-data" target="_blank" rel="noopener noreferrer">Learn AI Series (#30) - Natural Language Processing - Text as Data
- @scipio/learn-ai-series-31-word-embeddings-meaning-in-numbers" target="_blank" rel="noopener noreferrer">Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- @scipio/learn-ai-series-32-bayesian-methods-thinking-in-probabilities" target="_blank" rel="noopener noreferrer">Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- @scipio/learn-ai-series-33-ensemble-methods-deep-dive-stacking-and-blending" target="_blank" rel="noopener noreferrer">Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- @scipio/learn-ai-series-34-ml-engineering-from-notebook-to-production" target="_blank" rel="noopener noreferrer">Learn AI Series (#34) - ML Engineering - From Notebook to Production
- @scipio/learn-ai-series-35-data-ethics-and-bias-in-ml" target="_blank" rel="noopener noreferrer">Learn AI Series (#35) - Data Ethics and Bias in ML
- @scipio/learn-ai-series-36-mini-project-complete-ml-pipeline" target="_blank" rel="noopener noreferrer">Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- @scipio/learn-ai-series-37-the-perceptron-where-it-all-started" target="_blank" rel="noopener noreferrer">Learn AI Series (#37) - The Perceptron - Where It All Started (this post)
Learn AI Series (#37) - The Perceptron - Where It All Started
Welcome to Arc 3. Everything changes here.
For 36 episodes, every model you built operated on features you designed. You decided what to measure, how to transform it, which interactions to create. The model's job was to learn the mapping from your features to the target. Linear regression, random forests, gradient boosting, SVMs -- all of them depend on you, the engineer, to craft the right representation of the data. We spent an entire episode on feature engineering (#15) because in the classical ML world, your features make or break the model. The algorithm is secondary.
Neural networks flip this completely. Given enough data and the right architecture, they learn their own features directly from raw input. Feed a neural network raw pixels and it discovers edges, textures, shapes, and objects -- without you ever defining what an "edge" is. Feed it raw text and it learns syntax, semantics, and reasoning patterns. This ability to learn representations is what makes deep learning fundamentally different from classical ML, and it's the reason we just spent 36 episodes building the foundation to understand it.
But to understand neural networks, we start where the field started: with a single artificial neuron, invented in 1958 by a man named Frank Rosenblatt at Cornell. It's beautifully simple, and its limitations sparked decades of research that led to everything we call "AI" today.
Here we go!
The perceptron: one artificial neuron
A perceptron takes multiple inputs, multiplies each by a weight, sums the results, adds a bias, and passes the sum through an activation function. If the output exceeds a threshold, it fires (outputs 1); otherwise it doesn't (outputs 0).
Mathematically: output = step(w1*x1 + w2*x2 + ... + wn*xn + b) where the step function returns 1 if the argument is positive, 0 otherwise.
This should look familiar. Back in episode #12, logistic regression did almost exactly this -- weighted sum plus bias, passed through sigmoid in stead of step. The perceptron is even simpler: binary output, no probabilities. It's the stripped-down ancestor of everything that came after.
import numpy as np
def perceptron(x, weights, bias):
"""Single perceptron: weighted sum + bias -> step function"""
z = np.dot(x, weights) + bias
return 1 if z > 0 else 0
# AND gate
weights = np.array([1.0, 1.0])
bias = -1.5
print("AND gate:")
for x1, x2 in [(0,0), (0,1), (1,0), (1,1)]:
result = perceptron(np.array([x1, x2]), weights, bias)
print(f" {x1} AND {x2} = {result}")
The perceptron with weights [1, 1] and bias -1.5 implements AND: the weighted sum exceeds 0 only when both inputs are 1 (sum = 1 + 1 - 1.5 = 0.5 > 0). For any other input combination, the sum stays at or below zero.
By choosing different weights and bias, you can implement other logic gates too. This is worth seeing, because it shows how the same architecture can express fundamentally different functions just by changing parameters -- which is, when you think about it, exactly what "learning" means:
# OR gate: fires when at least one input is 1
weights_or = np.array([1.0, 1.0])
bias_or = -0.5
print("\nOR gate:")
for x1, x2 in [(0,0), (0,1), (1,0), (1,1)]:
result = perceptron(np.array([x1, x2]), weights_or, bias_or)
print(f" {x1} OR {x2} = {result}")
# NOT gate: inverts a single input
weights_not = np.array([-1.0])
bias_not = 0.5
print("\nNOT gate:")
for x1 in [0, 1]:
result = perceptron(np.array([x1]), weights_not, bias_not)
print(f" NOT {x1} = {result}")
# NAND gate: not(and) -- fires unless both inputs are 1
weights_nand = np.array([-1.0, -1.0])
bias_nand = 1.5
print("\nNAND gate:")
for x1, x2 in [(0,0), (0,1), (1,0), (1,1)]:
result = perceptron(np.array([x1, x2]), weights_nand, bias_nand)
print(f" {x1} NAND {x2} = {result}")
A single neuron can learn any linearly separable function -- any function where a straight line (or hyperplane, in higher dimensions) can separate the positive from negative examples. If you visualize the 2D input space, AND puts a line that only includes the top-right corner (both inputs = 1). OR puts a line that excludes only the bottom-left corner (both inputs = 0). Different lines, same mechanism. The weights and bias define where the line sits.
If you remember episode #20 on SVMs, this should ring a bell. SVMs also draw separating hyperplanes in feature space. The perceptron is the oldest, simplest version of that same idea -- except SVMs find the optimal hyperplane (maximum margin), while the perceptron just finds any hyperplane that works ;-)
The perceptron learning algorithm
Rosenblatt didn't just design a neuron -- he designed a learning algorithm. The perceptron learning algorithm is arguably the simplest learning algorithm that exists, and it connects directly to the training loop concept from episode #7:
- Initialize weights to small random values
- For each training example: compute the prediction, compare to the true label
- If correct: do nothing
- If wrong: adjust weights in the direction that would have given the right answer
The update rule: w = w + lr * (true - predicted) * x. When the perceptron predicts 0 but should predict 1, (true - predicted) = 1, so the weights increase in the direction of the input -- making it more likely to fire for similar inputs next time. When it predicts 1 but should predict 0, the weights decrease. The bias gets the same treatment.
Compare this to gradient descent from episodes #7 and #10. The perceptron update isn't technically gradient descent (it uses the raw error signal, not a differentiable loss function), but the spirit is identical: look at how wrong you are, adjust parameters to be less wrong. The same core loop you've been seeing since episode #7 -- predict, compare, adjust, repeat.
class Perceptron:
def __init__(self, n_features, lr=0.1):
self.weights = np.random.randn(n_features) * 0.01
self.bias = 0.0
self.lr = lr
self.history = []
def predict(self, x):
return 1 if np.dot(x, self.weights) + self.bias > 0 else 0
def train(self, X, y, epochs=100):
for epoch in range(epochs):
errors = 0
for xi, yi in zip(X, y):
pred = self.predict(xi)
error = yi - pred
if error != 0:
self.weights += self.lr * error * xi
self.bias += self.lr * error
errors += 1
self.history.append(errors)
if errors == 0:
print(f"Converged at epoch {epoch}")
return
print(f"Did not converge after {epochs} epochs")
# Train on AND
X_and = np.array([[0,0], [0,1], [1,0], [1,1]])
y_and = np.array([0, 0, 0, 1])
p = Perceptron(2)
p.train(X_and, y_and)
print(f"Learned weights: {p.weights.round(3)}")
print(f"Learned bias: {p.bias:.3f}")
for xi, yi in zip(X_and, y_and):
print(f" {xi} -> {p.predict(xi)} (true: {yi})")
There's a beautiful theorem behind this: if the data is linearly separable, the perceptron learning algorithm is guaranteed to converge in a finite number of steps. No hyperparameter tuning needed (well, the learning rate affects speed but not convergence), no learning rate scheduling, no early stopping. It just works -- as long as a solution exists. This is called the Perceptron Convergence Theorem, and it was one of the first formal guarantees in machine learning.
Let's verify this works on OR too, and track how quickly it converges:
# Train on OR
X_or = np.array([[0,0], [0,1], [1,0], [1,1]])
y_or = np.array([0, 1, 1, 1])
p_or = Perceptron(2, lr=0.1)
p_or.train(X_or, y_or)
print(f"\nOR gate learned:")
print(f" Weights: {p_or.weights.round(3)}, Bias: {p_or.bias:.3f}")
for xi, yi in zip(X_or, y_or):
print(f" {xi} -> {p_or.predict(xi)} (true: {yi})")
# Convergence speed comparison
print(f"\nConvergence history (errors per epoch):")
print(f" AND: {p.history}")
print(f" OR: {p_or.history}")
Both learn their respective functions in just a handful of epochs. The algorithm is dead simple and it works. So why aren't we done? Why do we need 100+ more episodes? Because of XOR.
The XOR problem: the wall
Now try XOR -- the function that outputs 1 when inputs differ and 0 when they're the same:
| x1 | x2 | XOR |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
No single straight line can separate the 1s from the 0s. Plot it mentally: the 1s are at (0,1) and (1,0), the 0s at (0,0) and (1,1). They're diagonally opposite. Any line you draw will always have at least one point on the wrong side. This is the geometry -- XOR is NOT linearly separable.
X_xor = np.array([[0,0], [0,1], [1,0], [1,1]])
y_xor = np.array([0, 1, 1, 0])
p_xor = Perceptron(2, lr=0.1)
p_xor.train(X_xor, y_xor, epochs=1000)
print(f"\nXOR results (after 1000 epochs):")
for xi, yi in zip(X_xor, y_xor):
print(f" {xi} -> {p_xor.predict(xi)} (true: {yi})")
print(f"Final errors per epoch (last 5): {p_xor.history[-5:]}")
The perceptron fails. It oscillates forever, never converging, because no solution exists within its model class (a single linear boundary). The convergence theorem says "if separable, converge." The contrapositive is equally true: if not separable, never converge.
This is the result that nearly killed neural network research. In 1969, Marvin Minsky and Seymour Papert published Perceptrons, a book that proved mathematically that single-layer perceptrons cannot learn XOR or any non-linearly separable function. The book was enormously influential -- and it was interpreted (correctly or not) as a death sentence for the entire neural network approach. Funding for neural network research dried up almost completely. The first AI winter had begun.
(Having said that, Minsky and Papert were careful in their book to note that multi-layer networks might solve the problem. But that nuance was lost in the public reception. The headline was "perceptrons can't learn XOR" and funders heard "neural networks don't work." The subtlety between "this specific architecture" and "the entire paradigm" got crushed under institutional momentum.)
The solution: multiple layers
Minsky and Papert were right about single neurons. But they were wrong to dismiss the entire approach -- because adding just one hidden layer solves the problem completely.
A multi-layer perceptron (MLP) chains neurons together: input layer -> hidden layer -> output layer. The hidden layer transforms the input into a new representation where the problem becomes linearly separable. For XOR, one hidden neuron can compute "x1 AND x2" and another can compute "x1 OR x2". The output neuron then combines them: "OR AND (NOT AND)" -- which is exactly XOR.
def mlp_xor(x):
"""Hand-crafted 2-layer network that solves XOR.
Hidden layer: 2 neurons (OR gate + AND gate).
Output: OR AND (NOT AND) = XOR."""
# Hidden layer: 2 neurons
h1 = 1 if (x[0] + x[1] - 0.5) > 0 else 0 # OR gate
h2 = 1 if (x[0] + x[1] - 1.5) > 0 else 0 # AND gate
# Output: OR AND (NOT AND) = XOR
out = 1 if (h1 - h2 - 0.5) > 0 else 0
return out
print("MLP solving XOR:")
for x1, x2 in [(0,0), (0,1), (1,0), (1,1)]:
result = mlp_xor([x1, x2])
expected = x1 ^ x2
print(f" {x1} XOR {x2} = {result} (expected: {expected})")
The hidden layer re-represents the input. The raw 2D input space has no linear separation for XOR. But the hidden layer maps each input to a new 2D point (h1, h2) where the XOR classes are linearly separable. Let me show you exactly what this transformation looks like:
# Visualize the hidden representation
print("\nHidden layer representation:")
print(f" {'Input':>10s} {'h1 (OR)':>8s} {'h2 (AND)':>9s} {'XOR':>4s}")
for x1, x2 in [(0,0), (0,1), (1,0), (1,1)]:
h1 = 1 if (x1 + x2 - 0.5) > 0 else 0
h2 = 1 if (x1 + x2 - 1.5) > 0 else 0
xor = mlp_xor([x1, x2])
print(f" ({x1}, {x2}) {h1:>5d} {h2:>5d} {xor:>3d}")
print("\nIn hidden space (h1, h2):")
print(" (0,0) -> XOR=0 (both inputs 0)")
print(" (1,0) -> XOR=1 (one input 1, not both)")
print(" (1,0) -> XOR=1 (one input 1, not both)")
print(" (1,1) -> XOR=0 (both inputs 1)")
print("\nNow the output neuron just draws a line at h1=1, h2=0")
print("A linearly separable problem!")
This is the core insight of deep learning, and it's worth pausing to let it sink in: each layer learns a new representation of the data that makes the next layer's job easier. Raw pixels -> edges -> textures -> parts -> objects. Each transformation makes the classification problem slightly more linearly separable than the last. The hidden layer doesn't just process data -- it re-represents it. That's a fundamentally different operation from anything we've done in the classical ML world, where the feature representation was fixed and hand-crafted by us.
Remember from episode #15 on feature engineering how we spent all that effort crafting good features? In a neural network, the hidden layers ARE the feature engineering -- learned automatically from data. The features you'd spend days hand-crafting, a sufficiently deep network can discover on its own (given enough data and compute).
Why depth matters
One hidden layer is theoretically sufficient -- the universal approximation theorem proves that a single hidden layer with enough neurons can approximate any continuous function to arbitrary precision. So why go deep? Why do modern networks have dozens or even hundreds of layers?
Because width (many neurons in one layer) and depth (many layers) have fundamentally different computational properties. A deep network can represent hierarchical compositions efficiently: "the third feature of the second transformation of the first abstraction." A wide shallow network can approximate the same function but needs exponentially more neurons to do it.
Consider a concrete example -- learning a composite function over 2D input:
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error
np.random.seed(42)
X_demo = np.random.randn(500, 2)
y_demo = (np.sin(X_demo[:,0] * X_demo[:,1]) +
np.cos(X_demo[:,0] + X_demo[:,1]))
# Shallow: 50 neurons, 1 layer (50 total parameters + biases)
shallow = MLPRegressor(
hidden_layer_sizes=(50,), max_iter=500, random_state=42
)
# Deep: 10+10+10 neurons, 3 layers (30 total + biases)
deep = MLPRegressor(
hidden_layer_sizes=(10, 10, 10), max_iter=500, random_state=42
)
shallow.fit(X_demo, y_demo)
deep.fit(X_demo, y_demo)
print("Width vs depth comparison:")
print(f" Shallow (50 neurons, 1 layer): "
f"MSE = {mean_squared_error(y_demo, shallow.predict(X_demo)):.4f}")
print(f" Deep (10+10+10, 3 layers): "
f"MSE = {mean_squared_error(y_demo, deep.predict(X_demo)):.4f}")
print(f"\n Shallow total neurons: 50")
print(f" Deep total neurons: 30")
print(f" (Depth achieves comparable or better fit with fewer neurons)")
Depth gives you compositionality -- the ability to build complex functions from simple parts, where each part builds on the previous one. The first layer learns simple features. The second layer combines those into higher-level features. The third layer combines those into even more abstract features. This hierarchical composition is remarkably efficient for the kinds of structured patterns that exist in real-world data (images, text, audio), which is why modern networks have dozens or hundreds of layers rather than one enormous layer.
(This connects to something we saw with decision trees in episode #17 and random forests in episode #18. A single decision tree builds a hierarchical representation too -- each split partitions the data into finer regions. But trees are greedy and axis-aligned. Neural networks learn smooth, continuous transformations that can capture much richer structure.)
Seeing it work with scikit-learn
We've built perceptrons from scratch. Now let's use scikit-learn's MLPClassifier to see how a proper multi-layer perceptron handles a real (non-trivial) classification problem:
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# make_moons: a classic non-linearly separable dataset
X_moons, y_moons = make_moons(n_samples=500, noise=0.2,
random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(
X_moons, y_moons, test_size=0.2, random_state=42
)
# Single perceptron (no hidden layer -- linear classifier)
linear = MLPClassifier(
hidden_layer_sizes=(), max_iter=1000, random_state=42
)
# MLP with one hidden layer
mlp_small = MLPClassifier(
hidden_layer_sizes=(10,), max_iter=1000, random_state=42
)
# MLP with two hidden layers
mlp_deep = MLPClassifier(
hidden_layer_sizes=(10, 10), max_iter=1000, random_state=42
)
results = {}
for name, model in [('No hidden layer', linear),
('1 hidden (10)', mlp_small),
('2 hidden (10,10)', mlp_deep)]:
model.fit(X_tr, y_tr)
train_acc = accuracy_score(y_tr, model.predict(X_tr))
test_acc = accuracy_score(y_te, model.predict(X_te))
results[name] = (train_acc, test_acc)
print(f"{name:>20s}: train={train_acc:.3f}, "
f"test={test_acc:.3f}")
print(f"\nThe moons dataset is non-linearly separable")
print(f"(like XOR but continuous). No single line can do it.")
print(f"Hidden layers solve it by re-representing the data.")
The pattern is clear: without a hidden layer, the MLP is just a linear classifier (like logistic regression from episode #12) and can't handle the non-linear boundary. Add even a single hidden layer with 10 neurons and accuracy jumps dramatically. The hidden neurons learn to bend the decision boundary into a curve that follows the moon shapes. That's representation learning in action.
The AI winters and the comeback
The history of neural networks is a story of boom-bust cycles driven by three factors: theory, hardware, and data. Understanding this history matters because it explains why deep learning took 60 years to dominate, and it gives you perspective on whether the current boom is different from the previous ones (spoiler: I think it is, but the reasoning matters more than the conclusion).
1958-1969: The first wave. Rosenblatt's perceptron generates enormous excitement. The New York Times reports "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence." That's... a bit optimistic for a single linear classifier ;-) Then Minsky and Papert's book proves the limitations. Funding collapses. First AI winter.
1986: The second wave. Backpropagation (which we'll implement from scratch in episode #39) is rediscovered and popularized by Rumelhart, Hinton, and Williams. This solves the training problem for multi-layer networks -- you can now automatically learn the hidden layer weights, not just hand-craft them like we did with our XOR solution above. Excitement returns. But training is slow (no GPUs), networks are small (limited memory), and SVMs (episode #20) often work better on the problems people actually cared about. The second wave fades.
2006-2012: The third wave. Geoffrey Hinton demonstrates that deep networks can be pre-trained layer by layer using unsupervised methods. But the real breakthrough comes in 2012: AlexNet -- a convolutional neural network -- wins the ImageNet image classification competition by a massive margin. Three things had changed since the 1980s:
- GPUs made training fast enough (10-100x speedup over CPUs for matrix operations)
- The internet had created massive labeled datasets (ImageNet: 14 million images, 20,000 categories)
- Decades of small algorithmic improvements (ReLU activation, dropout, batch normalization) made training stable enough for very deep networks
Since 2012, neural networks have dominated virtually every perceptual task: image recognition, speech recognition, machine translation, game playing, protein folding, and -- since 2022 -- general language understanding and generation through large language models. The architectures are vastly more sophisticated than Rosenblatt's perceptron, but the core principle is identical: weighted sums, nonlinear activations, gradient-based learning.
The difference between this wave and the previous two? Scale. The 1960s perceptron had tens of parameters. The 1980s networks had thousands. AlexNet had 60 million. GPT-4 has (reportedly) over a trillion. And unlike the previous waves, this one has commercial products generating billions in revenue -- which means the funding isn't coming from government research grants that can be cut. It's coming from markets that demand the technology. That changes the dynamics considerably.
What the perceptron teaches us
The perceptron is trivial by modern standards. You'd never use one in production. But it teaches three lessons that remain relevant throughout everything we'll build in Arc 3:
Representation is everything. The perceptron fails on XOR not because the algorithm is bad, but because the raw input space doesn't support a linear solution. Adding a hidden layer changes the representation, and suddenly the problem is trivial. This principle scales to every level of modern deep learning: the right representation makes hard problems easy. That's why feature learning (what neural networks do automatically) is so powerful -- it finds representations that no human would think to create.
Limitations drive progress. Minsky and Papert's proof told the field exactly what needed to be solved: learning multi-layer representations. Without a clear understanding of the limitation, there's no direction for research. The XOR problem was a gift disguised as a death sentence -- it gave researchers a precise target, and backpropagation was the bullseye.
Simple components compose into complex systems. A single neuron computes a weighted sum and a threshold. That's it. Stack thousands of them in layers, train them on data with backpropagation, and they learn to see, read, translate, and reason. The complexity doesn't come from complicated individual pieces -- it comes from the interactions between many simple pieces. This is a deep principle that shows up everywhere in computing, biology, and physics, and it's the foundational insight that makes neural networks work.
Before you close this tab
Here's what to take away from this episode:
- The perceptron is a single artificial neuron: weighted sum of inputs, bias, step activation -- outputs 0 or 1. Invented by Frank Rosenblatt in 1958;
- The perceptron learning algorithm adjusts weights by the error signal and is guaranteed to converge for linearly separable data (the Perceptron Convergence Theorem);
- XOR is not linearly separable -- no single perceptron can learn it, proven by Minsky and Papert in 1969. This result triggered the first AI winter;
- Adding a hidden layer solves XOR by re-representing the input in a space where linear separation is possible. This is the core idea behind all of deep learning;
- Depth matters because it enables hierarchical composition -- building complex functions from simple sequential transformations. The universal approximation theorem says one wide layer suffices in theory, but depth is far more efficient in practice;
- Neural network history follows boom-bust cycles. The current boom (since 2012) is different from previous ones because it's driven by commercial products and massive compute, not just research grants;
- Everything you learned in Arc 2 -- evaluation, regularization, ethics, production engineering -- carries forward and becomes MORE important with neural networks, not less.
Thanks for reading! Tot de volgende ;-)
Estimated Payout
$0.83
Discussion
No comments yet. Be the first!