
Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities

What will I learn
- You will learn Bayesian vs frequentist thinking -- two fundamentally different views of what probability even means;
- Bayes' theorem applied to real ML problems, step by step from scratch;
- Naive Bayes classifier -- simple, fast, and surprisingly effective for text classification;
- building Bayesian inference from scratch so you see exactly how priors update to posteriors;
- Bayesian optimization for hyperparameter tuning -- smarter than grid search;
- when Bayesian approaches outperform point estimates, and when they're overkill;
- the connection between Bayesian priors and the regularization we already know from episode #11.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- @scipio/learn-ai-series-1-what-machine-learning-actually-is" target="_blank" rel="noopener noreferrer">Learn AI Series (#1) - What Machine Learning Actually Is
- @scipio/learn-ai-series-2-setting-up-your-ai-workbench-python-and-numpy" target="_blank" rel="noopener noreferrer">Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- @scipio/learn-ai-series-3-your-data-is-just-numbers-how-machines-see-the-world" target="_blank" rel="noopener noreferrer">Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- @scipio/learn-ai-series-4-your-first-prediction-no-math-just-intuition" target="_blank" rel="noopener noreferrer">Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- @scipio/learn-ai-series-5-patterns-in-data-what-learning-actually-looks-like" target="_blank" rel="noopener noreferrer">Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- @scipio/learn-ai-series-6-from-intuition-to-math-why-we-need-formulas" target="_blank" rel="noopener noreferrer">Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- @scipio/learn-ai-series-7-the-training-loop-see-it-work-step-by-step" target="_blank" rel="noopener noreferrer">Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- @scipio/learn-ai-series-8-the-math-you-actually-need-part-1-linear-algebra" target="_blank" rel="noopener noreferrer">Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- @scipio/learn-ai-series-9-the-math-you-actually-need-part-2-calculus-and-probability" target="_blank" rel="noopener noreferrer">Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- @scipio/learn-ai-series-10-your-first-ml-model-linear-regression-from-scratch" target="_blank" rel="noopener noreferrer">Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- @scipio/learn-ai-series-11-making-linear-regression-real" target="_blank" rel="noopener noreferrer">Learn AI Series (#11) - Making Linear Regression Real
- @scipio/learn-ai-series-12-classification-logistic-regression-from-scratch" target="_blank" rel="noopener noreferrer">Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- @scipio/learn-ai-series-13-evaluation-how-to-know-if-your-model-actually-works" target="_blank" rel="noopener noreferrer">Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- @scipio/learn-ai-series-14-data-preparation-the-80-nobody-talks-about" target="_blank" rel="noopener noreferrer">Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- @scipio/learn-ai-series-15-feature-engineering-and-selection" target="_blank" rel="noopener noreferrer">Learn AI Series (#15) - Feature Engineering and Selection
- @scipio/learn-ai-series-16-scikit-learn-the-standard-library-of-ml" target="_blank" rel="noopener noreferrer">Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- @scipio/learn-ai-series-17-decision-trees-how-machines-make-decisions" target="_blank" rel="noopener noreferrer">Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- @scipio/learn-ai-series-18-random-forests-wisdom-of-crowds" target="_blank" rel="noopener noreferrer">Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- @scipio/learn-ai-series-19-gradient-boosting-the-kaggle-champion" target="_blank" rel="noopener noreferrer">Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- @scipio/learn-ai-series-20-support-vector-machines-drawing-the-perfect-boundary" target="_blank" rel="noopener noreferrer">Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- @scipio/learn-ai-series-21-mini-project-predicting-crypto-market-regimes" target="_blank" rel="noopener noreferrer">Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- @scipio/learn-ai-series-22-k-means-clustering-finding-groups" target="_blank" rel="noopener noreferrer">Learn AI Series (#22) - K-Means Clustering - Finding Groups
- @scipio/learn-ai-series-23-advanced-clustering-beyond-k-means" target="_blank" rel="noopener noreferrer">Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- @scipio/learn-ai-series-24-dimensionality-reduction-pca" target="_blank" rel="noopener noreferrer">Learn AI Series (#24) - Dimensionality Reduction - PCA
- @scipio/learn-ai-series-25-advanced-dimensionality-reduction-t-sne-and-umap" target="_blank" rel="noopener noreferrer">Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- @scipio/learn-ai-series-26-anomaly-detection-finding-what-doesnt-belong" target="_blank" rel="noopener noreferrer">Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- @scipio/learn-ai-series-27-recommendation-systems-users-like-you-also-liked" target="_blank" rel="noopener noreferrer">Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- @scipio/learn-ai-series-28-time-series-fundamentals-when-order-matters" target="_blank" rel="noopener noreferrer">Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- @scipio/learn-ai-series-29-time-series-forecasting-predicting-what-comes-next" target="_blank" rel="noopener noreferrer">Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- @scipio/learn-ai-series-30-natural-language-processing-text-as-data" target="_blank" rel="noopener noreferrer">Learn AI Series (#30) - Natural Language Processing - Text as Data
- @scipio/learn-ai-series-31-word-embeddings-meaning-in-numbers" target="_blank" rel="noopener noreferrer">Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- @scipio/learn-ai-series-32-bayesian-methods-thinking-in-probabilities" target="_blank" rel="noopener noreferrer">Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities (this post)
Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
Every model we've built so far gives you a single answer. Linear regression (episode #10) spits out "the predicted price is $350K." Logistic regression (episode #12) says "87% probability of spam." Random forests (episode #18) return one class label. But here's what none of them tell you: how confident is the model in that prediction? A model trained on 10,000 examples might produce the exact same number as one trained on just 10 examples -- but your trust in those two predictions should be wildly different. The prediction is the same; the uncertainty around it is not.
Bayesian methods provide a framework for reasoning about that uncertainty. In stead of producing a single point estimate, they produce a distribution -- a range of plausible values with associated probabilities. The model doesn't just say "the weight is 2.3"; it says "the weight is probably between 1.8 and 2.8, most likely around 2.3." And as you feed it more data, that range narrows. The model becomes more confident -- and it can tell you exactly HOW confident it is. That distinction turns out to be critical in any situation where decisions depend on predictions, which is (let's be honest) most of the time.
If you've been following along since episode #9, where we covered probability basics, you already have the mathematical foundation. Today we build on that foundation and connect it to everything we've learned since.
Here we go!
Two philosophies of probability
Before we write any code, we need to settle a philosophical question that has divided statisticians for over two centuries. Seriously -- people have had very heated arguments about this ;-)
Frequentist probability: probability is the long-run frequency of events. When I say "this coin has a 50% probability of heads," I mean that if you flip it millions of times, roughly half will be heads. Parameters (like the weights in a linear regression) are fixed but unknown constants. Our job is to estimate them from data. The data is random (drawn from some underlying process); the parameters are not.
Bayesian probability: probability represents our degree of belief. When I say "I'm 50% confident this coin is fair," I'm making a statement about my personal uncertainty, not about long-run frequencies. Parameters are random variables with their own probability distributions. I start with a prior (what I believed before seeing data), I observe data, and I compute a posterior (what I believe after seeing data). The data is fixed (we observed it); my belief about the parameters is what changes.
The practical difference? A frequentist asks "what's the probability of seeing this data, given a specific parameter value?" (the likelihood). A Bayesian asks "what's the probability of this parameter value, given the data I've observed?" (the posterior). The second question is usually the one you actually care about.
import numpy as np
# Frequentist vs Bayesian in action:
# We flip a coin 10 times and observe 7 heads.
# Frequentist estimate: maximum likelihood
n_flips = 10
n_heads = 7
mle_estimate = n_heads / n_flips
print(f"Observed: {n_heads} heads in {n_flips} flips")
print(f"Frequentist (MLE): p(heads) = {mle_estimate:.2f}")
print(f" (The single value that maximizes P(data | p))")
# Bayesian approach: compute a distribution over possible p values
# Prior: uniform (all values 0 to 1 equally likely)
# Likelihood: binomial
# Posterior: Beta distribution (conjugate prior for binomial)
from scipy.stats import beta as beta_dist
# Beta(alpha, beta) is the conjugate prior for binomial
# Uniform prior = Beta(1, 1)
# After observing k heads in n flips: posterior = Beta(1+k, 1+n-k)
alpha_prior, beta_prior = 1, 1
alpha_post = alpha_prior + n_heads
beta_post = beta_prior + (n_flips - n_heads)
posterior = beta_dist(alpha_post, beta_post)
mean_post = posterior.mean()
ci_low, ci_high = posterior.interval(0.90)
print(f"\nBayesian (Beta posterior):")
print(f" Prior: Beta({alpha_prior}, {beta_prior}) -- uniform")
print(f" Posterior: Beta({alpha_post}, {beta_post})")
print(f" Mean: {mean_post:.3f}")
print(f" 90% credible interval: [{ci_low:.3f}, {ci_high:.3f}]")
print(f" --> We believe p is between {ci_low:.2f} and "
f"{ci_high:.2f} with 90% confidence")
Notice the key difference: the frequentist gives you 0.70 and that's it. The Bayesian gives you 0.70 as the most likely value, but ALSO tells you the uncertainty around it -- with only 10 flips, the 90% credible interval is quite wide. If we had 1,000 flips with 700 heads, the interval would shrink dramatically. The Bayesian answer naturally captures sample size effects that the frequentist point estimate hides.
Bayes' theorem: the formula that connects it all
The entire Bayesian framework reduces to one formula. We first saw this in episode #9 during our probability discussion, but now we'll use it as a working tool:
P(model | data) = P(data | model) x P(model) / P(data)
In words:
- P(model | data) = the posterior -- our updated belief about the model after seeing data
- P(data | model) = the likelihood -- how probable the data is under this model
- P(model) = the prior -- our belief about the model before seeing any data
- P(data) = the evidence -- a normalizing constant (often intractable to compute directly)
In practice, we usually work with the proportional form:
P(model | data) is proportional to P(data | model) x P(model)
The posterior combines what we knew before (prior) with what the data tells us (likelihood). More data makes the likelihood dominate, so the prior becomes less important. With enough data, the prior is essentially irrelevant -- the data speaks for itself. With very little data, the prior matters a lot.
Let me make this concrete with a spam filtering example:
# Bayes' theorem for spam filtering
p_spam = 0.30 # prior: 30% of emails are spam
p_ham = 0.70 # prior: 70% are legitimate
# Likelihood: P("free money" | class)
p_free_money_given_spam = 0.80 # 80% of spam contains "free money"
p_free_money_given_ham = 0.02 # 2% of legit email has "free money"
# Evidence: P("free money") -- total probability
p_free_money = (p_free_money_given_spam * p_spam +
p_free_money_given_ham * p_ham)
# Posterior: P(spam | "free money")
p_spam_given_free_money = (p_free_money_given_spam * p_spam) / p_free_money
p_ham_given_free_money = (p_free_money_given_ham * p_ham) / p_free_money
print(f"Prior P(spam) = {p_spam:.0%}")
print(f"Observed: email contains 'free money'")
print(f"Posterior P(spam | 'free money') = {p_spam_given_free_money:.1%}")
print(f"Posterior P(ham | 'free money') = {p_ham_given_free_money:.1%}")
print(f"\n--> Without content: 30% chance of spam")
print(f"--> With 'free money': {p_spam_given_free_money:.0%} chance of spam")
Without any email content, our belief is 30% spam (the prior). After observing "free money" in the email, our belief jumps to roughly 94% spam. The observed evidence (the words) updated our prior belief through Bayes' theorem. This is the core insight: Bayesian reasoning is belief updating. You start with a prior, observe evidence, and compute an updated belief. Each new piece of evidence refines your estimate further.
Sequential updating: beliefs evolve with data
One of the most powerful properties of Bayesian reasoning is that it's sequential -- you can update your beliefs one data point at a time, and the order doesn't matter. The posterior from the first update becomes the prior for the next update. This is exactly how you'd want a system to learn from streaming data (connect this to the online learning concepts we touched on in episode #29):
from scipy.stats import beta as beta_dist
def bayesian_coin_update(observations, alpha_prior=1, beta_prior=1):
"""Watch beliefs evolve as we observe more coin flips."""
alpha = alpha_prior
beta_val = beta_prior
print(f"{'Flip':>5s} {'Result':>7s} {'Alpha':>6s} "
f"{'Beta':>6s} {'Mean':>6s} {'95% CI':>16s}")
print("-" * 56)
dist = beta_dist(alpha, beta_val)
ci = dist.interval(0.95)
print(f"{'prior':>5s} {'':>7s} {alpha:>6.1f} {beta_val:>6.1f} "
f"{dist.mean():>6.3f} [{ci[0]:.3f}, {ci[1]:.3f}]")
for i, obs in enumerate(observations, 1):
if obs == 1: # heads
alpha += 1
else:
beta_val += 1
dist = beta_dist(alpha, beta_val)
ci = dist.interval(0.95)
result = "H" if obs == 1 else "T"
if i <= 10 or i % 10 == 0 or i == len(observations):
print(f"{i:>5d} {result:>7s} {alpha:>6.1f} "
f"{beta_val:>6.1f} {dist.mean():>6.3f} "
f"[{ci[0]:.3f}, {ci[1]:.3f}]")
# Simulate a biased coin (true p = 0.65)
np.random.seed(42)
true_p = 0.65
flips = (np.random.random(100) < true_p).astype(int)
print("Bayesian sequential updating -- biased coin (true p=0.65)")
print("Starting with uniform prior Beta(1,1)\n")
bayesian_coin_update(flips)
Watch how the 95% credible interval narrows with each flip. After 10 flips, the interval is still quite wide -- you're not confident yet. After 50 flips, it's tightening around the true value. After 100 flips, the posterior mean is close to 0.65 and the interval is narrow. The prior (uniform, meaning "I have no idea") is completely overwhelmed by the data. This is what people mean when they say the prior "washes out" with enough data.
This sequential property is enormously practical. You don't need to retrain from scratch every time new data arrives -- you just update your posterior. For applications like A/B testing, clinical trials, or sensor fusion, this is exactly the behavior you want.
Bayesian inference from scratch: estimating a distribution's mean
Let's build something more directly connected to ML. Suppose you're measuring the weight of a product coming off a manufacturing line, and you want to estimate the true mean weight. A frequentist would compute the sample mean and a confidence interval. A Bayesian starts with a prior belief about the mean and updates it with each measurement:
def bayesian_mean_estimation(data, prior_mean=0, prior_var=100,
noise_var=1.0):
"""Bayesian estimation of a Gaussian mean with known variance.
This is the 'hello world' of Bayesian inference."""
mu = prior_mean
var = prior_var
print(f"Prior: mean = {mu:.2f}, std = {np.sqrt(var):.2f}")
print(f"Noise variance (known): {noise_var}")
print(f"\n{'n':>4s} {'Obs':>6s} {'Post mean':>10s} "
f"{'Post std':>9s} {'95% CI':>18s}")
print("-" * 52)
for i, x in enumerate(data, 1):
# Bayesian update for Gaussian with known variance
# Posterior precision = prior precision + data precision
prec_prior = 1.0 / var
prec_data = 1.0 / noise_var
prec_post = prec_prior + prec_data
# Posterior mean = weighted average of prior mean and data
mu = (prec_prior * mu + prec_data * x) / prec_post
var = 1.0 / prec_post
std = np.sqrt(var)
ci_low = mu - 1.96 * std
ci_high = mu + 1.96 * std
if i <= 5 or i % 10 == 0 or i == len(data):
print(f"{i:>4d} {x:>6.2f} {mu:>10.4f} "
f"{std:>9.4f} [{ci_low:.3f}, {ci_high:.3f}]")
return mu, var
# Generate measurements of a product (true weight = 5.2 kg)
np.random.seed(42)
true_weight = 5.2
measurements = np.random.normal(true_weight, 1.0, 50)
print("Bayesian mean estimation -- product weight")
print(f"True weight: {true_weight} kg\n")
final_mean, final_var = bayesian_mean_estimation(
measurements, prior_mean=0, prior_var=100, noise_var=1.0
)
print(f"\nFinal estimate: {final_mean:.3f} +/- {np.sqrt(final_var):.3f}")
print(f"Sample mean: {measurements.mean():.3f}")
print(f"(They converge because the prior was weak)")
Two things to notice here. First, the posterior mean is a weighted average of the prior mean and the data mean, weighted by their respective precisions (inverse variances). A strong prior (small variance) pulls the estimate toward the prior. A weak prior (large variance, like our 100) barely influences the result -- the data dominates almost immediately. Second, the posterior variance shrinks with every observation, regardless of the observation's value. More data always means more confidence, even if the measurements are noisy. This is a direct consequence of Bayes' theorem.
(Having said that, this particular derivation assumes known noise variance, which is a simplification. In real problems you'd typically estimate both the mean AND the variance jointly, which requires more sophisticated inference methods. But the intuition is the same.)
Naive Bayes classifier
Now let's connect Bayesian thinking to a practical ML classifier you'll use all the time. The Naive Bayes classifier applies Bayes' theorem to classification by making one key assumption: all features are conditionally independent given the class label. For text: the presence of one word doesn't affect the probability of another word, given the email is spam or ham.
This assumption is obviously, hilariously wrong -- "free" and "money" are highly correlated in spam emails. But the classifier works remarkably well despite the violated assumption, especially for text classification. This is one of those cases in ML where a wrong model with very few parameters beats a correct model with too many parameters (episode #13 anyone? ;-)):
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
# Email classification dataset
emails = [
"Free money click here now limited offer",
"Meeting at 3pm in conference room B today",
"Win a free iPhone congratulations winner",
"Project update attached please review document",
"Earn cash from home no experience needed",
"Can we reschedule tomorrow morning meeting",
"Claim your prize free gift card today now",
"Budget report Q3 numbers look good overall",
"Discount deals savings sale limited time only",
"Please send the invoice by Friday thanks",
"Congratulations you have been selected winner",
"Team lunch on Thursday at the usual place",
"Act now before this exclusive offer expires",
"Here are the meeting notes from yesterday",
"Double your income work from home easy",
"Quick question about the API documentation",
] * 15
labels = np.array([1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0] * 15)
# Shuffle
idx = np.random.RandomState(42).permutation(len(emails))
emails = [emails[i] for i in idx]
labels = labels[idx]
pipe = Pipeline([
('vec', CountVectorizer()),
('nb', MultinomialNB())
])
scores = cross_val_score(pipe, emails, labels, cv=5)
print(f"Naive Bayes spam classifier:")
print(f" Accuracy: {scores.mean():.1%} +/- {scores.std():.1%}")
# What did it learn? Inspect the log-probabilities
pipe.fit(emails, labels)
feature_names = pipe['vec'].get_feature_names_out()
log_probs_spam = pipe['nb'].feature_log_prob_[1]
log_probs_ham = pipe['nb'].feature_log_prob_[0]
# Most spammy and most hammy words
log_ratio = log_probs_spam - log_probs_ham
top_spam_idx = np.argsort(log_ratio)[::-1][:8]
top_ham_idx = np.argsort(log_ratio)[:8]
print(f"\nMost indicative of SPAM:")
for i in top_spam_idx:
print(f" {feature_names[i]:>15s} "
f"log-ratio: {log_ratio[i]:>+.2f}")
print(f"\nMost indicative of HAM:")
for i in top_ham_idx:
print(f" {feature_names[i]:>15s} "
f"log-ratio: {log_ratio[i]:>+.2f}")
Naive Bayes has several practical advantages that make it a go-to choice for text tasks even in 2026. It's extremely fast (both training and prediction are essentially just counting), requires very little data to estimate parameters, handles high-dimensional data naturally (a vocabulary of 100,000 words is no problem), and provides calibrated probability estimates out of the box. For email filtering, document categorization, and language detection, Naive Bayes is often the first thing you try -- and sometimes the last, because it just works.
The "naive" independence assumption actually helps prevent overfitting on small datasets. By treating each word independently, the model has fewer parameters to estimate than one that considers word interactions. This simplicity acts as implicit regularization -- a concept we first discussed in episode #11 with Ridge and Lasso, and that keeps showing up in different disguises throughout ML. Here it shows up as a modeling assumption rather than a penalty term, but the effect is the same: fewer parameters, less overfitting, better generalization on small data.
Naive Bayes vs other classifiers on text
Let's compare Naive Bayes directly against the classifiers we covered in earlier episodes. Remember the TF-IDF pipeline from episode #30? Same approach, different classifiers:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
# Larger dataset for a fair comparison
pos_reviews = [
"Excellent product amazing quality highly recommend",
"Great value fantastic results very satisfied customer",
"Outstanding performance works perfectly every time",
"Wonderful experience best purchase this year by far",
"Superb quality fast shipping will buy again soon",
"Really impressed with this product exceeded expectations",
"Love it absolutely amazing just what I needed here",
"Top quality brilliant design very well made product",
] * 25
neg_reviews = [
"Terrible quality waste of money do not buy this",
"Awful experience horrible product completely useless junk",
"Very disappointed broken on arrival total garbage item",
"Worst purchase ever poor quality returning for refund",
"Cheap junk fell apart after one week never again",
"Disgusting customer service will never purchase from them",
"Complete ripoff does not work as described at all",
"Horrible product terrible fit not worth a single penny",
] * 25
texts = pos_reviews + neg_reviews
y = np.array([1] * len(pos_reviews) + [0] * len(neg_reviews))
idx = np.random.RandomState(42).permutation(len(texts))
texts = [texts[i] for i in idx]
y = y[idx]
classifiers = {
'Naive Bayes': MultinomialNB(),
'Logistic Regression': LogisticRegression(max_iter=1000),
'Linear SVM': LinearSVC(max_iter=2000),
'Random Forest': RandomForestClassifier(
n_estimators=100, random_state=42),
}
print(f"Sentiment classification comparison (5-fold CV):")
print(f" {'Classifier':>22s} {'Accuracy':>12s} {'Train time':>12s}")
print("-" * 50)
import time
for name, clf in classifiers.items():
pipe = Pipeline([
('tfidf', TfidfVectorizer(
ngram_range=(1, 2), stop_words='english'
)),
('clf', clf),
])
t0 = time.time()
scores = cross_val_score(pipe, texts, y, cv=5)
elapsed = time.time() - t0
print(f" {name:>22s} {scores.mean():.1%} +/- {scores.std():.1%}"
f" {elapsed:.3f}s")
You'll see that Naive Bayes is competitive with logistic regression and SVM on accuracy, but trains noticeably faster. On larger datasets (thousands or millions of documents), that speed advantage becomes enormous. This is why Naive Bayes remains the default first attempt for text classification -- it gives you a strong baseline in seconds that you can then try to beat with more complex models.
The Bayesian view of regularization
Here's something beautiful that connects Bayesian thinking to techniques you already know (and this blew my mind when I first learned it). Remember Ridge regression from episode #11? We added a penalty term lambda * sum(w^2) to the loss function, which pushed weights toward zero and prevented overfitting. That penalty seemed like a pragmatic engineering trick -- just punish big weights to keep things simple.
From the Bayesian perspective, Ridge regression is exactly equivalent to Bayesian linear regression with a Gaussian prior on the weights centered at zero. The penalty strength lambda is the inverse of the prior variance. Saying "I want small weights" (regularization) is the same as saying "I believe the weights are probably near zero" (Gaussian prior). Same math, different interpretation.
from sklearn.linear_model import Ridge
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score
# Generate a regression problem with many irrelevant features
X, y_reg = make_regression(
n_samples=100, n_features=50, n_informative=5,
noise=10, random_state=42
)
# Ridge = Bayesian linear regression with Gaussian prior on weights
# alpha in sklearn = lambda = 1/sigma^2_prior
alphas = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
print("Ridge regression == Bayesian regression with Gaussian prior")
print(f"\n {'alpha':>8s} {'1/alpha (prior var)':>18s} "
f"{'Interpretation':>35s} {'CV R^2':>8s}")
print("-" * 76)
for alpha in alphas:
ridge = Ridge(alpha=alpha)
scores = cross_val_score(ridge, X, y_reg, cv=5,
scoring='r2')
prior_var = 1.0 / alpha
if alpha <= 0.01:
interp = "Very weak prior (trust data fully)"
elif alpha <= 1.0:
interp = "Moderate prior (mild regularization)"
elif alpha <= 100:
interp = "Strong prior (heavy regularization)"
else:
interp = "Very strong prior (weights -> 0)"
print(f" {alpha:>8.3f} {prior_var:>18.3f} "
f"{interp:>35s} {scores.mean():>8.3f}")
This connection runs deeper than Ridge. L1 regularization (Lasso, also from episode #11) corresponds to a Laplace prior -- a prior that's more peaked at zero and has heavier tails than the Gaussian. This explains why Lasso produces sparse solutions (weights exactly equal to zero): the Laplace prior has a sharp spike at zero that "pulls" small weights all the way down. Dropout in neural networks (which we'll cover in future episodes) can be interpreted as approximate Bayesian inference. The Bayesian perspective provides a unified framework for understanding why and how all these regularization tricks work.
Bayesian optimization: smart hyperparameter tuning
In episode #16, we used grid search and random search to tune hyperparameters -- trying combinations from a predefined set and picking the best. Grid search is exhaustive but wastes time on unpromising regions. Random search is better (Bergstra and Bengio, 2012, showed it beats grid search on most problems), but it's still blindly sampling without learning from previous results.
Bayesian optimization uses Bayesian reasoning to choose which hyperparameters to try next. It builds a surrogate model (typically a Gaussian process) of the objective function (e.g., validation accuracy as a function of learning rate and max depth), then uses that model to decide where to sample next -- balancing exploitation (sampling where the model predicts high performance) with exploration (sampling where the model is uncertain):
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
X_opt, y_opt = make_classification(
n_samples=500, n_features=20,
n_informative=10, random_state=42
)
def objective(learning_rate, max_depth, n_estimators):
"""Objective function: cross-val accuracy."""
clf = GradientBoostingClassifier(
learning_rate=learning_rate,
max_depth=int(max_depth),
n_estimators=int(n_estimators),
random_state=42
)
return cross_val_score(clf, X_opt, y_opt, cv=3).mean()
# Simulate Bayesian optimization with a simple strategy:
# try random points, then focus around the best region
np.random.seed(42)
results = []
# Phase 1: exploration (random sampling)
print("Phase 1: Exploration (random sampling)")
for i in range(10):
lr = 10 ** np.random.uniform(-3, 0)
depth = np.random.randint(2, 8)
n_est = np.random.choice([50, 100, 200, 300])
score = objective(lr, depth, n_est)
results.append((lr, depth, n_est, score))
print(f" Trial {i+1:>2d}: lr={lr:.4f}, depth={depth}, "
f"n_est={n_est:>3d}, score={score:.4f}")
# Find best so far
results.sort(key=lambda x: -x[3])
best = results[0]
print(f"\nBest after exploration: lr={best[0]:.4f}, "
f"depth={best[1]}, n_est={best[2]}, score={best[3]:.4f}")
# Phase 2: exploitation (search near best region)
print(f"\nPhase 2: Exploitation (refining near best)")
best_lr, best_depth = best[0], best[1]
for i in range(10):
# Sample near the best parameters (this is what the
# Gaussian process surrogate model does more sophisticatedly)
lr = best_lr * 10 ** np.random.uniform(-0.5, 0.5)
lr = np.clip(lr, 1e-4, 1.0)
depth = max(2, min(8, best_depth + np.random.randint(-1, 2)))
n_est = np.random.choice([100, 150, 200, 250, 300])
score = objective(lr, depth, n_est)
results.append((lr, depth, n_est, score))
print(f" Trial {i+11:>2d}: lr={lr:.4f}, depth={depth}, "
f"n_est={n_est:>3d}, score={score:.4f}")
results.sort(key=lambda x: -x[3])
best = results[0]
print(f"\nBest overall: lr={best[0]:.4f}, depth={best[1]}, "
f"n_est={best[2]}, score={best[3]:.4f}")
print(f"Total evaluations: {len(results)} "
f"(vs {5*6*4}=120 for exhaustive grid search)")
In practice, libraries like optuna and scikit-optimize implement proper Bayesian optimization with Gaussian process surrogate models and acquisiton functions (Expected Improvement, Upper Confidence Bound). They typically find better hyperparameters in fewer evaluations than grid search or random search -- which matters enormously when each evaluation (training + cross-validation) takes minutes or hours.
The intuition: Bayesian optimization is like a smart scientist who designs each experiment based on what previous experiments revealed, rather than a brute-force search that ignores all prior results. After 5 trials, the surrogate model has a rough estimate of which parameter regions are promising, and it focuses subsequent trials there while occasionally exploring uncertain regions to make sure it hasn't missed a better area.
Bayesian prediction: uncertainty that matters
Let's build a complete example showing where Bayesian uncertainty quantification genuinely helps. Consider a regression problem where you need to predict house prices, but some regions of the feature space have very little training data. A standard model gives you a point prediction everywhere with equal confidence. A Bayesian model tells you "I'm confident here (lots of data) but uncertain there (sparse data)":
from sklearn.linear_model import BayesianRidge
# Simulated house price data with a gap in the middle
np.random.seed(42)
# Houses in two price clusters, with a gap from 200-350 sqm
sqm_low = np.random.uniform(50, 200, 80)
sqm_high = np.random.uniform(350, 500, 40)
X_houses = np.concatenate([sqm_low, sqm_high]).reshape(-1, 1)
# True relationship: 2000 * sqm + noise
true_slope = 2000
y_houses = true_slope * X_houses.ravel() + np.random.randn(120) * 30000
# Bayesian Ridge regression
brr = BayesianRidge()
brr.fit(X_houses, y_houses)
# Predict across the full range including the gap
X_pred = np.linspace(30, 550, 100).reshape(-1, 1)
y_pred, y_std = brr.predict(X_pred, return_std=True)
# Show predictions with uncertainty
print(f"Bayesian Ridge Regression -- uncertainty matters!\n")
print(f"{'Area (sqm)':>12s} {'Predicted':>12s} "
f"{'Std Dev':>10s} {'Data density':>14s}")
print("-" * 52)
for sqm_val in [75, 150, 275, 400, 500]:
idx = np.argmin(np.abs(X_pred.ravel() - sqm_val))
# Count nearby training points
nearby = np.sum(np.abs(X_houses.ravel() - sqm_val) < 50)
density = "Dense" if nearby > 10 else ("Sparse" if nearby > 0
else "NO DATA")
print(f"{sqm_val:>12d} {y_pred[idx]:>12,.0f} "
f"{y_std[idx]:>10,.0f} {density:>14s}")
print(f"\n--> Notice how uncertainty INCREASES in the data gap!")
print(f" At 275 sqm (no training data), the model is "
f"honest about its ignorance")
This is enormously useful for decision-making. If someone asks "should I invest in a 275 sqm house?", a standard regression says "it's worth $550K" with false precision. The Bayesian model says "it's probably worth between $450K and $650K -- but I'm really not sure because I haven't seen houses in that size range." That uncertainty communicaton can prevent bad decisions, and it's something that point estimates fundamentally cannot provide.
When Bayesian thinking helps most (and when it doesn't)
Bayesian methods shine in specific situations:
Small data. When you have limited training examples, the prior genuinely helps. It encodes your domain knowledge ("learning rates are usually between 0.001 and 0.1", "house prices are positive") and prevents the model from fitting noise. With 10 data points and 50 features, a Bayesian model with informative priors dramatically outperforms a frequentist model that treats all parameter values as equally plausible.
Uncertainty matters. If your prediction will inform a high-stakes decision (medical diagnosis, financial investment, safety-critical system), knowing the confidence is as important as the prediction itself. A Bayesian model says "I'm 70% confident this is benign, 30% it could be malignant" -- giving the doctor crucial information that a point prediction ("benign") hides entirely.
Sequential decision-making. Bayesian updating is natural for problems where data arrives over time: online learning, A/B testing, clinical trials, reinforcement learning. You update your beliefs as new evidence arrives, rather than retraining from scratch each time.
Model comparison. Bayesian model comparison (via Bayes factors or marginal likelihood) provides a principled way to choose between models of different complexity, automatically penalizing overly complex models. This connects to the AIC/BIC criteria we used for ARIMA order selection in episode #29 -- BIC is, in fact, an approximation of the Bayesian marginal likelihood.
The limits are real, though. Full Bayesian inference is computationally expensive. Computing the exact posterior requires integrating over all possible parameter values, which is intractable for all but the simplest models. For a neural network with millions of parameters, fully Bayesian treatment is impractical -- you'd need to maintain a distribution over millions of weights in stead of just point estimates. Approximation methods exist (MCMC sampling, variational inference), but they're slower and more complex than standard optimization.
In practice, most ML practitioners use "Bayesian-inspired" approaches: Bayesian optimization for hyperparameters, Naive Bayes for fast text classification, Bayesian priors as a conceptual model for regularization, and prediction intervals from Bayesian regression when uncertainty quantificaton matters. Full Bayesian inference is reserved for small models where getting the uncertainty exactly right is critical -- clinical trials, A/B testing, and certain financial models where the cost of overconfidence is very high.
# Summary: when to go Bayesian
scenarios = [
("Small dataset (n < 100)", "Bayesian",
"Prior regularizes; prevents overfitting"),
("Large dataset (n > 10000)", "Frequentist (usually)",
"Prior washes out; Bayesian adds compute cost"),
("Uncertainty critical", "Bayesian",
"Credible intervals > point estimates"),
("Speed critical", "Frequentist / Naive Bayes",
"Point estimates are cheaper to compute"),
("Hyperparameter tuning", "Bayesian optimization",
"Smarter search than grid/random"),
("Text classification", "Naive Bayes",
"Fast, few parameters, handles sparse data"),
("Deep learning (millions params)", "Frequentist + dropout",
"Full Bayesian intractable at scale"),
("A/B testing", "Bayesian",
"Sequential updating, stop early with confidence"),
]
print(f"{'Scenario':>32s} {'Approach':>22s}")
print(f"{'':>32s} {'Reason':>22s}")
print("=" * 58)
for scenario, approach, reason in scenarios:
print(f"{scenario:>32s} {approach:>22s}")
print(f"{'':>32s} ({reason})")
print()
Zo, wat hebben we geleerd?
We've taken a philosophical concept -- "probability as degree of belief" -- and turned it into practical ML tools that complement everything we've built so far. Here's the full picture:
- Bayesian reasoning updates beliefs by combining prior knowledge with observed data through Bayes' theorem. The posterior distribution captures both the best estimate AND the uncertainty around it -- something point estimates fundamentally cannot provide;
- Sequential updating means you can incorporate new data one observation at a time, with the posterior from each update becoming the prior for the next. The order doesn't matter, and with enough data, the prior washes out completely;
- Naive Bayes classifier is fast, simple, and effective for text classification despite its independence assumption being obviously wrong. The simplicity prevents overfitting on small datasets -- the same bias-variance tradeoff from episode #13 in disguise;
- Bayesian regularization reveals that Ridge regression (episode #11) IS Bayesian regression with a Gaussian prior. L1/Lasso corresponds to a Laplace prior. Regularization and Bayesian priors are two names for the same mathematical operation;
- Bayesian optimization for hyperparameter tuning (episode #16) builds a surrogate model of the objective function and intelligently chooses where to sample next, balancing exploration and exploitation. More efficient than grid or random search, especially when evaluations are expensive;
- Bayesian prediction with uncertainty quantification is critical when the cost of overconfidence is high -- medical diagnosis, financial decisions, safety-critical systems. The model honestly reports when it doesn't know;
- Full Bayesian inference is computationally expensive and impractical for large-scale models. In practice, most ML uses Bayesian-inspired approximations: Naive Bayes for classification, Bayesian optimization for tuning, dropout as approximate inference, and prediction intervals when uncertainty matters.
The Bayesian perspective ties together quite some concepts we've seen before into a coherent framework: regularization (episode #11), cross-validation and model selection (episode #13), feature engineering priors (episode #15), and the AIC/BIC criteria from time series (episode #29). It also points forward -- the concept of maintaining and updating distributions over parameters will show up again when we get to ensemble methods, where combining multiple models is conceptually similar to maintaining uncertainty over which model is correct.
Bedankt voor het lezen! Tot de volgende keer ;-)
Estimated Payout
$0.91
Discussion
No comments yet. Be the first!