Learn AI Series (#36) - Mini Project - Complete ML Pipeline

What will I learn

You will learn how to build a complete ML pipeline from raw messy data to scored predictions -- tying together everything from 35 episodes into one coherent system;
simulating a realistic dataset with mixed feature types, missing values, and engineered features;
building a preprocessing pipeline with ColumnTransformer for heterogeneous data;
systematic model comparison across fundamentally different algorithm families;
proper evaluation with cross-validation, multiple metrics, and per-class analysis;
inspecting feature importance to understand what drives predictions;
saving and loading the complete pipeline as one deployable artifact with joblib.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.11+) distribution;
The ambition to learn AI and machine learning.

Difficulty

Beginner

Curriculum (of the `Learn AI Series`):

@scipio/learn-ai-series-1-what-machine-learning-actually-is" target="_blank" rel="noopener noreferrer">Learn AI Series (#1) - What Machine Learning Actually Is
@scipio/learn-ai-series-2-setting-up-your-ai-workbench-python-and-numpy" target="_blank" rel="noopener noreferrer">Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
@scipio/learn-ai-series-3-your-data-is-just-numbers-how-machines-see-the-world" target="_blank" rel="noopener noreferrer">Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
@scipio/learn-ai-series-4-your-first-prediction-no-math-just-intuition" target="_blank" rel="noopener noreferrer">Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
@scipio/learn-ai-series-5-patterns-in-data-what-learning-actually-looks-like" target="_blank" rel="noopener noreferrer">Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
@scipio/learn-ai-series-6-from-intuition-to-math-why-we-need-formulas" target="_blank" rel="noopener noreferrer">Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
@scipio/learn-ai-series-7-the-training-loop-see-it-work-step-by-step" target="_blank" rel="noopener noreferrer">Learn AI Series (#7) - The Training Loop - See It Work Step by Step
@scipio/learn-ai-series-8-the-math-you-actually-need-part-1-linear-algebra" target="_blank" rel="noopener noreferrer">Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
@scipio/learn-ai-series-9-the-math-you-actually-need-part-2-calculus-and-probability" target="_blank" rel="noopener noreferrer">Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
@scipio/learn-ai-series-10-your-first-ml-model-linear-regression-from-scratch" target="_blank" rel="noopener noreferrer">Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
@scipio/learn-ai-series-11-making-linear-regression-real" target="_blank" rel="noopener noreferrer">Learn AI Series (#11) - Making Linear Regression Real
@scipio/learn-ai-series-12-classification-logistic-regression-from-scratch" target="_blank" rel="noopener noreferrer">Learn AI Series (#12) - Classification - Logistic Regression From Scratch
@scipio/learn-ai-series-13-evaluation-how-to-know-if-your-model-actually-works" target="_blank" rel="noopener noreferrer">Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
@scipio/learn-ai-series-14-data-preparation-the-80-nobody-talks-about" target="_blank" rel="noopener noreferrer">Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
@scipio/learn-ai-series-15-feature-engineering-and-selection" target="_blank" rel="noopener noreferrer">Learn AI Series (#15) - Feature Engineering and Selection
@scipio/learn-ai-series-16-scikit-learn-the-standard-library-of-ml" target="_blank" rel="noopener noreferrer">Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
@scipio/learn-ai-series-17-decision-trees-how-machines-make-decisions" target="_blank" rel="noopener noreferrer">Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
@scipio/learn-ai-series-18-random-forests-wisdom-of-crowds" target="_blank" rel="noopener noreferrer">Learn AI Series (#18) - Random Forests - Wisdom of Crowds
@scipio/learn-ai-series-19-gradient-boosting-the-kaggle-champion" target="_blank" rel="noopener noreferrer">Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
@scipio/learn-ai-series-20-support-vector-machines-drawing-the-perfect-boundary" target="_blank" rel="noopener noreferrer">Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
@scipio/learn-ai-series-21-mini-project-predicting-crypto-market-regimes" target="_blank" rel="noopener noreferrer">Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
@scipio/learn-ai-series-22-k-means-clustering-finding-groups" target="_blank" rel="noopener noreferrer">Learn AI Series (#22) - K-Means Clustering - Finding Groups
@scipio/learn-ai-series-23-advanced-clustering-beyond-k-means" target="_blank" rel="noopener noreferrer">Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
@scipio/learn-ai-series-24-dimensionality-reduction-pca" target="_blank" rel="noopener noreferrer">Learn AI Series (#24) - Dimensionality Reduction - PCA
@scipio/learn-ai-series-25-advanced-dimensionality-reduction-t-sne-and-umap" target="_blank" rel="noopener noreferrer">Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
@scipio/learn-ai-series-26-anomaly-detection-finding-what-doesnt-belong" target="_blank" rel="noopener noreferrer">Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
@scipio/learn-ai-series-27-recommendation-systems-users-like-you-also-liked" target="_blank" rel="noopener noreferrer">Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
@scipio/learn-ai-series-28-time-series-fundamentals-when-order-matters" target="_blank" rel="noopener noreferrer">Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
@scipio/learn-ai-series-29-time-series-forecasting-predicting-what-comes-next" target="_blank" rel="noopener noreferrer">Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
@scipio/learn-ai-series-30-natural-language-processing-text-as-data" target="_blank" rel="noopener noreferrer">Learn AI Series (#30) - Natural Language Processing - Text as Data
@scipio/learn-ai-series-31-word-embeddings-meaning-in-numbers" target="_blank" rel="noopener noreferrer">Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
@scipio/learn-ai-series-32-bayesian-methods-thinking-in-probabilities" target="_blank" rel="noopener noreferrer">Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
@scipio/learn-ai-series-33-ensemble-methods-deep-dive-stacking-and-blending" target="_blank" rel="noopener noreferrer">Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
@scipio/learn-ai-series-34-ml-engineering-from-notebook-to-production" target="_blank" rel="noopener noreferrer">Learn AI Series (#34) - ML Engineering - From Notebook to Production
@scipio/learn-ai-series-35-data-ethics-and-bias-in-ml" target="_blank" rel="noopener noreferrer">Learn AI Series (#35) - Data Ethics and Bias in ML
@scipio/learn-ai-series-36-mini-project-complete-ml-pipeline" target="_blank" rel="noopener noreferrer">Learn AI Series (#36) - Mini Project - Complete ML Pipeline (this post)

Learn AI Series (#36) - Mini Project - Complete ML Pipeline

This is it -- the Arc 2 capstone. Thirty-five episodes ago we started with "what is a number, really?" and worked our way through linear regression, classification, data preparation, feature engineering, decision trees, random forests, gradient boosting, SVMs, clustering, dimensionality reduction, anomaly detection, recommendations, time series, NLP, word embeddings, Bayesian methods, ensemble stacking, production engineering, and ethics. Every episode introduced one concept in isolation, built it from scratch, and moved on. Today we bring them all together.

Back in episode #21 we built a focused mini-project -- one dataset, one problem (crypto market regimes), a handful of models compared head to head. That was about getting your hands dirty with the classification workflow. This time we go wider and deeper: a complete ML pipeline that handles messy real-world data with mixed feature types, missing values, proper preprocessing, systematic model comparison across algorithm families, detailed evaluation, feature importance analysis, and artifact saving. The full journey from "I have raw data" to "I have a deployable prediction system."

The task: predict content quality scores. Given metadata about a piece of content -- its length, readability metrics, author history, category, and engagement signals -- predict whether it's high quality or not. This is the kind of problem you'd encounter building a content platform, a recommendation engine, or a curation system for a publishing platform. If you've been following this series, you'll see exactly how a real-world classification problem like this pulls from virtually every episode we've done.

We'll simulate the dataset because the goal here is the pipeline, not the data source. In production you'd swap the simulation for a real data source, but the pipeline stays identical. That's the whole point -- the engineering pattern is reusable regardless of the domain.

Here we go!

Step 1: Generating a realistic messy dataset

Real data is messy. It has numerical features with different scales and distributions, categorical features with variable cardinality, missing values scattered unpredictably, and correlations between features that sometimes help and sometimes confuse your model. Our simulation reflects all of this:

import numpy as np
from sklearn.model_selection import train_test_split

np.random.seed(42)
n = 2000

# Numerical features with realistic distributions
word_count = np.random.lognormal(6.5, 0.8, n).astype(int)
avg_sentence_len = 10 + np.random.exponential(5, n)
unique_word_ratio = np.clip(
    np.random.normal(0.45, 0.12, n), 0.1, 0.9
)
author_posts = np.random.poisson(20, n)
author_rep = np.clip(np.random.normal(50, 15, n), 1, 80)
has_images = np.random.binomial(1, 0.4, n)
has_code = np.random.binomial(1, 0.25, n)

# Categorical feature
categories = np.random.choice(
    ['tech', 'science', 'tutorial', 'opinion', 'news'], n
)

# Derived feature (reading time approximation)
reading_time = word_count / 200 + np.random.normal(0, 0.5, n)

# Introduce missing values (5-7% per feature, realistic)
avg_sentence_len[np.random.random(n) < 0.07] = np.nan
author_rep[np.random.random(n) < 0.05] = np.nan

print(f"Dataset: {n} samples")
print(f"Missing avg_sentence_len: "
      f"{np.isnan(avg_sentence_len).sum()} "
      f"({np.isnan(avg_sentence_len).mean():.1%})")
print(f"Missing author_rep: "
      f"{np.isnan(author_rep).sum()} "
      f"({np.isnan(author_rep).mean():.1%})")

Notice the deliberate distribution choices. Log-normal for word counts -- because most posts are short, a few are very long, and nobody writes negative words (episode #14 on data preparation taught us that understanding your data's shape determines your preprocessing strategy). Poisson for author post counts -- discrete, non-negative, typically small but with occasional prolific outliers. Missing values at realistic percentages (5-7%), not the 50% chaos you sometimes see in tutorials that makes the problem harder than it needs to be.

Why these specific features? Because they mirror what you'd actually collect for a content quality system. Word count and sentence length capture writing effort. Unique word ratio measures vocabulary richness (a sign of quality writing vs. copy-paste repetition). Author history features (post count, reputation) give you signal about whether this author tends to produce good content. Binary flags (images, code) indicate effort investment. Category is the only categorical feature here, and it matters because "tutorial" content is structured differently from "opinion" pieces -- a model that learns this distinction is capturing something real.

Step 2: Feature engineering and target creation

Raw features are a starting point. As we covered extensively in episode #15, engineered features often carry more signal than raw inputs because they encode domain knowledge into a form the model can use directly:

# Engineered features
log_wc = np.log1p(word_count)
density = unique_word_ratio * log_wc
experience = (np.log1p(author_posts) *
              np.nan_to_num(author_rep, nan=50) / 50)

# Create binary target: high quality or not
# Weighted combination of signals + noise
quality_score = (
    0.30 * (log_wc - 5) / 3 +
    0.20 * unique_word_ratio +
    0.15 * np.nan_to_num(author_rep, nan=50) / 80 +
    0.10 * has_images +
    0.10 * has_code +
    0.15 * (categories == 'tutorial').astype(float) +
    np.random.normal(0, 0.15, n)
)
y = (quality_score > np.median(quality_score)).astype(int)

print(f"Target distribution: {y.mean():.1%} positive")
print(f"Engineered features: log_wc, density, experience")
print(f"Total feature count: 13 "
      f"(10 numeric + 1 categorical + 2 engineered)")

The log1p transform for word count compresses the long tail -- remember from episode #14, many ML algorithms (especially linear models and SVMs) perform better when features are on similar scales, and log-transforming skewed distributions helps enormously. The density feature multiplies vocabulary uniqueness by log word count, capturing the idea that a long AND lexically diverse post is a stronger quality signal than either alone. experience combines author post count with reputation -- a prolific author with high reputation is different from a prolific author with low reputation.

The target is binary -- high quality or not -- based on a weighted combination of feature signals plus noise. That added noise (np.random.normal(0, 0.15, n)) is critical: it ensures no model achieves perfect accuracy, which is realistic. In the real world, quality is partly subjective, partly random, and no feature set captures it completely. Your model has to work with imperfect information, just like every real ML system does.

Step 3: Building the preprocessing pipeline

This is where scikit-learn's pipeline and ColumnTransformer machinery (episode #16) really shines. Different feature types need fundamentally different preprocessing: numerical features get imputed and scaled, categorical features get encoded. Doing this in a pipeline means the entire preprocessing logic travels with the model -- no chance of training-serving skew (the silent killer from episode #34):

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# Assemble feature matrix
X_num = np.column_stack([
    log_wc, avg_sentence_len, unique_word_ratio,
    author_posts, author_rep, has_images, has_code,
    density, experience, reading_time
])

num_features = list(range(10))
cat_features = [10]

preprocessor = ColumnTransformer([
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ]), num_features),
    ('cat', OneHotEncoder(
        drop='first', sparse_output=False
    ), cat_features)
])

# Combine numerical and categorical into one matrix
X = np.column_stack([X_num, categories.reshape(-1, 1)])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Train: {len(X_train)} samples")
print(f"Test:  {len(X_test)} samples")
print(f"Preprocessing: median imputation + standard scaling "
      f"(numeric), one-hot encoding (categorical)")

The ColumnTransformer applies different transformation pipelines to different columns in a single step. Numerical columns get median imputation (robust to outliers -- mean imputation would be pulled by extreme values in our log-normal word count distribution) followed by standard scaling (zero mean, unit variance). The categorical column gets one-hot encoded with one category dropped to avoid multicollinearity (which matters for logistic regression and SVMs but is harmless for tree-based models).

The stratify=y parameter in train_test_split ensures both the training and test sets have the same class distribution -- a detail from episode #14 that becomes critical when your target isn't perfectly 50/50 balanced.

The entire preprocessor is a single scikit-learn object that can be fitted once on training data and applied identically to any new data. Fit it, forget about the details, and it handles imputation medians, scaler means and standard deviations, and encoder categories all in one place. This is the consistency guarantee from episode #34 -- one artifact, zero mismatch risk.

Step 4: Systematic model comparison

We compare across algorithm families -- not just hyperparameter variants within one family. This is the central lesson from episodes #18, #19, #20, and #33: different algorithms have fundamentally different inductive biases, and diversity of approach catches what any single method misses:

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import (RandomForestClassifier,
                               GradientBoostingClassifier)
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

models = {
    'LogReg': LogisticRegression(max_iter=1000),
    'RF': RandomForestClassifier(
        n_estimators=200, random_state=42),
    'GBM': GradientBoostingClassifier(
        n_estimators=200, random_state=42),
    'SVM': SVC(probability=True, random_state=42),
    'KNN': KNeighborsClassifier(n_neighbors=7),
}

results = {}
print(f"{'Model':>8s}  {'AUC-ROC':>12s}  {'Accuracy':>12s}")
print("-" * 36)

for name, model in models.items():
    pipe = Pipeline([
        ('prep', preprocessor),
        ('model', model)
    ])
    auc_scores = cross_val_score(
        pipe, X_train, y_train, cv=5, scoring='roc_auc'
    )
    acc_scores = cross_val_score(
        pipe, X_train, y_train, cv=5, scoring='accuracy'
    )
    results[name] = {
        'auc': auc_scores,
        'acc': acc_scores
    }
    print(f"{name:>8s}  "
          f"{auc_scores.mean():.3f} +/- {auc_scores.std():.3f}  "
          f"{acc_scores.mean():.3f} +/- {acc_scores.std():.3f}")

best_name = max(results, key=lambda k: results[k]['auc'].mean())
print(f"\nBest model by AUC: {best_name}")

We use AUC-ROC rather than plain accuracy because, as we discussed at lenght in episode #13, AUC gives you a threshold-independent measure of how well the model separates classes. A model with 0.85 AUC is genuinely better at distinguishing high-quality from low-quality content than one with 0.82, regardless of what classification threshold you eventually pick. Accuracy, by contrast, depends on the threshold and can be misleading when class proportions are unbalanced.

Cross-validation (5-fold here) gives us both the mean performance and the variance -- a model with AUC 0.85 +/- 0.01 is much more trustworthy than one with AUC 0.86 +/- 0.05. The first is consistently good; the second might be great on some data splits and terrible on others. Stability matters as much as peak performance, especially if you're deploying this thing ;-)

Notice that every model is wrapped in a Pipeline with the preprocessor. This means preprocessing is fitted inside each cross-validation fold -- no data leakage, no cheating. The scaler learns its means from training folds only, never peeking at validation data. This is exactly the leakage prevention we discussed in episode #33 on stacking.

Step 5: Detailed evaluation on the held-out test set

The best model from cross-validation deserves a thorough examination on data it has never seen. Multiple metrics, per-class breakdown, the full picture:

from sklearn.metrics import (classification_report,
                              roc_auc_score,
                              confusion_matrix)

best_pipe = Pipeline([
    ('prep', preprocessor),
    ('model', models[best_name])
])
best_pipe.fit(X_train, y_train)

y_pred = best_pipe.predict(X_test)
y_proba = best_pipe.predict_proba(X_test)[:, 1]

print(classification_report(
    y_test, y_pred,
    target_names=['Low Quality', 'High Quality']
))
print(f"AUC-ROC: {roc_auc_score(y_test, y_proba):.3f}")

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()
print(f"\nConfusion matrix:")
print(f"  True Neg:  {tn:>4d}  |  False Pos: {fp:>4d}")
print(f"  False Neg: {fn:>4d}  |  True Pos:  {tp:>4d}")

print(f"\nError analysis:")
print(f"  False positive rate: {fp / (fp + tn):.1%} "
      f"(promoted low-quality content)")
print(f"  False negative rate: {fn / (fn + tp):.1%} "
      f"(missed high-quality content)")

The classification report shows precision, recall, and F1 for each class -- and these numbers tell different stories depending on your application. For a content quality scorer, false positives (promoting low-quality content) and false negatives (burying high-quality content) have different costs. If you're building a front-page curator, false positives are worse -- showing bad content damages user trust. If you're building a discovery tool to find hidden gems, false negatives are worse -- missing good content defeats the purpose.

This connects directly to episode #35 on ethics and bias: who suffers from your model's errors? If your model disproportionately flags "opinion" pieces as low quality while favoring "tutorial" pieces, you've built a system that's biased toward one content type. The per-class metrics and the confusion matrix make these patterns visible. A single accuracy number hides them.

Step 6: Feature importance -- understanding the WHY

Understanding why the model makes its predictions is just as important as the predictions themselves. Episode #35 drove this home: if a model relies on features that are proxies for things you don't want to encode, you catch it here. For tree-based models, we can inspect feature importances directly:

if hasattr(best_pipe.named_steps['model'],
           'feature_importances_'):
    importances = best_pipe.named_steps['model'] \
        .feature_importances_

    # Build feature name list matching preprocessor output
    feat_names = [
        'log_wc', 'sent_len', 'uniq_ratio',
        'author_posts', 'author_rep', 'has_img',
        'has_code', 'density', 'experience', 'read_time'
    ]
    # OneHotEncoder drops first category alphabetically
    # Categories sorted: news, opinion, science, tech, tutorial
    # Dropped: news (first alphabetically with drop='first')
    cat_names = ['cat_opinion', 'cat_science',
                 'cat_tech', 'cat_tutorial']
    all_features = feat_names + cat_names

    sorted_idx = np.argsort(importances)[::-1]
    print("Feature importance (top 10):\n")
    for rank, i in enumerate(sorted_idx[:10], 1):
        bar = "#" * int(importances[i] * 100)
        print(f"  {rank:>2d}. {all_features[i]:>16s}: "
              f"{importances[i]:.3f}  {bar}")

    # Bottom features -- candidates for removal
    print(f"\nBottom 3 (candidates for removal):")
    for i in sorted_idx[-3:]:
        print(f"      {all_features[i]:>16s}: "
              f"{importances[i]:.3f}")

If the model relies heavily on cat_tutorial -- meaning it just learned "tutorials get high scores" -- that might be fine (if tutorials genuinely are higher quality in your system) or it might be a problem (if you want the model to evaluate quality independent of content type). Feature importance gives you the diagnostic to make that judgment call. This is where the evaluation mindset from episode #13 meets the ethics perspective from episode #35 -- you're not just checking IF the model works, but HOW it works.

Features that contribute near zero importance are candidates for removal. Simpler models are easier to maintain, faster to run, and (as we learned from the bias-variance tradeoff discussions) sometimes more robust. If cat_opinion adds nothing, drop it -- one less feature for the preprocessor to manage, one less potential source of bugs in production.

Step 7: Threshold tuning for your specific use case

The default classification threshold (0.5) assumes that false positives and false negatives are equally costly. They almost never are. Tuning the threshold lets you trade off between precision and recall based on what actually matters for your application:

from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(
    y_test, y_proba
)

# Find threshold for specific precision target
target_precision = 0.85
valid = precisions[:-1] >= target_precision
if valid.any():
    # Among thresholds achieving target precision,
    # pick the one with best recall
    best_idx = np.where(valid)[0][np.argmax(recalls[:-1][valid])]
    optimal_thresh = thresholds[best_idx]
    print(f"Target precision: >= {target_precision:.0%}")
    print(f"Optimal threshold: {optimal_thresh:.3f}")
    print(f"Achieved precision: {precisions[best_idx]:.3f}")
    print(f"Achieved recall: {recalls[best_idx]:.3f}")

    # Apply custom threshold
    y_custom = (y_proba >= optimal_thresh).astype(int)
    print(f"\nWith custom threshold:")
    print(classification_report(
        y_test, y_custom,
        target_names=['Low Quality', 'High Quality']
    ))
else:
    print(f"Cannot achieve {target_precision:.0%} precision "
          f"with this model")

This is the precision-recall tradeoff from episode #13 made concrete. If you set a high threshold (say 0.8), the model only flags content as "high quality" when it's very confident -- fewer false positives, but you miss some genuinely good content. If you set a low threshold (say 0.3), you catch most good content but promote a lot of mediocre stuff too. There's no universally correct answer -- it depends entirely on the downstream impact of each error type.

Step 8: Save the complete pipeline

The final pipeline is one self-contained object. Save it, ship it, and anyone can load it and make predictions without knowing a single detail about the preprocessing:

import joblib
import os

joblib.dump(best_pipe, 'content_quality_pipeline.joblib')
file_size = os.path.getsize(
    'content_quality_pipeline.joblib'
) / 1024

# Verify: identical predictions after reload
loaded = joblib.load('content_quality_pipeline.joblib')
loaded_preds = loaded.predict(X_test[:10])
original_preds = best_pipe.predict(X_test[:10])
match = np.array_equal(loaded_preds, original_preds)

print(f"Pipeline saved: content_quality_pipeline.joblib")
print(f"File size: {file_size:.1f} KB")
print(f"Model type: "
      f"{best_pipe.named_steps['model'].__class__.__name__}")
print(f"Predictions match after reload: {match}")

# What's inside the artifact
print(f"\nArtifact contents:")
print(f"  - Imputer (learned medians for 10 numeric features)")
print(f"  - Scaler (learned means + stds for 10 features)")
print(f"  - Encoder (learned categories for 1 categorical)")
print(f"  - Model ({best_name} with trained weights)")
print(f"\nOne file. Zero chance of preprocessing mismatch.")

This artifact -- a single .joblib file -- contains the imputer (with learned median values for each numerical feature), the scaler (with learned means and standard deviations), the encoder (with learned category mappings), and the trained model (with all its learned parameters). As we disscussed in episode #34, this is the production pattern: save the entire pipeline, not just the model. Deploy it with FastAPI, schedule batch predictions with cron, share it with a colleague. The pipeline guarantees identical preprocessing everywhere it's used.

Step 9: Simulating production inference

A pipeline that only works on your training data isn't a pipeline -- it's a prototype. Let's simulate what production inference actually looks like, including the messy reality of incomplete and unexpected inputs:

def predict_content_quality(pipeline, raw_input):
    """Score a single piece of content.
    Handles the real-world messiness of production data."""
    try:
        features = np.array([[
            np.log1p(max(0, raw_input.get('word_count', 0))),
            raw_input.get('avg_sentence_len', np.nan),
            raw_input.get('unique_word_ratio', 0.3),
            raw_input.get('author_posts', 0),
            raw_input.get('author_rep', np.nan),
            raw_input.get('has_images', 0),
            raw_input.get('has_code', 0),
            (raw_input.get('unique_word_ratio', 0.3) *
             np.log1p(max(0, raw_input.get('word_count', 0)))),
            (np.log1p(raw_input.get('author_posts', 0)) *
             raw_input.get('author_rep', 50) / 50),
            raw_input.get('word_count', 0) / 200,
            raw_input.get('category', 'news'),
        ]], dtype=object)

        pred = pipeline.predict(features)[0]
        prob = pipeline.predict_proba(features)[0]
        return {
            'prediction': 'high' if pred == 1 else 'low',
            'confidence': float(max(prob)),
            'probability': float(prob[1]),
        }
    except Exception as e:
        return {'prediction': 'error', 'error': str(e)}


# Test with realistic inputs
test_articles = [
    {
        'word_count': 3500,
        'avg_sentence_len': 18.5,
        'unique_word_ratio': 0.62,
        'author_posts': 45,
        'author_rep': 65,
        'has_images': 1,
        'has_code': 1,
        'category': 'tutorial',
    },
    {
        'word_count': 150,
        'unique_word_ratio': 0.25,
        'author_posts': 2,
        'has_images': 0,
        'has_code': 0,
        'category': 'opinion',
        # Note: missing avg_sentence_len and author_rep
    },
    {
        'word_count': 1200,
        'avg_sentence_len': 22.0,
        'unique_word_ratio': 0.48,
        'author_posts': 100,
        'author_rep': 72,
        'has_images': 1,
        'has_code': 0,
        'category': 'science',
    },
]

print("Production inference simulation:\n")
for i, article in enumerate(test_articles):
    result = predict_content_quality(loaded, article)
    print(f"Article {i+1} ({article['category']}, "
          f"{article['word_count']} words):")
    print(f"  Quality: {result['prediction']} "
          f"(probability: {result.get('probability', 'N/A'):.2f})")
    print()

Notice the second test article -- it's missing avg_sentence_len and author_rep. In production, this happens ALL the time. Maybe the readability analyzer timed out, maybe the author is brand new and doesn't have a reputation score yet. Because our pipeline includes SimpleImputer with strategy='median', those missing values get filled with the median learned from training data, and the prediction proceeds normally. No crashes, no special-case code, no "if feature is missing then..." branches. The pipeline handles it.

This is the power of investing upfront in proper data preparation (episode #14) and pipeline design (episode #16). You pay the complexity cost once, during training setup, and every subsequent prediction -- whether it's the first or the millionth -- just works.

Reflecting on Arc 2

Thirty-six episodes in. We started with "what is a number to a machine" and built up to a complete ML pipeline that handles every stage of the prediction journey. Let me map the episodes to the pipeline steps you just built, because this is the real payoff of following the series from the start:

Data understanding (episodes #3, #5): knowing that log-normal distributions exist, that features have different shapes, that patterns hide in correlations between variables -- this is why we chose specific distributions for our simulation and specific engineered features.

Preprocessing (episodes #14, #15, #16): imputation, scaling, encoding, feature engineering, and the ColumnTransformer pattern. Eighty percent of the real work in any ML project, and we built it to be reusable and leak-free.

Model selection (episodes #10-12, #17-20): linear models, trees, SVMs, ensembles -- we compared across families because diversity catches what uniformity misses (episode #33 made this explicit with stacking).

Evaluation (episode #13): AUC-ROC, precision-recall tradeoffs, per-class metrics, confusion matrices. A model with great aggregate accuracy can still be terrible for specific subgroups -- we measure what matters, not just what's easy.

Ethics (episode #35): feature importance isn't just a diagnostic tool -- it's an audit. If the model learned "tutorial = high quality" as its strongest signal, that's a bias you need to decide whether to keep or remove.

Production (episode #34): saving the complete pipeline as one artifact, validating inputs, handling missing data gracefully. The code you wrote today could serve real predictions with a FastAPI wrapper and nothing else.

The models you've built across Arc 2 are genuinely powerful -- gradient boosting on tabular data solves an enormous range of real-world problems, and many production ML systems use exactly what you now know how to build. But they all share a fundamental limitation: they operate on hand-crafted features. You, the engineer, decide what features to compute. The model learns the mapping from your features to predictions, but the feature design is entirely manual.

For images, audio, raw text, and other unstructured data, hand-crafting features is impractical or flat-out impossible. How do you manually define the features that distinguish a cat from a dog in a photograph? What features capture the emotion in a voice recording? You can try (and people did for decades -- SIFT features for images, MFCCs for audio, bag-of-words for text), but the results plateau far below what's possible.

Arc 3 changes the game. Neural networks learn their own features directly from raw data. No manual feature engineering. No domain expert deciding which transformations to apply. The network discovers what matters -- often finding representations that no human would have thought to create. We'll build them from scratch, starting with a single artificial neuron, and work our way up to the architectures that power modern AI systems. Everything you've learned in Arc 2 -- evaluation, regularization, ethics, production engineering -- carries forward. Those skills don't become obsolete with neural networks. They become even MORE important.

Zo, wat hebben we geleerd?

Here's what this pipeline project ties together:

A complete ML pipeline covers the full journey: raw data with missing values and mixed types, preprocessing, feature engineering, model selection, evaluation, interpretation, and artifact saving. Every step connects to specific episodes in this series;
Use ColumnTransformer to apply different preprocessing to different feature types in a single step. Numerical features get imputed and scaled, categorical features get encoded. The pipeline object carries all learned parameters;
Compare across fundamentally different algorithm families (linear, tree-based, SVM, distance-based) -- not just hyperparameter variants. Diversity catches what any single approach misses (episode #33);
Evaluate with multiple metrics (AUC, precision, recall, F1) because accuracy alone hides important tradeoffs. The confusion matrix shows you exactly where the model fails and for whom;
Inspect feature importance to understand model behavior AND audit for unwanted biases. If a feature contributes nothing, remove it. If a feature encodes something problematic, address it;
Tune the classification threshold for your specific application -- the default 0.5 assumes equal error costs, which is almost never true in practice;
Save the entire pipeline as one joblib artifact -- imputer medians, scaler parameters, encoder mappings, and model weights bundled together. One file, zero preprocessing mismatch;
This pipeline pattern is the foundation: swap the data source and the problem changes, but the engineering stays the same. From content quality to fraud detection to medical triage -- the structure you built today adapts to any tabular classification problem.

Bedankt voor het lezen! See you in Arc 3 ;-)

@scipio

Learn AI Series (#36) - Mini Project - Complete ML Pipeline

@scipio

Learn AI Series (#36) - Mini Project - Complete ML Pipeline

What will I learn

Requirements

Difficulty

Curriculum (of the Learn AI Series):

Learn AI Series (#36) - Mini Project - Complete ML Pipeline

Step 1: Generating a realistic messy dataset

Step 2: Feature engineering and target creation

Step 3: Building the preprocessing pipeline

Step 4: Systematic model comparison

Step 5: Detailed evaluation on the held-out test set

Step 6: Feature importance -- understanding the WHY

Step 7: Threshold tuning for your specific use case

Step 8: Save the complete pipeline

Step 9: Simulating production inference

Reflecting on Arc 2

Zo, wat hebben we geleerd?

Bedankt voor het lezen! See you in Arc 3 ;-)

Discussion

Curriculum (of the `Learn AI Series`):