Statistics for ML: Evaluation and Inference

Summary

In machine learning, we don’t just want a model that works; we want to know why it works and how much we can trust it. Statistics provides the mathematical framework for this rigor. This article explores the essential statistical tools for ML: hypothesis testing to compare models, confidence intervals to quantify uncertainty, and the bias‑variance tradeoff to diagnose performance. You’ll learn how to go beyond point estimates and start thinking in terms of probability distributions and statistical significance.

Why Statistics Matters in ML

Building a model is only half the battle. Evaluation is where we validate our assumptions and ensure our system generalizes to the real world. Without statistics, we are flying blind:

Quantifying Uncertainty: How sure are we that our 92% accuracy isn’t just luck?
Model Comparison: Is Model A statistically better than Model B, or is the difference just noise?
Root Cause Analysis: Is our model failing because it’s too simple (bias) or too complex (variance)?

In causal inference and finance, statistics is the primary language. We don’t just predict if a stock will go up; we infer the effect of a policy or an interest rate change on that stock with a specific level of confidence.

Core Statistical Concepts with ML Applications

1. Hypothesis Testing: The A/B Test of ML

In ML, we often use hypothesis testing to compare two models.

Null Hypothesis ( $H_0$ ): There is no significant difference between the models.
Alternative Hypothesis ( $H_1$ ): One model is significantly better than the other.

We calculate a p-value: the probability of observing our results if the null hypothesis were true. If $p < 0.05$ , we typically reject $H_0$ .

$t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$

import numpy as np
from scipy import stats

# Accuracy scores for two different models over 10 cross-validation folds
model_a_scores = [0.88, 0.89, 0.91, 0.87, 0.88, 0.90, 0.89, 0.88, 0.92, 0.88]
model_b_scores = [0.91, 0.92, 0.93, 0.90, 0.91, 0.92, 0.94, 0.91, 0.93, 0.92]

# Perform a paired t-test
t_stat, p_val = stats.ttest_rel(model_a_scores, model_b_scores)

print(f"P-value: {p_val:.4f}")
if p_val < 0.05:
    print("Reject H0: Model B is significantly better than Model A.")
else:
    print("Fail to reject H0: No significant difference detected.")

2. Confidence Intervals: Mapping the Error Bars

A single accuracy score (e.g., 95%) is a point estimate. A confidence interval (CI) provides a range where the true population parameter likely lies.

For a large enough sample, the CI for the mean is: $\bar{x} \pm z \frac{s}{\sqrt{n}}$

In ML, we often use Bootstrapping to calculate non-parametric confidence intervals for complex metrics.

def bootstrap_ci(data, n_iterations=1000, alpha=0.95):
    stats = []
    for _ in range(n_iterations):
        sample = np.random.choice(data, size=len(data), replace=True)
        stats.append(np.mean(sample))
    
    lower = np.percentile(stats, (1 - alpha) / 2 * 100)
    upper = np.percentile(stats, (1 + alpha) / 2 * 100)
    return lower, upper

ci_low, ci_high = bootstrap_ci(model_b_scores)
print(f"95% Confidence Interval for Model B: [{ci_low:.3f}, {ci_high:.3f}]")

3. The Bias-Variance Tradeoff

This is the “Golden Rule” of ML generalization. Total error can be decomposed into:

$\text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}$

Bias: Error from erroneous assumptions (Underfitting). The model is too simple (e.g., Linear Regression on non-linear data).
Variance: Error from sensitivity to small fluctuations in the training set (Overfitting). The model is too complex (e.g., Deep Tree on small data).

Feature	High Bias	High Variance
Training Error	High	Low
Test Error	High	High
Diagnosis	Underfitting	Overfitting
Solution	More features, complex model	More data, regularization

4. Rigorous Evaluation: The Confusion Matrix

For classification, accuracy is often misleading (especially with imbalanced classes). We use the Confusion Matrix to derive better metrics:

Precision: $\frac{TP}{TP + FP}$ (Quality: “Of all predicted positives, how many were actually positive?”)
Recall: $\frac{TP}{TP + FN}$ (Quantity: “Of all actual positives, how many did we catch?”)
F1-Score: Harmonic mean of Precision and Recall.

from sklearn.metrics import confusion_matrix, classification_report

y_true = [0, 1, 0, 1, 0, 1, 1, 0, 1, 1]
y_pred = [0, 1, 0, 0, 0, 1, 1, 1, 1, 1]

cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)
print("\nDetailed Report:")
print(classification_report(y_true, y_pred))

Conclusion

Statistics transforms machine learning from an experimental art into an engineering science. By applying hypothesis testing, constructing confidence intervals, and understanding the bias-variance tradeoff, you ensure that your models are not just high-performing on your laptop, but reliable and robust in production.

In the next part of this series, we will dive into Multivariable Calculus and Optimization to understand exactly how complex neural networks navigate their loss landscapes.