📖 Introduction

Model training is an iterative process of optimization—adjusting model parameters to minimize error on your training data while ensuring the model generalizes to new examples. The patterns in this page focus on how to execute this optimization effectively.

These patterns address core training mechanics: choosing the right optimization algorithm, determining when to stop training, managing learning rates, and distributing computation across multiple machines. Each decision impacts training speed, final model quality, and resource efficiency.

🎪 Useful overfitting

Useful Overfitting intentionally trains models to overfit the training dataset by removing generalization mechanisms like regularization, dropout, and validation-based early stopping.

🎯 When to Use

Useful Overfitting applies when two conditions are met:

No noise in labels: All training instances have accurate labels
Complete dataset: You have all possible examples (the full input space can be enumerated)

In these cases, overfitting becomes interpolation—there's no unseen data to generalize to, so overfitting isn't a concern.

🔧 Primary Applications

Approximating numerical systems: Use ML models to approximate solutions to partial differential equations (PDEs) or complex dynamical systems in climate science, computational biology, or finance. The model acts as a fast lookup table alternative when classical numerical methods are too slow.

Model distillation: Train a smaller model to mimic a larger model's predictions. The smaller model overfits on synthetic data labeled by the larger model, learning its soft outputs rather than ground truth labels.

Debugging and validation: Overfit on a small batch as a sanity check. A properly configured model should achieve zero loss on a small batch. If it can't, there's likely a bug in the model code, input pipeline, or loss function.

<aside> 💡

Model distillation: How it Works

Train the teacher model: First, train a large, high-capacity model on your dataset
Generate synthetic data: Use the teacher model to label a large amount of generated or augmented data
Train the student model: Train the smaller model on this synthetic dataset, learning from the teacher's soft outputs (probability distributions) rather than hard labels

Why Soft Outputs Matter

The teacher model's probability distribution contains richer information than binary labels. For example, a teacher might output [0.85, 0.10, 0.05] for three classes, revealing that the second and third classes share some similarity with the first. A hard label would just say "class 1," losing this nuance.

</aside>

🔑 Key Considerations

Interpolation limits: Works only when the lookup table is fine-grained enough and the system isn't chaotic beyond resolution thresholds
Model capacity: Start with a large model capable of overfitting, then apply regularization only if needed to improve validation accuracy