Machine learning models operate on numbers, not the messy reality of raw inputs. Data representation — how you transform inputs into features — often determines whether training is stable, predictive power is unlocked, and results hold up in production.
This guide presents practical patterns for data representation and feature engineering: scaling numbers, encoding categories, learning embeddings, crossing features to capture interactions, handling variable-length arrays, and combining modalities. Along the way, it highlights trade-offs, when each method shines, and how to avoid common pitfalls.
Inputs vs Features
Many ML algorithms are sensitive to the magnitude and range of input features. Inputting features that have vastly different scales (e.g. age in years and income in dollars) will lead to poor performance and slower convergence.
<aside> 💡
Why?
Gradient descent efficiency
The speed and stability of gradient descent is highly dependent on the scale of the input features. When features are on a similar scale, the gradient descent algorithm can converge to the optimal solution more quickly and smoothly. Without scaling, the algorithm might take much longer to find the best parameters, or it might even fail to converge at all.
It helps mitigate both exploding and vanishing gradients:
Scaling your inputs to a small range like [-1, 1] ensures that the numbers entering the network at the very beginning are not excessively large or small.
Improved performance for distance-based algorithms
Methods like k-NN, SVM, or clustering rely on distance calculations. Without scaling, features with larger ranges will dominate the distance computation, making smaller-scale features virtually irrelevant.
More effective regularisation
Regularisation techniques like L1 (Lasso) and L2 (Ridge) regression add a penalty to the model based on the magnitude of the coefficients. If features are on different scales, the penalty will be applied unevenly
</aside>
<aside> 🔥
When scaling is a necessity
<aside> 🔥
When you don't need scaling
Tree-based models: Random forests, gradient boosting, and decision trees are generally scale-invariant because they make splitting decisions based on feature rankings, not absolute values.
Already normalised data: If your features are already on similar scales (like percentages, or standardised measurements), scaling may be unnecessary.
Domain-specific reasons: Sometimes the natural scale of your data is meaningful and should be preserved for interpretability.
In credit scoring or fraud detection, the actual dollar amounts often matter more than relative scales. A $10,000 transaction might be normal for one customer but suspicious for another. For instance, transactions over $10,000 trigger specific reporting requirements, so scaling this to 0.5 on a [-1,1] scale would obscure this critical business logic.
</aside>
Two main types of scalers exist for numerical features:
Linear scalers
Use when:
Non-linear scalers
Use when:
Linear scaling adjusts the range of your features without changing the shape of their distribution.
The choice of the linear scaler depends on the data distribution, the presence of outliers, and the requirements of your specific algorithm.
The most commonly used linear transformations are: