Introduction

Machine learning models operate on numbers, not the messy reality of raw inputs. Data representation — how you transform inputs into features — often determines whether training is stable, predictive power is unlocked, and results hold up in production.

This guide presents practical patterns for data representation and feature engineering: scaling numbers, encoding categories, learning embeddings, crossing features to capture interactions, handling variable-length arrays, and combining modalities. Along the way, it highlights trade-offs, when each method shines, and how to avoid common pitfalls.

Key Concepts 🔄

Inputs vs Features

An input is the raw, real-world data fed into a model (e.g., a photo of a cat 🐱).
A feature is a processed or transformed version of that input that the model actually uses for computations (e.g., the pixel intensity values extracted from the image).

Numerical Inputs 🔢

Many ML algorithms are sensitive to the magnitude and range of input features. Inputting features that have vastly different scales (e.g. age in years and income in dollars) will lead to poor performance and slower convergence.

<aside> 💡

Why?

Gradient descent efficiency

The speed and stability of gradient descent is highly dependent on the scale of the input features. When features are on a similar scale, the gradient descent algorithm can converge to the optimal solution more quickly and smoothly. Without scaling, the algorithm might take much longer to find the best parameters, or it might even fail to converge at all.

It helps mitigate both exploding and vanishing gradients:
- Exploding Gradients 💣: If the inputs and weights are consistently large, repeated multiplications can cause the gradients to grow exponentially.
- Vanishing Gradients 📉: Conversely, if the inputs and weights are consistently small, the repeated multiplications cause the gradients to shrink exponentially until they become virtually zero.
Scaling your inputs to a small range like [-1, 1] ensures that the numbers entering the network at the very beginning are not excessively large or small.
Improved performance for distance-based algorithms

Methods like k-NN, SVM, or clustering rely on distance calculations. Without scaling, features with larger ranges will dominate the distance computation, making smaller-scale features virtually irrelevant.
More effective regularisation

Regularisation techniques like L1 (Lasso) and L2 (Ridge) regression add a penalty to the model based on the magnitude of the coefficients. If features are on different scales, the penalty will be applied unevenly

</aside>

<aside> 🔥

When scaling is a necessity

Linear and Logistic Regression: These models are sensitive to the scale of the input features, and scaling can lead to a more accurate and stable model.
Neural networks: Scaling inputs is a standard practice in deep learning as it helps in the efficient training of the network.
Support vector machines (SVM): The performance of SVMs, especially those with radial basis function (RBF) kernels, is highly dependent on the scale of the input data.
k-nearest neighbors (kNN): Since this algorithm is based on the distance between data points, scaling is crucial for its performance.
Principal component analysis (PCA): This dimensionality reduction technique is influenced by the variance of the features. Scaling ensures that features with high variance do not dominate the principal components.
Clustering algorithms: These algorithms use distance metrics to group data points, making scaling a vital step. </aside>

<aside> 🔥

When you don't need scaling

Tree-based models: Random forests, gradient boosting, and decision trees are generally scale-invariant because they make splitting decisions based on feature rankings, not absolute values.
Already normalised data: If your features are already on similar scales (like percentages, or standardised measurements), scaling may be unnecessary.
Domain-specific reasons: Sometimes the natural scale of your data is meaningful and should be preserved for interpretability.

In credit scoring or fraud detection, the actual dollar amounts often matter more than relative scales. A $10,000 transaction might be normal for one customer but suspicious for another. For instance, transactions over $10,000 trigger specific reporting requirements, so scaling this to 0.5 on a [-1,1] scale would obscure this critical business logic.

</aside>

Numerical Scalers

Two main types of scalers exist for numerical features:

Linear scalers

Use when:
- Your data is approximately normally distributed or uniformly distributed
- You're using algorithms sensitive to feature scale (SVM, neural networks, k-means, k-NN, PCA, gradient-descent-based algorithms)
- You need to preserve the original relationships between data points
- Interpretability is important (scaled values maintain proportional relationships)
Non-linear scalers

Use when:
- Your data has heavy skewness or is highly non-normal
- You have significant outliers that you want to compress rather than remove
- You're working with tree-based algorithms that can benefit from more uniform distributions
- You want to make linear models work better with non-linear relationships

Linear scaling 📏

Linear scaling adjusts the range of your features without changing the shape of their distribution.

The choice of the linear scaler depends on the data distribution, the presence of outliers, and the requirements of your specific algorithm.

The most commonly used linear transformations are: