Introduction

Machine learning models operate on numbers, not the messy reality of raw inputs. Data representation — how you transform inputs into features — often determines whether training is stable, predictive power is unlocked, and results hold up in production.

This guide presents practical patterns for data representation and feature engineering: scaling numbers, encoding categories, learning embeddings, crossing features to capture interactions, handling variable-length arrays, and combining modalities. Along the way, it highlights trade-offs, when each method shines, and how to avoid common pitfalls.

Key Concepts 🔄

Inputs vs Features

Numerical Inputs 🔢

Many ML algorithms are sensitive to the magnitude and range of input features. Inputting features that have vastly different scales (e.g. age in years and income in dollars) will lead to poor performance and slower convergence.

<aside> 💡

Why?

  1. Gradient descent efficiency

    The speed and stability of gradient descent is highly dependent on the scale of the input features. When features are on a similar scale, the gradient descent algorithm can converge to the optimal solution more quickly and smoothly. Without scaling, the algorithm might take much longer to find the best parameters, or it might even fail to converge at all.

    It helps mitigate both exploding and vanishing gradients:

    Scaling your inputs to a small range like [-1, 1] ensures that the numbers entering the network at the very beginning are not excessively large or small.

  2. Improved performance for distance-based algorithms

    Methods like k-NN, SVM, or clustering rely on distance calculations. Without scaling, features with larger ranges will dominate the distance computation, making smaller-scale features virtually irrelevant.

  3. More effective regularisation

    Regularisation techniques like L1 (Lasso) and L2 (Ridge) regression add a penalty to the model based on the magnitude of the coefficients. If features are on different scales, the penalty will be applied unevenly

</aside>

<aside> 🔥

When scaling is a necessity

<aside> 🔥

When you don't need scaling

</aside>

Numerical Scalers

Two main types of scalers exist for numerical features:

  1. Linear scalers

    Use when:

  2. Non-linear scalers

    Use when:

Linear scaling 📏

Linear scaling adjusts the range of your features without changing the shape of their distribution.

The choice of the linear scaler depends on the data distribution, the presence of outliers, and the requirements of your specific algorithm.

The most commonly used linear transformations are: