We’ve all been there.
You build a shiny machine learning model, train it with love, validate it with care, and—boom!—the accuracy looks great. You’re already daydreaming about deploying it into the real world… only to discover that outside your neat little validation sandbox, the model flops. Miserably.
So, what’s going wrong?
It might not be your model architecture, your feature engineering, or even your hyperparameters. The silent culprit could be how you split your dataset—specifically, how you create your test set.
The Usual Routine (and Its Hidden Trap)
In a typical ML project, we start by cleaning data:
- Fixing formats
- Handling missing values
- Checking data types
Then, before analysis, we (correctly!) set aside a test set. This prevents us from accidentally peeking at the answers and overfitting our model selection to the test data.
That’s good practice. But here’s the kicker: even if you do this right, your model can still fail in the real world.
Why? Let’s take the classic house price prediction project as an example.
Random Sampling: Looks Fair, Acts Unfair
Say one of your features is median_income. It’s continuous data, so when creating your test set, you probably:
- Take 20% of the dataset (less if the dataset is huge).
- Do this randomly.
This seems fine, right? After all, randomness = fairness… or so we think.
But here’s the problem: random sampling introduces something called sampling bias. Imagine your test set ends up with very few (or even zero!) samples from households with incomes between ₹60k–₹80k.
Now, your model looks great on validation because it’s never really tested on those missing ranges. But when it hits the real world—where people do earn $60k–$80k—it performs poorly.
Congrats, you’ve just built a model that’s brilliant… for everyone except the people you accidentally left out. 😅
Stratified Sampling: The Hero We Overlook
Enter stratified sampling—a simple but powerful fix.
Instead of picking samples randomly, we:
- Divide our continuous feature (
median_income) into categories (or “strata”). - Ensure the test set contains representatives from each stratum, proportional to their presence in the dataset.
In short: your test set becomes a mini-me of the whole dataset.
This way, when you validate, the performance actually reflects how the model would generalize to all categories—not just the lucky ones that random sampling picked.
This technique isn’t just for ML. It’s widely used in surveys, polls, and studies where representation matters. And in machine learning, it’s a lifesaver for ensuring your model isn’t just book-smart but also street-smart.
