How to Avoid Overfitting in Machine Learning: 4 Best Tips
The Core Concept: Understanding the Signal and the Noise
Machine learning serves a single purpose: it identifies meaningful patterns in historical data to predict future outcomes. You can think of the data as a crowded party where some conversations provide valuable information (the signal) while others represent random chatter (the noise).
![]() |
| How to Avoid Overfitting in Machine Learning: 4 Best Tips |
A successful model listens only to the signal, but an overfit model acts like an eager student who memorizes every cough, sneeze, and random whisper in the room. This student performs perfectly on the "practice test" because they remember every specific detail, yet they fail the "real exam" because they never understood the actual subject.
Overfitting essentially occurs when your algorithm learns the training data too well, capturing the idiosyncratic quirks of the sample rather than the underlying truth of the population.
The model essentially mistakes stochastic fluctuations for significant features. This error happens frequently when you use an algorithm that is too complex for the amount of data you have available. Imagine I ask you to draw a line through two points; a straight line is a simple and likely accurate representation.
However, if I give you a thousand points and you draw a complex, zig-zagging line that hits every single one, you probably captured a lot of random jitter that will not appear in the next batch of data. Generalization remains the gold standard of success, representing the model's ability to stay accurate when it encounters a fresh test set it has never seen before.
The process of training involves minimizing a loss function, which measures how far off the predictions are from the reality. Developers often feel a sense of triumph when the training error drops to zero, but this is usually the moment where the danger begins.
A model with zero error is often just a high-speed memory machine that lacks any predictive power for the real world. You must constantly audit your system to ensure it is learning robust patterns rather than fragile coincidences.
Several distinct factors contribute to this phenomenon:
- Excessive model parameters
- Insufficient training samples
- High levels of label noise
- Non-representative data distributions
- Over-tuning of hyperparameters
- Lack of regularization constraints
- Long training durations
- Redundant input features
- Class imbalance issues
- Data leakage during preprocessing
The model creates a beautiful map of a city that no longer exists when it focuses on the noise. You want a map that shows the permanent landmarks, not the temporary road construction or the parked cars.
Finding this balance requires a disciplined approach to model selection and training. I will show you exactly how to navigate these challenges in the following sections.
The Great Balance: Why You Need to Know How to Avoid Overfitting in Machine Learning
Statistical learning hinges on a concept known as the bias-variance tradeoff, which is the fundamental law governing model performance. I like to imagine this as a target on a wall where you are throwing darts.
Bias represents a model that makes very strong, often wrong, assumptions about the data, leading it to miss the target consistently because it is too simple. On the other hand, variance represents a model that is so flexible and sensitive that it reacts wildly to every tiny change in the training data.
This leads to the model hitting the bullseye on the training set but scattering shots everywhere else once the environment shifts slightly.
The total error in any prediction is the sum of squared bias, variance, and an irreducible amount of noise that no model can fix. Your mission is to find the "Goldilocks Zone" where the model is complex enough to capture the real relationship but simple enough to ignore the random fluff.
This specific balance is the core reason why learning how to avoid overfitting in machine learning is the most important skill for a data scientist. You essentially trade a small amount of accuracy on your training data for a massive gain in reliability on your future data. A slightly "imperfect" fit on the training data often leads to a much more perfect fit in the real world.
The Bias-Variance Tradeoff: Striking the Perfect Equilibrium
The bias-variance tradeoff describes the struggle between a model's simplicity and its flexibility. An underfit model has high bias because it fails to capture the complexity of the data, like trying to model a curved road with a straight line.
Conversely, an overfit model has high variance because it changes its entire logic based on which specific data points it sees during training. You can visualize this by imagining two students: one who only learns the "A" and "B" answers for every question (high bias) and another who memorizes the exact ink smudges on the page to identify the question (high variance).
A high-variance model typically achieves near-perfect accuracy on its training set while showing a massive drop in performance on the test set. This gap is the most reliable red flag you can look for during the development process.
The error on the training set might be 1%, while the error on the test set is 25%. This discrepancy tells you that the model is latching onto features that do not generalize. You should always aim for a scenario where both errors are low and relatively close to each other, indicating that the model has captured the true signal.
The following rules help you identify where you stand on the curve:
- High train error and high test error
- Low train error and high test error
- Increasing validation loss over time
- Erratic decision boundaries
- High sensitivity to small data changes
- Extremely large weight coefficients
- Overly complex feature interactions
- Diverging learning curves
- Perfect accuracy on tiny datasets
- Poor performance on out-of-distribution samples
The model stabilizes when you successfully balance these two competing forces. You essentially "chill out" the model, preventing it from overreacting to outliers or random measurement errors. This process results in a smoother decision boundary that better reflects the actual truth of the problem.
I find that most beginners focus too much on bias and end up with variance problems, so it helps to start simple and only add complexity when the data demands it.
The model reaches its peak utility when it can comfortably handle the messiness of real life. A robust algorithm accepts that some data points are just weird and refuses to let those weird points dictate the entire strategy.
You essentially build a system that is "mostly right" everywhere rather than "exactly right" in one small, unrepeatable instance.
Data-Driven Solutions: Building a Robust Foundation
Data quality and volume act as the first line of defense in the war against variance. I always tell my team that a model is only as smart as the examples it sees. If you only show a model ten pictures of cats, it might decide that a "cat" is anything with a specific shadow in the background.
If you show it ten million pictures of cats in every possible lighting, angle, and environment, the model is forced to learn what a cat actually looks like. More data makes it statistically impossible for the model to "cheat" by memorizing the noise, as the noise tends to cancel itself out over a large number of samples.
The Law of Large Numbers suggests that as you increase your sample size, the average of your results becomes a better representation of the true population. In the context of machine learning, a massive dataset acts like a heavy anchor that keeps the model from drifting into the territory of overfitting.
However, gathering millions of high-quality, labeled examples is often expensive or impossible. This is where creative strategies like data augmentation or synthetic data generation become vital. You essentially create "new" data out of thin air by applying logical transformations to what you already have.
How to Avoid Overfitting in Machine Learning Using Data Augmentation
Data augmentation is the clever process of creating modified versions of your existing data to increase the diversity of the training set. If you are working with images, you can flip them, rotate them, or change the brightness.
The model sees these as "new" examples, but since the label (e.g., "dog") remains the same, the model learns that a dog is still a dog even if it is upside down or in low light. This technique is a powerful way how to avoid overfitting in machine learning because it forces the model to ignore non-essential details like orientation or color intensity.
For text data, you can use synonym replacement or back-translation. In back-translation, you translate an English sentence into French and then back to English. The resulting sentence often has the same meaning but different wording.
This prevents the model from overfitting to specific vocabulary or sentence structures. You essentially teach the model the "essence" of the language rather than just the specific strings of characters it saw during its first pass.
These transformations offer several benefits:
- Artificial expansion of limited datasets
- Reduced sensitivity to camera angles
- Increased robustness to lighting shifts
- Prevention of pixel memorization
- Better handling of varied accents
- Improved semantic understanding in NLP
- Reduction in class imbalance issues
- Simulation of real-world noise
- Higher performance on distorted inputs
- Enhanced generalization to new environments
The model becomes much more resilient when it has practiced on these "hallucinated" variations. It is like an athlete who trains in the rain, the heat, and the cold so they are ready for any conditions on game day. You are essentially building "mental flexibility" into your algorithm.
I have seen augmentation turn a mediocre model into a world-class system without changing a single line of the actual algorithm code.
Scaling the Volume: Why More Data Solves Almost Everything
A larger dataset provides a more comprehensive map of the feature space. When you have only a few data points, the gaps between them are large, and the model can take a wild, erratic path to connect them.
As you fill those gaps with more observations, the path becomes much more constrained and predictable. You essentially "crowd out" the possibility for the model to find a spurious relationship that only exists by chance.
The model begins to see the underlying distribution rather than the individual samples. This is particularly important for complex models like deep neural networks, which have millions of parameters and a nearly infinite capacity to memorize things.
Without enough data, a neural network is like a massive library with only one book; it knows every word of that book perfectly, but it has no idea how the rest of the world works. Increasing the data volume is often the single most effective intervention you can make.
More data improves the model in these ways:
- Averaging out of measurement errors
- Greater representation of rare events
- Clearer separation of signal from noise
- Higher statistical significance of features
- Better estimation of population parameters
- Reduction in the impact of outliers
- Improved stability of the loss surface
- Narrower confidence intervals for predictions
- Easier detection of subtle patterns
- Stronger resistance to random correlations
The model gains a sense of perspective that it simply cannot have with a small sample. You should always prioritize data collection over architectural tweaking whenever possible. I find that a simple model with a mountain of data almost always beats a complex model with a molehill of data. It is the most reliable "cheat code" in the entire field of data science.
The model finally learns to ignore the "quirks" of individual records. When it sees a specific pattern repeated across a million different people, it can safely assume that the pattern is real. If it only sees it in three people, it might just be a coincidence. This statistical safety is why we always push for more information before we start messing with the math.
Evaluation Frameworks: Testing for Real-World Success
You cannot fix what you cannot measure, and you cannot measure overfitting if you only look at your training scores. Proper evaluation requires a rigorous setup that mimics the "unseen" nature of the real world.
I always insist on a strict separation of data into different pools: training, validation, and testing. This setup is the "firewall" that protects your model from becoming a memorization machine. If the model knows the answers to the test before it takes it, you have no way of knowing if it actually learned anything.
A common mistake is using the validation set so many times for tuning that the model eventually "overfits" to the validation set itself. This is a subtle and dangerous form of data leakage. To prevent this, the test set must remain locked away in a "vault" until the very end of the project.
It is the final judge of your work. If your model performs well on the training data and the validation data but fails on the test data, you know you have accidentally leaked information somewhere in the process.
The Infrastructure of Validation: Splitting Your Data Correctly
A standard data split usually follows a 70/15/15 or 80/10/10 ratio. The training set is what the model uses to adjust its weights. The validation set is used during the training process to check for overfitting and to tune hyperparameters like the learning rate or the number of layers.
Finally, the test set provides the unbiased estimate of how the model will perform in production. You should never, under any circumstances, use the test set to make decisions about the model's design.
The model's performance on the validation set tells you when to stop training. If you see the training error going down while the validation error starts to climb, you have hit the wall of overfitting. This visual "divergence" is the most important graph you will ever look at.
It is the moment the model stops being a learner and starts being a memorizer. Maintaining these boundaries is the only way to ensure your metrics are honest.
Proper data splitting requires these steps:
- Shuffle the data thoroughly
- Ensure a representative distribution
- Apply stratification for imbalanced classes
- Isolate the test set immediately
- Use hash-based splits for consistency
- Avoid temporal bleeding in time-series
- Keep grouped data in the same fold
- Monitor the validation-test gap
- Never look at the test set early
- Re-split if the distribution changes
The model gains credibility when its performance holds up across different slices of data. You want to see consistent results whether you are looking at the first 10% of the data or the last 10%. This consistency is the hallmark of a generalized model.
I find that most projects fail not because of bad algorithms, but because of sloppy data handling during the evaluation phase.
Cross-Validation Techniques: A Proven Method for How to Avoid Overfitting in Machine Learning
K-Fold Cross-Validation is the "gold standard" for evaluation, especially when you have a limited amount of data. Instead of splitting the data once, you divide it into equal parts (usually 5 or 10). You then train the model times, each time using a different part as the "test" set and the rest as the training data. You average the scores from all runs to get a much more stable and reliable performance estimate. This is a brilliant way how to avoid overfitting in machine learning because it ensures that the model can perform well across every single corner of the dataset.
This method effectively uses every single data point for both training and testing, which is a massive advantage when data is scarce. It reduces the chance that you just got "lucky" with a single random split.
If the model performs well in all five folds, you can be very confident that it has captured a universal truth. If it performs well in four but fails in the fifth, you know there is something weird about that specific subset that the model cannot handle yet.
Cross-validation offers these strategic advantages:
- Reduction in performance estimate variance
- Maximized use of small datasets
- Detection of hidden data biases
- Better hyperparameter tuning reliability
- Identification of unstable model architectures
- Unbiased error estimation
- Validation of the entire training pipeline
- Protection against "lucky" random splits
- Clearer insight into model consistency
- Robustness against outliers in specific folds
The model's final score becomes a much more trustworthy reflection of reality. You are no longer guessing based on one roll of the dice; you are looking at the average of ten rolls.
This statistical rigour is what separates the professionals from the hobbyists. I never deploy a model without at least a 5-fold cross-validation check to ensure it is not just a fluke.
The model achieves a level of stability that makes it much easier to trust in a production environment. When you know the model can handle any subset of the data you throw at it, you can sleep better at night knowing it won't break the first time it sees a new customer. This confidence is the ultimate goal of the entire validation process.
Regularization: The Mathematical Diet for Complex Models
If a model is too complex, we need a way to put it on a "diet." Regularization is the mathematical technique of adding a penalty to the model's loss function based on the size of its weights. I like to think of this as a tax on complexity.
If the model wants to use a very large weight for a specific feature, it has to "pay" for it with a higher loss. This forces the model to only use features that are absolutely necessary to get the job done. It encourages the model to stay simple, sleek, and focused on the big picture.
The most common forms of regularization are L1 (Lasso) and L2 (Ridge). They work by slightly modifying the goal of the training process. Instead of just trying to minimize the error, the model is now trying to minimize the error plus the penalty term.
This "tug-of-war" between fitting the data and staying simple is what creates a generalized model. It prevents any single feature from dominating the entire prediction and keeps the model from chasing every tiny fluctuation in the training set.
L1 and L2 Norms: Pruning the Weights of Complexity
L1 Regularization, also known as Lasso, adds the absolute value of the weights to the loss function. This has a fascinating geometric effect: it tends to push the weights of unimportant features all the way to zero. This makes L1 a built-in feature selection tool.
If you have a thousand input variables but only ten are actually useful, L1 will "prune" the other 990, leaving you with a sparse, interpretable model. It is perfect for situations where you suspect that most of your data is just noise.
L2 Regularization, or Ridge, adds the square of the weights to the loss. Unlike L1, it rarely pushes weights to exactly zero. Instead, it shrinks all of them proportionally, ensuring that no single weight becomes too large.
This "democratizes" the influence of the features, making the model much more stable and less sensitive to outliers. In deep learning, this is often called weight decay. It is like a friction force that keeps the model's weights from spinning out of control.
These mathematical constraints provide these results:
- Prevention of extreme coefficient values
- Automatic selection of relevant features
- Improved model interpretability
- Higher stability against multicollinearity
- Reduced sensitivity to noisy outliers
- Smoother decision surfaces
- Better performance on high-dimensional data
- Implicit simplification of the hypothesis space
- Consistent results across different samples
- Controlled model capacity expansion
The model stays "humble" because it is penalized for being too confident or too complex. You essentially force the algorithm to prove that a feature is truly valuable before it is allowed to influence the final answer.
This skepticism is healthy and prevents the model from being "fooled" by random patterns. I always start with a small amount of L2 regularization as a default setting in every project.
The Dropout Method: Improving Generalization Through Randomness
Dropout is one of the most brilliant inventions in the history of deep learning. During each training step, you randomly "turn off" a percentage of the neurons in the network (usually 20% to 50%). This means the network cannot rely on any single neuron or specific path to get the answer.
It is forced to learn redundant representations and spread the "knowledge" across the entire system. This prevents co-adaptation, where neurons start working together to memorize specific training examples.
Imagine a team of employees where you randomly tell half of them to stay home every day. If the company still runs smoothly, it means everyone knows how to do multiple jobs and the system is robust. Dropout effectively trains an ensemble of thousands of different "thinned" networks simultaneously.
When it comes time for the real test, you turn all the neurons back on, and the resulting "collective intelligence" is far more robust and accurate than any single path could ever be. This is a primary strategy for how to avoid overfitting in machine learning in the modern era.
The dropout technique delivers these benefits:
- Elimination of neuron co-dependence
- Forced learning of robust features
- Implicit training of model ensembles
- Reduced sensitivity to individual data points
- Higher accuracy on complex datasets
- Prevention of "memorization" pathways
- Better performance in deep architectures
- Stabilized internal activation distributions
- Faster convergence in many scenarios
- Increased robustness to noisy inputs
The model becomes much more resilient because it has practiced "thinking" with parts of its brain missing. This mental toughness leads to incredible generalization on unseen data.
I find that dropout is almost mandatory for any neural network with more than a few layers. It is the closest thing we have to a "magic button" for fixing variance in deep learning.
The model reaches a state of "distributed wisdom." No single part of the network is indispensable, which makes the entire structure incredibly hard to break.
This redundancy is the secret to why modern AI can handle the chaos of real-world images and speech so effectively. You are essentially building a system that is too diverse to overfit.
Architectural and Training Controls: When to Stop and Simplify
Sometimes the best way to avoid a problem is to just stop before it starts. Early stopping is a simple but incredibly effective technique where you monitor the model's performance on the validation set and halt the training process the moment the error stops improving.
I like to think of this as a chef who pulls the cake out of the oven the second it is done. If you leave it in too long, it gets burnt (overfit). If you take it out too soon, it is undercooked (underfit). Early stopping finds that perfect middle ground.
Another powerful tool is model simplification. If your model is consistently overfitting, it might just be too big for the task. Reducing the number of layers, the number of neurons, or the complexity of the features is a direct application of Occam's Razor.
The simplest explanation that fits the data is usually the correct one. You don't need a supercomputer to predict if it will rain based on three variables. Matching the "size" of the model to the "size" of the problem is a fundamental skill.
Early Stopping: Knowing Exactly When to Halt Your Progress
The beauty of early stopping is that it requires no complex math; you just need to watch the learning curves. Typically, the training error will keep going down forever as the model memorizes the data. However, the validation error will go down for a while and then start to rise again.
That "inflection point" is the exact moment the model has learned everything it needs to know. Stopping right there ensures you get the best possible generalization without the extra fluff of memorized noise.
You can set a patience parameter to tell the system to wait for a few epochs before stopping, just in case the error is just going through a temporary "bump." This ensures you don't stop too early during a noisy training phase.
It is a low-cost, high-reward strategy that saves you both time and compute power. Why spend hours training a model to be worse? Early stopping is a win-win for everyone involved.
Using early stopping provides these advantages:
- Direct prevention of over-training
- Significant savings in compute time
- Automatic discovery of optimal epochs
- Reduction in energy consumption
- Protection against "noisy" loss surfaces
- Seamless integration with other techniques
- Preservation of model simplicity
- Consistent results across different runs
- Higher reliability of final weights
- Easier debugging of training cycles
The model remains in its peak state because you captured it at its most intelligent moment. You are essentially "freezing" the model at its highest point of wisdom. I never run a long training job without an early stopping callback; it is like having an insurance policy for your accuracy. It is the smartest way to manage your time and your results.
Simplifying Your Model Architecture for Better Results
The temptation in machine learning is always to go bigger, but bigger is often the enemy of generalization. If you have a thousand data points, you don't need a 50-layer neural network.
A simple logistic regression or a shallow decision tree might actually perform better because it is forced to focus on the most important trends. This architectural discipline is a key way how to avoid overfitting in machine learning when you are dealing with smaller or noisier datasets.
Pruning is another way to simplify. In a decision tree, you can "prune" the branches that only account for a tiny number of samples. These branches are almost certainly capturing noise rather than real patterns.
By cutting them off, you make the tree smaller, faster, and much more accurate on new data. You are essentially cleaning the "clutter" out of your model's brain so it can see the big picture more clearly.
Simplification strategies include:
- Reducing the number of hidden layers
- Lowering the neuron count per layer
- Pruning decision tree depth
- Limiting polynomial feature degrees
- Using linear models as a baseline
- Removing redundant input variables
- Applying global average pooling
- Simplifying activation functions
- Restricting model capacity through weight constraints
- Consolidating correlated features
The model gains a level of clarity that complex systems often lack. When you have fewer moving parts, there are fewer things that can go wrong.
It is much easier to explain a simple model to a stakeholder than a black-box neural network. I always tell my students to start with the simplest possible model and only move to something complex if the simple one fails to meet the requirements.
The model achieves a robust and "common sense" understanding of the data. It is not distracted by the tiny, irrelevant details that plague more complex algorithms.
This focused intelligence is what allows a simple model to outperform a giant one in many real-world scenarios. It is the ultimate expression of algorithmic efficiency.
Ensemble Methods: Harnessing the Wisdom of the Collective
Why rely on one model when you can rely on a hundred? Ensemble methods combine the predictions of multiple different models to produce a final answer that is more accurate and robust than any single model could be. I like to think of this as a committee of experts.
If one expert makes a mistake because they overfit to a specific detail, the other 99 experts will outvote them. This "averaging" effect is one of the most powerful ways to cancel out the errors caused by variance.
The two main types of ensembling are Bagging and Boosting. Bagging (like in a Random Forest) trains many models in parallel on different random subsets of the data. Since each model "sees" a different part of the noise, their errors are random and tend to cancel each other out when you average them.
Boosting (like in XGBoost) trains models sequentially, where each new model tries to fix the mistakes of the previous one. Both are industry-standard tools for building high-performance systems.
Bagging and Boosting: Diversifying Your Predictions
Bagging, or Bootstrap Aggregating, is the ultimate "variance reducer." By training trees on different "bootstrap" samples of your data, you create a diverse forest where no single tree has too much power. This is a classic method for how to avoid overfitting in machine learning because it essentially "smooths out" the erratic behavior of individual decision trees.
If one tree decides that everyone with a blue shirt is a high-risk borrower because of a weird fluke in its training sample, the rest of the forest will correct that assumption.
Boosting works a bit differently. It starts with a simple model and then builds another model to predict the errors of the first one. It repeats this process hundreds of times. While boosting can sometimes overfit if you go too far, modern versions use regularization and early stopping to stay in the safe zone.
Boosting is incredibly powerful because it can turn "weak learners" into a "strong learner" that is nearly impossible to beat in terms of accuracy.
Ensemble techniques provide these distinct benefits:
- Drastic reduction in model variance
- Higher accuracy than single algorithms
- Greater robustness to noisy data
- Built-in feature importance rankings
- Parallel training capabilities (Bagging)
- High precision on complex patterns (Boosting)
- Ability to handle non-linear relationships
- Cancellation of individual model errors
- Improved stability across different datasets
- Effective handling of high-dimensional inputs
The model becomes a "super-expert" that has seen the problem from every possible angle. It is like having a team of doctors consult on a case instead of just one. The collective decision is almost always better. I find that for most tabular data problems (like spreadsheets), an ensemble of trees is almost always the winning solution.
The model reaches a level of performance that is truly impressive. By combining the strengths of many different perspectives, you create something that is greater than the sum of its parts. This is the final and most sophisticated level of the defense against overfitting. When you have a diversified portfolio of models, you are much better protected against the random fluctuations of the world.
Frequently Asked Questions About How to Avoid Overfitting in Machine Learning
The following questions frequently arise when practitioners attempt to stabilize their models and improve their real-world accuracy. I have provided clear, actionable answers based on the latest industry standards.
Is it always better to have more data?
More data is generally a massive advantage because it forces the model to learn the signal while the noise cancels itself out. However, more data only helps if it is high-quality and representative of the problem you are trying to solve. If you just add a million rows of random, irrelevant, or incorrectly labeled "garbage," you are actually making the problem worse by adding even more noise for the model to get distracted by. You should always prioritize the "cleanliness" and "diversity" of your data before you focus purely on the volume.
How do I know if my model is overfitting or just underfitting?
You can tell the difference by looking at your training and validation scores. If your training error is high and your validation error is also high, your model is too simple and you are underfitting; you need a more complex model or better features. If your training error is extremely low but your validation error is high (or rising), you are overfitting. A "good fit" is characterized by low error on both sets, with the two scores being relatively close to each other.
Which regularization technique is the best?
The "best" technique depends entirely on your specific data and model type. L2 regularization (Weight Decay) is a great "all-purpose" choice that works well in most scenarios to keep weights small and stable. L1 regularization (Lasso) is the way to go if you have a lot of features and you want the model to automatically pick the most important ones. Dropout is almost always the preferred choice for deep neural networks. Often, the most robust models use a combination of several techniques, such as Elastic Net (which combines L1 and L2).
Does early stopping always work?
Early stopping is incredibly effective, but it relies on having a high-quality validation set. If your validation set is too small or doesn't represent the real world, you might stop training at the "wrong" time. Additionally, the loss curve can sometimes be "bumpy," leading the model to stop because of a temporary dip before it has actually reached its full potential. Using a patience parameter (waiting for 5 or 10 epochs of no improvement) is the best way to ensure you don't stop prematurely.
Can a model generalize well even if it overfits slightly?
Yes, in some cases, a small amount of overfitting is acceptable if the overall accuracy on the test set is still high. The real world is never perfect, and sometimes a model that is a tiny bit "too attached" to the training data still performs better than a model that is too simple to capture the core trends. However, you should always be wary of a large gap between training and testing scores. If the gap is growing, your model is becoming less reliable, even if the individual scores still look "good." Always prioritize the generalization gap as your primary health metric.
