Common Pitfalls of Machine Learning

In Supervised Learning, historical data is used to create a model, the model is then deployed to get predictions on future data, this called scoring.

But what if future data is unlabelled? how do we know if predictions are any good?

A good model should generalize well.

1. Generalization:

A model’s performance should be measured on data that it hasn’t seen during training.

We say a good model should generalize to out-of-sample data (data not used for training). At training time we try to find parameters that minimize error on the training data, but there’s no guarantee that this will also minimize error on out-of-sample data.

A model generalizes well achieves similar performance on the training and validation data.

How well does a learned model generalize from the data it was trained on to a new test set?

The answer is: Training and Test data

A small random portion of historical data is set aside and considered as the future data, the remaining portion is called training data. Unlike future data, test data is labeled, so we can compare predictions with observed, also called ground truth. So we use the training data to train a model then we evaluate the trained model’s performance on the test data (data that the model didn’t see at training time).

2. Overfitting:

A model is overfitting if performs well on the training data, but poorly on the test data: model is too “complex” that it’s learning the signal but also “learning” irrelevant characteristics (noise) in the training data, and hence fails to generalize.

This results in a low training error and high validation error (big gap between training and validation performance).

Reducing overfitting:

Decrease model complexity, for, e.g.:
- Prune a decision tree
- Reduce the number of hidden layers in a neural network.
- Increase the number of neighbors (k) in k-NN
Decrease the number of features
- More aggressive feature selection
Regularization: Popular Way of Controlling Overfitting (control feature complexity)
- Penalize high weights.
- L-1 regularization (LASSO) very efficient at pushing weights of non-informative features to 0.
Gather more training data if possible
In iterative training algorithms, stop training earlier to prevent “memorization” of training data

3. Underfitting:

A model is underfitting when it performs poorly on the training data, then it almost certainly will perform poorly on the test data as well. The model is too “simple” to represent all the relevant class characteristics. This results in a high training error and high validation error.

Reducing Uderfitting:

Add more data
Increase the number of features, or create more relevant features
Increase model complexity, for, e.g.
- Increase the number of levels in a decision tree
- Increase the number of hidden layers in a neural network.
- Decrease the number of neighbors (k) in k-NN
In iterative training algorithms, iterate long enough so that the objective function has converged.

Indicators of Underfitting and Overfitting

Underfitting-Overfitting

Model performs poorly on both training and testing data:
- Underfitting, or
- Not relevant data
Model performs well on training, but poorly on testing:
- Overfitting

A good model is one that neither underfits nor overfits.

4. Data leakage:

A model performs well during testing, but fails to achieve the same level of performance during real-world usage?

Data leakage occurs when information between the test and training data-sets is accidentally shared. Typically, this happens at splitting a data-set into testing and training sets step.

Causes:

Pre-processing error: this happen when fitting a pre-processor on the entire data-set, this leaks information from the test set, since the parameters of the pre-processing model will be fitted with knowledge of the test set.
Duplicates : this occurs when the data-set contains several points with identical or near-identical data.
Dependencies between test and train set. A common example of this occurs with temporal data.

5. Class Imbalance:

In classification problems, data sets are said to be balanced if there are, approximately as many positive examples of the concept as there are negative ones.

The problem with class imbalances is that standard learners algorithms often biased towards the majority class. That is because these classifiers attempt to reduce global quantities such as the error rate, not taking the data distribution into consideration. As a result, examples from the minority class tend to be misclassified.

Class imbalance and model evaluation:

Class imbalance usually implies that not all errors (misclassification) should have the same importance, so looking at accuracy (percent misclassifications) can be optimistic, instead we look at other metrics, like precision, recall, or AUC

Resampling Techinques to deal with class imbalance:

Oversampling minority class by as adding more copies of the minority class. This can be a good choice in case when there is isn’t enough data.
Undersampling majority class by as removing some observations of the majority class. Undersampling can be a good choice when you have a ton of data -think millions of rows. But a drawback is that we are removing information that may be valuable. This could lead to underfitting and poor generalization to the test set.
Generate synthetic samples with SMOTE (Synthetic Minority Oversampling Technique) the Sate of the art Resampling Approach. SMOTE uses a nearest neighbours algorithm to generate new and synthetic data by combines informed oversampling of the minority classes with random undersampling of the majority classes.