Data Augmentation: A Tactic to Improve the Performance of ML
Learning about data augmentation will help you solve problems with repetitive machine learning models
Machine learning models can perform wonderful things if they have enough training data. Unfortunately, for many applications, access to quality data remains a barrier. One solution to this problem is data augmentation, a technique that generates new training examples from existing ones. Data augmentation is a low-cost and effective method to improve the performance and accuracy of machine learning models in data-constrained environments.
When machine learning models are trained on limited examples, they tend to overfit. Overfitting happens when an ML model performs accurately on its training examples but fails to generalize to unseen data. There are several ways to avoid overfitting in machine learning such as choosing different algorithms, modifying the model’s architecture, and adjusting hyperparameters. But ultimately, the main remedy to overfitting is adding more quality data to the training dataset. However, gathering extra training examples can be expensive, time-consuming, or sometimes impossible. This challenge becomes even more difficult in supervised learning applications where training examples must be labeled by human experts.
One of the ways to increase the diversity of the training dataset is to create copies of the existing data and make small modifications to them. This is called data augmentation. For example, say you have twenty images of ducks in your image classification dataset. By creating copies of your duck images and flipping them horizontally, you have doubled the training examples for the “duck” class. You can use other transformations such as rotation, cropping, zooming, and translation. You can also combine the transformations to further expand your collection of unique training examples.
Data augmentation does not need to be limited to geometric manipulation. Adding noise, changing color settings, and other effects such as blur and sharpening filters can also help in repurposing existing training examples as new data. Data augmentation is especially useful for supervised learning because you already have the labels and don’t need to put in extra effort to annotate the new examples. Data augmentation is also useful for other classes of machine learning algorithms such as unsupervised learning, contrastive learning, and generative models.
Data augmentation has become a standard practice for training machine learning models for computer vision applications. Popular machine learning and deep learning programming libraries have easy-to-use functions to integrate data augmentation into the ML training pipeline. Data augmentation is not limited to images and can be applied to other types of data. For text datasets, nouns and verbs can be replaced with their synonyms. In audio data, training examples can be modified by adding noise or changing the playback speed.
Data augmentation is not a silver bullet to solve all your data problems. You can think of it as a free performance booster for your ML models. Based on your target application, you still need a fairly large training dataset with enough examples. In some applications, training data might be too limited for data augmentation to help. In these cases, you must collect more data until you reach a minimum threshold before you can use data augmentation. Sometimes, you can use transfer learning, where you train an ML model on a general dataset and then repurpose it by finetuning its higher layers on the limited data you have for your target application.
Data augmentation also doesn’t address other problems such as biases that exist in the training dataset. The data augmentation process also needs to be adjusted to address other potential problems, such as class imbalance.