The Synthetic Data Generation
Improving a model using synthetic training data
Any form of training requires good examples of what is to be learnt, training a machine learning model is no different. The quality of the understanding increases with the quantity and variety of relevant examples that are available. Sometimes however, there may not be a sufficiently large volume of real life examples of a certain type. In this case, we can generate synthetic examples, provided the features of these examples accurately reflect those taken from real life. Take the example of the image of a kitten below. In Figure 1 we see the original image of a kitten. Below this in Figure 2 we can see 6 images that have been generated from the first by doing things like, zooming a bit, turning the head, rotating the kitten slightly. We now have 6 new examples of images of kittens, all different from the first but created using its properties. These can then be used to train a machine learning model to be able to recognise a broader variety of examples.
Figure 1. The original image of a Kitten
Figure 2. Synthetic images of a Kitten, generated from the original image in Figure 1
Our situation for merchants is no different. For some merchants we find we have a large variety of examples on which to train our models, however for others the training set may not be so rich. This leads to a class imbalance which will ultimately affect the overall quality of our model. Below is an example of a vectorised string that we have for Dunelm.
Original String
Original string split into n-grams
This set of n-grams can be converted to a vector as discussed in our Word to Vector blog. This string once represented as a vector sits in vector space alongside all the other real life examples of transactions from Dunelm. In order to generate synthetic samples for this merchant we use the Synthetic Minority Over-sampling TEchnique (SMOTE). This technique imputes points in the vector space that lie between those from real examples as can be seen in Figure 3.
Figure 3. Here we can see how the synthetic instances of the minority class are generated by imputing points between the existing points when using SMOTE
These imputed points in the vector space can then be converted back into n-grams to provide us with the synthetic representations such as that seen below
Synthetic n-gram string
Synthetic strings such as this, along with the real world examples can then be used to train our models, ultimately increasing its effectiveness.