Data Preparation
The data is split into 3 + K sets, where K is the number of augmentations.
As our labelled data is tiny, it's important to stratify the data on the labels.
It's up to the user to decide how to augment, how much labelled data to use, how much validation, etc.
Data Augmentation
The data augmentation is simple
Implementation
We found some difficulty working with torchvision
's CIFAR10 as the dataset is on-demand. We implement this approach, however, it's not the only way to do this.
In order to stratify and shuffle the data without loading the entire dataset, we pass indices into the Dataset initialization, and pre-emptively shuffle and filter out the indices.
Notably, torchvision.datasets.CIFAR10
has
50,000 training images
10,000 test images
So, the process is to figure out how to split range(50000)
indices into the indices for the labelled and unlabelled data, furthermore, use the targets to stratify the data.
Splitting Unlabelled to K Augmentations
Instead of trying to create K datasets, we create a single dataset, and make it return K augmented versions of the same image.
This works by overriding the __getitem__
method, and returning a tuple of augmented images. Take note to update all downstream code to handle this.