Pipeline

There are several crucial details about the pipeline of MixMatch that aren't mentioned in the paper that makes or breaks the performance of the model. This document aims to explain the pipeline in detail.

Shortforms

X: input data, in this case, images
Y: labels of the input data
K: number of augmentations, or to refer to the kth augmentation
Lab.: labeled data
Unl.: unlabeled data

Data Preparation

The data is split into 3 + K sets, where K is the number of augmentations.

See Data Preperation for more details.

Model Architecture

We used a Wide ResNet 28-2 as the base model. This is a custom implementation based on YU1ut's PyTorch Implementation and Google Research's TensorFlow Implementation

Training

Training is rather complex. The key steps are illustrated below.

To highlight certain steps, we use the following notation:

This is the pipeline of the training process.

We have both Data and Data List, as the augmentations create a new axis in the data.

A few things to note:

Concat is on the Batch axis, the 1st axis.
Predict uses the model's forward pass.
- The Label Guessing Prediction, Predict(X Unl. K), doesn't use gradient.
The Mix Up Shuffling is on the Batch axis, which includes the augmentations. If the data is of shape (B, K, C, H, W), the shuffling happens on both B and K.
CIFAR10 (and most datasets) are not even, use drop_last on the DataLoader to avoid errors.

Sharpening

This is a step to make the Unlabelled Predictions more confident. This is done by raising the predictions to a power, then normalizing the predictions.

A higher tau value will make the predictions more confident.

Mix Up

Mix Up mixes the original data with a shuffled version of the data. This ratio of this mix is determined by a random modified sample from a Beta distribution. The modified sample is the maximum of the sample and its complement.

Notably, when we modify the sample, we're effectively always taking the larger value, making the original sample more prevalent during the mix.

Unlabelled Loss Scaler

The unlabelled loss scaler is a scalar that scales the unlabelled loss. This linearly increases from 0 to 100 over the course of training.

The implementation is simple, just take the current epoch and divide by the total number of epochs.

Interleaving

Interleaving is not a well-documented step in the paper. See our Interleaving document for more details.

Evaluation

The evaluation is simple. We just take the accuracy of the model on the test set.

Last modified: 26 November 2023