import fastbook
fastbook.setup_book()
from fastbook import *

Image Classification

  • Make them better.
  • Apply them to a wider variety of types of data.

From Dogs and Cats to Pet Breeds

from fastai.vision.all import *
path = untar_data(URLs.PETS)
  • Individual files representing items of data, such as text documents or images, possibly organized into folders or with filenames representing information about those items
  • A table of data, such as in CSV format, where each row is an item which may include filenames providing a connection between the data in the table and data in other formats, such as text documents and images

the vast majority of the datasets you'll work with will use some combination of these two formats.

Path.BASE_PATH = path
path.ls()
(#2) [Path('images'),Path('annotations')]
(path/"images").ls()
(#7393) [Path('images/Bombay_13.jpg'),Path('images/beagle_193.jpg'),Path('images/Ragdoll_8.jpg'),Path('images/boxer_106.jpg'),Path('images/keeshond_56.jpg'),Path('images/american_pit_bull_terrier_162.jpg'),Path('images/saint_bernard_136.jpg'),Path('images/staffordshire_bull_terrier_76.jpg'),Path('images/pug_173.jpg'),Path('images/american_pit_bull_terrier_117.jpg')...]

By examining these filenames, we can see how they appear to be structured. Each filename contains the pet breed, and then an underscore (_), a number, and finally the file extension.

We need to create a piece of code that extracts the breed from a single Path

fname = (path/"images").ls()[0]

So head over to Google and search for "regular expressions tutorial" now, and then come back here after you've had a good look around.

When you are writing a regular expression, the best way to start is just to try it against one example at first.

Let's use the findall method to try a regular expression against the filename of the fname object:

re.findall(r'(.+)_\d+.jpg$', fname.name)
['Bombay']

For labeling with regular expressions, we can use the RegexLabeller class.

pets = DataBlock(blocks = (ImageBlock, CategoryBlock),
                 get_items=get_image_files, 
                 splitter=RandomSplitter(seed=42),
                 get_y=using_attr(RegexLabeller(r'(.+)_\d+.jpg$'), 'name'),
                 item_tfms=Resize(460),
                 batch_tfms=aug_transforms(size=224, min_scale=0.75))
dls = pets.dataloaders(path/"images")

One important piece of this DataBlock call that we haven't seen before is in these two lines:

item_tfms=Resize(460),
batch_tfms=aug_transforms(size=224, min_scale=0.75)
  • These lines implement a fastai data augmentation strategy which we call presizing.
  • Presizing is a particular way to do image augmentation that is designed to minimize data destruction while maintaining good performance.

Presizing

presizing adopts two strategies

  1. Resize images to relatively "large" dimensions—that is, dimensions significantly larger than the target training dimensions.
  2. Compose all of the common augmentation operations (including a resize to the final target size) into one, and perform the combined operation on the GPU only once at the end of processing, rather than performing the operations individually and interpolating multiple times.
  • The first step, the resize, creates images large enough that they have spare margin to allow further augmentation transforms on their inner regions without creating empty zones.
  • This transformation works by resizing to a square, using a large crop size.
  • On the training set, the crop area is chosen randomly, and the size of the crop is selected to cover the entire width or height of the image, whichever is smaller.
  • In the second step, the GPU is used for all data augmentation, and all of the potentially destructive operations are done together, with a single interpolation at the end.

This picture shows the two steps:

  1. Crop full width or height: This is in item_tfms, so it's applied to each individual image before it is copied to the GPU. It's used to ensure all images are the same size. On the training set, the crop area is chosen randomly. On the validation set, the center square of the image is always chosen.
  2. Random crop and augment: This is in batch_tfms, so it's applied to a batch all at once on the GPU, which means it's fast. On the validation set, only the resize to the final size needed for the model is done here. On the training set, the random crop and any other augmentations are done first.

To implement this process in fastai you use Resize as an item transform with a large size, and RandomResizedCrop as a batch transform with a smaller size. RandomResizedCrop will be added for you if you include the min_scale parameter in your aug_transforms function, as was done in the DataBlock call in the previous section. Alternatively, you can use pad or squish instead of crop (the default) for the initial Resize.

  • You can see that the image on the right is less well defined and has reflection padding artifacts in the bottom-left corner;
  • also, the grass at the top left has disappeared entirely.
  • We find that in practice using presizing significantly improves the accuracy of models, and often results in speedups too.

The fastai library also provides simple ways to check your data looks right before training a model, which is an extremely important step.

Checking and Debugging a DataBlock

  • Writing a DataBlock is just like writing a blueprint.
  • before training a model you should always check your data.
  • You can do this using the show_batch method:
dls.show_batch(nrows=1, ncols=5)

Take a look at each image, and check that each one seems to have the correct label for that breed of pet.

To debug this, we encourage you to use the summary method. It will attempt to create a batch from the source you give it, with a lot of details.

  • For instance, one common mistake is to forget to use a Resize transform, so you end up with pictures of different sizes and are not able to batch them.
pets1 = DataBlock(blocks = (ImageBlock, CategoryBlock),
                 get_items=get_image_files, 
                 splitter=RandomSplitter(seed=42),
                 get_y=using_attr(RegexLabeller(r'(.+)_\d+.jpg$'), 'name'))
pets1.summary(path/"images")

You can see exactly how we gathered the data and split it, how we went from a filename to a sample (the tuple (image, category)), then what item transforms were applied and how it failed to collate those samples in a batch (because of the different shapes).

learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(2)
epoch train_loss valid_loss error_rate time
0 1.494869 0.353126 0.113667 00:17
epoch train_loss valid_loss error_rate time
0 0.506801 0.321486 0.106901 00:19
1 0.333137 0.242048 0.075778 00:19

As we've briefly discussed before, the table shown when we fit a model shows us the results after each epoch of training. Remember, an epoch is one complete pass through all of the images in the data. The columns shown are the average loss over the items of the training set, the loss on the validation set, and any metrics that we requested—in this case, the error rate.

Remember that loss is whatever function we've decided to use to optimize the parameters of our model. But we haven't actually told fastai what loss function we want to use. So what is it doing? fastai will generally try to select an appropriate loss function based on what kind of data and model you are using. In this case we have image data and a categorical outcome, so fastai will default to using cross-entropy loss.

Cross-Entropy Loss

Cross-Entropy Loss has two benefits:

  • It works even when our dependent variable has more than two categories. -It results in faster and more reliable training.

In order to understand how cross-entropy loss works for dependent variables with more than two categories, we first have to understand what the actual data and activations that are seen by the loss function look like.

Viewing Activations and Labels

Let's take a look at the activations of our model. To actually get a batch of real data from our DataLoaders, we can use the one_batch method:

x,y = dls.one_batch()

As you see, this returns the dependent and independent variables, as a mini-batch. Let's see what is actually contained in our dependent variable:

y
TensorCategory([21, 10, 30,  4, 24,  7, 15, 17,  0, 10, 31, 14,  9, 34, 17,  8, 31, 22, 24,  1,  0,  0, 20, 14, 21, 24,  8, 12, 34, 15,  9,  8, 13, 15, 19, 21, 12,  7, 21, 26, 31, 23,  2, 24,  9, 23, 26, 26,
        32, 28, 12,  0, 35, 24,  3, 26, 27,  8, 36, 11, 13, 18,  2,  1], device='cuda:0')

Our batch size is 64, so we have 64 rows in this tensor. Each row is a single integer between 0 and 36, representing our 37 possible pet breeds.

We can view the predictions (that is, the activations of the final layer of our neural network) using Learner.get_preds.

  • This function either takes a dataset index (0 for train and 1 for valid) or an iterator of batches.
  • Thus, we can pass it a simple list with our batch to get our predictions.
  • It returns predictions and targets by default, but since we already have the targets, we can effectively ignore them by assigning to the special variable _
preds,_ = learn.get_preds(dl=[(x,y)])
preds[0]
TensorBase([2.1906e-09, 4.9218e-09, 1.0061e-06, 5.2513e-07, 5.3143e-08, 3.6090e-08, 3.1996e-07, 6.9169e-07, 2.7921e-07, 1.1637e-08, 1.2519e-07, 3.2140e-08, 2.7920e-06, 3.6968e-07, 2.2905e-07, 7.1837e-07,
        1.0868e-07, 1.3916e-06, 1.6785e-07, 1.2505e-05, 9.1398e-08, 9.9979e-01, 4.0886e-06, 4.7811e-06, 1.7009e-06, 2.3698e-06, 1.6723e-07, 7.4736e-06, 1.3465e-06, 3.0198e-07, 4.1827e-06, 1.5687e-04,
        1.5037e-06, 1.1295e-07, 9.4217e-07, 1.1124e-06, 1.7568e-08])

The actual predictions are 37 probabilities between 0 and 1, which add up to 1 in total:

len(preds[0]),preds[0].sum()
(37, TensorBase(1.))

To transform the activations of our model into predictions like this, we used something called the softmax activation function.

Softmax

In our classification model, we use the softmax activation function in the final layer to ensure that the activations are all between 0 and 1, and that they sum to 1.

Softmax is similar to the sigmoid function, which we saw earlier. As a reminder sigmoid looks like this:

plot_function(torch.sigmoid, min=-4,max=4)

We can apply this function to a single column of activations from a neural network, and get back a column of numbers between 0 and 1, so it's a very useful activation function for our final layer.

Now think about what happens if we want to have more categories in our target (such as our 37 pet breeds). That means we'll need more activations than just a single column: we need an activation per category. We can create,

Let's just use some random numbers with a standard deviation of 2 (so we multiply randn by 2) for this example, assuming we have 6 images and 2 possible categories (where the first column represents 3s and the second is 7s):

acts = torch.randn((6,2))*2
acts
tensor([[ 0.6734,  0.2576],
        [ 0.4689,  0.4607],
        [-2.2457, -0.3727],
        [ 4.4164, -1.2760],
        [ 0.9233,  0.5347],
        [ 1.0698,  1.6187]])

We can't just take the sigmoid of this directly, since we don't get rows that add to 1 (i.e., we want the probability of being a 3 plus the probability of being a 7 to add up to 1):

acts.sigmoid()
tensor([[0.6623, 0.5641],
        [0.6151, 0.6132],
        [0.0957, 0.4079],
        [0.9881, 0.2182],
        [0.7157, 0.6306],
        [0.7446, 0.8346]])

our neural net created a single activation per image, which we passed through the sigmoid function.

Binary problems are a special case of classification problems, because the target can be treated as a single boolean value, as we did in mnist_loss.

But binary problems can also be thought of in the context of the more general group of classifiers with any number of categories: in this case, we happen to have two categories. As we saw in the bear classifier, our neural net will return one activation per category.

We would expect that since this is just another way of representing the same problem, that we would be able to use sigmoid directly on the two-activation version of our neural net.

We can just take the difference between the neural net activations, because that reflects how much more sure we are of the input being a 3 than a 7, and then take the sigmoid of that:

(acts[:,0]-acts[:,1]).sigmoid()
tensor([0.6025, 0.5021, 0.1332, 0.9966, 0.5959, 0.3661])

The second column (the probability of it being a 7) will then just be that value subtracted from 1. Now, we need a way to do all this that also works for more than two columns. It turns out that this function, called softmax, is exactly that:

def softmax(x): return exp(x) / exp(x).sum(dim=1, keepdim=True)
  • xponential function (exp): Literally defined as $e**x$, where e is a special number approximately equal to 2.718. It is the inverse of the natural logarithm function. Note that exp is always positive, and it increases very rapidly!

Let's check that softmax returns the same values as sigmoid for the first column, and those values subtracted from 1 for the second column:

sm_acts = torch.softmax(acts, dim=1)
sm_acts
tensor([[0.6025, 0.3975],
        [0.5021, 0.4979],
        [0.1332, 0.8668],
        [0.9966, 0.0034],
        [0.5959, 0.4041],
        [0.3661, 0.6339]])

softmax is the multi-category equivalent of sigmoid—we have to use it any time we have more than two categories and the probabilities of the categories must add to 1, and we often use it even when there are just two categories, just to make things a bit more consistent.

We could create other functions that have the properties that all activations are between 0 and 1, and sum to 1; however, no other function has the same relationship to the sigmoid function, which we've seen is smooth and symmetric.

What does this function do in practice? Taking the exponential ensures all our numbers are positive, and then dividing by the sum ensures we are going to have a bunch of numbers that add up to 1. The exponential also has a nice property: if one of the numbers in our activations x is slightly bigger than the others, the exponential will amplify this (since it grows, well... exponentially), which means that in the softmax, that number will be closer to 1.

Intuitively, the softmax function really wants to pick one class among the others, so it's ideal for training a classifier when we know each picture has a definite label.

Note: it may be less ideal during inference, as you might want your model to sometimes tell you it doesn’t recognize any of the classes that it has seen during training, and not pick a class because it has a slightly bigger activation score. In this case, it might be better to train a model using multiple binary output columns, each using a sigmoid activation.

Softmax is the first part of the cross-entropy loss—the second part is log likelihood.

Log Likelihood

When we calculated the loss for our MNIST example in the last chapter we used:

def mnist_loss(inputs, targets):
    inputs = inputs.sigmoid()
    return torch.where(targets==1, 1-inputs, inputs).mean()
  • Just as we moved from sigmoid to softmax, we need to extend the loss function to work with more than just binary classification—it needs to be able to classify any number of categories (in this case, we have 37 categories).
  • Our activations, after softmax, are between 0 and 1, and sum to 1 for each row in the batch of predictions. Our targets are integers between 0 and 36.
  • In the binary case, we used torch.where to select between inputs and 1-inputs.
  • When we treat a binary classification as a general classification problem with two categories, it actually becomes even easier, because (as we saw in the previous section) we now have two columns, containing the equivalent of inputs and 1-inputs.
  • So, all we need to do is select from the appropriate column.
  • Let's try to implement this in PyTorch.
targ = tensor([0,1,0,1,1,0])

and these are the softmax activations:

sm_acts
tensor([[0.6025, 0.3975],
        [0.5021, 0.4979],
        [0.1332, 0.8668],
        [0.9966, 0.0034],
        [0.5959, 0.4041],
        [0.3661, 0.6339]])

Then for each item of targ we can use that to select the appropriate column of sm_acts using tensor indexing, like so:

idx = range(6)
sm_acts[idx, targ]
tensor([0.6025, 0.4979, 0.1332, 0.0034, 0.4041, 0.3661])

To see exactly what's happening here, let's put all the columns together in a table. Here, the first two columns are our activations, then we have the targets, the row index, and finally the result shown immediately above:

3 7 targ idx loss
0.602469 0.397531 0 0 -0.602469
0.502065 0.497935 1 1 -0.497935
0.133188 0.866811 0 2 -0.133188
0.996640 0.003360 1 3 -0.003360
0.595949 0.404051 1 4 -0.404051
0.366118 0.633882 0 5 -0.366118

Looking at this table, you can see that the final column can be calculated by taking the targ and idx columns as indices into the two-column matrix containing the 3 and 7 columns. That's what sm_acts[idx, targ] is actually doing.

  • The really interesting thing here is that this actually works just as well with more than two columns.
  • To see this, consider what would happen if we added an activation column for every digit (0 through 9), and then targ contained a number from 0 to 9.
  • As long as the activation columns sum to 1 (as they will, if we use softmax), then we'll have a loss function that shows how well we're predicting each digit.

We're only picking the loss from the column containing the correct label.

  • 올바른 라벨을 가진 열로부터 손실을 골라낸다.

We don't need to consider the other columns, because by the definition of softmax, they add up to 1 minus the activation corresponding to the correct label.

  • softmax의 정의에 따라 1에서 정확한 라벨에 해당하는 활성화함수를 뺐기 때문에 다른 열은 고려할 필요가 없다,

Therefore, making the activation for the correct label as high as possible must mean we're also decreasing the activations of the remaining columns.

  • 따라서 올바른 레이블에 대한 활성화 함수를 가능한한 높게 만든다는 것은 나머지 열의 활성화 함수도 감소한다는 것을 의미한다.

PyTorch provides a function that does exactly the same thing as sm_acts[range(n), targ] (except it takes the negative, because when applying the log afterward, we will have negative numbers), called nll_loss (NLL stands for negative log likelihood):

-sm_acts[idx, targ]
tensor([-0.6025, -0.4979, -0.1332, -0.0034, -0.4041, -0.3661])
F.nll_loss(sm_acts, targ, reduction='none')
tensor([-0.6025, -0.4979, -0.1332, -0.0034, -0.4041, -0.3661])

Despite its name, this PyTorch function does not take the log. We'll see why in the next section, but first, let's see why taking the logarithm can be useful.

Taking the Log

The function we saw in the previous section works quite well as a loss function, but we can make it a bit better.

The problem is that we are using probabilities, and probabilities cannot be smaller than 0 or greater than 1.

  • That means that our model will not care whether it predicts 0.99 or 0.999.
  • Indeed, those numbers are so close together—but in another sense, 0.999 is 10 times more confident than 0.99.

So, we want to transform our numbers between 0 and 1 to instead be between negative infinity and 0.

There is a mathematical function that does exactly this: the logarithm (available as torch.log).

  • It is not defined for numbers less than 0, and looks like this:
plot_function(torch.log, min=0,max=4)

Does "logarithm" ring a bell? The logarithm function has this identity:

y = b**a
a = log(y,b)

In this case, we're assuming that log(y,b) returns log y base b. However, PyTorch actually doesn't define log this way: log in Python uses the special number e (2.718...) as the base.

he key thing to know about logarithms is this relationship:

log(a*b) = log(a)+log(b)

When we see it in that format, it looks a bit boring;

  • but think about what this really means.
  • It means that logarithms increase linearly when the underlying signal increases exponentially or multiplicatively.

This is used, for instance, in the Richter scale of earthquake severity, and the dB scale of noise levels.

  • It's also often used on financial charts, where we want to show compound growth rates more clearly.
  • Computer scientists love using logarithms, because it means that multiplication, which can create really really large and really really small numbers, can be replaced by addition, which is much less likely to result in scales that are difficult for our computers to handle.

Taking the mean of the positive or negative log of our probabilities (depending on whether it's the correct or incorrect class) gives us the negative log likelihood loss.

In PyTorch, nll_loss assumes that you already took the log of the softmax, so it doesn't actually do the logarithm for you.

When we first take the softmax, and then the log likelihood of that, that combination is called cross-entropy loss. In PyTorch, this is available as nn.CrossEntropyLoss (which, in practice, actually does log_softmax and then nll_loss):

loss_func = nn.CrossEntropyLoss()

As you see, this is a class. Instantiating it gives you an object which behaves like a function:

loss_func(acts, targ)
tensor(1.8045)

All PyTorch loss functions are provided in two forms, the class just shown above, and also a plain functional form, available in the F namespace:

F.cross_entropy(acts, targ)
tensor(1.8045)

By default PyTorch loss functions take the mean of the loss of all items. You can use reduction='none' to disable that:

nn.CrossEntropyLoss(reduction='none')(acts, targ)
tensor([0.5067, 0.6973, 2.0160, 5.6958, 0.9062, 1.0048])

Important: An interesting feature about cross-entropy loss appears when we consider its gradient. The gradient of cross_entropy(a,b) is just softmax(a)-b. Since softmax(a) is just the final activation of the model, that means that the gradient is proportional to the difference between the prediction and the target. This is the same as mean squared error in regression (assuming there’s no final activation function such as that added by y_range), since the gradient of (a-b)*2 is 2(a-b). Because the gradient is linear, that means we won’t see sudden jumps or exponential increases in gradients, which should lead to smoother training of models.

We have now seen all the pieces hidden behind our loss function. But while this puts a number on how well (or badly) our model is doing, it does nothing to help us know if it's actually any good. Let's now see some ways to interpret our model's predictions.

Model Interpretation

interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix(figsize=(12,12), dpi=60)

Oh dear—in this case, a confusion matrix is very hard to read. We have 37 different breeds of pet, which means we have 37×37 entries in this giant matrix! Instead, we can use the most_confused method, which just shows us the cells of the confusion matrix with the most incorrect predictions (here, with at least 5 or more):

interp.most_confused(min_val=5)
[('Ragdoll', 'Birman', 10),
 ('Bengal', 'Egyptian_Mau', 7),
 ('american_pit_bull_terrier', 'staffordshire_bull_terrier', 6)]

A little bit of Googling tells us that the most common category errors shown here are actually breed differences that even expert breeders sometimes disagree about.

Improving Our Model

we will explain a little bit more about transfer learning and how to fine-tune our pretrained model as best as possible, without breaking the pretrained weights.

The first thing we need to set when training a model is the learning rate.

The Learning Rate Finder

One of the most important things we can do when training a model is to make sure that we have the right learning rate. If our learning rate is too low, it can take many, many epochs to train our model. Not only does this waste time, but it also means that we may have problems with overfitting, because every time we do a complete pass through the data, we give our model a chance to memorize it.

learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(1, base_lr=0.1)
epoch train_loss valid_loss error_rate time
0 2.557064 4.890897 0.518268 00:17
epoch train_loss valid_loss error_rate time
0 3.624976 1.880442 0.502706 00:19
  • That doesn't look good. Here's what happened.
  • The optimizer stepped in the correct direction, but it stepped so far that it totally overshot the minimum loss.
  • Repeating that multiple times makes it get further and further away, not closer and closer!

In 2015 the researcher Leslie Smith came up with a brilliant idea, called the learning rate finder.

His idea was to start with a very, very small learning rate, something so small that we would never expect it to be too big to handle.

  • We use that for one mini-batch, find what the losses are afterwards, and then increase the learning rate by some percentage (e.g., doubling it each time).
  • Then we do another mini-batch, track the loss, and double the learning rate again.
  • We keep doing this until the loss gets worse, instead of better.
  • This is the point where we know we have gone too far.
  • We then select a learning rate a bit lower than this point.

Our advice is to pick either:

  • One order of magnitude less than where the minimum loss was achieved (i.e., the minimum divided by 10)
  • The last point where the loss was clearly decreasing

Both these rules usually give around the same value.

we didn't specify a learning rate, using the default value from the fastai library (which is 1e-3):

learn = cnn_learner(dls, resnet34, metrics=error_rate)
lr_min,lr_steep = learn.lr_find()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-77-24d2d25f8259> in <module>
      1 learn = cnn_learner(dls, resnet34, metrics=error_rate)
----> 2 lr_min,lr_steep = learn.lr_find()

ValueError: not enough values to unpack (expected 2, got 1)
print(f"Minimum/10: {lr_min:.2e}, steepest point: {lr_steep:.2e}")

output : Minimum/10: 1.00e-02, steepest point: 5.25e-03

  • We can see on this plot that in the range 1e-6 to 1e-3, nothing really happens and the model doesn't train.
  • Then the loss starts to decrease until it reaches a minimum, and then increases again.
  • We don't want a learning rate greater than 1e-1 as it will give a training that diverges like the one before (you can try for yourself), but 1e-1 is already too high: at this stage we've left the period where the loss was decreasing steadily.

In this learning rate plot it appears that a learning rate around 3e-3 would be appropriate,

learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(2, base_lr=3e-3)
epoch train_loss valid_loss error_rate time
0 1.304318 0.356317 0.108254 00:17
epoch train_loss valid_loss error_rate time
0 0.542090 0.335903 0.099459 00:19
1 0.335094 0.241663 0.075101 00:19

focus on error_rate

Note: Logarithmic Scale: The learning rate finder plot has a logarithmic scale, which is why the middle point between 1e-3 and 1e-2 is between 3e-3 and 4e-3. This is because we care mostly about the order of magnitude of the learning rate.

Unfreezing and Transfer Learning

Our challenge when fine-tuning is to replace the random weights in our added linear layers with weights that correctly achieve our desired task (classifying pet breeds) without breaking the carefully pretrained weights and the other layers.

There is actually a very simple trick to allow this to happen: tell the optimizer to only update the weights in those randomly added final layers.

Don't change the weights in the rest of the neural network at all.

This is called freezing those pretrained layers.

When we create a model from a pretrained network fastai automatically freezes all of the pretrained layers for us. When we call the fine_tune method fastai does two things:

  • Trains the randomly added layers for one epoch, with all other layers frozen
  • Unfreezes all of the layers, and trains them all for the number of epochs requested
learn.fine_tune??
Signature:
learn.fine_tune(
    epochs,
    base_lr=0.002,
    freeze_epochs=1,
    lr_mult=100,
    pct_start=0.3,
    div=5.0,
    lr_max=None,
    div_final=100000.0,
    wd=None,
    moms=None,
    cbs=None,
    reset_opt=False,
)
Source:   
@patch
@delegates(Learner.fit_one_cycle)
def fine_tune(self:Learner, epochs, base_lr=2e-3, freeze_epochs=1, lr_mult=100,
              pct_start=0.3, div=5.0, **kwargs):
    "Fine tune with `Learner.freeze` for `freeze_epochs`, then with `Learner.unfreeze` for `epochs`, using discriminative LR."
    self.freeze()
    self.fit_one_cycle(freeze_epochs, slice(base_lr), pct_start=0.99, **kwargs)
    base_lr /= 2
    self.unfreeze()
    self.fit_one_cycle(epochs, slice(base_lr/lr_mult, base_lr), pct_start=pct_start, div=div, **kwargs)
File:      ~/anaconda3/envs/csy/lib/python3.8/site-packages/fastai/callback/schedule.py
Type:      method

The fine_tune method has a number of parameters you can use to change its behavior, but it might be easiest for you to just call the underlying methods directly if you want to get some custom behavior

  • So let's try doing this manually ourselves. First of all we will train the randomly added layers for three epochs, using fit_one_cycle.
  • fit_one_cycle is the suggested way to train models without using fine_tune.
learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn.fit_one_cycle(3, 3e-3)
epoch train_loss valid_loss error_rate time
0 1.144811 0.352312 0.104871 00:17
1 0.523307 0.293994 0.088633 00:17
2 0.325098 0.246233 0.074425 00:17

Then we'll unfreeze the model:

learn.unfreeze()

and run lr_find again, because having more layers to train, and weights that have already been trained for three epochs, means our previously found learning rate isn't appropriate any more:

learn.lr_find()
SuggestedLRs(valley=1.4454397387453355e-05)

Here we have a somewhat flat area before a sharp increase, and we should take a point well before that sharp increase—for instance, 1e-5. The point with the maximum gradient isn't what we look for here and should be ignored.

Let's train at a suitable learning rate:

learn.fit_one_cycle(6, lr_max=1e-5)
epoch train_loss valid_loss error_rate time
0 0.269834 0.237671 0.070365 00:19
1 0.242716 0.228370 0.064953 00:20
2 0.230646 0.225023 0.071042 00:19
3 0.197709 0.216351 0.068336 00:19
4 0.188970 0.214318 0.066306 00:19
5 0.180490 0.214578 0.066306 00:19

This has improved our model a bit, but there's more we can do. The deepest layers of our pretrained model might not need as high a learning rate as the last ones, so we should probably use different learning rates for those—this is known as using discriminative learning rates.

Discriminative Learning Rates

The later layers learn much more complex concepts, like "eye" and "sunset," which might not be useful in your task at all (maybe you're classifying car models, for instance). So it makes sense to let the later layers fine-tune more quickly than earlier layers.

fastai lets you pass a Python slice object anywhere that a learning rate is expected. The first value passed will be the learning rate in the earliest layer of the neural network, and the second value will be the learning rate in the final layer. The layers in between will have learning rates that are multiplicatively equidistant throughout that range. Let's use this approach to replicate the previous training, but this time we'll only set the lowest layer of our net to a learning rate of 1e-6; the other layers will scale up to 1e-4. Let's train for a while and see what happens:

learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn.fit_one_cycle(3, 3e-3)
learn.unfreeze()
learn.fit_one_cycle(12, lr_max=slice(1e-6,1e-4))
epoch train_loss valid_loss error_rate time
0 1.173200 0.293170 0.100812 00:17
1 0.507859 0.241154 0.080514 00:17
2 0.325587 0.211869 0.070365 00:17
epoch train_loss valid_loss error_rate time
0 0.265301 0.207439 0.071042 00:19
1 0.248191 0.200019 0.065629 00:19
2 0.245084 0.195803 0.066306 00:19
3 0.206206 0.196536 0.062923 00:19
4 0.203215 0.196698 0.066982 00:19
5 0.169679 0.196622 0.064953 00:19
6 0.166411 0.181361 0.061570 00:19
7 0.155452 0.177654 0.062246 00:20
8 0.135744 0.176545 0.059540 00:19
9 0.133612 0.177804 0.061570 00:19
10 0.122526 0.179314 0.058187 00:19
11 0.122826 0.180650 0.058863 00:19

fastai can show us a graph of the training and validation loss:

learn.recorder.plot_loss()
  • As you can see, the training loss keeps getting better and better.
  • But notice that eventually the validation loss improvement slows, and sometimes even gets worse!
  • This is the point at which the model is starting to over fit.
  • In particular, the model is becoming overconfident of its predictions.
  • But this does not mean that it is getting less accurate, necessarily.
  • Take a look at the table of training results per epoch, and you will often see that the accuracy continues improving, even as the validation loss gets worse.
  • In the end what matters is your accuracy, or more generally your chosen metrics, not the loss.
  • The loss is just the function we've given the computer to help us to optimize.

Another decision you have to make when training the model is for how long to train for. We'll consider that next.

Selecting the Number of Epochs

if you find that you have overfit, what you should actually do is retrain your model from scratch, and this time select a total number of epochs based on where your previous best results were found.

If you have the time to train for more epochs, you may want to instead use that time to train more parameters—that is, use a deeper architecture.

Deeper Architectures

ResNet architecture that we are using in this chapter comes in variants with 18, 34, 50, 101, and 152 layer, pretrained on ImageNet

A larger version of a ResNet will always be able to give us a better training loss, but it can suffer more from overfitting, because it has more parameters to overfit with.

In general, a bigger model has the ability to better capture the real underlying relationships in your data, and also to capture and memorize the specific details of your individual images.

However, using a deeper model is going to require more GPU RAM, so you may need to lower the size of your batches to avoid an out-of-memory error. This happens when you try to fit too much inside your GPU and looks like:

Cuda runtime error: out of memory

The way to solve it is to use a smaller batch size, which means passing smaller groups of images at any given time through your model. You can pass the batch size you want to the call creating your DataLoaders with bs=.

To enable this feature in fastai, just add to_fp16() after your Learner creation.

from fastai.callback.fp16 import *
learn = cnn_learner(dls, resnet50, metrics=error_rate).to_fp16()
learn.fine_tune(6, freeze_epochs=3)
Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /home/csy/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth
epoch train_loss valid_loss error_rate time
0 1.278750 0.353822 0.112991 00:19
1 0.604646 0.288618 0.098782 00:19
2 0.435047 0.319332 0.096076 00:19
epoch train_loss valid_loss error_rate time
0 0.272431 0.319684 0.086604 00:21
1 0.292204 0.460287 0.111637 00:21
2 0.246758 0.323761 0.094723 00:21
3 0.158297 0.239570 0.068336 00:21
4 0.079137 0.211848 0.058187 00:21
5 0.057617 0.208058 0.054804 00:21

This is useful to remember—bigger models aren't necessarily better models for your particular case! Make sure you try small models before you start scaling up.

Conclusion

the choices made in the implementation of cross-entropy loss are not the only possible choices that could have been made. Just like when we looked at regression we could choose between mean squared error and mean absolute difference (L1).

Questionnaire

  1. Why do we first resize to a large size on the CPU, and then to a smaller size on the GPU?
  • they have spare margin to allow further augmentation transforms on their inner regions without creating empty zones.
  1. If you are not familiar with regular expressions, find a regular expression tutorial, and some problem sets, and complete them. Have a look on the book's website for suggestions.
    using_attr(RegexLabeller(r'(.+)_\d+.jpg$'), 'name'),
    
  2. What are the two ways in which data is most commonly provided, for most deep learning datasets?
  • Individual files representing items of data, such as text documents or images, possibly organized into folders or with filenames representing information about those items
  • A table of data, such as in CSV format, where each row is an item which may include filenames providing a connection between the data in the table and data in other formats, such as text documents and images
  1. Look up the documentation for L and try using a few of the new methods that it adds.
  2. Look up the documentation for the Python pathlib module and try using a few methods of the Path class.
  1. Give two examples of ways that image transformations can degrade the quality of the data.
  • Resize images to relatively "large" dimensions—that is, dimensions significantly larger than the target training dimensions.
  • Compose all of the common augmentation operations (including a resize to the final target size) into one, and perform the combined operation on the GPU only once at the end of processing, rather than performing the operations individually and interpolating multiple times.
  1. What method does fastai provide to view the data in a DataLoaders?
  2. What method does fastai provide to help you debug a DataBlock?
  3. Should you hold off on training a model until you have thoroughly cleaned your data?
  4. What are the two pieces that are combined into cross-entropy loss in PyTorch?
  5. What are the two properties of activations that softmax ensures? Why is this important?
  6. When might you want your activations to not have these two properties?
  7. Calculate the exp and softmax columns of <> yourself (i.e., in a spreadsheet, with a calculator, or in a notebook).
  8. Why can't we use torch.where to create a loss function for datasets where our label can have more than two categories?
  9. What is the value of log(-2)? Why?
  10. What are two good rules of thumb for picking a learning rate from the learning rate finder?
  11. What two steps does the fine_tune method do?
  12. In Jupyter Notebook, how do you get the source code for a method or function?
  13. What are discriminative learning rates?
  14. How is a Python slice object interpreted when passed as a learning rate to fastai?
  15. Why is early stopping a poor choice when using 1cycle training?
  16. What is the difference between resnet50 and resnet101?
  17. What does to_fp16 do?