Coke vs. Pepsi: A deep learning brand logo classifier using transfer learning

Goal

The goal of this exercise is to train a (deep) neural net using transfer learning to classify brand logo images. Specifically, the network should detect whether an image contains the Cocal Cola or the Pepsi logo.

To easily train our model, we rely on the open source Python library keras that has been ported to R.

# load keras
library(keras)

# for reproducibility
myseed <- 42
set.seed(myseed)
use_session_with_seed(myseed, disable_gpu = FALSE, disable_parallel_cpu = FALSE)

## Set session seed to 42

The data

The dataset is based on the Flickr Logos 27 dataset¹. After some preprocessing (see flickr27_data_preprocessing.R), 120 training and 40 validation images were sampled for both Pepsi and Coke (70:30 split).

Let’s check out some of the images.

Preparation

The data is stored in different folders for training, validation, and test, and in each folder there exists a subfolder with images of Pepsi and Coke, respectively. Thus, we have to set the corresponding directories first.

# set data directories (i.e., location of images)
#
# beware of the folder structure!
# the labels (i.e., brand names) will be inferred from the folder names
train_directory <- "flickr27/training/"
val_directory <- "flickr27/validation/"
test_directory <- "flickr27/test/"

# how many images in training, validatiton and test set
train_samples = length(list.files(path = train_directory, recursive = TRUE))
validation_samples = length(list.files(path = val_directory, recursive = TRUE))
test_samples = length(list.files(path = test_directory, recursive = TRUE))

Next, we have to decide on the image size and the batch size (because we will sample subsets of images for training, see below).

# set parameters: image width and height to use for the model
img_width <- 100
img_height <- 100

# batch size: how many samples (i.e., images) should be used in one training iteration
batch_size <- 10

Because we have more than just few images, it makes sense to sample images on the fly when they are needed instead of loading all images at once. To this end, we can define an image_data_generator function that performs specific operations when loading the images. Here, we just scale the values between 0 and 1. Note that we could define much more parameters here if we would want to do data augmentation.

datagen_train <- image_data_generator(rescale = 1/255)
datagen_val <- image_data_generator(rescale = 1/255)

Note that there are three subfolders within the training, test, and validation folders (Cocacola, Pepsi, Heineken). We’ll need all three for a later example. However, for now we just want to use the images from Pepsi and Coke. Therefore, we define a list of classes we want to use in our model.

Furthermore, we define the flow_images_from_directory function which samples images from a folder with the given parameters (i.e., data generator function, image size, batch size, …). Because we are doing a binary classification task, we also have to specify binary as class mode.

class_list <- c("Pepsi", "Cocacola")

train_generator <- flow_images_from_directory(train_directory, generator = datagen_train,
                                              target_size = c(img_width, img_height),
                                              class_mode = "binary", batch_size = batch_size,
                                              classes = class_list,
                                              seed = myseed)

validation_generator <- flow_images_from_directory(val_directory, generator = datagen_val,
                                                   target_size = c(img_width, img_height),
                                                   class_mode = "binary", batch_size = batch_size,
                                                   classes = class_list,
                                                   seed = myseed)

As a sanity check, we test whether all 120 training images were detected.

# check label coding
train_generator$class_indices

## $Cocacola
## [1] 1
## 
## $Pepsi
## [1] 0

# note that coke is coded as 1 and pepsi as 0
table(train_generator$classes)

## 
##   0   1 
## 120 120

Build the neural network using transfer learning

Building a deep learning model includes two essential steps. First, you have to specify your model’s network architecture (i.e., define the layers). Second, you have to compile the model.

In this example, we make use of transfer learning. The idea of transfer learning in image classification is that we use the convolutional base of an existing network that was trained on a large dataset, typically on a large-scale image-classification task. If this original dataset is large enough and general enough, then the spatial hierarchy of features learned by the pretrained network can effectively act as a generic model of the visual world.² Its features can prove useful for many different new tasks, even though these new tasks may involve completely different classes than those of the original task. Typically, you want to use a network that was trained on ImageNet (where classes are mostly animals and everyday objects) and then repurpose this trained network for your new task.

Choosing the convolutional base and adding model layers

As convolutional base we use the very popular VGG16 network architecture. Luckily, VGG16 is pre-defined in keras. We can directly load the net with the application_vgg16 function (note that keras offers many more pre-defined, popular network architectures). Because we want to replace the original classifier with a new one, we load VGG16 without the “top” part of the network (include_top = FALSE). Furthermore, we load the network with the weights from imagenet as described above.

# we use VGG16 as convolutional base with the imagenet weights
conv_base <- application_vgg16(
  weights = "imagenet",
  include_top = FALSE,
  input_shape = c(img_width, img_height, 3)
)

# inspect model
summary(conv_base)

## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #     
## ===========================================================================
## input_1 (InputLayer)             (None, 100, 100, 3)           0           
## ___________________________________________________________________________
## block1_conv1 (Conv2D)            (None, 100, 100, 64)          1792        
## ___________________________________________________________________________
## block1_conv2 (Conv2D)            (None, 100, 100, 64)          36928       
## ___________________________________________________________________________
## block1_pool (MaxPooling2D)       (None, 50, 50, 64)            0           
## ___________________________________________________________________________
## block2_conv1 (Conv2D)            (None, 50, 50, 128)           73856       
## ___________________________________________________________________________
## block2_conv2 (Conv2D)            (None, 50, 50, 128)           147584      
## ___________________________________________________________________________
## block2_pool (MaxPooling2D)       (None, 25, 25, 128)           0           
## ___________________________________________________________________________
## block3_conv1 (Conv2D)            (None, 25, 25, 256)           295168      
## ___________________________________________________________________________
## block3_conv2 (Conv2D)            (None, 25, 25, 256)           590080      
## ___________________________________________________________________________
## block3_conv3 (Conv2D)            (None, 25, 25, 256)           590080      
## ___________________________________________________________________________
## block3_pool (MaxPooling2D)       (None, 12, 12, 256)           0           
## ___________________________________________________________________________
## block4_conv1 (Conv2D)            (None, 12, 12, 512)           1180160     
## ___________________________________________________________________________
## block4_conv2 (Conv2D)            (None, 12, 12, 512)           2359808     
## ___________________________________________________________________________
## block4_conv3 (Conv2D)            (None, 12, 12, 512)           2359808     
## ___________________________________________________________________________
## block4_pool (MaxPooling2D)       (None, 6, 6, 512)             0           
## ___________________________________________________________________________
## block5_conv1 (Conv2D)            (None, 6, 6, 512)             2359808     
## ___________________________________________________________________________
## block5_conv2 (Conv2D)            (None, 6, 6, 512)             2359808     
## ___________________________________________________________________________
## block5_conv3 (Conv2D)            (None, 6, 6, 512)             2359808     
## ___________________________________________________________________________
## block5_pool (MaxPooling2D)       (None, 3, 3, 512)             0           
## ===========================================================================
## Total params: 14,714,688
## Trainable params: 14,714,688
## Non-trainable params: 0
## ___________________________________________________________________________

The next step is to add our own custom layers on top of the existing network. After flattening the extracted features from VGG16, we add a fully connected (dense) layer with 128 units (i.e., neurons). Because of the binary nature of our classification task, the last (output) layer uses a sigmoid activation function and only one unit.

The last step is to freeze the convolutional base so that only the weights of our new classifier are updated. Otherwise we would re-train the whole network and the whole benefit of using a pre-trained, generic model as base would be gone.

# add your custom layers
model <- keras_model_sequential() %>%
  conv_base %>%
  layer_flatten() %>%
  layer_dense(units = 128, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

# freeze the convolutional base
cat("Trainable weights before freezing:",length(model$trainable_weights), "\n")

## Trainable weights before freezing: 30

freeze_weights(conv_base)

cat("Trainable weights after freezing:",length(model$trainable_weights), "\n")

## Trainable weights after freezing: 4

We can plot the newly created model using Andrie de Vries’ deepviz package.

# devtools::install_github("andrie/deepviz")
deepviz::plot_model(model)

Compile the model

Next, we compile the model with the appropriate settings. We need to define the loss function, an optimizer, and the metric with which we want to monitor the training steps (here accuracy). Note that because of the binary classification, we have to specify binary_crossentropy for the loss function.

# compile the model
# 
# note that we use "binary_crossentropy" as loss function because of our binary
# classification task
model %>% compile(
  loss = "binary_crossentropy",
  optimizer = optimizer_rmsprop(lr = 5e-5, decay = 1e-6),
  metrics = "accuracy"
)

Train the model

Finally, we train our neural network. Note that depending on your hardware setup this may take a while. If you’re using RStudio the training progress is automatically visualized.

# train the model (this may take some time ...)
# 
# note that the number of epochs defines how many iterations should be done.
hist <- model %>% fit_generator(
  train_generator,
  steps_per_epoch = as.integer(train_samples/batch_size), 
  epochs = 10,
  validation_data = validation_generator,
  validation_steps = as.integer(validation_samples/batch_size)
)

# You can also plot the results of the training
plot(hist)

Model evaluation

After training the model, its performance can be evaluated and predictions can be made.

Make predictions

First, we want to make predictions on two simple images downloaded from Google. To this end, we define a image prediction function. The funtcion does some image preprocessing (so that the data fit the data format with which the network was trained) and predicts the probability that the image contains a coke logo (remember that coke was coded as 1 and pepsi as zero, see above).

pred_img <- function(path) {
  img <- image_load(path, target_size = c(img_width, img_height))
  x <- image_to_array(img)
  x <- x/255
  x <- array_reshape(x, c(1, dim(x)))
  return(paste0(round(model %>% predict(x)*100,3),"%"))
}

Let’s see the verdict.

# create a helper function to plot an image
img_plot <- function(path){
    img <- image_load(path, target_size = c(img_width, img_height))
  x <- image_to_array(img)
  x <- x/255
  grid::grid.raster(x)
}

# reminder: Coca Cola: 1, Pepsi: 0
# pred_img gives prob of the image containing a coca Cola logo
img_plot("cocacola.jpg")

pred_img("cocacola.jpg")

## [1] "66.336%"

img_plot("pepsi.jpg")

pred_img("pepsi.jpg")

## [1] "0.001%"

Test set performance

We can also make predictions on the official test set of the Flickr 27 dataset. To this end, we first have to create again a generator function that samples the images from the corresponding directory. Then, we can evaluate the model’s performance on the test dataset.

Here are some of the test images.

Let’s see the verdict!

# test on 10 images from the test set
test_generator <- flow_images_from_directory(
  test_directory, generator = datagen_val,
  target_size = c(img_width, img_height),
  class_mode = "binary", batch_size = batch_size,
  classes = class_list,
  seed = myseed)

test_performance <- model %>% evaluate_generator(test_generator, steps = 5)
print(paste0("Test accuracy: ", round(test_performance$acc*100,4), "%"))

## [1] "Test accuracy: 80%"

The example is inspired by the book Deep Learning with R from Francois Chollet and J.J. Allaire.↩
cf. Chollet & Allaire (2018). Deep Learning with R, p. 132. Manning.↩