A beginner’s step to Deep Learning — Dog breed classification

V Mukundan
7 min readJun 25, 2021

--

Figure 1: Of course there are lot more than 100 breeds on earth!

In todays post, I will be briefly explaining a ‘cool’ project worked out with neural networks(NN), OpenCV, ImageNet etc. Given a picture, the goal is to detect either a human face or a dog in it and also detect their breed with good accuracy. It is nearly impossible by humans to know the anatomy and subtle differences among 250 or more breeds on our planet!

Problem Statement

With the maximum speed of internet and never seen before memory capacities, the amount of information in terms of text, images, videos and other forms we consume are gargantuan. With the rise of artificial intelligence and big data, the goal is to classify and analyze the oceans of data to draw relevant information while automating the process. In this post, we delve into building and training a convolutional neural networks (CNN) with a major chunk of the starting codes provided by Udacity as a part of a capstone project. In the spirit of classification, we piece together a series of nearly-perfect image classification algorithms to detect humans and dogs here with a target accuracy of 80%.

What is a CNN?

CNN are regularized version of multilayer perceptron (MLP) and are shift invariant artificial neural networks. MLPs are fully connected and are prone to overfitting data. Each neuron estimates an output by applying a specific function determined by weights and bias terms. Learning requires iteratively adjusting the weights and biases. CNN are defined by many hyperparameter such as input channels, output channels, kernel dimensions, padding, stride and dilation. Filters in CNN is attributed to a certain feature. Each layer receives input from restricted area of the previous layer and is responsible for delineating/learning a particular feature starting with the simples ones. Convolution operation in these layers reduces free parameters taking into the account the spatial relationship between different features and permits the networks to be deeper. Pooling consists of down sampling features in sub regions of a feature map.

Metrics:

The predictions from these models are used to calculate the accuracy score. Accuracy is used as a metric to evaluate how many times our model predicts the model correctly. As we improve the algorithm, accuracy is foun to increase.

Problem solving strategy

In building the classifier, we follow seven steps to include importing datasets, detect humans, detect dogs, create a CNN to classify dog breeds, trains a CNN from scratch, use a award winning pre-trained NNs, adapt the CNN to our ‘cute’ problem via transfer learning and finally testing the CNN with a few interesting images. Like any algorithm for a task, no perfect algorithm exists but can be a great learning experience decoding a NN architecture. We are using Keras which is a open source library in python for NN with tensorflow backend run on Jupyter notebook environment.

Exploring the dataset (EDA)
The provided dataset has 8351 images and these will be divided into training set to model, validation set to tune the parameters and finally into a test set to estimate the accuracy. The number of classification of canine breeds is restricted to 133 for this project. The number of human images provided were 13233. The dataset is split into 6680, 835, 836 images for the training, validation and test datasets respectively.

Data Visualization
OpenCV is a library of ML libraries were used. Here the Haar feature-based classifier with function of selecting best features from a pool of 160000+ features(eyes, nose etc). To enable accuracy and speed, Adaboost and cascade of classifiers concepts were used. We one of the many pre-trained detectors stored as XML files on github of OpenCV . The code snippet shown below helps in detecting human faces in a sample image. The images are converted to grayscale which works for the an OpenCV function ‘detectMultiScale’ in the code and returns a ‘box’ of coordinate as a numpy array.

Data preprocessing
This step is to detect dogs and we use a pre-trained ResNet-50 model in Keras. We convert the RGB coding to BGR and normalize with the weights obtained from the ImageNet database. ImageNet is a very popular dataset with over 10 million images classifying into 1000 categories. ImageNet has a large scale visual recognition competition held annually to find the sophisticated groundbreaking CNN architectures. The classification label between 151 and 268 is attributed to different dog breeds. Keras with TensorFlow backend requires a 4D array of number of samples, number of rows, columns and channels as input. The ‘path_to_tensor’ function takes a string-values file path to an color image and returns a 4D tensor which is keyed in as a input for Keras CNN. Using the predict method in ResNet-50, the model predicts the probability the image belong to a particular ImageNet category. By taking the argmax of the predicted probability vector, we obtain an integer corresponding to the model’s predicted object class, which we can identify with an object category through the use of this dictionary. ‘dog_detector’ function is assessed on a smaller subset of 100 files of both the human and dog datasets.

Model Implementation

Figure 3: Model architecture

Here we add a base Sequential model with the details as shown in figure 3 to attain >1% accuracy in 5 epochs. There are three convolutional layers followed by MaxPooling2D to global average pooling. Pooling is a down sampling concept to reduce spatial size of the representation, to reduce number of parameters , memory and amount of computation and control overfitting. This architecture taken in a 4D tensor with (1, 224, 224, 3) and predict the canine breed with probabilities. We use ‘Relu’ activation function and an ‘RMSProp’ optimizer. After compiling, training the model with 100 epochs and loading the model with the best validation loss, we move to testing the model. This yielded a test accuracy of 11.8% with 100 epochs and 1.4% for 5 epochs. This step took almost two hours even with a GPU support emphasizing the need for faster algorithm without compromising testing accuracy.

Figure 4: Training the model

Hyperparameter Tuning and refinement

VGG Net is a prize winning CNN at the ImageNet Large Scale Visual Recognition competition in 2014 with an accuracy of 92.7% and was developed by researchers at University of Oxford for the purpose of strengthening classification accuracy with increasing depth of the CNN.
Here we use a pre-trained VGG16 to classify dog breed using transfer learning. Bottleneck features is a concept using pre-trained model and terminating desired classifying layer and then adding it as s first layer into our model. This effectively works as a feature extractor. Here a sequential layer with global average pooling is added. Compiling and training the model over 20 epoch yields a testing accuracy of 41.5%.

Results
Transfer learning is research problem where reusing the information from previously learned tasks for learning new tasks efficiently.
Other available CNN with bottleneck features were namely VGG19, ResNet-50, Inception and Xception in Keras. Xception was chosen with a global average pooling layer. This model was compiled and trained for 20 epochs yielding a test accuracy of 85%. Next we build a function to extract the bottleneck features and returns the dog breed predicted by the model.

Figure 5: Higher accuracy attained

Model evaluation, validation & justification

The model was tested on a few other images and surprisingly did not pefect classification. Here we used a cat image, two dog breeds of southern Indian subcontinent (‘Rajapalayam’ and ‘Kombai’) and an image containing both a human and a dog. The results of tests are shown in the figure below. The supplementary codes are available here.

Figure 6: A few interesting tests

Reflection and Improvement

Deep neural networks surprisingly recognizes patterns in multidimensional data but it is not fully understood how it does it. The scope of learning features are also restricted to the training data sets. Perhaps more images with more subtle features can help developing algorithms with more accuracy (>90%). There maybe some breeds that are cross breeds or sub breeds. In this project, the transfer learning, data augmentation and bottleneck features of the pre-trained networks were leveraged with ResNet-50 and Xception models. I have used CNN here as a black box model and it requires further study to develop a theoretical understanding of these neural networks. Improvements in accuracy can be tested by probing other combinations of pre-trained networks such as ResNet-50, Inception, VGG19 along with VGG16 and Xception explored here. It is definitely interesting to develop CNN with a few lines of codes to find answers that are challenging to a human brain with high accuracy. I am interested in replicating this project on birds to see if the neural networks performs with the same or higher efficiency in the near future.

--

--

No responses yet