I would like to introduce in this post some specific CNN implementations which are able to offer a cognitive image processing that is very close to the state of the art. The exemplary code is based on the use of MXNet framework combined with the Python language.

MXNet is an open source project, distributed under the Apache License Version 2.0, a result of the research and development groups’ collaboration, belonging to important institutions such as CMU, NYU, NUS and MIT. It is a DMLC project. DMLC is a group of companies, laboratories and universities engaged in projects which are contributing from many years in defining the leading edge in the field of machine learning.

MXNet is essentially a framework for the implementation of deep learning solutions; it is natively developed in C++ and includes interfaces to the most popular languages and environments (e.g. Python, Go, R, Matlab, etc.). It can be efficiently run on both CPU and on high-performance systems based on GPU/Cuda, and can upscale on distributed architectures; it is light enough to run on the mobile devices. It is also an integral part of other frameworks, such as Turi.

MXnet is based on a combination of two programming approaches: the imperative programming (classic approach) and declarative or symbolic programming. In particular, the symbolic programming allows us to easily implement prototypes released by the particular nature of the raw data by bringing more generality to the resulting models. A good introduction to MXnet is provided in: MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed System.

How to use pre-trained models

This section describes how to use pre-trained models with MXNet. Using pre-trained models is very easy and useful. It helps to save a lot of time.

Researchers have demonstrated steady progress in computer vision by validating their work against ImageNet — an academic benchmark for computer vision. In the following example we learn how to load 3 pre-trained models available from DMLC MXNet Model Gallery. These models are in line with the state of the art and they are all trained on the ImageNet collection.

We start with the model Inception-v3 that is trained for the ImageNet Large Visual Recognition Challenge using the data from 2012 (ILSVRC 2012). This is a standard task in computer vision, where models try to classify entire images into 1000 classes. This model can be visualized with this schematic diagram:


Inception-v3 is able to achieve 76.88% Top-1 Accuracy and 93.344% Top-5 Accuracy on ILSVRC2012-Validation Set. Furthermore, in the 2015 ImageNet Challenge, an ensemble of 4 of these models came in 2nd in the image classification task.

In the following code blocks, we assume that pre-trained files have been already downloaded and copied on our local disk (e.g. in the folder ./datasets/preloaded/).

# import mxnet
import mxnet as mx
# import image preprocessing functions
import preprocessing

# pretrained model file settings
prefix = "./datasets/preloaded/model_v3/Inception-7"

# get the first epoch
num_round = 1

# load model
model = mx.model.FeedForward.load(prefix=prefix, iteration=num_round)

# load textual tags for each classes
# used below to translate the predicted classes in a human-readable way
synset = [l.strip() for l in open('./datasets/preloaded/model_v3/synset.txt').readlines()]

# test image preprocessing
batch = preprocessing.PreprocessImage_V3('./test-images/candle.jpg', True)
# batch = preprocessing.PreprocessImage_V3('./test-images/funny-cat.jpg', True)

# get prediction probability of 1000 classes from model
prob = model.predict(batch)[0]
# argsort, get prediction index from largest prob to lowest
pred = np.argsort(prob)[::-1]
# get top1 label
top1 = synset[pred[0]]
print("Top1: ", top1)
# get top5 label
top5 = [synset[pred[i]] for i in range(5)]
print("Top5: ", top5)

With reference to the previous code block: we start by executing the mx.model.FeedForward.load function and by specifying two input parameters: prefix which is an arbitrary string used to dynamically generate names of files (e.g. prefix-symbol.json and prefix-{epoch}.params) where are stored the parameters to be loaded and iteration which is the epoch number in which the model parameters have been generated. An epoch is simply a unit of measure for the training of a neural network. You may think to train your own network for a number of epochs and to check at the end if training has produced positive results or not. Usually, during the training of a neural network, the learned parameters are periodically stored on disk (e.g. at the end of each epoch, every 10 epochs, and so on). For the model in question, DMLC provides training at the first epoch, thus, we set num_round = 1 to load the model at the epoch given.

To obtain “human understandable” predictions, we load in memory also the synset list that associates the progressive numeric index assigned to each class in a list of words that describe it. This list is used to translate the network output (that is an ordered sequence of numeric indices corresponding to classes with relative probability) in a list of understandable words describing each inferred class.

To be processed, the images must meet specific requirements dictated by the network and are therefore modified through a specific pre-processing function preprocessing.PreprocessImage_V3.

The prediction is performed by invoking the model.predict function.

Now we can do some prediction using the following two test images:

Classification results of the candle’s image are:

('Top1: ', 'n02948072 candle, taper, wax light')
('Top5: ', ['n02948072 candle, taper, wax light', 'n02699494 altar', 'n03666591 lighter, light, igniter, ignitor', 'n03729826 matchstick', 'n04456115 torch'])

Classification results of the cat’s image are:

('Top1: ', 'n03126707 crane')
('Top5: ', ['n03126707 crane', 'n04467665 trailer truck, tractor trailer, trucking rig, rig, articulated lorry, semi', 'n02123045 tabby, tabby cat', 'n02123159 tiger cat', 'n04252225 snowplow, snowplough'])

Other pre-trained models, BN-Inception-v2 and Inception-21k1, available from DMLC model gallery can be loaded by introducing small changes to the previous code, as reported in the following code excerpts.


# pretrained model file settings
prefix = "./datasets/preloaded/model_bn/Inception_BN"
num_round = 39

# read textual tags for each classes
# used below to translate the predicted classes in a human-readable way
synset = [l.strip() for l in open('./datasets/preloaded/model_bn/synset.txt').readlines()]

# preprocessing of an image
batch = preprocessing.PreprocessImage_BN('./test-images/candle.jpg', True)
# or batch = preprocessing.PreprocessImage_BN('./test-images/funny-cat.jpg', True)


# pretrained model file settings
prefix = "./datasets/preloaded/model_21k/Inception"
num_round = 9

# read textual tags for each classes
# used below to translate the predicted classes in a human-readable way
synset = [l.strip() for l in open('./datasets/preloaded/model_21k/synset.txt').readlines()]

# preprocessing of an image
batch = preprocessing.PreprocessImage_21k('./test-images/candle.jpg', True)
# or batch = preprocessing.PreprocessImage_21k('./test-images/funny-cat.jpg', True)

All pre-trained models are based on configurations of CNN networks of varying complexity and depth (the depth is the number of the layers that compose them). The depth influences the network’s capacity to recognize and automatically extract the features that compose an image. Normally, after preparing a dataset where there may potentially be images of all sizes, we proceed with simple tasks to resize each image, crop and other handling. Such treatment is obviously external, that is done prior to the start of the network’s training. Once pre-treated, the images are ready to be run through the network. Each of the pre-trained models requires a specific type of image processing. To simplify, I collected treatment functions for each model in the preprocessing module which you can require me by filling out the contact form.

A CNN network built from scratch for image recognition

At this point, the objective is to implement from scratch a demo CNN, able to operate on two classes of images: cat and airplane. For the creation of training, validation and testing sets, I used the STL-10 dataset that is a subset of ImageNet collection. STL-10 consists of 13,000 color images with 96 × 96 pixels in size, arranged in 10 classes and 100,000 other unclassified usable images for unsupervised learning. The classified images are organized in 5000 images (500 per class) that can be used for training and 8000 images (800 per class) that can be used for validation and testing.

Training, validation and testing sets preparation

Starting from functions collected in the STL-10 Utils module, I extracted 2600 images related to the cat and airplane classes. 1820 (70%) images are used for the training set, 520 (20%) for the validation set, 260 (10%) for the testing set. Each set contains an equivalent number of images in both classes. I also created three “.lst” files for training, validation and testing sets, each containing the images list and the respective belonging class (0 for the airplane and 1 for the cat). Lastly, using the im2rec tool, distributed with the MXNet framework, I grouped the images in two binary archives (with “.rec” extension). I configured the im2rec tool in order to simultaneously perform the archive creation and also resize to 28 × 28 and convert to grayscale.

The resulting datasets should be organized as follows:

  • train.lst: includes associations between classes and images of the training set
  • validation.lst: includes associations between classes and images of the validation set
  • test.lst: includes associations between classes and images of the testing set
  • train_28_gray.rec: archive of 28×28 grayscale images of training set
  • val_28_gray.rec: archive of 28×28 grayscale images of validation set
  • test_28_gray.rec: archive of 28×28 grayscale images of test set

If you are interested in python code for the images extraction and files generation, you can request it by filling out the contact form. You can also visit the MXNet Python Data Loading API for further information about using im2rec and the “.lst” file format.

Creation and training of the CNN network

In the next code block, first of all, we run the import of the necessary libraries among which, obviously, stands out the Python wrapper for MXNet. Afterwards, we create two instances of mx.io.ImageRecordIter class. The ImageRecordIter objects are basically iterators on images blocks. The first iterator is configured to work on the training set, the second on the validation set. In the case of the training set, the iterator is constructed so randomly performs cropping rand_crop = True and horizontal tilting rand_mirror = True of the input images (this helps the network to learn in a more robust way); these images treatments are not required on validation set. On both sets (training and test), we set the image size data_shape = (1,28,28) where 1 represents the number of image channels (grayscale) and the following values represent the width and height respectively. We also set batch_size = 10 on both sets. batch_size represents the number of input elements, i.e. of images in this case, which are processed in the network from time to time (important for the training of a network). This parameter is set based on the available memory of your computer (memory of the video card if you work in GPU mode, or RAM if you work in CPU mode). Changing this parameter does not affect the network performance in terms of classification accuracy. Surely, a lower batch size value will result in a slower phase of training, because fewer images are processed at a time, but this is in line with the fact that you have a more limited hardware.

# import libraries
import mxnet as mx
import seaborn as sns
import sys,logging,numpy,random,csv
from sklearn.metrics import roc_auc_score, auc, precision_recall_curve, roc_curve, average_precision_score

train = mx.io.ImageRecordIter(
	path_imgrec = './datasets/stl10/images/train_28_gray.rec',
	data_shape = (1,28,28),
	batch_size = 10,
	rand_crop = True,
	rand_mirror = True

val = mx.io.ImageRecordIter(
	path_imgrec = './datasets/stl10/images/val_28_gray.rec',
	rand_crop = False,
	rand_mirror = False,
	data_shape = (1,28,28),
	batch_size = 10

In the following block: we set the logger in order to take advantage from a debug display during training. For the reproducibility of results, we set the random number generator by specifying the value of the seed mx.random.seed(100). Also, we set devs = mx.cpu() taking care in showing which processor to use for processing (in my case the CPU). I suggest you have a read in Run MXNet on Multiple CPU/GPUs with Data Parallel to understand better how to make the most of the multi-core CPUs and GPUs.

# configure logger
logger = logging.getLogger()

# set seeds for reproducibility

# device used. CPU in my case.
devs = mx.cpu()

Finally, in the next code block, we define the architecture of the CNN network (symbolic programming). We have 4 layers with trainable parameters: a series of 2 convolutional layers followed by two FC (fully connected) layers. Each convolutional layer is followed by a tanh layer act_type="tanh" and of a MAX pooling layer pool_type="max". Note: from my experiments this network performs well if the activation layers are all tanh-based or if the convolutional layers are ReLU-based and FC layers are tanh-based.

All the layers of the pooling have a region of 2×2 extension kernel=(2,2) and a stride equals to 2 stride=(2,2): it means that we use an overlapping pooling. This choice is due to the fact that this type of pooling slightly increases the network performances compared to the normal pooling without overlapping. About the 2 FC layers: the first has 500 neurons num_hidden=500, whereas the last has 2 units num_hidden=2 corresponding to the classes’ number of our interest (cat and airplane).

A single phase of forward propagation involves the following main operations: the first convolutional layer accepts the input image of a 28x28x1 (usually the generic input of a convolutional layer is also called volume) and applies in it 20 filters num_filter=20, each of 5×5 size kernel=(5,5) with a stride of 1 (default setting) and no zero-padding (default setting), having 20 activation maps with a 24×24 size. In other words, the size of output volume is 24x24x20. To calculate the size of the output volume, we can use the following formulas:


where W1, H1 are width and height of the input region, F is the side of the receptive field area, S is the stride, P is the amount of zero-padding, K is the filters number. In this particular case: W2=24, H2=24 with F=5, S=1, P=0.

The second convolutional layer takes the 12x12x20 volume input, obtained from the 24x24x20 volume to which have been applied the tanh and overlapping pooling functions (the latter produces the volume of reduced dimensions). The second convolutional layer convolve the 12x12x20 volume with 50 filters of 5x5x20 size, a stride of 1 and no zero-padding, by obtaining a 8x8x50 volume. Subsequently, after the operations of tanh, it is applied another identical pooling layer to the previous one where we can get a 4x4x50 volume.

At this point, the first of the 2 FC layers, which possesses 500 neurons, performs its normal work, i.e. the various products and sums for the activations of its 500 neurons by obtaining a vector output of 1×500 size. The same occurs with the second and last FC layer, which produces in output a vector of dimensions 1×2 by possessing 2 units, the number of classes of our interest.

The softmax activation function of the output neurons NN_model=mx.symbol.SoftmaxOutput(data=fc2, name='softmax') creates each output in [0..1] with the sum of all the outputs equal to 1, allowing to interpret the network response as estimates of probability.

For more insights about mxnet.symbol.Convolution and parameters’ default values, please see MXNet Python Symbolic API.

data = mx.symbol.Variable('data')

# 1st convolutional layer
conv1 = mx.symbol.Convolution(data=data, kernel=(5,5), num_filter=20)
tanh1 = mx.symbol.Activation(data=conv1, act_type="tanh")
pool1 = mx.symbol.Pooling(data=tanh1, pool_type="max", kernel=(2,2), stride=(2,2))

# 2nd convolutional layer
conv2 = mx.symbol.Convolution(data=pool1, kernel=(5,5), num_filter=50)
tanh2 = mx.symbol.Activation(data=conv2, act_type="tanh")
pool2 = mx.symbol.Pooling(data=tanh2, pool_type="max",kernel=(2,2), stride=(2,2))

# 1st fully connected layer
flatten = mx.symbol.Flatten(data=pool2)
fc1 = mx.symbol.FullyConnected(data=flatten, num_hidden=500)
tanh3 = mx.symbol.Activation(data=fc1, act_type="tanh")
# 2nd fully connected layer
fc2 = mx.symbol.FullyConnected(data=tanh3, num_hidden=2)

# Output. Softmax output since we'd like to get some probabilities.
NN_model = mx.symbol.SoftmaxOutput(data=fc2, name='softmax')

This model can be visualized with this schematic diagram2:

After the definition of the network architecture, we pass to the creation of the model by instantiation of the mx.model.FeedForward class. During the initialization we specify the execution context ctx=devs, the network architecture symbol=NN_model, the epochs number num_epoch=400, the learning rate value learning_rate=0.0001. Learning rate indicates what portion of the error should be considered when updating the weights of the network during the gradient descent algorithm. Another specified parameter is the momentum momentum=0.9. Momentum (or “persistence of change of synaptic connections”) indicates the proportion of the last weight update that is used in the current weight update (during the gradient descent). Momentum is an effective way to move a network towards a good generalization, but can also be a problem if the momentum moves us away from the optimum weights. Generally, we always leave the value 0.9 because it was shown that there can be obtained good results in this way. Finally, the last specified parameter is the weight decay weight_decay = 0.00001; it is a parameter that affects the regularization factor which appears in the formula for the cost function calculation:

C(W,b)=\frac{1}{2 \cdot m}\cdot \sum_{i=1}^{m} [y^{(i)}-f_{W,b}(x^{(i)})]^{2}+wd\cdot \left \| W \right \|^{2}

where fW,b(x) is the network output, y(i) are the target outputs in relation to the input x(i), m is the number of training samples. W represents all the network weights and is regarded as a vector. The second addend is a regularization factor. The weight decay (wd) regulates the effect of regularization factor: a high value makes the network unable to treat the non-linearity by canceling the effect of a large part of the weights (in practice, it is as if we were using a simpler network), the lower values make the network be more robust, moderately mitigating the effect of the weights but allowing to treat nonlinearities without running into overfitting. The weight decay influences the degrees of the network freedom. Even here, it is usually left around a fixed value (0.0005 or short distance values).

model = mx.model.FeedForward(
	ctx = devs,
	symbol = NN_model,
	num_epoch = 400,
	learning_rate = 0.0001,
	momentum = 0.9,
	wd = 0.00001

The next step is to start learning the network through the invocation of model.fit function. model.fit is a function defined in mx.model.FeedForward class and it requires the specification of certain parameters for the invocation. First you need to provide the training set X=train and the validation set eval_data=val. At the execution of each batch, we can connect the invocation of a tracking function batch_end_callback = mx.callback.Speedometer(batch_size = 10, frequent = 50), able to display the learning speed (number of samples processed per second). The Speedometer function uses a list of metrics eval_metric=['accuracy',roc_auc_score] for calculating performances and displaying their values at the end of each batch on the training set. It can be specified built-in metrics in the eval_metric (e.g. accuracy, top_k_accuracy, f1, mae, mse, rmse, cross-entropy) or metrics implemented in external functions as roc_auc_score that is imported from sklearn.metrics package. The built-in metric accuracy is calculated as the ratio between the number of correctly classified samples (true positive + true negative) divided by the total number of samples. The roc_auc_score metric calculates the area under the ROC curve. Speedometer calculates the above metrics on the training set, at the end of each batch. So, if we have 1000 samples and the batch size is 10, the training will take place in 100 blocks and each 50 blocks frequent = 50 while statistics will be shown in the performance of the learning network. At the end of each epoch, it will also be displayed the same metrics listed in eval_metric but reffered to the validation set.
After the execution of each epoch, it is possible to invoke a checkpoint function which makes the parameters learned by the network to be saved in a file on disk epoch_end_callback = mx.callback.do_checkpoint(prefix = "rot_tanh_28_5_400", period = 1) where prefix is an arbitrary string that is used to automatically generate the name of the checkpoint file (progressive numerical index) and period indicates on how many epochs should be performed the saving. The saving to disk of the learned data allows you to distribute the network models, already pre-trained, to the epoch which is considered more appropriate. To load a pretrained model, we can proceed in a precisely similar way to the previous one, used for DMLC models.

	batch_end_callback=mx.callback.Speedometer(batch_size=10, frequent=50),
	epoch_end_callback=mx.callback.do_checkpoint(prefix="rot_tanh_28_5_400", period=10)

The two charts in the figure below allow us to quickly analyze the network performance compared to training and validation set in ROC (Receiver Operator Characteristic) space and in the PR (Precision-Recall) space with AUC (Area Under the Curve) indication as an evaluation metric. For further information about ROC-PR and AUC, I suggest to read The Relationship Between Precision-Recall and ROC Curves.


The next excerpt contains the code for the generation of such charts.

with open('./datasets/stl10/images/train.lst','r') as csvfile:
	spamreader = csv.reader(csvfile, delimiter='\t')
	for row in spamreader:
with open('./datasets/stl10/images/val.lst','r') as csvfile:
	spamreader = csv.reader(csvfile, delimiter='\t')
	for row in spamreader:


y_eval_pred = model.predict( val )[:,1]
y_train_pred = model.predict( train )[:,1]
t_fpr, t_tpr, _ = roc_curve( y_train, y_train_pred )
t_precision, t_recall, _ = precision_recall_curve( y_train, y_train_pred )
e_fpr, e_tpr, _ = roc_curve( y_val, y_eval_pred )
e_precision, e_recall, _ = precision_recall_curve( y_val, y_eval_pred )

plt.figure( figsize=(15, 6) )
train_auc = numpy.around( auc(t_fpr, t_tpr), 4 )
eval_auc = numpy.around( auc(e_fpr, e_tpr), 4 )
plt.title( "ROC" )
plt.xlabel( "FPR" )
plt.ylabel( "TPR" )
plt.plot( t_fpr, t_tpr, alpha=0.6, c='m', label="Training AUC = {}".format( train_auc ) )
plt.plot( e_fpr, e_tpr, alpha=0.6, c='c', label="Evaluation AUC = {}".format( eval_auc ) )
plt.plot( [0,1], [0, 1], c='k', alpha=0.6 )
plt.legend( loc=2 )
train_auc = numpy.around( average_precision_score( y_train, y_train_pred ), 4 )
eval_auc = numpy.around( average_precision_score( y_val, y_eval_pred ), 4 )
plt.title( "ROC" )
plt.xlabel( "Recall" )
plt.ylabel( "Precision" )
plt.plot( t_recall, t_precision, alpha=0.6, c='m', label="Training AUC = {}".format( train_auc ) )
plt.plot( e_recall, e_precision, alpha=0.6, c='c', label="Evaluation AUC = {}".format( eval_auc ) )
plt.plot( [0,1], [0.5, 0.5], c='k', alpha=0.6 )
plt.ylim(0.5, 1.0)

CNN network test

Finally, it’s time to try the network predictive capabilities. To do this, we can use the testing test (previously described) which consists of 260 images, equally distributed into the two classes. This set includes images that have never been subjected to the network. Images can be extracted from the binary archive “.rec” and, through the “.lst” files, we can go back to the belonging class for each3. Iteratively on each image, we invoke the model.predict function to get the list of probabilities for each of the two classes and by comparing predicted and real classes we obtain a concrete idea of the overall performance.

Briefly, these are the results I have achieved:

success: 237, total: 260, cat: 129, plane 108

Thus, out of 260 images about 91% (237 of 260) has been correctly recognized, with 99% of the successes on cat and 83% on airplane. A very interesting result and certainly improvable by working on network parameters. But this is another story.

You can ask me datasets, pre-trained models and custom software modules by filling out the contact form (this is just a way, completely free for you, to know my blog readers).

  1. This model is a pretrained model on full ImageNet dataset with 14,197,087 images in 21,841 classes.
  2. Visualizing CNN architectures side by side with mxnet
  3. These are named the real classes of images.

Posted by lorenzo

Full-time engineer. I like to write about data science and artificial intelligence.


  1. Hi Lorenzo,
    Thanks for the post. I’m working on a problem where I have to recognise products which have similar packaging but with different colours. I did train a VGG net and an Alexnet with my images with a test accuracy of 60%, however, when an unknown image is fed into the network for classification, the results are not great.
    I get only 2 out of 10 images correctly identified and sometimes none. Also sometimes It also considers background and gives me wrong results. For example, class A product colour is white and there’s a grey background and class B, product is grey in colour while the background is white. Now when I feed in an image from class B which is grey in colour, it gets identified as class A.
    Can you tell me how can I fix this ?

    PS: I’m working to get the bounding box and train the images with FAST RCNN which might work, but not sure on that either as there’s very little available on the internet.



    1. Hi Pramod.
      The first problem could be related to the image preprocessing (normalization and so on).
      About bounding box, lastly, I found very useful https://github.com/Microsoft/CNTK on a Kaggle problem.


  2. […] you interested in object classification? Give a look to my post Image recognition tutorial in Python/MXNet using deep convolutional. To learn about neural networks in general, Michael Nielsen’s free online book is an […]


Vuoi commentare?