Transfer Learning With Convolutional Neural Networks In Pytorch


Transfer Learning with Convolutional Neural Networks in PyTorch

How to use a pre-trained convolutional neural network for object recognition with PyTorch

Although Keras is a great library with a simple API for building neural networks, the recent excitement about PyTorch finally got me interested in exploring this library. While I’m one to blindly follow the hype, the adoption by researchers and inclusion in the library convinced me there must be something behind this new entry in deep learning.

Since the best way to learn a new technology is by using it to solve a problem, my efforts to learn PyTorch started out with a simple project: use a pre-trained convolutional neural network for an object recognition task. In this article, we’ll see how to use PyTorch to accomplish this goal, along the way, learning a little about the library and about the important concept of transfer learning.

While PyTorch might not be for everyone, at this point it’s impossible to say which deep learning library will come out on top, and being able to quickly learn and use different tools is crucial to succeed as a data scientist.

The complete code for this project is available as a Jupyter Notebook on GitHub. This project was born out of my participation in the Udacity PyTorch scholarship challenge.

Predicted from trained network

Approach to Transfer Learning

Our task will be to train a convolutional neural network (CNN) that can identify objects in images. We’ll be using the Caltech 101 dataset which has images in 101 categories. Most categories only have 50 images which typically isn’t enough for a neural network to learn to high accuracy. Therefore, instead of building and training a CNN from scratch, we’ll use a pre-built and pre-trained model applying transfer learning.

The basic premise of transfer learning is simple: take a model trained on a large dataset and transfer its knowledge to a smaller dataset. For object recognition with a CNN, we freeze the early convolutional layers of the network and only train the last few layers which make a prediction. The idea is the convolutional layers extract general, low-level features that are applicable across images — such as edges, patterns, gradients — and the later layers identify specific features within an image such as eyes or wheels.

Thus, we can use a network trained on unrelated categories in a massive dataset (usually Imagenet) and apply it to our own problem because there are universal, low-level features shared between images. The images in the Caltech 101 dataset are very similar to those in the Imagenet dataset and the knowledge a model learns on Imagenet should easily transfer to this task.

Idea behind Transfer Learning (source).

Following is the general outline for transfer learning for object recognition:

  1. Load in a pre-trained CNN model trained on a large dataset
  2. Freeze parameters (weights) in model’s lower convolutional layers
  3. Add custom classifier with several layers of trainable parameters to model
  4. Train classifier layers on training data available for task
  5. Fine-tune hyperparameters and unfreeze more layers as needed

This approach has proven successful for a wide range of domains. It’s a great tool to have in your arsenal and generally the first approach that should be tried when confronted with a new image recognition problem.

Data Set Up

With all data science problems, formatting the data correctly will determine the success or failure of the project. Fortunately, the Caltech 101 dataset images are clean and stored in the correct format. If we correctly set up the data directories, PyTorch makes it simple to associate the correct labels with each class. I separated the data into training, validation, and testing sets with a 50%, 25%, 25% split and then structured the directories as follows:


The number of training images by classes is below (I use the terms classes and categories interchangeably):

Number of training images by category.

We expect the model to do better on classes with more examples because it can better learn to map features to labels. To deal with the limited number of training examples we’ll use data augmentation during training (more later).

As another bit of data exploration, we can also look at the size distribution.

Distribution of average image sizes (in pixels) by category.

Imagenet models need an input size of 224 x 224 so one of the preprocessing steps will be to resize the images. Preprocessing is also where we will implement data augmentation for our training data.

Data Augmentation

The idea of data augmentation is to artificially increase the number of training images our model sees by applying random transformations to the images. For example, we can randomly rotate or crop the images or flip them horizontally. We want our model to distinguish the objects regardless of orientation and data augmentation can also make a model invariant to transformations of the input data.

An elephant is still an elephant no matter which way it’s facing!

Image transformations of training data.

Augmentation is generally only done during training (although test time augmentation is possible in the []( library). Each epoch — one iteration through all the training images — a different random transformation is applied to each training image. This means that if we iterate through the data 20 times, our model will see 20 slightly different versions of each image. The overall result should be a model that learns the objects themselves and not how they are presented or artifacts in the image.

Image Preprocessing

This is the most important step of working with image data. During image preprocessing, we simultaneously prepare the images for our network and apply data augmentation to the training set. Each model will have different input requirements, but if we read through what Imagenet requires, we figure out that our images need to be 224x224 and normalized to a range.

To process an image in PyTorch, we use transforms , simple operations applied to arrays. The validation (and testing) transforms are as follows:

  • Resize
  • Center crop to 224 x 224
  • Convert to a tensor
  • Normalize with mean and standard deviation

The end result of passing through these transforms are tensors that can go into our network. The training transformations are similar but with the addition of random augmentations.

First up, we define the training and validation transformations:

Then, we create datasets and DataLoaders . By using datasets.ImageFolder to make a dataset, PyTorch will automatically associate images with the correct labels provided our directory is set up as above. The datasets are then passed to a DataLoader , an iterator that yield batches of images and labels.

We can see the iterative behavior of the DataLoader using the following:

# Iterate through the dataloader once
trainiter = iter(dataloaders['train'])
features, labels = next(trainiter)
features.shape, labels.shape
**(torch.Size([128, 3, 224, 224]), torch.Size([128]))**

The shape of a batch is (batch_size, color_channels, height, width). During training, validation, and eventually testing, we’ll iterate through the DataLoaders, with one pass through the complete dataset comprising one epoch. Every epoch, the training DataLoader will apply a slightly different random transformation to the images for training data augmentation.

Pre-Trained Models for Image Recognition

With our data in shape, we next turn our attention to the model. For this, we’ll use a pre-trained convolutional neural network. PyTorch has a number of models that have already been trained on millions of images from 1000 classes in Imagenet. The complete list of models can be seen here. The performance of these models on Imagenet is shown below:

Pretrained models in PyTorch and performance on Imagenet (Source).

For this implementation, we’ll be using the VGG-16. Although it didn’t record the lowest error, I found it worked well for the task and was quicker to train than other models. The process to use a pre-trained model is well-established:

  1. Load in pre-trained weights from a network trained on a large dataset
  2. Freeze all the weights in the lower (convolutional) layers: the layers to freeze are adjusted depending on similarity of new task to original dataset
  3. Replace the upper layers of the network with a custom classifier: the number of outputs must be set equal to the number of classes
  4. Train only the custom classifier layers for the task thereby optimizing the model for smaller dataset

Loading in a pre-trained model in PyTorch is simple:

from torchvision import models
model = model.vgg16(pretrained=True)

This model has over 130 million parameters, but we’ll train only the very last few fully-connected layers. Initially, we freeze all of the model’s weights:

# Freeze model weights
for param in model.parameters():
    param.requires_grad = False

Then, we add on our own custom classifier with the following layers:

  • Fully connected with ReLU activation, shape = (n_inputs, 256)
  • Dropout with 40% chance of dropping
  • Fully connected with log softmax output, shape = (256, n_classes)
import torch.nn as nn
# Add on classifier
model.classifier[6] = nn.Sequential(
                      nn.Linear(n_inputs, 256), 
                      nn.Linear(256, n_classes),                   

When the extra layers are added to the model, they are set to trainable by default ( require_grad=True ). For the VGG-16, we’re only changing the very last original fully-connected layer. All of the weights in the convolutional layers and the the first 5 fully-connected layers are not trainable.

# Only training classifier[6]
  (0): Linear(in_features=25088, out_features=4096, bias=True)
  (1): ReLU(inplace)
  (2): Dropout(p=0.5)
  (3): Linear(in_features=4096, out_features=4096, bias=True)
  (4): ReLU(inplace)
  (5): Dropout(p=0.5)
  (6): Sequential(
    (0): Linear(in_features=4096, out_features=256, bias=True)
    (1): ReLU()
    (2): Dropout(p=0.4)
    (3): Linear(in_features=256, out_features=100, bias=True)
    (4): LogSoftmax()

The final outputs from the network are log probabilities for each of the 100 classes in our dataset. The model has a total of 135 million parameters, of which just over 1 million will be trained.

# Find total parameters and trainable parameters
total_params = sum(p.numel() for p in model.parameters())
print(f'{total_params:,} total parameters.')
total_trainable_params = sum(
    p.numel() for p in model.parameters() if p.requires_grad)
print(f'{total_trainable_params:,} training parameters.')
**135,335,076 total parameters.
1,074,532 training parameters.**

Moving Model to GPU(s)

One of the best aspects of PyTorch is the ease of moving different parts of a model to one or more gpus so you can make full use of your hardware. Since I’m using 2 gpus for training, I first move the model to cuda and then create a DataParallel model distributed over the gpus:

# Move to gpu
model ='cuda')
# Distribute across 2 gpus
model = nn.DataParallel(model)

(This notebook should be run on a gpu to complete in a reasonable amount of time. The speedup over a cpu can easily by 10x or more.)

Training Loss and Optimizer

The training loss (the error or difference between predictions and true values) is the negative log likelihood (NLL). (The NLL loss in PyTorch expects log probabilities, so we pass in the raw output from the model’s final layer.) PyTorch uses automatic differentiation which means that tensors keep track of not only their value, but also every operation (multiply, addition, activation, etc.) which contributes to the value. This means we can compute the gradient for any tensor in the network with respect to any prior tensor.

What this means in practice is that the loss tracks not only the error, but also the contribution to the error by each weight and bias in the model. After we calculate the loss, we can then find the gradients of the loss with respect to each model parameter, a process known as backpropagation. Once we have the gradients, we use them to update the parameters with the optimizer. (If this doesn’t sink in at first, don’t worry, it takes a little while to grasp! This powerpoint helps to clarify some points.)

The optimizer is Adam, an efficient variant of gradient descent that generally does not require hand-tuning the learning rate. During training, the optimizer uses the gradients of the loss to try and reduce the error (“optimize”) of the model output by adjusting the parameters. Only the parameters we added in the custom classifier will be optimized.

The loss and optimizer are initialized as follows:

from torch import optim
# Loss and optimizer
criteration = nn.NLLLoss()
optimizer = optim.Adam(model.parameters())

With the pre-trained model, the custom classifier, the loss, the optimizer, and most importantly, the data, we’re ready for training.


Model training in PyTorch is a little more hands-on than in Keras because we have to do the backpropagation and parameter update step ourselves. The main loop iterates over a number of epochs and on each epoch we iterate through the train DataLoader . The DataLoader yields one batch of data and targets which we pass through the model. After each training batch, we calculate the loss, backpropagate the gradients of the loss with respect to the model parameters, and then update the parameters with the optimizer.

I’d encourage you to look at the notebook for the complete training details, but the basic pseudo-code is as follows:

We can continue to iterate through the data until we reach a given number of epochs. However, one problem with this approach is that our model will eventually start overfitting to the training data. To prevent this, we use our validation data and early stopping.

Early Stopping

Early stopping means halting training when the validation loss has not decreased for a number of epochs. As we continue training, the training loss will only decrease, but the validation loss will eventually reach a minimum and plateau or start to increase. We ideally want to stop training when the validation loss is at a minimum in the hope that this model will generalize best to the testing data. When using early stopping, every epoch in which the validation loss decreases, we save the parameters so we can later retrieve those with the best validation performance.

We implement early stopping by iterating through the validation DataLoader at the end of each training epoch. We calculate the validation loss and compare this to the lowest validation loss. If the loss is the lowest so far, we save the model. If the loss has not improved for a certain number of epochs, we halt training and return the best model which has been saved to disk.

Again, the complete code is in the notebook, but pseudo-code is:

To see the benefits of early stopping, we can look at the training curves showing the training and validation losses and accuracy:

Negative log likelihood and accuracy training curves

As expected, the training loss only continues to decrease with further training. The validation loss, on the other hand, reaches a minimum and plateaus. At a certain epoch, there is no return (or even a negative return) to further training. Our model will only start to memorize the training data and will not be able to generalize to testing data.

Without early stopping, our model will train for longer than necessary and will overfit to the training data.

Another point we can see from the training curves is that our model is not overfitting greatly. There is some overfitting as is always be the case, but the dropout after the first trainable fully connected layer prevents the training and validation losses from diverging too much.

Making Predictions: Inference

In the notebook I take care of some boring — but necessary — details of saving and loading PyTorch models, but here we’ll move right to the best part: making predictions on new images. We know our model does well on training and even validation data, but the ultimate test is how it performs on a hold-out testing set it has not seen before. We saved 25% of the data for the purpose of determining if our model can generalize to new data.

Predicting with a trained model is pretty simple. We use the same syntax as for training and validation:

for data, targets in testloader:
    log_ps = model(data)
    # Convert to probabilities
    ps = torch.exp(log_ps)
**(128, 100)**

The shape of our probabilities are ( batch_size , n_classes ) because we have a probability for every class. We can find the accuracy by finding the highest probability for each example and compare these to the labels:

# Find predictions and correct
pred = torch.max(ps, dim=1)
equals = pred == targets
# Calculate accuracy
accuracy = torch.mean(equals)

When diagnosing a network used for object recognition, it can be helpful to look at both overall performance on the test set and individual predictions.

Model Results

Here are two predictions the model nails:

These are pretty easy, so I’m glad the model has no trouble!

We don’t just want to focus on the correct predictions and we’ll take a look at some wrong outputs shortly. For now let’s evaluate the performance on the entire test set. For this, we want to iterate over the test DataLoader and calculate the loss and accuracy for every example.

Convolutional neural networks for object recognition are generally measured in terms of topk accuracy. This refers to the whether or not the real class was in the k most likely predicted classes. For example, top 5 accuracy is the % the right class was in the 5 highest probability predictions. You can get the topk most likely probabilities and classes from a PyTorch tensor as follows:

top_5_ps, top_5_classes = ps.topk(5, dim=1)
**(128, 5)**

Evaluating the model on the entire testing set, we calculate the metrics:

**Final test top 1 weighted accuracy = 88.65%
Final test top 5 weighted accuracy = 98.00%
Final test cross entropy per image = 0.3772.**

These compare favorably to the near 90% top 1 accuracy on the validation data. Overall, we conclude our pre-trained model was able to successfully transfer its knowledge from Imagenet to our smaller dataset.

Model Investigation

Although the model does well, there’s likely steps to take which can make it even better. Often, the best way to figure out how to improve a model is to investigate its errors (note: this is also an effective self-improvement method.)

Our model isn’t great at identifying crocodiles, so let’s look at some test predictions from this category:

Given the subtle distinction between crocodile and crocodile_head , and the difficulty of the second image, I’d say our model is not entirely unreasonable in these predictions. The ultimate goal in image recognition is to exceed human capabilities, and our model is nearly there!

Finally, we’d expect the model to perform better on categories with more images, so we can look at a graph of accuracy in a given category versus the number of training images in that category:

There does appear to be a positive correlation between the number of training images and the top 1 test accuracy. This indicates that _more training data augmentation could be helpfu_l, or, even that we should use test time augmentation. We could also try a different pre-trained model, or build another custom classifier. At the moment, deep learning is still an empirical field meaning experimentation is often required!


While there are easier deep learning libraries to use, the benefits of PyTorch are speed, control over every aspect of model architecture / training, efficient implementation of backpropagation with tensor auto differentiation, and ease of debugging code due to the dynamic nature of PyTorch graphs. For production code or your own projects, I’m not sure there is yet a compelling argument for using PyTorch instead of a library with a gentler learning curve such as Keras, but it’s helpful to know how to use different options.

Through this project, we were able to see the basics of using PyTorch as well as the concept of transfer learning, an effective method for object recognition. Instead of training a model from scratch, we can use existing architectures that have been trained on a large dataset and then tune them for our task. This reduces the time to train and often results in better overall performance. The outcome of this project is some knowledge of transfer learning and PyTorch that we can build on to build more complex applications.

We truly live in an incredible age for deep learning, where anyone can build deep learning models with easily available resources! Now get out there and take advantage of these resources by building your own project.

As always, I welcome feedback and constructive criticism. I can be reached on Twitter @koehrsen_will or through my personal website

Read More

Deploying A Python Web App On Aws


Deploying a Python Web App on AWS

How to share your Python project with the world

While I enjoy doing data science and programming projects for the personal thrill that comes with building something of my own, there is also a certain joy in sharing your project online with anyone in the world. Fortunately, thanks to Amazon Web Services (AWS), in a few minutes, we can deploy a Python web application to the entire world for free.

In this article, we’ll see how to deploy a deep learning web app to AWS on a free EC2 instance. This article will work with the app built in Deploying a Keras Deep Learning Model as a Web Application in Python using the model developed in Recurrent Neural Networks by Example in Python. Neither of these is required, just know that our application generates novel patent abstracts with an RNN. All the code for the project can be found on GitHub.

Amazon Web Services EC2

Amazon Web Services is the umbrella term for the range of Amazon’s cloud computing offerings. We be using Amazon Elastic Compute Cloud (EC2), a service where we rent virtual computers in the cloud to run applications. AWS EC2 offers a free tier so we can deploy without spending a cent.

To get started, create an AWS account and head to the EC2 console at Click on the Launch Instance button which takes you to choose an Amazon Machine Instance (AMI), “ a template that contains the software configuration (operating system) required to launch your instance.” You can use any os you’re familiar with (although some aren’t eligible for the free tier), but I’ll be using Ubuntu Server 18.04:

AMI type (Ubuntu 18.04)

Hit Select, then on the next page choose the free tier eligible t2.micro instance (an instance is the hardware for our AMI). This only has 1 CPU and 1 GB of RAM, but it will actually be enough to run our pre-trained recurrent neural network application! If you’re expecting more traffic or running a cpu-intensive application, you’ll probably have to shell out.

Security Groups

Select the instance type you want and then go to tab 6. Configure Security Group at the top of the page. Security groups filter the traffic into and out of our instance — basically, who can access our virtual computer.

You (and only you) will need to access the instance via ssh, so add a rule that allows Source “My IP” for SSH. We want others to be able to access our app in a web browser, so add a rule to allow HTTP access for all sources. The final security configuration is:

Security group rules

Next, hit Review and Launch and then Launch. This brings up the options for using a key pair. You need this to access the server via ssh, so make sure to create a new key pair and save the private key somewhere you remember it. If you lose this, you will not be able to access your instance again!

Finally, hit Launch Instances and Amazon will start up your very own virtual machine which is physically located…somewhere. Wait a few minutes for the instance to boot before heading to the next step: connecting to your instance.

Connecting to Server via SSH

Once the instance is up and running, select it on the EC2 Instance dashboard (Services > EC2 > Running Instances) and hit Connect. This will give us the exact commands to connect to the instance.

Connect dialog from EC2 running instances dashboard.

Copy the example code, and paste it into Bash or a command prompt running in the folder with your private key (you generate this when launching your instance). Assuming everything goes well, you’ll be logged into your instance and see a familiar terminal command prompt.

Installing Requirements

This AMI comes equipped with Python 3.6, so we just need to clone the repository and install the app dependencies. First, get the repository:

git clone [](

Then install pip, move into the repository, and install the requirements.

sudo apt-get update
sudo apt-get install python3-pip
cd recurrent-neural-networks
pip3 install --user -r requirements.txt

Running and Accessing the Web Application

Running the app is simple (you might need sudo for the second command):

cd deployment

(If you want to understand what’s going on in the web application, take a look at the previous article for the development process).

You should see the following output in the terminal:

While it looks like this the app is running on localhost:80/, that’s on the virtual machine. For us to access the web app, we’ll have to use the instance’s Public DNS IPv4 which can be found on the running instance dashboard.

Public DNS for running instance.

Copy and paste the address into your browser, and you’ll see the application!

Homepage of the web application.

Feel free to play around with the recurrent neural network application. What it’s doing is generating new patent abstracts with a recurrent neural network trained on thousands of abstracts with the keyword “neural network” You can either enter random for a random starting sequence, or your own sequence. (To see the development, check out this article or this notebook).

Keras recurrent neural network application.

Your application can now be reached by anyone in the world via the IPv4. If you want the app to keep running even after you log out of the instance, run it in a Screen session. (Screen is a handy program that lets you run terminal sessions from a single terminal window using virtual consoles.)

# From within recurrent-neural-networks/deployment
screen -R deploy

My (if I haven’t shut it down or run into errors) application should be running at Because I’m using a t2.micro instance, the cost to run this web application in perpetuity is precisely $0.00! If you want a domain name, you can pick up one from a domain name registrar such as Hover.

Next Steps

Although this is a decent solution to quickly deploy a personal project, this is not a production-ready deployment! For that, you’ll want to make sure to use proper security (with HTTPS and a certified certificate). You’ll also want to make sure your application can handle expected traffic. Only use this specific solution for small projects without sensitive data.


We truly live in incredible times: with Flask we can develop a Python web app in a few minutes and then we can deploy it to the world free with AWS. The general process we followed was: develop a web application (in Python preferably), rent commodity hardware from a cloud provider, and deploy a web application to the world.

If you were able to follow all the tutorials from the implementation of a recurrent neural network to developing a local web application to deploying on AWS, then you’ll have completed an impressive project! The overall process from a blank file to a running web application may be daunting, but like most technology problems, if you break it down, each step isn’t overwhelming and there are many open-source tools to make the process easy. If you’re a data scientist bored with doing self-contained analysis in Jupyter Notebooks, take the initiative to do a project you can deploy as an application. It’s good to branch out into other disciplines, and building and deploying a web application is a great opportunity to learn some new skills.

As always, I welcome feedback and constructive criticism. I can be reached on Twitter @koehrsen_will or through my personal website

Read More

Deploying A Keras Deep Learning Model As A Web Application In P


Deploying a Keras Deep Learning Model as a Web Application in Python

Deep learning, web apps, Flask, HTML, and CSS in one project

Building a cool machine learning project is one thing, but at the end of the day, you want other people to be able to see your hard work. Sure, you could put the whole project on GitHub, but how are your grandparents supposed to figure that out? No, what we want is to deploy our deep learning model as a web application accessible to anyone in the world.

In this article, we’ll see how to write a web application that takes a trained Keras recurrent neural network and allows users to generate new patent abstracts. This project builds on work from the Recurrent Neural Networks by Example article, but knowing how to create the RNN isn’t necessary. We’ll just treat it as a black box for now: we put in a starting sequence, and it outputs an entirely new patent abstract that we can display in the browser!

Traditionally, data scientists develop the models and front end engineers show them to the world. In this project, we’ll have to play both roles, and dive into web development (almost all in Python though).

This project requires joining together numerous topics:

  • Flask: creating a basic web application in Python
  • Keras: deploying a trained recurrent neural network
  • Templating with the Jinja template library
  • HTML and CSS for writing web pages

The final result is a web application that allows users to generate entirely new patent abstracts with a trained recurrent neural network:

The complete code for this project is available on GitHub.


The goal was to get a web application up and running as quickly as possible. For that, I went with Flask, which allows us to write the app in Python. I don’t like to mess with styling (which clearly shows) so almost all of the CSS is copied and pasted. This article by the Keras team was helpful for the basics and this article is a useful guide as well.

Overall, this project adheres to my design principles: get a prototype up and running quickly — copying and pasting as much as required — and then iterate to make a better product.

A Basic Web Application with Flask

The quickest way to build a web app in Python is with Flask. To make our own app, we can use just the following:

from flask import Flask
app = Flask(__name__)

def hello():
    return "

Not Much Going On Here

"'', port=50000)

If you copy and paste this code and run it, you’ll be able to view your own web app at localhost:50000. Of course, we want to do more than that, so we’ll use a slightly more complicated function which basically does the same thing: handles requests from your browser and serves up some content as HTML. For our main page, we want to present the user with a form to enter some details.

User Input Form

When our users arrive at the main page of the application, we’ll show them a form with three parameters to select:

  1. Input a starting sequence for RNN or select randomly
  2. Choose diversity of RNN predictions
  3. Choose the number of words RNN outputs

To build a form in Python we’ll use [wtforms]( .The code to make the form is:

This creates a form shown below (with styling from main.css):

The validator in the code make sure the user enters the correct information. For example, we check all boxes are filled and that the diversity is between 0.5 and 5. These conditions must be met for the form to be accepted.

Validation error

The way we actually serve the form is with Flask is using templates.


A template is a document with a basic framework that we need to fill in with details. For a Flask web application, we can use the Jinja templating library to pass Python code to an HTML document. For example, in our main function, we’ll send the contents of the form to a template called index.html.

When the user arrives on the home page, our app will serve up index.html with the details from form. The template is a simple html scaffolding where we refer to python variables with `` syntax.

For each of the errors in the form (those entries that can’t be validated) an error will flash. Other than that, this file will show the form as above.

When the user enters information and hits submit (a POST request) if the information is correct, we want to divert the input to the appropriate function to make predictions with the trained RNN. This means modifying home() .

Now, when the user hits submit and the information is correct, the input is sent either to generate_random_start or generate_from_seed depending on the input. These functions use the trained Keras model to generate a novel patent with a diversity and num_words specified by the user. The output of these functions in turn is sent to either of the templates random.html or seeded.html to be served as a web page.

Making Predictions with a Pre-Trained Keras Model

The model parameter is the trained Keras model which load in as follows:

(The tf.get_default_graph() is a workaround based on this gist.)

I won’t show the entirety of the two util functions (here is the code), and all you need to understand is they take the trained Keras model along with the parameters and make predictions of a new patent abstract.

These functions both return a Python string with formatted HTML. This string is sent to another template to be rendered as a web page. For example, the generate_random_start returns formatted html which goes into random.html:

Here we are again using the Jinja template engine to display the formatted HTML. Since the Python string is already formatted as HTML, all we have to do is use `` (where input is the Python variable) to display it. We can then style this page in main.css as with the other html templates.


The functiongenerate_random_start picks a random patent abstract as the starting sequence and makes predictions building from it. It then displays the starting sequence, RNN generated output, and the actual output:

Random starting sequence output.

The functiongenerate_from_seed takes a user-supplied starting sequence and then builds off of it using the trained RNN. The output appears as follows:

Output from starting seed sequence

While the results are not always entirely on-point, they do show the recurrent neural network has learned the basics of English. It was trained to predict the next word from the previous 50 words and has picked up how to write a slightly-convincing patent abstract! Depending on the diversity of the predictions, the output might appear to be completely random or a loop.

Running the App

To run the app for yourself, all you need to do is download the repository, navigate to the deployment directory and type python . This will immediately make the web app available at localhost:10000.

Depending on how your home WiFi is configured, you should be able to access the application from any computer on the network using your IP address.

Next Steps

The web application running on your personal computer is great for sharing with friends and family. I’d definitely not recommend opening this up to everyone on your home network though! For that, we’ll want to set the app up on an AWS EC2 instance and serve it to the world (coming later) .

To improve the app, we can alter the styling (through [main.css]( ) and perhaps add more options, such as the ability to choose the pre-trained network. The great thing about personal projects is you can take them as far as you want. If you want to play around with the app, download the code and get started.


In this article, we saw how to deploy a trained Keras deep learning model as a web application. This requires bringing together a number of different technologies including recurrent neural networks, web applications, templating, HTML, CSS, and of course Python.

While this is only a basic application, it shows that you can start building web applications using deep learning with relatively little effort. There aren’t many people who can say they’ve deployed a deep learning model as a web application, but if you follow this article, count yourself among them!

As always, I welcome feedback and constructive criticism. I can be reached on Twitter @koehrsen_will or through my personal website

submit = SubmitField("Enter")

Loading in Trained Model

Read More

Modeling Teaching A Machine Learning Algorithm To Deliver Business Value

Modeling: Teaching a Machine Learning Algorithm to Deliver Business Value

How to train, tune, and validate a machine learning model

This is the fourth in a four-part series on how we approach machine learning at Feature Labs. The complete set of articles can be found below:

  1. Overview: A General-Purpose Framework for Machine Learning
  2. Prediction Engineering: How to Set Up Your Machine Learning Problem
  3. Feature Engineering: What Powers Machine Learning
  4. Modeling: Teaching an Algorithm (this article)

These articles cover the concepts and a full implementation as applied to predicting customer churn. The project Jupyter Notebooks are all available on GitHub. (Full disclosure: I work for Feature Labs, a startup developing tooling, including Featuretools, for solving problems with machine learning. All of the work documented here was completed with open-source tools and data.)

The Machine Learning Modeling Process

The outputs of prediction and feature engineering are a set of label times, historical examples of what we want to predict, and features, predictor variables used to train a model to predict the label. The process of modeling means training a machine learning algorithm to predict the labels from the features, tuning it for the business need, and validating it on holdout data.

Inputs and outputs of the modeling process.

The output from modeling is a trained model that can be used for inference, making predictions on new data points.

The objective of machine learning is not a model that does well on training data, but one that demonstrates it satisfies the business need and can be deployed on live data.

Similar to feature engineering, modeling is independent of the previous steps in the machine learning process and has standardized inputs which means we can alter the prediction problem without needing to rewrite all our code. If the business requirements change, we can generate new label times, build corresponding features, and input them into the model.

Implementation of Modeling for Customer Churn

In this series, we are using machine learning to solve the customer churn problem. There are several ways to formulate the task, but our definition is:

Predict on the first of each month which customers will churn during the month. Use a lead time of one month and churn is 31 days with no subscription. With a lead time of 1 month, this means we make predictions 1 month in advance: on January 1, we make predictions of churn during the month of February.

Although machine learning algorithms may sound technically complex, implementing them in Python is simple thanks to standard machine learning libraries like Scikit-Learn. As a bit of practical advice, empirical results have shown that the choice of machine learning model and hyperparameters matters, but not as much as feature engineering.

Therefore, the rational decision is to put most of the effort into prediction and feature engineering, and insert a pre-built solution for machine learning.

In this project, I went with Scikit-Learn to rapidly implement a few models. To get the data ready for machine learning, we have to take some basic steps: missing value imputation, encoding of categorical variables, and optionally feature selection if the input dimension is too large (see notebook for full details). Then, we can create a model with standard modeling syntax:

Metrics and Baseline Results

Before applying machine learning, it’s best to establish a naive baseline to determine if machine learning is actually helping. With a classification problem, this can be as simple as guessing the majority label in the training data for all examples in the hold-out testing data. For the customer churn data, guessing every test label is not a churn yields an accuracy of 96.5%.

This high accuracy may sound impressive, but for an imbalanced classification problem — where one class is represented more than another — accuracy is not an adequate metric. Instead, we want to use recall, precision, or the F1 score.

Recall represents the percentage of actual churns in the data that our model identifies with the naive guess recording 3.5%. Precision measures the percentage of churns predicted by our model that actually were churns, with a naive score of 1.0%. The F1 score is the harmonic mean of these measures.

Since this is a classification problem, for a machine learning baseline I tried a logistic regression which did not perform well. This indicates the problem is likely non-linear, so my second attempt used a Random Forest Classifier with better results. The random forest is quick to train, relatively interpretable, highly accurate and is usually a solid model choice.

The metrics for no machine learning, logistic regression, and the random forest with default hyperparameters are shown below:

Metrics recorded by baseline models

Each model was evaluated using about 30% of the data for holdout testing based on a time-series split. (This is crucial when evaluating a model in a time-series problem because it prevents training data leakage and should provide a good estimate of the actual model performance on new data.)

Aligning the Model with the Business Requirement

Even though the metrics for the ml models are better than with no machine learning, we want to optimize a model for a given metric(s) in line with the business need. In this example, we’ll focus on recall and precision. We will tune the model to achieve a certain recall by adjusting the threshold, the probability above which an observation is classified as positive — a churn.

Precision and Recall Tuning

There is a fundamental tradeoff in machine learning between recall and precision, which means we can increase one only at the cost of decreasing the other. For example, if we want to find every instance of churn — a recall of 100% — then we would have to accept a low precision — many false positives. Conversely, if we limit the false positives by increasing the precision, then we will identify fewer of the actual churns lowering the recall.

The balance between these two is altered by adjusting the model’s threshold. We can visualize this in the model’s precision-recall curve.

Precision-recall curve tuned for 75% recall.

This shows the precision versus the recall for different values of the threshold. The default threshold in Scikit-Learn is 0.5, but depending on the business needs, we can adjust this to achieve desired performance.

For customer churn we’ll tune the threshold to achieve a recall of 75%. By inspecting the predicted probabilities (the actual values), we determine the threshold should be 0.39 to hit this mark. At a threshold of 0.39, our recall is 75% and our precision is 8.31%.

Choosing the recall or precision lies in the business domain. It requires determining which is more costly, false positives — predicting a customer will churn when in fact they will not — or false negatives — predicting a customer will not churn when in fact they will — and adjusting appropriately.

A recall of 75% was chosen as an example optimization but this can be changed. At this value, compared to the naive baseline, we have achieved a 20x improvement in recall and an 8x improvement in precision.

Model Validation

Once we have selected the threshold for classifying a churn, we can plot the confusion matrix from the holdout testing set to examine the predictions.

Confusion Matrix for Tuned Random Forest

At this threshold, we identify more than half the churns (75%) although with a significant number of false positives (upper right). Depending on the relative cost of false negatives vs false positives, our model might not actually be an improvement!

To make sure our model has solved the problem, we need to use the holdout results to calculate the return from implementing the model.

Validating Business Value

Using the model’s metrics on the hold-out testing set as an estimate of performance on new data, we can calculate the value of deploying this model before deploying it. Using the historical data, we first calculate the typical revenue lost to churn and then the reduced amount of revenue lost to churn with a model that achieves 75% recall and 8% precision.

Making a few assumptions about customer conversions (see notebook for details) we arrive at the following conclusion:

Machine learning increases the number of active monthly subscribers and recoups 13.5% of the monthly losses from customer churns.

Considering a subscription cost, this represents $130,000 (USD) per month.

With these numbers, we conclude that machine learning has solved the business need of increasing monthly subscribers and delivered a positive solution.

As a final piece of model interpretation, we can look at the most important features to get a sense of the variables most relevant to the problem. The 10 most important variables from the random forest model are shown below:

Most important features from random forest model.

The most important variables agree with our intuition for the problem. For instance, the most important feature is the total spending in the month before the cutoff time. Because we are using a lead time of 1 month, this represents the spending two months prior to the month of prediction. The more customers spent in this period, the less likely they were to churn. We also see top features like the average time between transactions or method of payment id, which could be important to monitor for our business.

Making Predictions and Deployment

With our machine learning pipeline complete and the model validated, we are ready to make predictions of future customer churn. We don’t have live data for this project, but if we did, we could make predictions like the following:

Predictions for new data based on threshold.

These predictions and feature importances can go to the customer engagement team where they will do the hard work of retaining members.

In addition to making predictions each time we get new data, we’ll want to continue to validate our solution once it has been deployed. This means comparing model predictions to actual outcomes and looking at the data to check for concept drift. If performance decreases below the level of providing value, we can gather and train on more data, change the prediction problem, optimize the model settings, or adjust the tuned threshold.

Modeling Notes

As with prediction and feature engineering, the modeling stage is adaptable to new prediction problems and uses common tools in data science. Each step in the machine learning framework we use is segmented, meaning we are able to implement solutions to numerous problems without needing to rewrite all the code. Moreover, the APIs — Pandas, Featuretools, and Scikit-Learn — are user-friendly, have great documentation, and abstract away the tedious details.

Conclusions for the Machine Learning Process

The future of machine learning lies not in one-off solutions but in a general-purpose framework allowing data scientists to rapidly develop solutions for all the problems they face. This scaffolding functions in much the same way as website templates: each time we build a website, we don’t start from scratch, we use an existing template and fill in the details.

The same methodology should apply to solving problems with machine learning: instead of building a new solution for each problem, adapt an existing scaffolding and fill in the details with user-friendly tooling.

In this series of articles, we walked through the concepts and use of a general-purpose framework for solving real-world machine learning problems.

The process is summarized in three steps:

  1. Prediction Engineering: Define a business need, translate the need into a supervised machine learning problem, and create labeled examples
  2. Feature Engineering: Use label times and raw historical data to build predictor variables for each label
  3. Modeling: Train, tune for the business need, validate the value of solution, and make predictions with a machine learning algorithm

A general purpose framework for solving problems with machine learning.

While machine learning is not a sacred art available only to a select few, it has remained out of the reach of many organizations because of the lack of standardized processes. The objective of this framework is to make machine learning solutions easier to develop and deploy, which will allow more organizations to see the benefits of leveraging this powerful technology.

If building meaningful, high-performance predictive models is something you care about, then get in touch with us at Feature Labs. While this project was completed with the open-source Featuretools, the commercial product offers additional tools and support for creating machine learning solutions.

Read More