Feature Engineering What Powers Machine Learning

Feature Engineering: What Powers Machine Learning

How to Extract Features from Raw Data for Machine Learning

This is the third in a four-part series on how we approach machine learning at Feature Labs. The complete set of articles is:

  1. Overview: A General-Purpose Framework for Machine Learning
  2. Prediction Engineering: How to Set Up Your Machine Learning Problem
  3. Feature Engineering (this article)
  4. Modeling: Teaching an Algorithm to Make Predictions

These articles cover the concepts and a full implementation as applied to predicting customer churn. The project Jupyter Notebooks are all available on GitHub. (Full disclosure: I work for Feature Labs, a startup developing tooling, including Featuretools, for solving problems with machine learning. All of the work documented here was completed with open-source tools and data.)

Feature Engineering

It’s often said that “data is the fuel of machine learning.” This isn’t quite true: data is like the crude oil of machine learning which means it has to be refined into features — predictor variables — to be useful for training a model. Without relevant features, you can’t train an accurate model, no matter how complex the machine learning algorithm. The process of extracting features from a raw dataset is called feature engineering.

The Feature Engineering Process

Feature engineering, the second step in the machine learning pipeline, takes in the label times from the first step— prediction engineering — and a raw dataset that needs to be refined. Feature engineering means building features for each label while filtering the data used for the feature based on the label’s cutoff time to make valid features. These features and labels are then passed to modeling where they will be used for training a machine learning algorithm.

The process of feature engineering.

While feature engineering requires label times, in our general-purpose framework, it is not hard-coded for specific labels corresponding to only one prediction problem. If we wrote our feature engineering code for a single problem — as feature engineering is traditionally approached — then we would have to redo this laborious step every time the parameters change.

Instead, we use APIs like Featuretools that can build features for any set of labels without requiring changes to the code. This means for the customer churn dataset, we can solve multiple prediction problems — predicting churn every month, every other week, or with a lead time of two rather than one month — using the exact same feature engineering code.

This fits with the principles of our machine learning approach: we segment each step of the pipeline while standardizing inputs and outputs. This independence means we can change the problem in prediction engineering without needing to alter the downstream feature engineering and machine learning code.

The key to making this step of the machine learning process repeatable across prediction problems is automated feature engineering.

Automated Feature Engineering: Build Better Predictive Models Faster

Traditionally, feature engineering is done by hand, building features one at a time using domain knowledge. However, this manual process is error-prone, tedious, must be started from scratch for each dataset, and ultimately is limited by constraints on human creativity and time. Furthermore, in time-dependent problems where we have to filter every feature based on a cutoff time, it’s hard to avoid errors that can invalidate an entire machine learning solution.

Automated feature engineering overcomes these problems through a reusable approach to automatically building hundreds of relevant features from a relational dataset. Moreover, this method filters the features for each label based on the cutoff time, creating a rich set of valid features. In short, automated feature engineering enables data scientists to build better predictive models in a fraction of the time.

Manual vs Automated Feature Engineering Pipelines.

Motivation for Automated Feature Engineering

After solving a few problems with machine learning, it becomes clear that many of the operations used to build features are repeated across datasets. For instance, we often find the weekday of an event — be it a transaction or a flight— and then find the average transaction amount or flight delay by day of the week for each customer or airline. Once we realize that these operations don’t depend on the underlying data, why not abstract this process into a framework that can build features for any relational dataset?

This is the idea behind automated feature engineering. We can apply the same basic building blocks — called feature primitives — to different relational datasets to build predictor variables. As a concrete example, the “max” feature primitive applied to customer transactions can also be applied to flight delays. In the former case, this will find the largest transaction for each customer, and in the latter, the longest flight delay for a given flight number.

Source: How Deep Feature Synthesis Works

This is an embodiment of the idea of abstraction: remove the need to deal with the details — writing specific code for each dataset — by building higher level tools that take advantage of operations common to many problems.

Ultimately, automated feature engineering makes us more efficient as data scientists by removing the need to repeat tedious operations across problems.

Implementation of Feature Engineering

Currently, the only open-source Python library for automated feature engineering using multiple tables is Featuretools, developed and maintained by Feature Labs. For the customer churn problem, we can use Featuretools to quickly build features for the label times that we created in prediction engineering. (Full code available in this Jupyter Notebook).

We have three tables of data: customer background info, transactions, and user listening logs. If we were using manual feature engineering, we’d brainstorm and build features by hand, such as the average value of a customer’s transactions, or her total spending on weekends in the previous year. For each feature, we’d first have to filter the data to before the cutoff time for the label. In contrast, in our framework, we make use of Featuretools to automatically build hundreds of relevant features in a few lines of code.

We won’t go through the details of Featuretools, but the heart of the library is an algorithm called Deep Feature Synthesis which stacks the feature engineering building blocks known as primitives (simple operations like “max” or finding the “weekday” of a transaction) to build “deep features”. The library also automatically filters data for features based on the cutoff time.

For more on automated feature engineering in Featuretools see:

Featuretools requires some background code to link together the tables through relationships, but then we can automatically make features for customer churn using the following code (see notebook for complete details):

This one line of code gives us over 200 features for each label in cutoff_times. Each feature is a combination of feature primitives and is built with only data from before the associated cutoff time.

Sample of features from Featuretools automated feature engineering.

The features built by Featuretools are explainable in natural language because they are built up from basic operations. For example, we see the feature AVG_TIME_BETWEEN(transactions.transaction_date). This represents the average time between transactions for each customer. When we plot this colored by the label we see that customers who churned appear to have a slightly longer average time between transactions.

Distribution of time between transactions colored by the label.

In addition to getting hundreds of valid, relevant features, developing an automated feature engineering pipeline in Featuretools means we can use the same code for different prediction problems with our dataset. We just need to pass in the correct label times to the cutoff_times parameter and we’ll be able to build features for a different prediction problem.

Automated feature engineering means we can solve multiple problems in the time it would normally take to complete just one. A change in parameters means tweaking a few lines of code instead of implementing an entirely new solution.

To solve a different problem, rather than rewrite the entire pipeline, we:

  1. Tweak the prediction engineering code to create new label times
  2. Input the label times to feature engineering and output features
  3. Use the features to train and a supervised machine learning model

(As a brief note: the feature engineering code can be run in parallel using either Dask or Spark with PySpark. For the latter approach, see this notebook or this article on the Feature Labs engineering blog.)

Next Steps

Just as the label times from prediction engineering flowed into feature engineering, the features serve as inputs to the next stage, modeling: training an algorithm to predict the label from the features. In the final article in this series, we’ll look at how to train, tune, validate, and predict with a machine learning model to solve the customer churn problem.

As a preview, pictured is the tuned precision-recall curve from machine learning. (Full notebook available on GitHub.)

Precision Recall Curve for Machine Learning


Feature engineering has tended to be a tedious aspect of solving problems with machine learning and a source of errors preventing solutions from being successfully implemented. By using automated feature engineering in a general-purpose machine learning framework we:

  • Automatically build hundreds of features for any relational dataset
  • Create only valid features by filtering data on cutoff times

Furthermore, the feature engineering code is not hard-coded for the inputs from prediction engineering which means we can use the same exact code to make features for multiple prediction problems. Applying automated feature engineering in a structured framework we are able to turn feature engineering from a painful process into a quick, reusable procedure allowing us to solve many valuable machine learning problems.

If building meaningful, high-performance predictive models is something you care about, then get in touch with us at Feature Labs. While this project was completed with the open-source Featuretools, the commercial product offers additional tools and support for creating machine learning solutions.

Read More

Prediction Engineering How To Set Up Your Machine Learning Problem

Prediction Engineering: How to Set Up Your Machine Learning Problem

An explanation and implementation of the first step in solving problems with machine learning

This is the second in a four-part series on how we approach machine learning at Feature Labs. The other articles can be found below:

  1. Overview: A General-Purpose Framework for Machine Learning
  2. Feature Engineering: What Powers Machine Learning (coming soon)
  3. Modeling: Teaching an Algorithm to Make Predictions (coming soon)

These articles will cover the concepts and a full implementation as applied to predicting customer churn. The project Jupyter Notebooks are all available on GitHub. (Full disclosure: I work for Feature Labs, a startup developing tooling, including Featuretools, for solving problems with machine learning. All of the work documented here was completed with open-source tools and data.)

When working with real-world data on a machine learning task, we define the problem, which means _we have to develop our own labels —_historical examples of what we want to predict — to train a supervised model. The idea of making our own labels may initially seem foreign to data scientists (myself included) who got started on Kaggle competitions or textbook datasets where the answers are already included.

The concept behind prediction engineering — making labels to train a supervised machine learning model — is not new. However, it currently is not a standardized process and is done by data scientists on an as-needed basis. This means that for each new problem — even with the same dataset — a new script must be developed to accomplish this task resulting in solutions that cannot be adapted to different prediction problems.

A better solution is to write functions that are flexible to changing business parameters, allowing us to quickly generate labels for many problems. This is one area where data science can learn from software engineering: solutions should be reusable and accept changing inputs. In this article, we’ll see how to implement a reusable approach to the first step in solving problems with machine learning — prediction engineering.

The Process of Prediction Engineering

Prediction engineering requires guidance both from the business viewpoint to figure out the right problem to solve as well as from the data scientist to determine how to translate the business need into a machine learning problem. The inputs to prediction engineering are the parameters which define the prediction problem for the business requirement , and the historical dataset for finding examples of what we want to predict.

Process of prediction engineering.

The output of prediction engineering is a label times table: a set of labels with negative and positive examples made from past data along with an associated cutoff time indicating when we have to stop using data to make features for that label (more on this shortly).

For our use case we’ll work through in this series — customer churn — we defined the business problem as increasing monthly active subscribers by reducing rates of churn. The machine learning problem is building a model to predict which customers will churn using historical data. The first step in this task is making a set of labels of past examples of customer churn.

The parameters for what constitutes a churn and how often we want to make predictions will vary depending on the business need, but in this example, let’s say we want to make predictions on the first of each month for which customers will churn one month out from the time of prediction. Churn will be defined as going more than 31 days without an active membership.

It’s important to remember this is only one definition of churn corresponding to one business problem. When we write functions to make labels, they should take in parameters so they can be quickly changed to different prediction problems.

Our goal for prediction engineering is a label times table as follows:

Example of label times table

The labels correspond to whether a customer churned or not based on historical data. Each customer is used as a training example multiple times because they have multiple months of data. Even if we didn’t use customers many times, because this is a time-dependent problem, we have to correctly implement the concept of cutoff times.

Cutoff Times: How to Ensure Your Features are Valid

The labels are not complete without the cutoff time which represents when we have to stop using data to make features for a label. Since we are making predictions about customer churn on the first of each month, we can’t use any data after the first to make features for that label. Our cutoff times are therefore all on the first of the month as shown in the label times table above.

All the features for each label must use data from before this time to prevent the problem of data leakage. Cutoff times are a crucial part of building successful solutions to time-series problems that many companies do not account for. Using invalid data to make features leads to models that do well in development but fail in deployment.

Imagine we did not limit our features to data that occurred before the first of the month for each label. Our model would figure out that customers who had a paid transaction during the month could not have churned in that month and would thus record high metrics. However, when it came time to deploy the model and make predictions for a future month, we do not have access to the future transactions and our model would perform poorly. It’s like a student who does great on homework because she has the answer key but then is lost on the test without the same information.

Dataset for Customer Churn

Now that we have the concepts, let’s work go through the details. KKBOX is Asia’s leading music streaming service offering both a free and a pay-per-month subscription option to over 10 million members. KKBOX has made available a dataset for predicting customer churn. There are 3 data tables coming in at just over 30 GB that are represented by the schema below:

Relational diagram of data.

The three tables consist of:

  • customers: Background information such as age and city ( msno is the customer id):

  • transactions: Transaction data for each payment for each customer:

  • activity logs: Logs of customer listening behavior:

This is a typical dataset for a subscription business and is an example of structured, relational data: observations in the rows, features in the columns, and tables tied together by primary and foreign keys: the customer id.

Finding Historical Labels

The key to making prediction engineering adaptable to different problems is to follow a repeatable process for extracting training labels from a dataset. At a high-level this is outlined as follows:

  1. Define positive and negative labels in terms of key business parameters
  2. Search through past data for positive and negative examples
  3. Make a table of cutoff times and associate each cutoff time with a label

For customer churn, the parameters are the

  • prediction date (cutoff time): the point at which we make a prediction and when we stop using data to make features for the label
  • number of days without a subscription before a user is considered a churn
  • lead time: the number of days or months in the future we want to predict
  • prediction window: the period of time we want to make predictions for

The following diagram shows each of these concepts while filling in the details with the problem definition we’ll work through.

Parameters defining the customer churn prediction problem.

In this case, the customer has churned during the month of January as they went without a subscription for more than 30 days. Because our lead time is one month and the prediction window is also one month, the label of churn is associated with the cutoff time of December 1. For this problem, we are thus teaching our model to predict customer churn one month in advance to give the customer satisfaction team sufficient time to engage with customers.

For a given dataset, there are numerous prediction problems we can make from it. We might want to predict churn at different dates or frequencies, such as every two weeks, with a lead time of two months, or define churn as a shorter duration without an active membership. Moreover, there are other problems unrelated to churn we could solve with this dataset: predict how many songs a customer will listen to in the next month; predict the rate of growth of the customer base; or, segment customers into different groups based on listening habits to tailor a more personal experience.

When we develop the functions for creating labels, we make our inputs parameters so we can quickly make multiple sets of labels from the dataset.

If we develop a pipeline that has parameters instead of hard-coded values, we can rapidly adopt it to different problems. When we want to change the definition of a churn, all we need to do is alter the parameter input to our pipeline and re-run it.

Labeling Implementation

To make labels, we develop 2 functions (full code in the notebook):

label_customer(customer_transactions, prediction_date = "first of month", days_to_churn = 31, lead_time = "1 month", prediction_window = "1 month")

make_labels(all_transactions, prediction_date = "first of month", days_to_churn = 31, lead_time = "1 month", prediction_window = "1 month")

The label_customer function takes in a customer’s transactions and the specified parameters and returns a label times table. This table has a set of prediction times — the cutoff times — and the label during the prediction window for each cutoff time corresponding to a single customer.

As an example, our labels for a customer look like the following:

Label times for one customer.

The make_labels function then takes in the transactions for all customers along with the parameters and returns a table with the cutoff times and the label for every customer.

When implemented correctly, the end outcome of prediction engineering is a function which can create label times for multiple prediction problems by changing input parameters. These labels times — a cutoff time and associated label — are the input to the next stage in which we make features for each label. The next article will document this step, but for those who want a head-start, the Jupyter Notebook is already available!


We haven’t invented the process of prediction engineering, just given it a name and defined a reusable approach for this first part in the pipeline.

The process of prediction engineering is captured in three steps:

  1. Identify a business need that can be solved with available data
  2. Translate the business need into a supervised machine learning problem
  3. Create label times from historical data

Getting prediction engineering right is crucial and requires input from both the business and data science sides of a business. By writing the code for prediction engineering to accept different parameters, we can rapidly change the prediction problem if the needs of our company change.

More generally, our approach to solving problems with machine learning segments the different parts of the pipeline while standardizing each input and output. The end result, as we’ll see, is we can quickly change the prediction problem in the prediction engineering stage without needing to rewrite the subsequent steps. In this article, we’ve developed the first step in a framework that can be used to solve many problems with machine learning.

If building meaningful, high-performance predictive models is something you care about, then get in touch with us at Feature Labs. While this project was completed with the open-source Featuretools, the commercial product offers additional tools and support for creating machine learning solutions.

Read More

How To Create Value With Machine Learning

How to Create Value with Machine Learning

A General-Purpose Framework for Defining and Solving Meaningful Problems in 3 Steps

Imagine the following scenario: your boss asks you to build a machine learning model to predict every month which customers of your subscription service will churn during the month with churn defined as no active membership for more than 31 days. You painstakingly make labels by finding historical examples of churn, brainstorm and engineer features by hand, then train and manually tune a machine learning model to make predictions.

Pleased with the metrics on the holdout testing set, you return to your boss with the results, only to be told now you must develop a different solution: one that makes predictions every two weeks with churn defined as 14 days of inactivity. Dismayed, you realize none of your previous work can be reused because it was designed for a single prediction problem.

You wrote a labeling function for a narrow definition of churn and the downstream steps in the pipeline — feature engineering and modeling — were also dependent on the initial parameters and will have to be redone. Due to hard-coding a specific set of values, you’ll have to build an entirely new pipeline to address for what is only a small change in problem definition.

Structuring the Machine Learning Process

This situation is indicative of how solving problems with machine learning is currently approached. The process is ad-hoc and requires a custom solution for each parameter set even when using the same data. The result is companies miss out on the full benefits of machine learning because they are limited to solving a small number of problems with a time-intensive approach.

A lack of standardized methodology means there is no scaffolding for solving problems with machine learning that can be quickly adapted and deployed as parameters to a problem change.

How can we improve this process? Making machine learning more accessible will require a general-purpose framework for setting up and solving problems. This framework should accommodate existing tools, be rapidly adaptable to changing parameters, applicable across different industries, and provide enough structure to give data scientists a clear path for laying out and working through meaningful problems with machine learning.

At Feature Labs, we’ve put a lot of thought into this issue and developed what we think is a better way to solve useful problems with machine learning. In the next three parts of this series, I’ll lay out how we approach framing and building machine learning solutions in a structured, repeatable manner built around the steps of prediction engineering, feature engineering, and modeling.

We’ll walk through the approach as applied in full to one use case — predicting customer churn — and see how we can adapt the solution if the parameters of the problem change. Moreover, we’ll be able to utilize existing tools — Pandas, Scikit-Learn, Featuretools — commonly used for machine learning.

The general machine learning framework is outlined below:

  1. Prediction Engineering: State the business need, translate into a machine learning problem, and generate labeled examples from a dataset
  2. Feature Engineering: Extract predictor variables — features — from the raw data for each of the labels
  3. Modeling: Train a machine learning model on the features, tune for the business need, and validate predictions before deploying to new data

A general-purpose framework for defining and solving meaningful problems with machine learning

We’ll walk through the basics of each step as well as how to implement them in code. The complete project is available as Jupyter Notebooks on GitHub. (Full disclosure: I work for Feature Labs, a startup developing tooling, including Featuretools, for solving problems with machine learning. All of the work documented here was completed with open-source tools and data.)

Although this project discusses only one application, the same process can be applied across industries to build useful machine learning solutions. The end deliverable is a framework you can use to solve problems with machine learning in any field, and a specific solution that could be directly applied to your own customer churn dataset.

Business Motivation: Make Sure You Solve the Right Problem

The most sophisticated machine learning pipeline will have no impact unless it creates value for a company. Therefore, the first step in framing a machine learning task is understanding the business requirement so you can determine the right problem to solve. Throughout this series, we’ll work through the common problem of addressing customer churn.

For subscription-based business models, predicting which customers will churn — stop paying for a service for a specified period of time — is crucial. Accurately predicting if and when customers will churn lets businesses engage with those who are at risk for unsubscribing or offer them reduced rates as an incentive to maintain a subscription. An effective churn prediction model allows a company to be proactive in growing the customer base.

For the customer churn problem the business need is:

increase the number of paying subscribers by reducing customer churn rates.

Traditional methods of reducing customer churn require forecasting which customers would churn with survival-analysis techniques, but, given the abundance of historical customer behavior data, this presents an ideal application of supervised machine learning.

We can address the business problem with machine learning by building a supervised algorithm that learns from past data to predict customer churn.

Stating the business goal and expressing it in terms of a machine learning-solvable task is the critical first step in the pipeline. Once we know what we want to have the model predict, we can move on to using the available data to develop and solve a supervised machine learning problem.

Next Steps

Over the next three articles, we’ll apply the prediction engineering, feature engineering, and modeling framework to solve the customer churn problem on a dataset from KKBOX, Asia’s largest subscription music streaming service.

Look for the following posts (or check out the GitHub repository):

  1. Prediction Engineering: How to Set Up Your Machine Learning Problem
  2. Feature Engineering: What Powers Machine Learning (coming soon)
  3. Modeling: Training an Algorithm to Make Predictions (coming soon)

We’ll see how to fill in the details with existing data science tools and how to change the prediction problem without rewriting the complete pipeline. By the end, we’ll have an effective model for predicting churn that is tuned to satisfy the business requirement.

Precision-recall curve for model tuned to business need.

Through these articles, we’ll see an approach to machine learning that lets us rapidly build solutions for multiple prediction problems. The next time your boss changes the problem parameters, you’ll be able to have a new solution up and running with only a few lines of changes to the code.

If building meaningful, high-performance predictive models is something you care about, then get in touch with us at Feature Labs. While this project was completed with the open-source Featuretools, the commercial product offers additional tools and support for creating machine learning solutions.

Read More

Recurrent Neural Networks By Example In Python


Recurrent Neural Networks by Example in Python

Using a Recurrent Neural Network to Write Patent Abstracts

The first time I attempted to study recurrent neural networks, I made the mistake of trying to learn the theory behind things like LSTMs and GRUs first. After several frustrating days looking at linear algebra equations, I happened on the following passage in Deep Learning with Python:

In summary, you don’t need to understand everything about the specific architecture of an LSTM cell; as a human, it shouldn’t be your job to understand it. Just keep in mind what the LSTM cell is meant to do: allow past information to be reinjected at a later time.

This was the author of the library Keras (Francois Chollet), an expert in deep learning, telling me I didn’t need to understand everything at the foundational level! I realized that my mistake had been starting at the bottom, with the theory, instead of just trying to build a recurrent neural network.

Shortly thereafter, I switched tactics and decided to try the most effective way of learning a data science technique: find a problem and solve it!

This top-down approach means learning how to implement a method before going back and covering the theory. This way, I’m able to figure out what I need to know along the way, and when I return to study the concepts, I have a framework into which I can fit each idea. In this mindset, I decided to stop worrying about the details and complete a recurrent neural network project.

This article walks through how to build and use a recurrent neural network in Keras to write patent abstracts. The article is light on the theory, but as you work through the project, you’ll find you pick up what you need to know along the way. The end result is you can build a useful application and figure out how a deep learning method for natural language processing works.

The full code is available as a series of Jupyter Notebooks on GitHub. I’ve also provided all the pre-trained models so you don’t have to train them for several hours yourself! To get started as quickly as possible and investigate the models, see the Quick Start to Recurrent Neural Networks, and for in-depth explanations, refer to Deep Dive into Recurrent Neural Networks.

Recurrent Neural Network

It’s helpful to understand at least some of the basics before getting to the implementation. At a high level, a recurrent neural network (RNN) processes sequences — whether daily stock prices, sentences, or sensor measurements — one element at a time while retaining a memory (called a state) of what has come previously in the sequence.

Recurrent means the output at the current time step becomes the input to the next time step. At each element of the sequence, the model considers not just the current input, but what it remembers about the preceding elements.

Overview of RNN (Source)

This memory allows the network to learn long-term dependencies in a sequence which means it can take the entire context into account when making a prediction, whether that be the next word in a sentence, a sentiment classification, or the next temperature measurement. A RNN is designed to mimic the human way of processing sequences: we consider the entire sentence when forming a response instead of words by themselves. For example, consider the following sentence:

“The concert was boring for the first 15 minutes while the band warmed up but then was terribly exciting.”

A machine learning model that considers the words in isolation — such as a bag of words model — would probably conclude this sentence is negative. An RNN by contrast should be able to see the words “but” and “terribly exciting” and realize that the sentence turns from negative to positive because it has looked at the entire sequence. Reading a whole sequence gives us a context for processing its meaning, a concept encoded in recurrent neural networks.

At the heart of an RNN is a layer made of memory cells. The most popular cell at the moment is the Long Short-Term Memory (LSTM) which maintains a cell state as well as a carry for ensuring that the signal (information in the form of a gradient) is not lost as the sequence is processed. At each time step the LSTM considers the current word, the carry, and the cell state.

LSTM (Long Short Term Memory) Cell (Source)

The LSTM has 3 different gates and weight vectors: there is a “forget” gate for discarding irrelevant information; an “input” gate for handling the current input, and an “output” gate for producing predictions at each time step. However, as Chollet points out, it is fruitless trying to assign specific meanings to each of the elements in the cell.

The function of each cell element is ultimately decided by the parameters (weights) which are learned during training. Feel free to label each cell part, but it’s not necessary for effective use! Recall, the benefit of a Recurrent Neural Network for sequence learning is it maintains a memory of the entire sequence preventing prior information from being lost.

Problem Formulation

There are several ways we can formulate the task of training an RNN to write text, in this case patent abstracts. However, we will choose to train it as a many-to-one sequence mapper. That is, we input a sequence of words and train the model to predict the very next word. The words will be mapped to integers and then to vectors using an embedding matrix (either pre-trained or trainable) before being passed into an LSTM layer.

When we go to write a new patent, we pass in a starting sequence of words, make a prediction for the next word, update the input sequence, make another prediction, add the word to the sequence and continue for however many words we want to generate.

The steps of the approach are outlined below:

  1. Convert abstracts from list of strings into list of lists of integers (sequences)
  2. Create feature and labels from sequences
  3. Build LSTM model with Embedding, LSTM, and Dense layers
  4. Load in pre-trained embeddings
  5. Train model to predict next work in sequence
  6. Make predictions by passing in starting sequence

Keep in mind this is only one formulation of the problem: we could also use a character level model or make predictions for each word in the sequence. As with many concepts in machine learning, there is no one correct answer, but this approach works well in practice.

Data Preparation

Even with a neural network’s powerful representation ability, getting a quality, clean dataset is paramount. The raw data for this project comes from USPTO PatentsView, where you can search for information on any patent applied for in the United States. I searched for the term “neural network” and downloaded the resulting patent abstracts — 3500 in all. I found it best to train on a narrow subject, but feel free to try with a different set of patents.

Patent Abstract Data

We’ll start out with the patent abstracts as a list of strings. The main data preparation steps for our model are:

  1. Remove punctuation and split strings into lists of individual words
  2. Convert the individual words into integers

These two steps can both be done using the Keras [Tokenizer](https://keras.io/preprocessing/text/#tokenizer) class. By default, this removes all punctuation, lowercases words, and then converts words to sequences of integers. A Tokenizer is first fit on a list of strings and then converts this list into a list of lists of integers. This is demonstrated below:

The output of the first cell shows the original abstract and the output of the second the tokenized sequence. Each abstract is now represented as integers.

We can use the idx_word attribute of the trained tokenizer to figure out what each of these integers means:

If you look closely, you’ll notice that the Tokenizer has removed all punctuation and lowercased all the words. If we use these settings, then the neural network will not learn proper English! We can adjust this by changing the filters to the Tokenizer to not remove punctuation.

# Don't remove punctuation or uppercase
tokenizer = Tokenizer(num_words=None, 
                     lower = False, split = ' ')

See the notebooks for different implementations, but, when we use pre-trained embeddings, we’ll have to remove the uppercase because there are no lowercase letters in the embeddings. When training our own embeddings, we don’t have to worry about this because the model will learn different representations for lower and upper case.

Features and Labels

The previous step converts all the abstracts to sequences of integers. The next step is to create a supervised machine learning problem with which to train the network. There are numerous ways you can set up a recurrent neural network task for text generation, but we’ll use the following:

Give the network a sequence of words and train it to predict the next word.

The number of words is left as a parameter; we’ll use 50 for the examples shown here which means we give our network 50 words and train it to predict the 51st. Other ways of training the network would be to have it predict the next word at each point in the sequence — make a prediction for each input word rather than once for the entire sequence — or train the model using individual characters. The implementation used here is not necessarily optimal — there is no accepted best solution — but it works well!

Creating the features and labels is relatively simple and for each abstract (represented as integers) we create multiple sets of features and labels. We use the first 50 words as features with the 51st as the label, then use words 2–51 as features and predict the 52nd and so on. This gives us significantly more training data which is beneficial because the performance of the network is proportional to the amount of data that it sees during training.

The implementation of creating features and labels is below:

The features end up with shape (296866, 50) which means we have almost 300,000 sequences each with 50 tokens. In the language of recurrent neural networks, each sequence has 50 timesteps each with 1 feature.

We could leave the labels as integers, but a neural network is able to train most effectively when the labels are one-hot encoded. We can one-hot encode the labels with numpy very quickly using the following:

To find the word corresponding to a row in label_array , we use:

After getting all of our features and labels properly formatted, we want to split them into a training and validation set (see notebook for details). One important point here is to shuffle the features and labels simultaneously so the same abstracts do not all end up in one set.

Building a Recurrent Neural Network

Keras is an incredible library: it allows us to build state-of-the-art models in a few lines of understandable Python code. Although other neural network libraries may be faster or allow more flexibility, nothing can beat Keras for development time and ease-of-use.

The code for a simple LSTM is below with an explanation following:

We are using the Keras Sequential API which means we build the network up one layer at a time. The layers are as follows:

  • An Embedding which maps each input word to a 100-dimensional vector. The embedding can use pre-trained weights (more in a second) which we supply in the weights parameter. trainable can be set False if we don’t want to update the embeddings.
  • A Masking layer to mask any words that do not have a pre-trained embedding which will be represented as all zeros. This layer should not be used when training the embeddings.
  • The heart of the network: a layer of LSTM cells with dropout to prevent overfitting. Since we are only using one LSTM layer, it does not return the sequences, for using two or more layers, make sure to return sequences.
  • A fully-connected Dense layer with relu activation. This adds additional representational capacity to the network.
  • A Dropout layer to prevent overfitting to the training data.
  • A Dense fully-connected output layer. This produces a probability for every word in the vocab using softmax activation.

The model is compiled with the Adam optimizer (a variant on Stochastic Gradient Descent) and trained using the categorical_crossentropy loss. During training, the network will try to minimize the log loss by adjusting the trainable parameters (weights). As always, the gradients of the parameters are calculated using back-propagation and updated with the optimizer. Since we are using Keras, we don’t have to worry about how this happens behind the scenes, only about setting up the network correctly.

LSTM network layout.

Without updating the embeddings, there are many fewer parameters to train in the network. The input to the [LSTM](https://machinelearningmastery.com/reshape-input-data-long-short-term-memory-networks-keras/) layer is [(None, 50, 100)](https://machinelearningmastery.com/reshape-input-data-long-short-term-memory-networks-keras/) which means that for each batch (the first dimension), each sequence has 50 timesteps (words), each of which has 100 features after embedding. Input to an LSTM layer always has the (batch_size, timesteps, features) shape.

There are many ways to structure this network and there are several others covered in the notebook. For example, we can use two LSTM layers stacked on each other, a Bidirectional LSTM layer that processes sequences from both directions, or more Dense layers. I found the set-up above to work well.

Pre-Trained Embeddings

Once the network is built, we still have to supply it with the pre-trained word embeddings. There are numerous embeddings you can find online trained on different corpuses (large bodies of text). The ones we’ll use are available from Stanford and come in 100, 200, or 300 dimensions (we’ll stick to 100). These embeddings are from the GloVe (Global Vectors for Word Representation) algorithm and were trained on Wikipedia.

Even though the pre-trained embeddings contain 400,000 words, there are some words in our vocab that are included. When we represent these words with embeddings, they will have 100-d vectors of all zeros. This problem can be overcome by training our own embeddings or by setting the Embedding layer’s trainable parameter to True (and removing the Masking layer).

We can quickly load in the pre-trained embeddings from disk and make an embedding matrix with the following code:

What this does is assign a 100-dimensional vector to each word in the vocab. If the word has no pre-trained embedding then this vector will be all zeros.

To explore the embeddings, we can use the cosine similarity to find the words closest to a given query word in the embedding space:

Embeddings are learned which means the representations apply specifically to one task. When using pre-trained embeddings, we hope the task the embeddings were learned on is close enough to our task so the embeddings are meaningful. If these embeddings were trained on tweets, we might not expect them to work well, but since they were trained on Wikipedia data, they should be generally applicable to a range of language processing tasks.

If you have a lot of data and the computer time, it’s usually better to learn your own embeddings for a specific task. In the notebook I take both approaches and the learned embeddings perform slightly better.

Training the Model

With the training and validation data prepared, the network built, and the embeddings loaded, we are almost ready for our model to learn how to write patent abstracts. However, good steps to take when training neural networks are to use ModelCheckpoint and EarlyStopping in the form of Keras callbacks:

  • Model Checkpoint: saves the best model (as measured by validation loss) on disk for using best model
  • Early Stopping: halts training when validation loss is no longer decreasing

Using Early Stopping means we won’t overfit to the training data and waste time training for extra epochs that don’t improve performance. The Model Checkpoint means we can access the best model and, if our training is disrupted 1000 epochs in, we won’t have lost all the progress!

The model can then be trained with the following code:

On an Amazon p2.xlarge instance ($0.90 / hour reserved), this took just over 1 hour to finish. Once the training is done, we can load back in the best saved model and evaluate a final time on the validation data.

from keras import load_model
# Load in model and evaluate on validation data
model = load_model('../models/model.h5')
model.evaluate(X_valid, y_valid)

Overall, the model using pre-trained word embeddings achieved a validation accuracy of 23.9%. This is pretty good considering as a human I find it extremely difficult to predict the next word in these abstracts! A naive guess of the most common word (“the”) yields an accuracy around 8%. The metrics for all the models in the notebook are shown below:

The best model used pre-trained embeddings and the same architecture as shown above. I’d encourage anyone to try training with a different model!

Patent Abstract Generation

Of course, while high metrics are nice, what matters is if the network can produce reasonable patent abstracts. Using the best model we can explore the model generation ability. If you want to run this on your own hardware, you can find the notebook here and the pre-trained models are on GitHub.

To produce output, we seed the network with a random sequence chosen from the patent abstracts, have it make a prediction of the next word, add the prediction to the sequence, and continue making predictions for however many words we want. Some results are shown below:

One important parameter for the output is the diversity of the predictions. Instead of using the predicted word with the highest probability, we inject diversity into the predictions and then choose the next word with a probability proportional to the more diverse predictions. Too high a diversity and the generated output starts to seem random, but too low and the network can get into recursive loops of output.

The output isn’t too bad! Some of the time it’s tough to determine which is computer generated and which is from a machine. Part of this is due to the nature of patent abstracts which, most of the time, don’t sound like they were written by a human.

Another use of the network is to seed it with our own starting sequence. We can use any text we want and see where the network takes it:

Again, the results are not entirely believable but they do resemble English.

Human or Machine?

As a final test of the recurrent neural network, I created a game to guess whether the model or a human generated the output. Here’s the first example where two of the options are from a computer and one is from a human:

What’s your guess? The answer is that the second is the actual abstract written by a person (well, it’s what was actually in the abstract. I’m not sure these abstracts are written by people). Here’s another one:

This time the third had a flesh and blood writer.

There are additional steps we can use to interpret the model such as finding which neurons light up with different input sequences. We can also look at the learned embeddings (or visualize them with the Projector tool). We’ll leave those topics for another time, and conclude that we know now how to implement a recurrent neural network to effectively mimic human text.


It’s important to recognize that the recurrent neural network has no concept of language understanding. It is effectively a very sophisticated pattern recognition machine. Nonetheless, unlike methods such as Markov chains or frequency analysis, the rnn makes predictions based on the ordering of elements in the sequence. Getting a little philosophical here, you could argue that humans are simply extreme pattern recognition machines and therefore the recurrent neural network is only acting like a human machine.

The uses of recurrent neural networks go far beyond text generation to machine translation, image captioning, and authorship identification. Although this application we covered here will not displace any humans, it’s conceivable that with more training data and a larger model, a neural network would be able to synthesize new, reasonable patent abstracts.

A Bi-Directional LSTM Cell (Source)

It can be easy to get stuck in the details or the theory behind a complex technique, but a more effective method for learning data science tools is to dive in and build applications. You can always go back later and catch up on the theory once you know what a technique is capable of and how it works in practice. Most of us won’t be designing neural networks, but it’s worth learning how to use them effectively. This means putting away the books, breaking out the keyboard, and coding up your very own network.

As always, I welcome feedback and constructive criticism. I can be reached on Twitter @koehrsen_will or through my website at willk.online.

Read More

Biases And How To Overcome Them


Overcome Your Biases with Data

We’re awful at viewing the world objectively. Data can help.

There’s a pervasive myth — perhaps taught to you by an economics course — that humans are rational. The traditional view is we objectively analyze the world, draw accurate conclusions, and make decisions in our best interest. While few people completely buy into this argument anymore, we are still often unaware of our cognitive biases, with the result that we vote, spend money, and form opinions based on a distorted view of the world.

A recent personal experience where I badly misjudged reality — due to cognitive illusions— brought home this point and demonstrated the importance of fact-checking our views of the world. While this situation had no negative consequences, it was a great reminder that we are all subject to powerful biases and personal opinions are no substitute for checking the data.

Shortly after moving to Boston, I thought I noticed a striking phenomenon: loads of people smoking. After a few days, it seemed to me that every street corner was filled with people lighting up cigarettes. Having come from a small midwestern town where it was exceedingly rare to see anyone smoking, I was dismayed: maybe the big city encouraged negative vices I would eventually pick up, or worse, smoking rates were on the rise nationwide.

While a few decades ago I would have had no option but to either persist in this belief or painstakingly look for demographic data in a library, now I was able to find verified data from the Centers for Disease Control and Prevention within seconds. To my surprise, and dealing a large blow to my rational view of myself, I found the following table comparing smoking rates in the metro area nearest my small town (Peoria, IL) to those in Boston:

Source: CDC

Not only was I wrong, I was significantly wrong as indicated by the non-overlapping 95% confidence intervals. (Although we tend to focus on a single number, considering uncertainty estimates is crucial especially when dealing with real-world demographic data). To show visually how wrong I was, even accounting for the uncertainty, I made the following smoking rate boxplots:

Boxplot of Smoking Rates

Why was I so wrong? I’m a firm believer in analyzing your mistakes so you don’t make them again and, in this process, I came up with three reasons:

  1. Availability heuristic: we judge how likely something is by the number of occurrences of it we can bring to memory.
  2. Confirmation bias: once we have a belief, we unconsciously seek out evidence that confirms it and ignore evidence that contradicts it.
  3. Denominator neglect: we look at only the numerator — the number of smokers — and ignore the denominator — the total people we see over the course of a day.

These are all examples of cognitive biases — mistakes in reasoning and deviations from rational decision making — or heuristics — mental shortcuts (rules of thumb) we use to quickly make judgements. While these served us well in our evolutionary past, they often fail us in today’s world. We are now required to process many streams of information with complex interacting factors and our fast intuitions are not adapted for this purpose.

(For the definitive reference on cognitive biases and how to overcome them, read Daniel Kahnemann’s masterwork Thinking, Fast and Slow. A less intimidating format for learning these is the You Are Not So Smart Podcast).

One of the simplest ways to correct for our innate shortcomings is to fact-check ourselves. Especially in an age with so much accurate information freely available there is no excuse for persisting in false beliefs. Instead of reasoning from personal experience/anecdotes, look up the actual numbers!

Moreover, in addition to just figuring out the right answer, it’s important to think about why we erred. We’re never going to rid ourselves of cognitive biases, but we can learn to recognize when they occur and how to overcome them. For example, I should have noticed the people who weren’t smoking (the denominator), or thought about the total number of people I see in my small town every day compared to the number of people I observed in Boston.

Building on that last point, while looking at data by itself is useful, trying to understand what it means in the context of your life can be more helpful. This is where some statistics and basic data manipulation can go a long way.

On an average day in my home town, I probably saw about 50 people walking around (okay no one walks in the Midwest but stay with me) compared to Boston with maybe 100 times as many at 5,000. Knowing the smoking rate and the total number of people I expect to see, I simulated 10,000 days to find out how many smokers I would expect to see on a day in Boston versus my hometown. (Jupyter Notebook available on GitHub).

Simulation results for 10,000 days in my hometown versus Boston.

Even though my hometown has a statistically greater proportion of smokers, in terms of raw numbers, on an average day, I’d see about 100 times the number of smokers in Boston. These graphs, coupled with my neglect of the denominator, show why I was so susceptible to the availability heuristic.

Why This Matters

While this small example is innocuous, the general impact of our biases is pervasive and often detrimental. For instance, because of a predominantly negative news cycle (manifesting in the availability heuristic), people generally think the world is getting worse. If fact, by almost all objective measures, we are living at the best time in human history and things are improving. (All graphs from Enlightenment Now by Steven Pinker).

Graphs showing positive gains in numerous measures worldwide.

This has real-world implications: people vote for leaders who promise a return to better times because they haven’t looked at the data and realized the best time is now! Moving away from politics, think about your personal life: is there a relationship you’ve spent too long in, a negative job you’ve stuck with, or even a book you continue reading despite not enjoying? Then you’ve fallen victim to the Sunk-Cost Fallacy, where we continue squandering time on an endeavor because of the effort we’ve already put in.

As another example, if you find yourself worrying about air travel, instead of reading about the miniscule number of plane crashes, look at the data showing flying is the safest way to travel.

I’ve tried to adopt two simple rules for myself to mitigate cognitive biases:

  1. Look up relevant data: try to find multiple reputable sources, consider uncertainty estimates, and explore the data yourself when available
  2. Seek out disconfirming evidence: when everything you read confirms your beliefs, it’s probably time to read something else.

Following these guidelines won’t make me perfect, but they are helping me gradually become less wrong. I don’t believe there’s always one objective truth, but I do think facts are much better than our subjective judgements.

To end this article on a happy note, here’s a graph showing the decline in smoking rates in the United States (despite my misbelief)!

National Smoking Rates in the United States (Source: Gallup). There’s no substitute for accurate data.

As always, I welcome feedback and constructive criticism. I can be reached on Twitter @koehrsen_will or through my website at willk.online.

Read More