How To Avoid Common Difficulties In Your Data Science Programming Environment


Reduce the incidental issues in your programming environment so you can focus on the important data science problems.

Consider the following situation: you’re trying to practice your soccer skills, but each time you take to the field, you encounter some problems: your shoes are on the wrong feet, the laces aren’t tied correctly, your socks are too short, your shorts are too long, and the ball is the wrong size. This is a ridiculous situation, but it’s analogous to that many data scientists find themselves in due to a few common, easily solvable issues:

  • Failure to manage library dependencies
  • Inconsistent code style
  • Inconsistent naming conventions
  • Different development environments across a team
  • Not using an integrated development environment for code editing

All of these mistakes “trip” you up, costing you time and valuable mental resources worrying about small details. Instead of solving data science problems, you find yourself struggling with incidental difficulties trying to set up your environment or get your code to run. Fortunately, the above issues are simple to fix with the right tooling and approach. In this article, we’ll look at best practices for a data science programming environment that will give you more time and concentration for working on the problems that matter.

Read More

Data Scientists Your Variable Names Are Awful Heres How To Fix Them

Wading your way through data science code is like hacking through a jungle. (Source)

A Simple Way to Greatly Improve Code Quality

Quick, what does the following code do?

for i in range(n):
    for j in range(m):
        for k in range(l):
            temp_value = X[i][j][k] * 12.5
            new_array[i][j][k] = temp_value + 150

It’s impossible to tell right? If you were trying to modify or debug this code, you’d be at a loss unless you could read the author’s mind. Even if you were the author, a few days after writing this code you wouldn’t know what it does because of the unhelpful variable names and use of “magic” numbers.

Working with data science code, I often see examples like above (or worse): code with variable names such as X, y, xs, x1, x2, tp, tn, clf, reg, xi, yi, iiand numerous unnamed constant values. To put it frankly, data scientists (myself included) are terrible at naming variables when we go to the trouble of naming them at all.

As I’ve grown from writing research-oriented data science code for one-off analyses to production-level code (at Cortex Building Intel), I’ve had to improve my programming by unlearning practices from data science books, courses, and the lab. There are many differences between machine learning code that can be deployed and how data scientists are taught to program, but we’ll start here by focusing on two common problems with a large impact:

  • Unhelpful/confusing/vague variable names
  • Unnamed “magic” constant numbers

Both these problems contribute to the disconnect between data science research (or Kaggle projects) and production machine learning systems. Yes, you can get away with them in a Jupyter Notebook that runs once, but when you have mission-critical machine-learning pipelines running hundreds of times per day with no errors, you have to write readable and understandable code. Fortunately, there are best practices from software engineering we data scientists can adopt to this end, including the ones we’ll cover in this article.

Read More

Notes On Software Construction From Code Complete


Lessons from “Code Complete: A Practical Handbook of Software Construction” with applications to data science

When people ask about the hardest part of my job as a data scientist, they often expect me to say building machine learning models. Given that all of our ML modeling is done in about 3 lines:

from sklearn import model, training_targets)

predictions = model.predict(testing_features)

I reply that machine learning is one of the easier parts of the job. Rather, the hardest part of being a data scientist in industry is the software engineering required to build the infrastructure that goes into running machine learning models continuously in production.

Starting out, (at Cortex Building Intel) I could write a good Jupyter Notebook for a one-time machine learning project, but I had no idea what it meant to “run machine learning in production” let alone how to do it. Half a year in, and having built several ML systems making predictions around the clock to help engineers run buildings more efficiently, I’ve learned it takes a whole lot of software construction and a tiny bit of data science. Moreover, while there are not yet standard practices in data science, there are time-tested best practices for writing software that can help you be more effective as a programmer.

With a relative lack of software engineering skills entering my job, I’ve had to learn quickly. Much of that came from interacting with other software engineers and soaking up their knowledge, but some of it has also come from resources such as textbooks and online tutorials. One of those textbooks is the 900-page masterwork on constructing quality software, Code Complete: A Practical Handbook of Software Constructionby Steve McConnell. In this article, I wanted to outline the high-level points regarding software construction I took away from reading this book. These are as follows:

  1. Thoroughly plan your project before touching a keyboard
  2. Write readable code because it’s read more than it’s written
  3. Reduce the complexity of your programs to free mental capacity
  4. Test and review every line of code in a program
  5. Be an egoless programmer
  6. Iterate on your designs and repeatedly measure progress
Read More

Masters In Computer Science At Georgia Tech Personal Statement


Why I’m pursuing an advanced degree in computer science

Author’s Note: this is my personal statement for application to Georgia Tech’s Online Master’s in Computer Science (OMSCS). This degree, ranked 8th in the country for Computer Science, is the best deal in graduate education (at least in the United States) coming in under $7,000. (Compared to over $70,000 for degrees at lower-rated institutions.) It’s designed for working professionals, which means I’ll be working full-time at Cortex Building Intel as I pursue the degree. While this is still a work in progress, and I haven’t yet been accepted, I thought I’d share and any feedback is much appreciated. If I get in, I’m very much looking forward to continuing my education which is a must for anyone in the field of data science!

Update July 12, 2019: I have been accepted into the program. I will be attending the Online Master’s in Computer Science starting January 2020.

Read More

How To Generate Prediction Intervals With Scikit Learn And Python


Using the Gradient Boosting Regressor to show uncertainty in machine learning estimates

“All models are wrong but some are useful” — George Box. It’s critical to keep this sage advice in mind when we present machine learning predictions. With all machine learning pipelines, there are limitations: features which affect the target that are not in the data (latent variables), or assumptions made by the model which don’t align with reality. These are overlooked when we show a single exact number for a prediction — the house will be $450,300.01 —which gives the impression we are entirely confident our model is a source of truth.

A more honest way to show predictions from a model is as a range of estimates: there might be a most likely value, but there is also a wide interval where the real value could be. This isn’t a topic typically addressed in data science courses, but it’s crucial that we show uncertainty in predictions and don’t oversell the capabilities of machine learning. While people crave certainty, I think it’s better to show a wide prediction interval that does contain the true value than an exact estimate which is far from reality.

In this article, we’ll walk through one method of producing uncertainty intervals in Scikit-Learn. The full code is available on GitHub with an interactive version of the Jupyter Notebook on nbviewer. We’ll focus primarily on implementation, with a brief section and resources for understanding the theory at the end. Generating prediction intervals is another tool in the data science toolbox, one critical for earning the trust of non-data-scientists.

Prediction intervals we’ll make in this walkthough.

Read More