Featuretools on Spark

Distributed feature engineering in Featuretools with Spark

Apache Spark is one of the most popular technologies on the big data landscape. As a framework for distributed computing, it allows users to scale to massive datasets by running computations in parallel either on a single machine or on clusters of thousands of machines. Spark can be used with Scala, R, Java, SQL, or Python code and its capabilities have led to a rapid adoption as the size of datasets — and the need for methods to work with them — increase.

After using Dask to scale automated feature engineering with Featuretools by running calculations in parallel on a single multi-core machine, we wanted to see if we could use a similar approach with Spark to scale to a cluster of multiple machines. While Dask can also be used for cluster computing, we wanted to demonstrate that Featuretools can run on multiple distributed computing frameworks. The same feature engineering code that runs in parallel using Dask requires no modification to also be distributed with Spark.

Apache Spark is a framework for distributed computing and big data processing.

In this article, we’ll see how to use Spark with PySpark to run Featuretools on a computing cluster to scale to even larger datasets. The code for this article is available as a Jupyter Notebook on GitHub.

Read More

Wikipedia Data Science: Working with the World’s Largest Encyclopedia

How to programmatically download and parse the Wikipedia

Wikipedia is one of modern humanity’s most impressive creations. Who would have thought that in just a few years, anonymous contributors working for free could create the greatest source of online knowledge the world has ever seen? Not only is Wikipedia the best place to get information for writing your college papers, but it’s also an extremely rich source of data that can fuel numerous data science projects from natural language processing to supervised machine learning.

The size of Wikipedia makes it both the world’s largest encyclopedia and slightly intimidating to work with. However, size is not an issue with the right tools, and in this article, we’ll walk through how we can programmatically download and parse through all of the English language Wikipedia.

Along the way, we’ll cover a number of useful topics in data science:

  1. Finding and programmatically downloading data from the web
  2. Parsing web data (HTML, XML, MediaWiki) using Python libraries
  3. Running operations in parallel with multiprocessing/multithreading
  4. Benchmarking methods to find the optimal solution to a problem
Read More

Converting Medium Articles to Markdown

How to quickly export Medium articles to your blog

If like me, you got your start blogging on Medium, but also want to build your own website to display your articles, you’ll need a way to move articles from Medium to the Markdown language. Markdown is a lightweight language meant to be converted into HTML for the web, and there are several tools that allow you to go from existing Medium articles to Markdown for a blog.

(If you don’t yet have a blog, then follow this guide to build your own website in five minutes using Jekyll and GitHub pages.)

Medium to Markdown Tools

There is both a Chrome Extension and a command line tool for taking your Medium posts to Markdown. Unfortunately, I’ve found the Chrome Extension to be unreliable, and if it does work, it makes a number of formatting errors that require correcting.

If you can get the chrome extension to work and you aren’t comfortable at the command line, then that is probably the best choice for you. However, I’ve found the command line tool to be better for my use because it works every time, and requires fewer re-adjustments to the text after running.

Read More

Five Minutes to Your Own Website

How to Use GitHub Pages and Jekyll to get started with your own — entirely free — blog

Building your own website is rewarding on several levels. There’s the opportunity to showcase your work to friends, family, and potential employers, the pride in making something, and the freedom to shape a (very small) part of the web to your tastes.

While Medium is a great option to start blogging because the limited features let you focus on writing, eventually, like me, you’ll want your own website to serve as a central location for your work. Fortunately, we live in a great age for creativity where you can use free tools to build a website in minutes.

In this post, we’ll see how to use the Jekyll site generator and GitHub Pages to build and publish a simple blog with no coding required. If you want an idea of the end product, you can take a look at my (work in progress) site.

Read More

Another Machine Learning Walkthrough and a Challenge

Don’t just read about machine learning — practice it!

After spending considerable time and money on courses, books, and videos, I’ve arrived at one conclusion: the most effective way to learn data science is by doing data science projects. Reading, listening, and taking notes is valuable, but it’s not until you work through a problem that concepts solidify from abstractions into tools you feel confident using.

In this article, I’ll present another machine learning walk-through in Python and also leave you with a challenge: try to develop a better solution (some helpful tips are included)! The complete Jupyter Notebook for this project can be run on Kaggle — no download required — or accessed on GitHub.

Read More