Introduction to Machine Learning, Part 4: Getting Started with Machine Learning

From the series: Introduction to Machine Learning

With machine learning, there’s rarely a straight line from start to finish—you’ll find yourself trying different ideas and approaches.

Today, we’ll walk through a machine learning workflow step by step, and we’ll focus on a few key decision points along the way.

Every machine learning workflow begins with three questions:

  • What kind of data are you working with?
  • What insights do you want to get from it?
  • How and where will those insights be applied?

The example in this video is based on a cell phone health-monitoring app. The input consists of sensor data from the phone’s accelerometer and gyroscope.

The responses are the activities performed–walking, standing, running, climbing stairs, or lying down. We want to use the sensor data to train a classification model to identify these activities.

Now let’s step through each part of workflow to see how we can get our fitness app working.

We’ll start with data from the sensors in the phone.

A flat file format such as text or CSV is easy to work with and makes importing data straightforward.

Now we import all that data into MATLAB and plot each labeled set to get a feel for what's in the data.

To preprocess the data, we look for missing data or outliers.  In this case, we might also look at using signal processing techniques to remove the low-frequency gravitational effects. That would help the algorithm focus on the movement of the subject, not the orientation of the phone.

Finally, we divide the data into two sets. We save part of the data for testing and use the rest to build the models.

Feature engineering is one of the most important parts of machine learning. It turns raw data into information that a machine learning algorithm can use.

For the activity tracker, we want to extract features that capture the frequency content of the accelerometer data.

These features will help the algorithm distinguish between walking (low frequency) and running (high frequency).

We create a new table that includes the selected features.

The number of features that you could derive is limited only by your imagination. However, there are a lot of techniques commonly used for different types of data.

Now it's time to build and train the model.

It’s a good idea to start with something simple like a basic decision tree. This will run fast and be easy to interpret.

To see how well it performs, we look at the confusion matrix, a table that compares the classifications made by the model with the actual class labels.

The confusion matrix shows that our model is having trouble distinguishing between dancing and running.

Maybe a decision tree doesn’t work well for this type of data. We’ll try something else.

Let’s try a multiclass support vector machine (SVM).

With this method, we now get 99% accuracy, which is a big improvement.

We achieved our goal by iterating on the model and trying different algorithms, however it's rarely this simple.

If our classifier still couldn’t reliably differentiate between dancing and running, we’d look into other ways to improve the model.

Improving a model can take two different directions: make the model simpler to avoid over-fitting, or adding complexity in order to improve accuracy.

A good model only includes the features with the most predictive power, so to simplify the model, we should first try and reduce the number of features.

Sometimes, we look at ways to reduce the model itself. We can do this by pruning branches from a decision tree or removing learners from an ensemble

If our model still can’t tell the difference between running and dancing, it may be due to over-generalizing. So, to fine-tune our model, we can add additional features.

In our example, the gyroscope records the orientation of the cell phone during activity.

This data might provide unique signatures for the different activities.

For example, there might be a combination of acceleration and rotation that’s unique to running.

Now that we’ve adjusted our model, we can validate its performance against the test data we set aside in preprocessing. If the model can reliably classify the activities, we’re ready to move it to the phone and start tracking.

So, that wraps up our machine learning example and our overview video series about machine learning. For more information, check out the links below.

In our next series, we’re going to look at some advanced topics related to machine learning, such as feature engineering and hyperparameter tuning.