Data Science (Machine Learning) 101

Date Science, or Machine Learning, is a scary topic. It’s hard to know where to get started. It’s hard to even find a good definition of what it does and what you have to do. And there’s always the risk of unleashing the singularity or Skynet by mistake.

As I’ve given a few ad hoc presentations on Machine Learning (and though focused on implementing it with Azure, the basics are applicable to other platforms) I thought I’d take my random notes and present them as a primer. You don’t need to be a Rocket Scientist to get started, but having a basic understanding of Linear Algebra will be helpful. As this isn’t focused purely on Azure Machine Learning (AML), and there are good tutorials on getting started there, this isn’t a step-by-step guide for AML Studio.

The first thing to understand is that there are two main types of machine learning model:

Supervised:

Supervised models are where there is a ground truth to learn from. Each row in your training data has a column where the specific value can be predicted given other column values. The column you are trying to predict is often referred to as the label, and the columns that help predict the label are features.

The three main types of Supervised learning are:

Classification: Helps you group data into a simple Yes/No classification with a confidence factor. For instance, given the following medical features does the label indicate the subject will suffer from Deep Vein Thrombosis.
Regression: Predicts a number or value for the label for each observation. For instance, given features such as the square footage, number of floors, and zip code, what is the expected sale price? Weather forecasts are another good example of regression models.
Anomaly detection: Used to build models to help detect outliers by understanding the “normal” and “abnormal” aspects of a dataset. Given a set of features, it is then possible to predict deviation from the norm. Examples include valid vs fraudulent transactions, network intrusion (unusual traffic patterns), or water leaks.

Unsupervised:

Unsupervised models start without a ground truth. They rely largely on having sufficient data to allow statistical analysis to determine clusters based on patterns and similarity. It compares Euclidian distance within a space to determine affinity, and once a model is defined, can assign a new set of features to a given sub-set.

The primary type of Unsupervised learning is:

Clustering: Examines the features of a dataset which has no defined label, to produce clusters based on given parameters (number of target clusters, size/spread etc). Movie or restaurant recommendations are the canonical example – people who like X also like Y. You can only cluster numerical data, not categorical (see ingestion below for a definition of categorical data).

How do I get started?

To start off you need two things. A source of data, and the question you are trying to answer. The type of question you are trying to answer (true/false, numerical value, classification) will help you determine what model you are trying to build, and thus what algorithms to try. The data you work with should be as clean and complete as possible, but you also want to remove redundant data which may overcomplicate or confuse the model. The process of balancing accuracy and simplicity goes hand in hand with choosing the correct algorithms, and is largely iterative, but luckily with the computing power available to Data Scientists today you can rapidly perform multiple experiments and refinements on a model and compare the results.

Ingest and clean data.

The more data you have available to model the better, but each column in the data needs to add value. An important first step is to make sure you are working with valuable data. The process of data cleaning and transformation is sometimes referred to a munging, and getting things right here – exhibiting a good understanding of the meaning of the data – provides a solid foundation for the work to come, so it’s not something to skimp on.

As part of this process you will perform one or more of the following steps to clean the data:

Flag values as categorical (i.e. indicate that the value is just a reference value, not a number that can be operated on)
Quantization (bin) values to create discrete groupings (e.g. age ranges).
Clean missing and repeated values. You can do this in a number of ways depending on the value and size of the dataset: remove row; substitute fixed values; interpolate (e.g. using trend or time series); forward or backward fill (take previous/next value); impute (interpolate statistically from rest of data sets).
Identify Outliers and errors (erroneous measurements, entry errors, transposition). The easiest way to do this is review a scatter plot to visually identify outliers, and at that point you can either censor, trim, interpolate, or substitute values.
Normalization: Scale data to keep values on similar scales, helping to avoid bias. Scales can be logarithmic; 0-to-1; min-to-max; etc. Z-score normalization (zero rated standard deviation) is good for physical measures.
Identify collinear features: These are features that are tightly coupled, so you can remove one without losing value (and to avoid skewing the model through over-correlation).
Identify categorical features that don’t add value. Removing those features avoids clutter and potential distraction.
Feature Engineering: You may need to compute new features, and weight/drop features based on performance (pruning).
Apply transformation (polynomial): Create simpler linear models vs curves (a square or cube of the data for a given feature).

This isn’t a single operation as you may need to do more than one process step for each column. You may need to cascade the steps (e.g. remove outliers, set invalid values to “missing”, then clean and quantize).

As well as cleaning, you may need to combine data from multiple sources (e.g. reference tables) or transform data to fit the same model (e.g. different formats from older systems).

In the Azure Machine Learning Studio (AML Studio), many of these steps can be applied via the GUI, or programmatically via R or Python modules. You can include these to take advantage of code you may have already written. This also gives the ability to chain a number of steps that you have already verified into a single block.

As a very important part of this process you will conduct exploratory visual data analysis, both to understand the data, but also find potential errors in the model.

By using both the AML Studio GUI or R/Python code blocks, you can obtain multiple visualizations on your data to assist you in understanding the relationships between features, or to identify outliers or anomalous values.

Conditioning (faceting data and displaying in a trellis or lattice format) is where you project multiple dimensions onto a 2 dimensional plane (results from a group by operation) and display the data in the form of a number of different plots.

Other plotting/charting styles supported by R and Python to help you visualize the data include: scatter and line; histogram (binned, continuous variables); box; violin (multi-modal data); q-q (compare observed quantile distribution to an ‘ideal’).

For example data without clear classifications (eg violin plots with very similar facets or histograms) won’t give good results, so would be a good candidate for pruning.

Choosing an Algorithm

In AML Studio there are many pre-defined algorithms you can apply to your dataset. As long as you know what sort of question you are trying to answer (what sort of model you’re going to build) you will either find an algorithm in the pre-packaged set (one of the advantages of Microsoft having a large pool of Data Scientists, and many years’ experience running Machine Learning models) or develop your own in either R or Python to plug into code blocks in the GUI.

To help you pick an algorithm to try, you can follow the flowchart at aka.ms/AzureMachineLearningCheatSheet which guides you through some possible selections based on your data sources, and what you’re trying to predict.

For the curious a few types of algorithm are described below (though you don’t need to really know what they’re doing as long as you evaluate and tune the output to maximize the accuracy of the models)

Adaptive Boosting (Adaboost): Combines multiple algorithms and weights them to improve performance

Boosted Decision Tree: AdaBoost plus Decision Tree. Good performance and improved accuracy.

Decision Trees: Build from the top down creating paths to classify factors. Theoretically easier to interpret, but influenced by noisy data.

Hierarchical Agglomerative Clustering: Initially treats each point as a “cluster”, then merges clusters based on least distance. Stops when the merge distance crosses an average threshold. Adaptive distance measure changes with the density of data to stop it being overwhelmed.

K-Means Clustering: Randomizes data around “k” cluster centers (where “k” is simply the number of buckets to cluster data in), evaluates based on co-ordinate distance, and repeat (moving cluster centers to center of the current group) until cluster centers are “true” and not moving.

Logistic Regression: Data curve is compressed using logistic transformation (to minimize effect of extreme values) and used as an input to a neural model. Good for models with many factors.

Matrix Factorization: uses both the “essence” of the observation, and the fact (e.g. essence of what a user likes, allows you to guess a movie score based on movie type).

Multi-class Classification: Compares each item to the rest of the dataset to determine affinity or grouping. Once groups are determined prediction is based on affinity to an existing cluster.

Support Vector Machine (SVM): Essentially a quadratic equation solver. It maps examples as points in space, and determines a clustering based on the widest clear gap between two groups of data, then attempts to predict which category new items fall in.

In the AML Studio there are detailed descriptions for each of the algorithms. You can further research the math and theory behind each to understand in detail, how it applies to the model you are building

Review/Validate

Arguably this is the most important part of the process as here is where you asses the accuracy of the model at predicting results. In doing so you validate both your initial assumptions about the data – the munging you performed – and the algorithm(s) that you selected.

The way that you do this is by splitting the source data into a number of sets, and using them to independently train, test and evaluate your model. Not using your test data to evaluate, helps remove bias for known values.

If you know what metrics you want to measure at the start of the process (or discover possible scenarios during the original data analysis) then your Model Evaluation process can use those metrics to evaluate the results. It can also understand the sources of errors and improve the result by further modification of the initial data, or fine-tuning of the model. Perhaps the errors come from outliers, over or under pruning, or undetected errors in the source data.

In AML you can use the Evaluate Model module to review a single module, or to compare two scored models to help get closer to an accurate prediction.

When assessing the different model types there are a number of factors to look for:

Regression models give you the mean absolute error, root mean squared error, relative absolute error, relative squared error (sum of squared errors divided by range), and the coefficient of determination (how much the variance in original model has been explained). You want the errors to be as close to 0 as possible, and you want the coefficient of determination to be as close to 1 as possible.

Binary (two-class) classification models provide metrics showing accuracy, precision, recall, F1 score (a combination of precision and recall), and AUC (area under the curve). You want all of these numbers to be as close to 1 as possible (a value of 0.5 indicates that the model is no better than flipping a coin!). It also provides the number of true positives, false positives, false negatives, and true negatives. You want the number of true positives and true negatives to be high, and the number of false positives and false negatives to be low.

Given that TP = True Positive; TN = True Negative; FP = False Positive; and FN = False Negative then:

Accuracy = TP + TN / (TP+TN+FP+FN)
Precision = TP/(TP+FP)
Recall = TP/(TP + FN) fraction of true positives vs all actual positives <- helps minimize negative rates
F1 = precision * recall / (prevision + recall)
ROC Curve (Area Under Curve) graph of True Positive Rate vs False Positive Rate

Multiclass classification models provide a confusion matrix of actual vs. predicted instances (Actual Positive, Negative vs Predicted Positive, Predicted Negative).

At all stages, be guided by Occams razor. The simplest model, with the fewest assumptions, that fits the data well with the lowest residuals (difference between the observed label and the predicted label) is the best. At each stage you should strive to balance accuracy and simplicity.

Remember too, that this process is iterative. Be prepared at any stage to question your assumptions and see if a change earlier in the model (be it data cleaning or algorithm selection or tuning) improves the outcome.

Compute resources while you’re training the model are relatively cheap and fast, so don’t be afraid to experiment. AML Studio includes a module called Sweep Parameters to help with this. It can automatically re-run the chosen algorithm with different parameters to adjust the weighting to determine which gives (statistically) the best result.

When you have trained your model you may need to revisit the experiment and:

Evaluate (review root mean square errors) and understand (and possibly trim) residuals;
Modify the rules that filter/transform data;
Improve feature engineering (polynomials);
Improve feature selection (prune);
Test alternate models/algorithms (e.g. an alternative to a linear regression might be a Support Vector Machine or Decision Tree);
Adjust parameters for selected algorithms

Note that you can get to a point of diminishing returns with small residuals (clustered around zero) and no significant improvement in root mean square errors.

Before going into production, you should re-train the model using all data (including the test data) to give it even more to work from.

In Azure Machine Learning Studio, it is a fairly simple process to publish the trained and optimized model as a web service that you can then consume, either ad hoc, or in batch to return predictions. You can then evaluate performance of your work against fresh data, and see how good it is. At any time, you can re-train the model on a new or expanded data set to improve the results over time.

The best way to get started though… is to get started! To that end there are sites like Kaggle or the Cortana Intelligence Competitions which have sample datasets and scenarios for you to hone your skills on. Go give it a try, and come back and comment on ways I can improve the explanations above…