Squeezed into a set of short tips and schemes, a cheat sheet is not only a source for visual inspiration but also a quick way to learn something new, as well as to refresh your knowledge about any particular subject.

Here’s a collection of cheat sheets of **Machine Learning**. Enjoy it and don’t forget to bookmark this page for quick access to this exclusive cheat sheet list.

Let’s start!

The Microsoft Azure Machine Learning Algorithm Cheat Sheet^{1} helps to choose the right machine learning algorithm. It includes some generalizations, oversimplifications and there are lots of algorithms not listed, but it also provides “breadcrumbs” to follow the safe direction. There are different revisions of this cheat sheet, whereas this is the version that I suggest for the moment:

scikit-learn developers tried to do some sort of flow chart on how to do machine learning^{2}. Here’s the resulting picture:

*In the previous picture with ensemble classifiers/regressors we mean random forests, extremely randomized trees, gradient boosted^{3} trees and AdaBoost classifier.*

The table below includes some useful information in order to choose an algorithm: each algorithm has been analyzed from the perspective of accuracy, training time, linearity, number of (hyper)parameters and other peculiarities (e.g. memory footprint and required dataset size). This table is a variation of an original scheme published by Microsoft^{4}:

Algorithm |
Accuracy |
Training time |
Linearity |
Number of Parameters |
Notes |
---|---|---|---|---|---|

Two-class classification |
|||||

logistic regression | ▲ | ▲ | medium | ||

decision forest | ▲ | △ | medium | ||

decision jungle^{5} |
▲ | △ | medium | low memory footprint | |

boosted decision tree | ▲ | △ | medium | large memory footprint | |

neural network | ▲ | high | |||

averaged perceptron^{6} |
△ | △ | ▲ | low | |

support vector machine | △ | ▲ | medium | good for large feature sets | |

locally deep support vector machine^{7} |
△ | high | good for large feature sets | ||

Bayes’ point machine^{8} |
△ | ▲ | low | ||

Multi-class classification |
|||||

logistic regression | ▲ | ▲ | medium | ||

decision forest | ▲ | △ | medium | ||

decision jungle | ▲ | △ | medium | low memory footprint | |

neural network | ▲ | high | |||

one-v-all | – | – | – | – | see properties of the two-class method selected |

Regression |
|||||

linear | ▲ | ▲ | medium | ||

Bayesian linear | △ | ▲ | low | ||

decision forest | ▲ | △ | medium | ||

boosted decision tree | ■ | △ | medium | large memory footprint | |

fast forest quantile^{9} |
▲ | △ | high | distributions rather than point predictions | |

neural network | ▲ | high | |||

Poisson^{10} |
▲ | medium | technically log-linear; for predicting counts | ||

ordinal regression ^{11} |
low | for predicting rank-ordering | |||

Anomaly detection |
|||||

support vector machine | △ | △ | low | especially good for large feature sets | |

PCA-based anomaly detection | △ | ▲ | low | ||

K-means | △ | ▲ | medium | a clustering algorithm |

**accuracy**: is the algorithm’s “ability” to produce correct predictions with reference to all examined cases;**training time**: is the time required to train a model;**linearity**: linear classification algorithms assume that classes can be linearly separated. A linearity assumption is not bad under some circumstances but on other the hand it brings accuracy down;**number of parameters**: defines the degrees-of-freedom to tune the underlying algorithm. The number of parameters depends on the specific algorithm implementation. I proposed a simplified set of labels (low, medium, high) just to recall how tunable each type of algorithm is;- symbol ▲ means excellent accuracy, fast training times, and linearity usage;
- symbol △ means good accuracy and moderate training times.

For Python and R programmers the cheat sheet below could be a timesaver:

Which algorithm can meet my needs?

To answer this question, a comprehensive table^{12} is shown in the following picture (it includes a big image so consider its download before and then zoom in):

If you are planning to use decision trees, then you should have a look to this figure:

There are many artificial neural networks (ANN) types. Topology of a neural network refers to the way that neurons are connected, and it is an important factor in network functioning and learning. If you need an overview of ANN topologies, the following picture could be helpful:

Another resource, quite basic but good, is the post Machine Learning for Dummies Cheat Sheet; also, I have particularly found useful the “Locating the algorithm you need for Machine Learning” table which provides the online location for information about the algorithms (both Python and R) used in machine learning.

If you have other “on-topic” cheat sheets, please feel free to leave a comment. I will be more than happy to take into account your suggestions.

- https://docs.microsoft.com/en-in/azure/machine-learning/machine-learning-algorithm-cheat-sheet
- http://peekaboo-vision.blogspot.it/2013/01/machine-learning-cheat-sheet-for-scikit.html
**Boosting**means that each tree is dependent on prior trees, and learns by fitting the residual of the trees that preceded it. Thus, boosting in a decision tree ensemble tends to improve accuracy with some small risk of less coverage.- https://docs.microsoft.com/en-in/azure/machine-learning/machine-learning-algorithm-choice
- Decision jungles are a recent extension to decision forests. A decision jungle consists of an ensemble of decision directed acyclic graphs (DAGs). Link: Decision Jungles:

Compact and Rich Models for Classification Supplementary Material - The averaged perceptron method is an early and very simple version of a neural network. In this supervised learning method, inputs are classified into several possible outputs based on a linear function, and then combined with a set of weights that are derived from the feature vector—hence the name perceptron. The simpler perceptron models are suited to learning linearly separable patterns, whereas neural networks (especially deep neural networks) can model more complex class boundaries. However, perceptrons are faster, and because they process cases serially, perceptrons can be used with continuous training.
- In this implementation from Microsoft Research, the kernel function that is used for mapping data points to feature space is specifically designed to reduce the time needed for training while maintaining most of the classification accuracy. This model is a supervised learning method, and therefore requires a tagged dataset, which includes a label column.
- The Bayes Point Machine is a Bayesian approach to linear classification. It efficiently approximates the theoretically optimal Bayesian average of linear classifiers (in terms of generalization performance) by choosing one “average” classifier, the Bayes Point. Because the Bayes Point Machine is a Bayesian classification model, it is not prone to overfitting to the training data. For more information, read the original research paper: Bayes Point Machines.
- You can use the Fast Forest Quantile Regression module to create a regression model that can predict values for a specified number of quantiles. Quantile regression is useful if you want to understand more about the distribution of the predicted value, rather than get a single mean prediction value. This method has many applications, including: Predicting prices; Estimating student performance or applying growth charts to assess child development; Discovering predictive relationships in cases where there is only a weak relationship between variables. This regression method is a supervised learning method, and therefore requires a tagged dataset, which includes a label column. The label column must contain numerical values.
- You can use the Poisson Regression module to create a regression model that can be used to predict numeric values, typically counts. You should use this model only if the values you are trying to predict fit a Poisson distribution and cannot be negative.
- You can use the Ordinal Regression module to create a regression model that can be used to predict ranked values. An example of ranked values might be survey responses that capture user’s preferred brands on a 1 to 5 scale, or the order of finishers in a race.
- https://blogs.technet.microsoft.com/machinelearning/2015/09/01/which-algorithm-family-can-answer-my-question/

## Sharing: