This is the third in a series of articles intended to make Machine Learning more approachable to those without technical training. Prior articles introduce the concept of Machine Learning and discuss how the process of learning works in general. You can start the series here.
This installment describes a number of commonly used machine learning algorithm families. This list is far from exhaustive, and is focused mostly on two-class classification algorithms (the ones I am most familiar with). Descriptions are intended to be brief but intuitive, targeted at unsophisticated or inexperienced readers. A brief description and illustration of each algorithm family is provided, along with some examples of algorithms from each family. A further reading list is provided for a more detailed mathematical treatment of these algorithms and to discover many more.
Naive Bayes Classifiers
Naive Bayes Classifiers are a family of classification algorithms which are based on fairly simple principles. These algorithms learn which input attributes have high and low probabilities of being associated with a class of interest, and then calculates whether or not an instance falls into a certain class based on these attribute probabilities. For example, a motor vehicle may be classified as motorbike if it has two wheels, is between 1900mm and 2500mm in length, and is between 700mm and 1000mm in width.
Naive Bayes Classifiers are so-called because of the statistical assumption that each variable is independent from the other, though in practice this is rarely the case, for example the number of wheels is usually correlated with the width.
Generally, these algorithms require a relatively small amount of data to estimate parameters, which is considered an advantage. However, due to the simplicity of the underlying mathematical model, these algorithms are often outperformed by others.
One of the earliest and most common uses of this family of algorithms is text filtering. In Spam Filtering, an email may be classified as spam based on a certain word, combination of words or random text string being found in the email. In a similar way, these algorithms can be used to categorize documents into different subject categories.
A number of alternate algorithms exist which try to overcome the independence assumption of the Naive Bayes Classifier. One of the strongest examples is Average One-Dependence Estimators (AODE), which can often produce lower error levels than the Naive Bayes while only requiring modest additional computational power.
Decision Tree Algorithms
This is a large family of algorithms which operate using the Decision Tree principle, and are known to be highly successful in a number of contexts. These algorithms categorize an instance into a certain class (leaves) based on a progression of decisions (branches) made based on the attributes of the instance.
For example, the diagram below is a simple decision tree algorithm for determining what happened to a passenger on the Titanic based on three inputs — sex, age and number of siblings and spouses on board (sibsp). The numbers underneath each leaf represent the probability of survival and the percentage of instances which fell into that class:
Decision Tree algorithms learn using a top-down induction, often called a greedy algorithm (due to the fact that it always decides to take the largest share). Starting at each class (leaf) the input data is split into subsets according to whether it associates with the class or not. This is then repeated on each subset (or decision node) in a process known as recursive partitioning. The learning stops when the algorithm reaches a decision node for which there is no further value in splitting the data. In our Titanic example, the algorithm would split the data between male and female, and it would then determine that there is no further value in splitting the female data. It would then determine that it is valuable to split the male data, first according to age, then according to sibsp.
Decision Tree algorithms are robust, easy to interpret and can handle large datasets well. However, there are many problems for which they will not work because the underlying data is too complex. These algorithms can also have a tendency to create over-complex trees from the training data that do not generalize well and end up overfitting the data. It is possible to go back and simplify the tree in some cases. This is known as pruning.
Examples of algorithms based around the Decision Tree method include Bagging Trees, Random Forests and Boosted Trees.
Support Vector Machines
Support Vector Machines are non-probabilistic algorithms that are founded on the principles of multi-dimensional linear algebra. These algorithms are strongly suited to the binary problem of classifying a sample into one of two classes, and can be particularly useful when there is a limited structure to the input data.
SVM algorithms take the attributes of an instance in the data set and transform these attributes into a vector in n-dimensional space, of the form X = (x1, x2, x3, … , xn).
The algorithm will then attempt to compute a hyperplane or ‘line’ in n-dimensional space that separates the two classes by the greatest distance possible.
A simple example in two dimensions is shown below. In this example, the vectors lying on the dotted-line margins are known as the support vectors and are used to calculate the line of greatest separation, marked with a solid line.
Modern SVM algorithms are able to determine the number of dimensions and attributes by mining text strings and identifying the subsets and families of text which can be used to categorize the sample. Dimensions, and hence vector lengths, can be in the hundreds or even thousands. The complexity of performing computations on such data can very easily and quickly ascend past the limits of computational power available. However, modern SVM algorithms make use of a kernel function which can reduce the complexity of the problem significantly.
Multiple developments on linear SVM have led to a number of alternative options for situations when the training set is not easily linearly separable (i.e., it is not possible to identify a hyperplane of separation). For example, soft margins may be used in cases when some of the data may be misclassified. There are also recent algorithms that develop SVM principles to work with more than two classes, as well as in non-linear and regression problems.
Similar to SVM, the Perceptron algorithm operates by converting input attributes into a vector of the form X = (x1, x2, x3, … , xn). For each instance in the training set, the algorithm calculates a vector of weights W = (w1, w2, w3, …, wn) to form a linear prediction function W.X which correctly predicts all prior training set instances .
Different from most other algorithms, Perceptron performs online learning, which means that it anayzes each training instance one by one and updates the weight vector W after each instance is processed, rather than having to process all instances first. This can be more efficient and lean compared to other algorithms.
However, the Perceptron algorithm will only converge on a weight vector W if the training set is linearly separable. If no hyperplane exists to separate the positive from the negative examples in the training data, the algorithm will not be able to converge on a W that will correctly classify all training set examples.
Variants of Perceptron can be used to deal with multi-class problems, and even non-linear problems. The Pocket Algorithm is a potentially useful variant which deals with non-linearly separable training data. As it processes each training instance, this algorithm keeps the best solution seen so far ‘in its pocket’. At each instance, the algorithm returns the ‘pocket’ solution rather than the last solution, which allows it to iterate to the solution with the least error on the training set.
Neural Network algorithms are among the most complex algorithms currently in use. They are modeled after the structure of neurons in the brain, and they are designed to learn in a similar way to how the brain learns. They are not usually pre-programmed with any set rules, but learn through processing examples.
Image recognition is a common example of where neural networks are often used. As an example, the network may be fed a collection of images of dogs and told which breed each dog is. It can then learn characteristics of the dog images which allows it to predict the breed for new images it is given.
It’s natural to ask how a computer can simulate a network of neurons. In reality, the signals between the ‘neurons’ is usually a number of some form, and the ‘neurons’ themselves are some sort of mathematical function which processes the signal it receives and spits out a new number — so in fact these algorithms mathematically model the signal to node behavior of a network of neurons.
I’ve only scratched the surface here, but I’ve covered the major classes of classification algorithms which you may hear about. If you want to really get into the delightful universe of algorithms like BLAST, nauty, Saucy or the Mersenne Twister (yes, they are all algorithm names), here are a few places you can look. Enter at your own risk, especially if you are not a mathematician, statistician or computer scientist.
- Introduction to Machine Learning, Ethem Alpaydin, ISBN 0262028182. This is a very good broad text on Machine Learning, covering a wide range of topics. It is a very common textbook for University programs.
- Machine Learning: The Art and Science of Algorithms that Make Sense of Data, Peter Flach, ISBN 1107422221. This is a very practical treatment of the topic, with the use of specific examples to illustrate how algorithms operate.
- Foundations of Machine Learning, Mehryar Mohri and others, ISBN 026201825X. This book is quite theoretical in nature, with a focus on proofs of key concepts. It also covers the mathematical foundations of several common ML algorithms.
- Machine Learning, A Probabilistic Perspective, Kevin P Murphy, ISBN 0262018020. This book is a good mix between theory (in particular probability and statistics) and practice, with examples of algorithm codes and lots of illustrations.
- Information Theory, Inference and Learning Algorithms, David MacKay, ISBN 0521642981. The focus of this book is coding for science and engineering purposes. However, there are some entertaining diversions on algorithms for crossword puzzles and biological evolution.
- The Elements of Statistical Learning, Data Mining, Inference and Prediction, Trevor Hastie, ISBN 0387848576. An academic text focused on combining various concepts related to Machine Learning into a single statistical framework.
- Introduction to Information Retrieval, Christopher D Manning and others, ISBN 0521865719. This book is focused on applications related to text classification and clustering.
The next article will focus on how the effectiveness of a predictive algorithm is measured. Read it here.