Bluff the bots — a glossary of key terms in Machine Learning

This is the fifth and final installment in a series of articles intended to make Machine Learning more approachable to those without technical training. Prior articles introduce the concept of Machine Learning, show how the process of learning works in general, and describe commonly used algorithms. You can start the series here.

To wrap up the series, I provide below a glossary of basic Machine Learning terms and concepts. This is not exhaustive (otherwise it would break Medium!), but it contains many of the most fundamental terms in the field. Consider bookmarking or printing this if you are having frequent conversations with data scientists, or reading reports on Machine Learning results.

Accuracy — The proportion of all instances which were predicted correctly by a classification algorithm.

Algorithm — A procedure for solving a problem, usually understood in the Machine Learning context as an iterative computer program.

Association — The process of linking objects based on an identified statistical relationship.

AUC — Area under the ROC curve for a probabilistic classifier. AUC is an indicator of the strength of a classifier with strong classifiers usually achieving and AUC of 80% of more.

Classification — The process of assigning objects into pre-defined groups or classes.

Confusion Matrix — A 2 x 2 array used to illustrate the accuracy and error in a classification algorithm. Also known as an Error Matrix or Contingency Table.

Contingency Table — See Confusion Matrix.

Decision Trees — The nested process of decision making taking the shape of a tree with branches, which is modeled by certain algorithms.

Discrete classifier — A classification algorithm which outputs which class an instance is a member of.

Error — The extent to which the predicted output of a supervised learning algorithm failed to match the desired (or real) output. In classification, this is simply a number of incorrect predictions.

Error Matrix — See Confusion Matrix.

False Negative — In classification learning, this is a negative prediction which disagreed with real data.

False Positive — In classification learning, this is a positive prediction which disagreed with real data.

FP Rate — False Positive rate. In a classification algorithm, the proportion of negative instances which were incorrectly predicted to be positive.

Generalization — In supervised learning, the application of an algorithm to new data outside the training set.

Gini coefficient — See AUC. Gini is macroeconomic measure of wealth distribution which is linearly equivalent to AUC and sometimes used in its place. Gini = 2 x AUC — 1.

Hyperplane — In linear algebra and geometry, this is the generalization of the idea of a line or plane into higher dimensional space. In n-dimensional space, a hyperplane is a subspace of n-1 dimensions.

Inductive bias — In supervised learning, the ingoing hypotheses made by the data scientist about which type of algorithm or function best fits the training data.

Input attributes — The data points or properties of examples in the training set for supervised learning.

Linear separability — The existence of lines or ‘hyperplanes’ in multidimensional space that can divide data into classes.

Lorenz Curve — A curve which plots a cumulative probability distribution. An ROC curve is a Lorenz curve.

Most general hypothesis — In supervised learning, this is the function that fits the training data to the ‘loosest’ possible degree.

Most specific hypothesis — In supervised learning, this is the function that fits the training data with to the ‘tightest’ possible degree.

Naïve Bayes Classifiers — Simple classification algorithms that operate around basic probability calculations.

Noise — Unwanted anomalies in data which disguise underlying relationships or structure.

Online learning — The process of self-correction during the learning process. An algorithm which performs online learning will optimize the solution during the learning process so that the final output is the best result possible given the parameters.

Overfit — In supervised learning, errors on new data due to the algorithm being too complex.

Precision — In a classification algorithm, the proportion of predicted positive instances which were correct.

Probabilistic classifier — A classification algorithm which outputs a probability that an instance is a member of a certain class.

Recall — See TP rate.

Recursive Partitioning — A process used in decision tree algorithms where data is repeatedly broken into smaller subsets based on its probabilistic relationship with the outcome.

Reinforcement Learning — Learning by ‘trying’ a response and being ‘punished’ or ‘rewarded’ depending on whether the response was the desired one.

ROC curve — A curve plotted on an ROC graph to illustrate the relationship between TP rate and FP rate for a probabilistic classifier.

ROC graph — Receiving Operator Characteristics graph. A 2 dimensional graph used to plot the effectiveness of classification algorithms, usually with FP rate on the x axis and TP rate on the y axis.

Sensitivity — See TP rate.

Specificity — In a classification algorithm, the proportion of negative instances which were predicted correctly. Specificity = 1 — FP rate.

Supervised Learning — Learning from a sample of data which contains a specific ‘answer’ or outcome, and using this to predict the outcome for new data.

Support Vector Machines — A family of classification algorithms that plot data into multi-dimensional space and attempt to find dividing lines or ‘hyperplanes’ between the classes.

Test set — In supervised learning, a set of data used to calculate the error of an algorithm.

TP Rate — True Positive rate. In a classification algorithm, the proportion of positive instances which were predicted correctly. Also known as Recall or Sensitivity.

Training set — The set of data from which a supervised learning algorithm will learn.

True Negative — In classification learning, this is a negative prediction which agreed with real data.

True Positive — In classification learning, this is a positive prediction which agreed with real data.

Underfit — In supervised learning, errors on new data due to the algorithm being too simple.

Unsupervised Learning — Learning underlying relationships or structure in data where no specific ‘answer’ or output is expected.

Validation set — In supervised learning, a set of data used to test the generalization of an algorithm that has been trained on a training set.

Leave a Reply

%d bloggers like this: