# Machine learning glossary

Term | Meaning |
---|---|

association | The extent to which values of one field depend on or are predicted by values of another field. |

bagging | A modeling technique that is designed to enhance the stability of the model and avoid overfitting. See also boosting, overfitting. |

batch scoring | running the the model predictions offline (asynchronously) on a large dataset. |

Bayesian network | A graphical model that displays variables in a data set and the probabilistic or conditional in-dependencies between them. |

binomial logistic regression | A logistic regression that is used for targets with two discrete categories. See also multinomial logistic regression, target. |

boosting | A modeling technique that creates a sequence of models, rather than a single model, to obtain more accurate predictions. Cases are classified by applying the whole set of models to them, and then combining the separate predictions into one overall prediction. See also bagging. |

classification and regression tree algorithm | A decision tree algorithm that uses recursive partitioning to split the training records into segments by minimizing the impurity at each step. See also Quick, Unbiased, Efficient Statistical Tree algorithm. |

confidence score | An estimate of the accuracy of a prediction, usually expressed as a number from 0.0 to 1.0. |

correlation | A statistical measure of the association between two numeric fields. Values range from -1 to +1. A correlation of 0 means that there is no relationship between the two fields. |

Cox regression algorithm | An algorithm that produces a survival function that predicts the probability that the event of interest has occurred at a given time for given values of the predictor variables. |

cross-validation | A technique for testing how well a model generalizes in the absence of a holdout test sample. Cross-validation divides the training data into a number of subsets, and then builds the same number of models, with each subset held out in turn. Each of those models is tested on the holdout sample, and the average accuracy of the models on those holdout samples is used to estimate the accuracy of the model when applied to new data. See also overfitting. |

data quality | The extent to which data has been accurately coded and stored. Factors that adversely affect data quality include missing values, data entry errors, measurement errors, and coding inconsistencies. |

data set | A collection of data, usually in the form of rows (records) and columns (fields) and contained in a file or database table. |

data visualization | The process of presenting data patterns in graphical format, including the use of traditional plots as well as advanced interactive graphics. In many cases, visualization reveals patterns that would be difficult to find using other methods. |

decision list | An algorithm that identifies subgroups or segments that show a higher or lower likelihood of a given binary (yes/no) outcome relative to the overall population. |

decision tree algorithm | An algorithm that classifies data, or predicts future outcomes, based on a set of decision rules. |

deployment | The process of enabling the widespread use of a predictive analytics project within an organization. |

evaluate | The process of determining whether a model will accurately predict the target on new and future data. |

heat map | A graphical representation of data values in a two-dimensional table format, in which higher values are represented by darker colors and lower values by lighter ones. |

histogram | A graphical display of the distribution of values for a numeric field, in the form of a vertical bar chart in which taller bars indicate higher values. |

linear regression | A statistical technique for estimating a linear model for a continuous (numeric) output field. Linear models predict a continuous target based on linear relationships between the target and one or more predictors. See also regression. |

linear regression model | A modeling algorithm that assumes that the relationship between the input and the output for the model is of a particular, simple form. The model fits the best line through linear regression and generates a linear mapping between the input variables and each output variable. |

logistic regression | A statistical technique for classifying records based on the values of the input fields. Logistic regression is similar to linear regression, but takes a categorical target field instead of a numeric one. See also regression. |

misclassification cost | A specification of the relative importance of different kinds of classification errors, such as classifying a high-risk credit applicant as low risk. Costs are specified in the form of weights applied to specific incorrect predictions. |

model building | The process of creating data models by using algorithms. Model building typically consists of several stages: training, testing and (optionally) validation of evaluation. See also testing, training, validation. |

multinomial logistic regression | A logistic regression that is used for targets with more than two categories. See also binomial logistic regression, target. |

neural network | A mathematical model for predicting or classifying cases by using a complex mathematical scheme that simulates an abstract version of brain cells. A neural network is trained by presenting it with a large number of observed cases, one at a time, and allowing it to update itself repeatedly until it learns the task. |

online scoring | Apply model prediction real time on a single record through a published endpoint within or outside the organization, expects fast response in terms of milliseconds |

overfitting | The unintentional modeling of chance variations in data, leading to models that do not work well when applied to other data sets. Bagging and cross-validation are two methods for detecting or preventing overfitting. See also bagging, cross-validation. |

partition | To divide a data set into separate subsets or samples for the training, testing, and validation stages of model building. |

predictive analytics | A business process and a set of related technologies that are concerned with the prediction of future possibilities and trends. Predictive analytics applies such diverse disciplines as probability, statistics, machine learning, and artificial intelligence to business problems to find the best action for a given situation. |

Predictive Model Markup Language (PMML) | An XML-based language defined by the Data Mining Group that provides a way for companies to define predictive models and share models between compliant vendors' applications. |

probability | A measure of the likelihood that an event will occur. Probability values range from 0 to 1; 0 implies that the event never occurs, and 1 implies that the event always occurs. A probability of 0.5 indicates that the event has an even chance of occurring or not occurring. |

Quick, Unbiased, Efficient Statistical Tree algorithm (QUEST) | A decision tree algorithm that provides a binary classification method for building the tree. The algorithm is designed to reduce the processing time required for large C & R tree analyses while also reducing the tendency found in classification tree methods to favor inputs that allow more splits. See also classification and regression tree algorithm, decision tree algorithm. |

regression | A statistical technique for estimating the value of a target field based on the values of one or more input fields. See also linear regression, logistic regression. |

regression tree algorithm | A tree-based algorithm that splits a sample of cases repeatedly to derive homogeneous subsets, based on values of a numeric output field. See also Chi-squared Automatic Interaction Detector algorithm. |

score | To apply a predictive model to a data set with the intention of producing a classification or prediction for a new, untested case. |

script | A series of commands, combined in a file, that carry out a particular function when the file is run. Scripts are interpreted as they are run. |

testing | The stage of model building in which the model produced by the training stage is tested against a data subset for which the outcome is already known. See also model building, training, validation. |

training | The initial stage of model building, involving a subset of the source data. The model can then be tested against a further, different subset for which the outcome is already known. See also model building, testing, validation. |

transformation | A formula that is applied to the values of a field to alter the distribution of values. Some statistical methods require that fields have a particular distribution. When a field's distribution differs from what is required, a transformation (such as taking logarithms of values) can often remedy the problem. |

unrefined model | A model that contains information extracted from the data but which is not designed for generating predictions directly. |

validation | An optional final stage of model building in which the refined model from the testing stage is validated against a further subset of the source data. See also model building, testing, training. |