||The extent to which values of one field depend on or are
predicted by values of another field.
||A modeling technique that is designed to enhance the stability
of the model and avoid overfitting. See also boosting,
||running the the model predictions offline (asynchronously) on a
||A graphical model that displays variables in a data set and the
probabilistic or conditional in-dependencies between them.
|binomial logistic regression
||A logistic regression that is used for targets with two
discrete categories. See also multinomial logistic regression,
||A modeling technique that creates a sequence of models, rather
than a single model, to obtain more accurate predictions. Cases are
classified by applying the whole set of models to them, and then
combining the separate predictions into one overall prediction. See
|classification and regression tree algorithm
||A decision tree algorithm that uses recursive partitioning to
split the training records into segments by minimizing the impurity
at each step. See also Quick, Unbiased, Efficient Statistical Tree
||An estimate of the accuracy of a prediction, usually expressed
as a number from 0.0 to 1.0.
||A statistical measure of the association between two numeric
fields. Values range from -1 to +1. A correlation of 0 means that
there is no relationship between the two fields.
|Cox regression algorithm
||An algorithm that produces a survival function that predicts
the probability that the event of interest has occurred at a given
time for given values of the predictor variables.
||A technique for testing how well a model generalizes in the
absence of a holdout test sample. Cross-validation divides the
training data into a number of subsets, and then builds the same
number of models, with each subset held out in turn. Each of those
models is tested on the holdout sample, and the average accuracy of
the models on those holdout samples is used to estimate the
accuracy of the model when applied to new data. See also
||The extent to which data has been accurately coded and stored.
Factors that adversely affect data quality include missing values,
data entry errors, measurement errors, and coding
||A collection of data, usually in the form of rows (records) and
columns (fields) and contained in a file or database table.
||The process of presenting data patterns in graphical format,
including the use of traditional plots as well as advanced
interactive graphics. In many cases, visualization reveals patterns
that would be difficult to find using other methods.
||An algorithm that identifies subgroups or segments that show a
higher or lower likelihood of a given binary (yes/no) outcome
relative to the overall population.
|decision tree algorithm
||An algorithm that classifies data, or predicts future outcomes,
based on a set of decision rules.
||The process of enabling the widespread use of a predictive
analytics project within an organization.
||The process of determining whether a model will accurately
predict the target on new and future data.
||A graphical representation of data values in a two-dimensional
table format, in which higher values are represented by darker
colors and lower values by lighter ones.
||A graphical display of the distribution of values for a numeric
field, in the form of a vertical bar chart in which taller bars
indicate higher values.
||A statistical technique for estimating a linear model for a
continuous (numeric) output field. Linear models predict a
continuous target based on linear relationships between the target
and one or more predictors. See also regression.
|linear regression model
||A modeling algorithm that assumes that the relationship between
the input and the output for the model is of a particular, simple
form. The model fits the best line through linear regression and
generates a linear mapping between the input variables and each
||A statistical technique for classifying records based on the
values of the input fields. Logistic regression is similar to
linear regression, but takes a categorical target field instead of
a numeric one. See also regression.
||A specification of the relative importance of different kinds
of classification errors, such as classifying a high-risk credit
applicant as low risk. Costs are specified in the form of weights
applied to specific incorrect predictions.
||The process of creating data models by using algorithms. Model
building typically consists of several stages: training, testing
and (optionally) validation of evaluation. See also testing,
|multinomial logistic regression
||A logistic regression that is used for targets with more than
two categories. See also binomial logistic regression, target.
||A mathematical model for predicting or classifying cases by
using a complex mathematical scheme that simulates an abstract
version of brain cells. A neural network is trained by presenting
it with a large number of observed cases, one at a time, and
allowing it to update itself repeatedly until it learns the
||Apply model prediction real time on a single record through a
published endpoint within or outside the organization, expects fast
response in terms of milliseconds
||The unintentional modeling of chance variations in data,
leading to models that do not work well when applied to other data
sets. Bagging and cross-validation are two methods for detecting or
preventing overfitting. See also bagging, cross-validation.
||To divide a data set into separate subsets or samples for the
training, testing, and validation stages of model building.
||A business process and a set of related technologies that are
concerned with the prediction of future possibilities and trends.
Predictive analytics applies such diverse disciplines as
probability, statistics, machine learning, and artificial
intelligence to business problems to find the best action for a
|Predictive Model Markup Language (PMML)
||An XML-based language defined by the Data Mining Group that
provides a way for companies to define predictive models and share
models between compliant vendors' applications.
||A measure of the likelihood that an event will occur.
Probability values range from 0 to 1; 0 implies that the event
never occurs, and 1 implies that the event always occurs. A
probability of 0.5 indicates that the event has an even chance of
occurring or not occurring.
|Quick, Unbiased, Efficient Statistical Tree algorithm
||A decision tree algorithm that provides a binary classification
method for building the tree. The algorithm is designed to reduce
the processing time required for large C & R tree analyses
while also reducing the tendency found in classification tree
methods to favor inputs that allow more splits. See also
classification and regression tree algorithm, decision tree
||A statistical technique for estimating the value of a target
field based on the values of one or more input fields. See also
linear regression, logistic regression.
|regression tree algorithm
||A tree-based algorithm that splits a sample of cases repeatedly
to derive homogeneous subsets, based on values of a numeric output
field. See also Chi-squared Automatic Interaction Detector
||To apply a predictive model to a data set with the intention of
producing a classification or prediction for a new, untested
||A series of commands, combined in a file, that carry out a
particular function when the file is run. Scripts are interpreted
as they are run.
||The stage of model building in which the model produced by the
training stage is tested against a data subset for which the
outcome is already known. See also model building, training,
||The initial stage of model building, involving a subset of the
source data. The model can then be tested against a further,
different subset for which the outcome is already known. See also
model building, testing, validation.
||A formula that is applied to the values of a field to alter the
distribution of values. Some statistical methods require that
fields have a particular distribution. When a field's distribution
differs from what is required, a transformation (such as taking
logarithms of values) can often remedy the problem.
||A model that contains information extracted from the data but
which is not designed for generating predictions directly.
||An optional final stage of model building in which the refined
model from the testing stage is validated against a further subset
of the source data. See also model building, testing,