Predictive Analytics, Big Data, Hadoop, PMML: April 2009

Friday, April 3, 2009

PMML - Data Pre-Processing Primer

PMML defines several kinds of data transformations. These are:

Normalization: map values to numbers, the input can be continuous (element NormContinuous) or discrete (element NormDiscrete).
Discretization: map continuous values to discrete values.
Value Mapping: map discrete values to discrete values.
Text Indexing (text mining element introduced in PMML 4.2): derive a frequency-based value for a given term (not covered in this primer).
Functions: derive a value by applying a function to one or more parameters.

Below we will see how these transformations can be used and combined to allow for data powerful pre-processing in PMML and the Zementis products: ADAPA for real-time scoring and UPPI for Big Data scoring.

PMML transformations and much more are covered in the PMML book "PMML in Action" available on Amazon.com. Companion to the book and a learning tool for data pre-processing in PMML is the Transformations Generator, which is a graphical interface for learning PMML transformations.

Transformations in PMML are also covered in depth in a PMML course being offered online by UCSD Extension. For more information about this course, please visit the UCSD Predictive Models with PMML page.

Normalization: NormContinuous

This is a general method to normalize a continuous variable to another continuous variable. In the PMML code below, two normalizations take place and two derived variables are created. The first derived variable is named "DerivedNormalizedVar1" and the second "DerivedNormalizedVar2". The normalization itself is defined under the element NormContinuous. Typically, NormContinuous is used to normalize input values between 0 and 1 as shown in the first normalization example. However, as shown in the second normalization example, one can have as many LinearNorm elements as necessary and these do not need to normalized values between 0 and 1; the only restriction is that the first attribute orig must be in increasing order. The normalized variable is then a linearly interpolated value between the LinearNorm elements.

Note that in the second example, we are using the attribute outliers to specify that any outliers to the normalization should be treated as extreme values. In this case, all "InputVar2" values lower than 100 will be assigned value "0" and all values greater than 900 will be assigned value "4".

Normalization: NormDiscrete

This method is used to transform string values to numeric values. Many models encode string values into numeric values in order to perform mathematical functions. For example, regression and neural network models often split categorical and ordinal variables into multiple dummy
variables.

The PMML code below implements the following logic:

IF CategoricalInputVar1 == Partner
THEN
DerivedVar1 = 1
ELSE
DerivedVar1 = 0

IF CategoricalInputVar1 == Associate
THEN
DerivedVar2 = 1
ELSE
DerivedVar2 = 0

IF CategoricalInputVar1 == Colleague
THEN
DerivedVar1 = 1
ELSE
DerivedVar1 = 0

In this example, variable "CategoricalInputVar1" has three possible valid values: "Partner", "Associate", and "Colleague". By using the NormDiscrete element, we can create derived variables (the dummy variables in this case) that will be assigned values "1" or "0" depending on the value of the categorical input variable being processed.

Note that we are also handling missing values in this example. The mapMissingTo attribute means that if the input value "CategoricalInputVar1" is missing then all three derived variables will be assigned "0".

Discretization

This method maps continuous values to discrete string values. For a continuous variable we define intervals and if the variable falls inside that interval, we assign to it the string value defined for that interval. In the PMML code below, the Discretize element defines the continuous field "InputVar" which is to be discretized. The DiscretizeBin elements define the intervals. The first one defines a bin which has a rightMargin of 1000. By default, its left margin (attribute leftMargin) is negative infinity. The closure attribute finally defines the interval as (-infinity,1000], i.e. the right margin is less than or equal to 1000. So if the value of the variable "InputVar" falls in this interval, we assign to the derived variable "DerivedVar" the string value “low”. Similarly, if the value of "InputVar" is in (1000,100000], the derived value is equal to "medium" and if it is in (100000,1000000), the derived value is equal to "high".

The mapping from a continuous to a discrete variable can be many-to-one but not one-to-many. In other words, two intervals can be mapped to the same string value but the same interval may not have more than one string value. This implies that the intervals defined must be disjoint; there can be no overlap between two defined intervals.

One can establish what should happen if the input is not in any of the defined intervals. The attribute defaultValue in the Discretize element establishes that if the input does not lie in any of the intervals (as for example, if "InputVar"=2,000,000), then by default it is assigned the value "extreme".

Discretization can also be paired up with element NormDiscrete to generate dummy variables which indicate if an input variable belongs to a certain interval. The PMML example shown below implements just that. We assume a numeric input variable named "InputVar" which leads to the creation of three dummy variables: "DerivedVar1", "DerivedVar2", and "DerivedVar3". These are used to indicate membership of "InputVar" to one of the following intervals: (-infinity,100], (100,200], and (200,+infinity).

Note that in this example, we are using the Discretize element to map continuous values to an integer which is then used by the NormDiscrete element to indicate interval membership.

Value Mapping

This method is used to map discrete values to discrete values. This is done by using a table which lists the input values and the mapped output values. Each row in the table refers to a possible value for the input variable. Each column in the table has a name which is used to refer to that column. This is defined inside the element MapValues.

In the PMML code below, the attribute outputColumn in element MapValues establishes that the mapped output value will be found in the column named "color". The element FieldColumnPair uses the attribute field to define the input for mapping which consists of one or more input variables. In this example, we use two variables "InputVar1" and "InputVar2". We find the input value in the "animal" column for "InputVar1" and the input value in the "property" column for "InputVar2" and map them to the output value in the"color" column.

The InlineTable element finally defines the table to be used for value mapping. So for example, if variable "InputVar1" has the value "dog" and variable "InputVar2" has the value "smart", we find "dog" in column "animal" and "smart" in column "property", both in the first row. These are them mapped to the value in the column named "color" in the same row. In this way, derived field "DerivedVar" is assigned the value "red".

As in the Discretize element, the MapValues element can have a defaultValue attribute which specifies what to map the input to if it does not have a matching value in any row.

In the PMML code below, we use elements MapValues and NormDiscrete to group small sets of categorical values. More specifically, we want to find out if input variable "VarInputColor" belongs to a specific group of colors. We do that by using the InLineTable element of MapValues to map different colors to the same number. We then use the element NormDiscrete to create dummy variables which are used to indicate group membership. This problem can be represented schematically in the following way:

IF VarInputColor is in ("Yellow","Red")
THEN
VarColorGroup_1 = 1
ELSE
VarColorGroup_1 = 0

IF VarInputColor is in ("Blue","Green")
THEN
VarColorGroup_2 = 1
ELSE
VarColorGroup_2 = 0

Note that in the example shown above we are mapping missing values to value"2" and setting the default value to "1" inside the MapValues transformation element. Derived field "GroupedInputColor" is assigned the result of the mapping between the input strings and the number representing the group. This variable is then used to generate the desired output. If "GroupedInputColor" contains value "1", variable "VarColorGroup_1" is assigned value "1", otherwise it is assigned value "0". If "GroupedInputColor" contains value "2", variable "VarColorGroup_2" is assigned value "1", otherwise it is assigned value "0". This is accomplished by the use of the NormDiscrete element as described earlier.

Large sets of values can be handled in ADAPA by an external mapping table. For example, imagine that you would like to map all existing colors (that you can imagine) into four different groups and then create dummy variables to indicate such grouping. You can design this table externally (ADAPA allows external tables to be referred from a PMML file) and use the PMML element TableLocator to reference it from inside the MapValues transformation element. The PMML code below implements this functionality.

Note that we are mapping missing values to value "4" and setting the default value to "3" inside the MapValues transformation element. Derived field "GroupedInputColor" is assigned the result of the mapping between the input colors and the number representing the group they belong to. As before, variable "GroupedInputColor" is then used to generate the desired output through the use of the element NormDiscrete.

The mapping table itself is only being referenced inside the element TableLocator. We are using a PMML extension in this case to indicate to ADAPA that the mapping table we want to access is named "GroupedInputColor".

Functions

If a certain transformation is to be applied to input data many times and to multiple fields, it makes sense to encapsulate the transformation inside a function and just use is as many times as necessary. This reduces the complexity of the PMML model and greatly simplifies its application. PMML provides a number of built-in functions as well as providing the capability for the user to define a user-defined function.

Built-in Functions

ADAPA supports all PMML built-in functions. The complete list is shown below.

+, -, * and /
min, max, sum, avg, median, product
log10, ln, sqrt, abs, exp, pow, threshold, floor, ceil, round
isMising, isNotMissing
equal, notEqual, lessThan, lessOrEqual, greaterThan, greaterOrEqual
and,or
not
concat
matches, replace (regular expressions)
uppercase
substring
trimBlanks
formatNumber
formatDatetime
dateDaysSinceYear
dateSecondsSinceYear
dateSecondsSinceMidnight

Note that functions such as "min", "max", "sum" and "avg" take a variable number of parameters (derived fields or input fields) and return a single value which can then be assigned to a new derived field. Please refer to the DMG website - Built-in Functions for code examples and descriptions.

The PMML code shown below implements the following arithmetic operation:
ResultVar=maximum(round(InputVar1/3.3),2^(1+log(1.3*InputVar2+1)))

Note that it uses two different input variables: "InputVar1" and "InputVar2" as well as a number of numeric constants. The example shows many built-in functions which are used together to implement such a complex operation. The end result is assigned to derived variable "ResultVar".

PMML also defines many functions that support boolean operations. These are used to compare parameters which are required to be of identical type (e.g., strings or dates) or of compatible type for numeric variables (e.g., double vs. integer). The result is of boolean type: "true" or "false", which is evaluated by functions such as "if" and "not".

PMML functions which operate on two input attributes, e.g. "a" and "b", of identical or compatible types are:

equal: evaluates to "true" if "a" is equal to "b", "false" otherwise.
notEqual: evaluates to "true" if "a" is not equal to "b", "false" otherwise.
lessThan: evaluates to "true" if "a" is less than "b", "false" otherwise.
lessOrEqual: evaluates to "true" if "a" is less or equal to "b", "false" otherwise.
greaterThan: evaluates to "true" if "a" is greater than "b", "false" otherwise.
greaterOrEqual: evaluates to "true" if "a" is greater or equal to "b", "false" otherwise.
isIn: evaluates to "true" if "a" is in "b", "false" otherwise. Attribute "b" in this case is an array of values.
isNotIn: evaluates to "true" if "a" is not in "b", "false" otherwise. Attribute "b" in this case is an array of values.

PMML functions which operate on a single input attribute, e.g. "a", are:

isMissing: evaluates to true if "a" is missing, i.e. equal to NULL, "false" otherwise.
isNotMissing: evaluates to "true" if "a" is not missing, "false" otherwise.

PMML functions which evaluate boolean operations are:

not: operates on a single boolean attribute. Negates existing boolean evaluation result.
and: summarizes the results of two or more independent boolean operations. Evaluates to "true" only if all operations are "true", "false" otherwise.
or: summarizes the results of two or more independent boolean operations. Evaluates to "true" if a single operation is "true", "false" only if all operations are "false".
if: implements IF-THEN-ELSE logic. The ELSE part is optional.

Below, we give a few examples of how these functions can be used to implement logical operations. We start with the PMML code below which implements the following logical and arithmetic operations:

IF InputVar1 == "Partner"
THEN
DerivedVar1 = "P"
ELSE
DerivedVar2 = 2 * InputVar2

In this example, we are using "InputVar1" which contains a string to assign values to two very different derived variables: "DerivedVar1" which is a string and "DerivedVar2" which is an integer.

Note that the code uses functions: "if", "equal", and "not" as well as the built-in fuction "*". The main reason for not using the "else" part of the "if" function is simply because we want to assign the "then" result to "DerivedVar1" and the "else" result to a different variable, "DerivedVar2". Data transformations in PMML are encapsulated under a single DerivedField element.

The PMML code below assumes that both "then" and "else" parts of the "if" use the same derived variable "DerivedVar1" to implement the following operations:

IF InputVar1 == "Partner"
THEN
DerivedVar1 = "5.1 * InputVar2"
ELSE
DerivedVar1 = "InputVar2 / 3.3"

Note that in this case, "DerivedVar2" is not being used. The "then" and "else" part are being used to assign the same variable "DerivedVar1" the result of two different computations.

We showed earlier how to use the Discretize element in conjunction with NormDiscrete to create dummy variables to indicate membership to numeric intervals. What if we would like to do just that, but this time for strings? The PMML code below exemplifies how this could be accomplished by using functions "if" and "lessOrEqual" in conjunction with the NormDiscrete element.

The PMML code below implements the following operations:

IF InputVar less or equal to "Denmark"
THEN
DerivedVar1 = 1 (otherwise = 0)
ELSE
IF InputVar less or equal to "France"
THEN
DerivedVar2 = 1 (otherwise = 0)
ELSE
DerivedVar3 = 1 (otherwise = 0)

Note that the three dummy variables: "DerivedVar1", "DerivedVar2", and "DerivedVar3" are used to indicate the membership of input variable "InputVar" to three different string intervals.

Finally, we end our list of PMML code examples by showing the use of functions "isMissing" and "isIn" combined with function "if". The example shown in implements the following operations:

IF InputVar is missing
THEN
DerivedVar = 1
ELSE
IF InputVar is in ("Partner", "Associate", "Colleague")
THEN
DerivedVar = 2
ELSE
DerivedVar = 3

When defining a PMML document, the pre-processing of the input variables is mainly located inside the following PMML elements: TransformationDictionary and LocalTransformations. Although the TransformationDictionary element is mostly used for user-defined functions (element DefineFunction).

For the formal PMML schema definition of the transformations covered here, please refer to the PMML Transformations page on the DMG website.

PMML 101

The Predictive Model Markup Language (PMML) is an XML-based language developed by the Data Mining Group (DMG) which provides a way for applications to define statistical and data mining models and to share models between PMML compliant applications.

PMML provides applications a vendor-independent method of defining models so that proprietary issues and incompatibilities are no longer a barrier to the exchange of models between applications. It allows users to develop models within one vendor's application, and use another vendors' applications to visualize, analyze, evaluate or otherwise use the models. Previously, this was very difficult, but with PMML, the exchange of models between compliant applications is now straightforward.

Since PMML is an XML-based standard, the specification comes in the form of an XML Schema.

PMML Components

PMML follows a very intuitive structure to describe a data mining model, be it an artificial neural network or a logistic regression model. Sequentially, it can be described by the following components:

PMML Elements - A PMML file is highly structured. The list of PMML elements allows for data manipulation and model to be expressed in a single PMML file.

Header: contains general information about the PMML document, such as copyright information for the model, its description, and information about the application used to generate the model such as name and version. It also contains an attribute for a timestamp which can be used to specify the date of model creation.

Data Dictionary: contains definitions for all the possible fields used by the model. It is in the data dictionary that a field is defined as continuous, categorical, or ordinal (attribute optype). Depending on this definition, the appropriate value ranges are then defined as well as the data type (such as, string or double).

Data Transformations: transformations allow for the mapping of user data into a more desirable form to be used by the mining model. PMML defines several kinds of data transformations.

Normalization: map values to numbers, the input can be continuous or discrete.

Discretization: map continuous values to discrete values.

Value mapping: map discrete values to discrete values.

Functions: derive a value by applying a function to one or more parameters.

Aggregation: used to summarize or collect groups of values.

The ability to represent data transformations (as well as outlier and missing value treatment methods) in conjunction with the model itself is a major advantage of PMML. When it comes to the actual use of a PMML model, pre- and post-processing are embedded into the PMML file itself. All that is needed is the raw input data and users are on the go (see useful links for a primer on how to represent data pre-processing in PMML).

Data transformations and predictive models are represented in a single PMML file, which facilitates model deployment.

Model: contains the definition of the data mining model. A multi-layered feed-forward neural network is the most common neural network representation in contemporary applications, given the popularity and efficacy associated with its training algorithm known as Backpropagation. Such a network is represented in PMML by a "NeuralNetwork" element which contains attributes such as:

Model Name (attribute modelName)

Function Name (attribute functionName)

Algorithm Name (attribute algorithmName)

Activation Function (attribute activationFunction)

Number of Layers (attribute numberOfLayers)

This information is then followed by three kinds of neural layers which specify the architecture of the neural network model being represented in the PMML document. These attributes are NeuralInputs, NeuralLayer, and NeuralOutputs. Besides neural networks, PMML allows for the representation of many other data mining models including support vector machines, association rules, naive bayes classifier, clustering models, text models, decision trees, and different regression models.

Mining Schema: the mining schema lists all fields used in the model. This can be a subset of the fields as defined in the data dictionary. It contains specific information about each field, such as:

Name (attribute name): must refer to a field in the data dictionary.

Usage type (attribute usageType): defines the way a field is to be used in the model. Typical values are: active, predicted, and supplementary. Predicted fields are those whose values are predicted by the model.

Outlier Treatment (attribute outliers): defines the outlier treatment to be use. In PMML, outliers can be treated as missing values, as extreme values (based on the definition of high and low values for a particular field), or as is.

Missing Value Replacement Policy (attribute missingValueReplacement): if this attribute is specified then a missing value is automatically replaced by the given values.

Missing Value Treatment (attribute missingValueTreatment): indicates how the missing value replacement was derived (e.g. as value, mean or median).

Targets: The targets element allows for the scaling of predicted variables. It is a straight-forward way to represent post-processing of raw outputs.

Supported Modeling Techniques

The list of modeling techniques supported by the PMML standard is constantly being updated. Version 4.1 supports the following techniques:

Neural Networks (Feedforward neural networks as well as radial-basis)

Decision Trees (with coding for several missing value strategies)

Support Vector Machines

Linear and Logistic Regression Models (via a generic representation or a simplified one)

Association Rules

Clustering

Naive Bayes

Sequences

Text Models

Time Series

Rulesets

Scorecards

K-Nearest Neighbors

Baseline Models

PMML Example

The example below shows a PMML file used to represent a logistic regression model. In this model, the predicted variable is named honcomb. Note that this is a very simple model. There are only three input variables (female, read_score, and science_score) which are all double. There is no pre-processing of the raw input variables and so these are fed directly into the regression model which produces a value for honcomp (0 or 1).

PMML Example - File containing a simple regression model expressed in PMML.

A comprehensive list of PMML examples can be found at the Zementis website - Examples Page.

PMML Products

A range of products are being offered to produce and consume PMML. Please check the following page at the DMG website for an updated list of PMML-powered products:

http://www.dmg.org/products.html

Useful Pages

PMML Examples - A list of PMML 3.2 files including neural network models, support vector machines, decision trees, regression models and clustering.

ADAPA Predictive Analytics Engine - Available as a Service through the Amazon Elastic Compute Cloud, the ADAPA engine can import several PMML models. After uploading, models are available for scoring or verification.

Data Pre-Processing in PMML and ADAPA - A Primer: Contains several examples on how to manipulate data in PMML.

Data Mining Group Home

Friday, April 3, 2009

PMML - Data Pre-Processing Primer

PMML 101

Welcome to the World of Predictive Analytics!