PMML defines several kinds of data transformations. These are:
- Normalization: map values to numbers, the input can be continuous (element NormContinuous) or discrete (element NormDiscrete).
- Discretization: map continuous values to discrete values.
- Value Mapping: map discrete values to discrete values.
- Text Indexing (text mining element introduced in PMML 4.2): derive a frequency-based value for a given term (not covered in this primer).
- Functions: derive a value by applying a function to one or more parameters.
Below we will see how these transformations can be used and combined to allow for data powerful pre-processing in PMML and the Zementis products: ADAPA for real-time scoring
and UPPI for Big Data scoring
PMML transformations and much more are covered in the PMML book "PMML in Action" available on Amazon.com
. Companion to the book and a learning tool for data pre-processing in PMML is the Transformations Generator
, which is a graphical interface for learning PMML transformations.
Transformations in PMML are also covered in depth in a PMML course being offered online
by UCSD Extension. For more information about this course, please visit the UCSD Predictive Models with PMML page
This is a general method to normalize a continuous variable to another continuous variable. In the PMML code below, two normalizations take place and two derived variables are created. The first derived variable is named "DerivedNormalizedVar1" and the second "DerivedNormalizedVar2". The normalization itself is defined under the element NormContinuous
. Typically, NormContinuous
is used to normalize input values between 0 and 1 as shown in the first normalization example. However, as shown in the second normalization example, one can have as many LinearNorm
elements as necessary and these do not need to normalized values between 0 and 1; the only restriction is that the first attribute orig
must be in increasing order. The normalized variable is then a linearly interpolated value between the LinearNorm
Note that in the second example, we are using the attribute outliers
to specify that any outliers to the normalization should be treated as extreme values. In this case, all "InputVar2" values lower than 100 will be assigned value "0" and all values greater than 900 will be assigned value "4".
This method is used to transform string values to numeric values. Many models encode string values into numeric values in order to perform mathematical functions. For example, regression and neural network models often split categorical and ordinal variables into multiple dummy
The PMML code below implements the following logic:
IF CategoricalInputVar1 == Partner
DerivedVar1 = 1
DerivedVar1 = 0
IF CategoricalInputVar1 == Associate
DerivedVar2 = 1
DerivedVar2 = 0
IF CategoricalInputVar1 == Colleague
DerivedVar1 = 1
DerivedVar1 = 0
In this example, variable "CategoricalInputVar1" has three possible valid values: "Partner", "Associate", and "Colleague". By using the NormDiscrete
element, we can create derived variables (the dummy variables in this case) that will be assigned values "1" or "0" depending on the value of the categorical input variable being processed.
Note that we are also handling missing values in this example. The mapMissingTo
attribute means that if the input value "CategoricalInputVar1" is missing then all three derived variables will be assigned "0".
This method maps continuous values to discrete string values. For a continuous variable we define intervals and if the variable falls inside that interval, we assign to it the string value defined for that interval. In the PMML code below, the Discretize
element defines the continuous field "InputVar" which is to be discretized. The DiscretizeBin
elements define the intervals. The first one defines a bin which has a rightMargin
of 1000. By default, its left margin (attribute leftMargin
) is negative infinity. The closure attribute finally defines the interval as (-infinity,1000], i.e. the right margin is less than or equal to 1000. So if the value of the variable "InputVar" falls in this interval, we assign to the derived variable "DerivedVar" the string value “low”. Similarly, if the value of "InputVar" is in (1000,100000], the derived value is equal to "medium" and if it is in (100000,1000000), the derived value is equal to "high".
The mapping from a continuous to a discrete variable can be many-to-one but not one-to-many. In other words, two intervals can be mapped to the same string value but the same interval may not have more than one string value. This implies that the intervals defined must be disjoint; there can be no overlap between two defined intervals.
One can establish what should happen if the input is not in any of the defined intervals. The attribute defaultValue
in the Discretize
element establishes that if the input does not lie in any of the intervals (as for example, if "InputVar"=2,000,000), then by default it is assigned the value "extreme".
Discretization can also be paired up with element NormDiscrete
to generate dummy variables which indicate if an input variable belongs to a certain interval. The PMML example shown below implements just that. We assume a numeric input variable named "InputVar" which leads to the creation of three dummy variables: "DerivedVar1", "DerivedVar2", and "DerivedVar3". These are used to indicate membership of "InputVar" to one of the following intervals: (-infinity,100], (100,200], and (200,+infinity).
Note that in this example, we are using the Discretize
element to map continuous values to an integer which is then used by the NormDiscrete
element to indicate interval membership.
This method is used to map discrete values to discrete values. This is done by using a table which lists the input values and the mapped output values. Each row in the table refers to a possible value for the input variable. Each column in the table has a name which is used to refer to that column. This is defined inside the element MapValues
In the PMML code below, the attribute outputColumn
in element MapValues
establishes that the mapped output value will be found in the column named "color". The element FieldColumnPair
uses the attribute field
to define the input for mapping which consists of one or more input variables. In this example, we use two variables "InputVar1" and "InputVar2". We find the input value in the "animal" column for "InputVar1" and the input value in the "property" column for "InputVar2" and map them to the output value in the"color" column.
element finally defines the table to be used for value mapping. So for example, if variable "InputVar1" has the value "dog" and variable "InputVar2" has the value "smart", we find "dog" in column "animal" and "smart" in column "property", both in the first row. These are them mapped to the value in the column named "color" in the same row. In this way, derived field "DerivedVar" is assigned the value "red".
As in the Discretize
element, the MapValues
element can have a defaultValue
attribute which specifies what to map the input to if it does not have a matching value in any row.
In the PMML code below, we use elements MapValues
to group small sets of categorical values. More specifically, we want to find out if input variable "VarInputColor" belongs to a specific group of colors. We do that by using the InLineTable
element of MapValues
to map different colors to the same number. We then use the element NormDiscrete
to create dummy variables which are used to indicate group membership. This problem can be represented schematically in the following way:
IF VarInputColor is in ("Yellow","Red")
VarColorGroup_1 = 1
VarColorGroup_1 = 0
IF VarInputColor is in ("Blue","Green")
VarColorGroup_2 = 1
VarColorGroup_2 = 0
Note that in the example shown above we are mapping missing values to value"2" and setting the default value to "1" inside the MapValues
transformation element. Derived field "GroupedInputColor" is assigned the result of the mapping between the input strings and the number representing the group. This variable is then used to generate the desired output. If "GroupedInputColor" contains value "1", variable "VarColorGroup_1" is assigned value "1", otherwise it is assigned value "0". If "GroupedInputColor" contains value "2", variable "VarColorGroup_2" is assigned value "1", otherwise it is assigned value "0". This is accomplished by the use of the NormDiscrete
element as described earlier.
Large sets of values can be handled in ADAPA by an external mapping table. For example, imagine that you would like to map all existing colors (that you can imagine) into four different groups and then create dummy variables to indicate such grouping. You can design this table externally (ADAPA allows external tables to be referred from a PMML file) and use the PMML element TableLocator
to reference it from inside the MapValues
transformation element. The PMML code below implements this functionality.
Note that we are mapping missing values to value "4" and setting the default value to "3" inside the MapValues
transformation element. Derived field "GroupedInputColor" is assigned the result of the mapping between the input colors and the number representing the group they belong to. As before, variable "GroupedInputColor" is then used to generate the desired output through the use of the element NormDiscrete.
The mapping table itself is only being referenced inside the element TableLocator
. We are using a PMML extension in this case to indicate to ADAPA that the mapping table we want to access is named "GroupedInputColor".
If a certain transformation is to be applied to input data many times and to multiple fields, it makes sense to encapsulate the transformation inside a function and just use is as many times as necessary. This reduces the complexity of the PMML model and greatly simplifies its application. PMML provides a number of built-in functions as well as providing the capability for the user to define a user-defined function.
ADAPA supports all PMML built-in functions. The complete list is shown below.
- +, -, * and /
- min, max, sum, avg, median, product
- log10, ln, sqrt, abs, exp, pow, threshold, floor, ceil, round
- isMising, isNotMissing
- equal, notEqual, lessThan, lessOrEqual, greaterThan, greaterOrEqual
- matches, replace (regular expressions)
Note that functions such as "min", "max", "sum" and "avg" take a variable number of parameters (derived fields or input fields) and return a single value which can then be assigned to a new derived field. Please refer to the DMG website - Built-in Functions
for code examples and descriptions.
The PMML code shown below implements the following arithmetic operation:
Note that it uses two different input variables: "InputVar1" and "InputVar2" as well as a number of numeric constants. The example shows many built-in functions which are used together to implement such a complex operation. The end result is assigned to derived variable "ResultVar".
PMML also defines many functions that support boolean operations. These are used to compare parameters which are required to be of identical type (e.g., strings or dates) or of compatible type for numeric variables (e.g., double vs. integer). The result is of boolean type: "true" or "false", which is evaluated by functions such as "if" and "not".
PMML functions which operate on two input attributes, e.g. "a" and "b", of identical or compatible types are:
- equal: evaluates to "true" if "a" is equal to "b", "false" otherwise.
- notEqual: evaluates to "true" if "a" is not equal to "b", "false" otherwise.
- lessThan: evaluates to "true" if "a" is less than "b", "false" otherwise.
- lessOrEqual: evaluates to "true" if "a" is less or equal to "b", "false" otherwise.
- greaterThan: evaluates to "true" if "a" is greater than "b", "false" otherwise.
- greaterOrEqual: evaluates to "true" if "a" is greater or equal to "b", "false" otherwise.
- isIn: evaluates to "true" if "a" is in "b", "false" otherwise. Attribute "b" in this case is an array of values.
- isNotIn: evaluates to "true" if "a" is not in "b", "false" otherwise. Attribute "b" in this case is an array of values.
PMML functions which operate on a single input attribute, e.g. "a", are:
- isMissing: evaluates to true if "a" is missing, i.e. equal to NULL, "false" otherwise.
- isNotMissing: evaluates to "true" if "a" is not missing, "false" otherwise.
PMML functions which evaluate boolean operations are:
- not: operates on a single boolean attribute. Negates existing boolean evaluation result.
- and: summarizes the results of two or more independent boolean operations. Evaluates to "true" only if all operations are "true", "false" otherwise.
- or: summarizes the results of two or more independent boolean operations. Evaluates to "true" if a single operation is "true", "false" only if all operations are "false".
- if: implements IF-THEN-ELSE logic. The ELSE part is optional.
Below, we give a few examples of how these functions can be used to implement logical operations. We start with the PMML code below which implements the following logical and arithmetic operations:
IF InputVar1 == "Partner"
DerivedVar1 = "P"
DerivedVar2 = 2 * InputVar2
In this example, we are using "InputVar1" which contains a string to assign values to two very different derived variables: "DerivedVar1" which is a string and "DerivedVar2" which is an integer.
Note that the code uses functions: "if", "equal", and "not" as well as the built-in fuction "*". The main reason for not using the "else" part of the "if" function is simply because we want to assign the "then" result to "DerivedVar1" and the "else" result to a different variable, "DerivedVar2". Data transformations in PMML are encapsulated under a single DerivedField
The PMML code below assumes that both "then" and "else" parts of the "if" use the same derived variable "DerivedVar1" to implement the following operations:
IF InputVar1 == "Partner"
DerivedVar1 = "5.1 * InputVar2"
DerivedVar1 = "InputVar2 / 3.3"
Note that in this case, "DerivedVar2" is not being used. The "then" and "else" part are being used to assign the same variable "DerivedVar1" the result of two different computations.
We showed earlier how to use the Discretize element in conjunction with NormDiscrete to create dummy variables to indicate membership to numeric intervals. What if we would like to do just that, but this time for strings? The PMML code below exemplifies how this could be accomplished by using functions "if" and "lessOrEqual" in conjunction with the NormDiscrete
The PMML code below implements the following operations:
IF InputVar less or equal to "Denmark"
DerivedVar1 = 1 (otherwise = 0)
IF InputVar less or equal to "France"
DerivedVar2 = 1 (otherwise = 0)
DerivedVar3 = 1 (otherwise = 0)
Note that the three dummy variables: "DerivedVar1", "DerivedVar2", and "DerivedVar3" are used to indicate the membership of input variable "InputVar" to three different string intervals.
Finally, we end our list of PMML code examples by showing the use of functions "isMissing" and "isIn" combined with function "if". The example shown in implements the following operations:
IF InputVar is missing
DerivedVar = 1
IF InputVar is in ("Partner", "Associate", "Colleague")
DerivedVar = 2
DerivedVar = 3
When defining a PMML document, the pre-processing of the input variables is mainly located inside the following PMML elements: TransformationDictionary
. Although the TransformationDictionary element is mostly used for user-defined functions (element DefineFunction).
For the formal PMML schema definition of the transformations covered here, please refer to the PMML Transformations page on the DMG website