Predictive Analytics, Big Data, Hadoop, PMML: 2009

Friday, November 6, 2009

PMML and Open Source Data Mining

Open source tools provide a cost-effective, yet powerful option for data mining. The following contenders adhere to the PMML standard which facilitates model exchange among open source and commercial vendors, providing a definitive route for production deployment of predictive models.

The R Project

The R Project for Statistical Computing is definitely the most used and revered statistical package among advocates of open-source and community computing projects. Like the iPhone app store, you can basically find anything you need in CRAN (statistical that’s to say ... yep, no navigation system for R), the Comprehensive R Archive Network. It is in CRAN that you will find the R PMML Package. This package allows R users to export PMML for a variety of models, including decision trees and neural networks (among many others). We recently co-authored an article with Graham Williams, the original author and maintainer of the package. It can be downloaded directly from The R Journal website. If you are interested in contributing code for the package, please contact us.

KNIME

Developed by the University of Konstanz, KNIME is an open-source platform that enables users to visually create and execute data flows. Since KNIME 2.0 (available as of December 2008), users can import and export PMML models into and out of KNIME. Given that users can use R within KNIME, the R PMML package can also be used to export and convert R models to PMML within KNIME. New versions of KNIME will most certainly expand its support for PMML even further.

Weka

Developed by the University of Waikato, Weka provides a large collection of machine learning algorithms for solving data mining problems. Although Weka has currently no export functionality for PMML, Mark Hall is currently working on implementing import functionally for PMML. Weka can already import models such as regression, decision trees and neural networks. PMML support in Weka is constantly expanding with the addition of transformations and built-in functions.

RapidMiner

Most recently, Rapid-I announced that it will extend the latest version of its RapidMiner software to include support for PMML. RapidMiner, formerly known as YALE, is an open-source platform that offers operators for all aspects of data mining. As with KNIME, Rapid-I is one of the latest companies to join the rankings of the Data Mining Group (DMG) beside companies like IBM, Microstrategy, SPSS, SAS and Zementis. The DMG is already busy at work refining and adding yet more capabilities and power to PMML.

PMML Discussion Forums

For an on-going discussion and to read about the latest PMML news, we would like to invite you to join the PMML group in LinkedIn or the discussion forum in the PMML group on Analytic Bridge, a social network community for analytics professionals. For PMML resources, examples, and useful links, please take a look at the PMML page on the Zementis website.

Thursday, October 15, 2009

Latest issue of ACM SIGKDD Explorations focuses on open source analytics, PMML and cloud computing.

The latest issue of ACM SIGKDD Explorations is out! This issue is relevant in many ways, since it not only gives special attention to open source analytics (including articles on Weka and KNIME), but it also discusses PMML and cloud computing.

PMML, in particular, gets special treatment. It is described in a full article written by Rick Pechter from Microstrategy. As Rick puts it, "the Predictive Model Markup Language data mining standard has arguably become one of the most widely adopted data mining standards in use today."

PMML is also discussed in most of the other articles, including the one by Zementis, entitled: "Efficient Deployment of Predictive Analytics through Open Standards and Cloud Computing". In this article, we use the ADAPA scoring engine to illustrate how the benefits of PMML and cloud computing can be combined to offer a platform that leverages these elements to deliver an efficient deployment process for statistical models.

So, don't miss out on this issue of SIGKDD Explorations. We invite you to explore all the peer-reviewed articles in detail.

Wednesday, October 14, 2009

Open standards for data mining and the need for training material

PMML awareness is growing. Many companies have recently joined the DMG (Data Mining Group) and others are already in the process of adopting PMML as their main vehicle to represent models and data manipulation. The availability of PMML resources is key to its success.

Zementis and the other DMG members are committed to publish training material on PMML. We have posted many PMML tutorials on our support pages which are already being used by the community at large.

We realized though that there aren't that many presentations about PMML out there. So, we decided to make one available. To download it, click HERE. Hope you find it useful! And, feel free to pass it around.

The same presentation is also available in the Analytic Bridge website (PMML discussion group).

PMML Interest Group in LinkedIn - Join to read the latest PMML news.

The Predictive Model Markup Language (PMML) is the leading standard for representing statistical and data mining models. With PMML, it is straightforward to develop a model on one system using one application and deploy the model on another system using another application. PMML reduces complexity and bridges the gap between development and production deployment of predictive analytics.

PMML is governed by the Data Mining Group (DMG), an independent, vendor led consortium that develops data mining standards. PMML is currently supported by over 20 vendors and organizations and awareness as well as use of the standard is growing quickly. To establish a conduit in which people can come together to learn and discuss topics related to PMML, we have recently created a PMML interest group in LinkedIn. The group aims to serve as a central resource regarding the practical application of PMML, its benefits for business and IT. PMML increases business agility by eliminating the need for proprietary solutions or custom code development. For this reason, it is a critical element in the quest for business process optimization and automated, intelligent decisions.

We encourage active participation in the PMML group from the entire community, please post your questions! The group already contains postings related to

The value of PMML for business and IT

PMML powered products

Links to a general introduction and overview presentation

If your organization is already supporting the PMML standard, please feel welcome to share information about your products which do so.

To join the Predictive Model Markup Language (PMML) group on LinkedIn, please follow this link: http://www.linkedin.com/groupRegistration?gid=2328634

Wednesday, June 17, 2009

PMML 4.0 is here!

The DMG (Data Mining Group) has just released PMML 4.0, the latest and greatest version of the Predictive Model Markup Language.

Zementis, together with SPSS, SAS, IBM, Open Data Group, Salford Systems, Microstrategy and all the other contributing members of the DMG is proud to be part of the making of PMML, the de facto standard to represent data mining models.

Not only PMML can represent a wide range of statistical techniques, but it can also be used to represent the data transformations necessary to transform raw data into meaningful feature detectors. In this way, PMML offers a standard to represent data manipulation and modeling in a single concise way.

Improved Pre-Processing Capabilities

PMML 4.0 extends the range of pre-processing capabilities supported by older versions by adding a range of boolean operations (e.g., and, or, not, equal, notEqual, greaterOrEqual, ...) to the list of built-in functions. These, combined with an IF-THEN-ELSE function which is also new to PMML, allow for the representation of a wide range of feature detectors.

For examples on how to use these new pre-processing capabilities as well as all the standard PMML transformations, please check the PMML Data Pre-Processing Primer.

Time Series Models

PMML 4.0 also extends the existing standard by allowing for the representation of Time Series Models. In particular, it allows for data miners and data mining tools to represent Exponential Smoothing models and offers place holders for ARIMA, Seasonal Trend Decomposition, and Spectral Analysis which are to be supported in the near future.

Model Explanation

Other additions are Model Explanation and Multiple Models. Model Explanation allows for evaluation and model performance measures to be part of the PMML file itself. In this way, not only data manipulation and models get to be defined, but also associated ROC Graph, Gains/Lift Charts, Confusion Matrix, Field Correlations, Univariate Statistics, and more.

Multiple Models

Multiple Models allows for model composition, ensembles, and segmentation. It replaces the old Model Composition element to offer great flexibility for combining different models types, such as regression and decision trees.

Extending Existing Elements

Last, but not least, PMML 4.0 offers a range of extensions to existing elements, such as the addition of multi-class classification for Support Vector Machines, improved representation for Association Rules, and the addition of Cox Regression Models.

There is no doubt that PMML is here to stay. The announcement of PMML 4.0 attests to the commitment of the leading data mining vendors to be able to represent their solutions through a single language, a language that can be understood by all. It is our vision that users will be free to share models among many solutions, benefiting from an environment in which interoperability is truly attainable.

For more information on PMML and a list of useful links, please check PMML 101. Also, check the article "PMML: An Open Standard for Sharing Models" just published in The R Journal.

We also invite the entire community to join our on-going PMML discussion at the AnalyticBridge website.

Wednesday, June 3, 2009

Check our PMML article - The R Journal

Zementis, in collaboration with Togaware, produced an article describing PMML and the PMML Package available for exporting predictive models built in R.

The article is featured in the first edition of The R Journal. Check it out:

PMML: An Open Standard for Sharing Models

Friday, April 3, 2009

PMML - Data Pre-Processing Primer

PMML defines several kinds of data transformations. These are:

Normalization: map values to numbers, the input can be continuous (element NormContinuous) or discrete (element NormDiscrete).
Discretization: map continuous values to discrete values.
Value Mapping: map discrete values to discrete values.
Text Indexing (text mining element introduced in PMML 4.2): derive a frequency-based value for a given term (not covered in this primer).
Functions: derive a value by applying a function to one or more parameters.

Below we will see how these transformations can be used and combined to allow for data powerful pre-processing in PMML and the Zementis products: ADAPA for real-time scoring and UPPI for Big Data scoring.

PMML transformations and much more are covered in the PMML book "PMML in Action" available on Amazon.com. Companion to the book and a learning tool for data pre-processing in PMML is the Transformations Generator, which is a graphical interface for learning PMML transformations.

Transformations in PMML are also covered in depth in a PMML course being offered online by UCSD Extension. For more information about this course, please visit the UCSD Predictive Models with PMML page.

Normalization: NormContinuous

This is a general method to normalize a continuous variable to another continuous variable. In the PMML code below, two normalizations take place and two derived variables are created. The first derived variable is named "DerivedNormalizedVar1" and the second "DerivedNormalizedVar2". The normalization itself is defined under the element NormContinuous. Typically, NormContinuous is used to normalize input values between 0 and 1 as shown in the first normalization example. However, as shown in the second normalization example, one can have as many LinearNorm elements as necessary and these do not need to normalized values between 0 and 1; the only restriction is that the first attribute orig must be in increasing order. The normalized variable is then a linearly interpolated value between the LinearNorm elements.

Note that in the second example, we are using the attribute outliers to specify that any outliers to the normalization should be treated as extreme values. In this case, all "InputVar2" values lower than 100 will be assigned value "0" and all values greater than 900 will be assigned value "4".

Normalization: NormDiscrete

This method is used to transform string values to numeric values. Many models encode string values into numeric values in order to perform mathematical functions. For example, regression and neural network models often split categorical and ordinal variables into multiple dummy
variables.

The PMML code below implements the following logic:

IF CategoricalInputVar1 == Partner
THEN
DerivedVar1 = 1
ELSE
DerivedVar1 = 0

IF CategoricalInputVar1 == Associate
THEN
DerivedVar2 = 1
ELSE
DerivedVar2 = 0

IF CategoricalInputVar1 == Colleague
THEN
DerivedVar1 = 1
ELSE
DerivedVar1 = 0

In this example, variable "CategoricalInputVar1" has three possible valid values: "Partner", "Associate", and "Colleague". By using the NormDiscrete element, we can create derived variables (the dummy variables in this case) that will be assigned values "1" or "0" depending on the value of the categorical input variable being processed.

Note that we are also handling missing values in this example. The mapMissingTo attribute means that if the input value "CategoricalInputVar1" is missing then all three derived variables will be assigned "0".

Discretization

This method maps continuous values to discrete string values. For a continuous variable we define intervals and if the variable falls inside that interval, we assign to it the string value defined for that interval. In the PMML code below, the Discretize element defines the continuous field "InputVar" which is to be discretized. The DiscretizeBin elements define the intervals. The first one defines a bin which has a rightMargin of 1000. By default, its left margin (attribute leftMargin) is negative infinity. The closure attribute finally defines the interval as (-infinity,1000], i.e. the right margin is less than or equal to 1000. So if the value of the variable "InputVar" falls in this interval, we assign to the derived variable "DerivedVar" the string value “low”. Similarly, if the value of "InputVar" is in (1000,100000], the derived value is equal to "medium" and if it is in (100000,1000000), the derived value is equal to "high".

The mapping from a continuous to a discrete variable can be many-to-one but not one-to-many. In other words, two intervals can be mapped to the same string value but the same interval may not have more than one string value. This implies that the intervals defined must be disjoint; there can be no overlap between two defined intervals.

One can establish what should happen if the input is not in any of the defined intervals. The attribute defaultValue in the Discretize element establishes that if the input does not lie in any of the intervals (as for example, if "InputVar"=2,000,000), then by default it is assigned the value "extreme".

Discretization can also be paired up with element NormDiscrete to generate dummy variables which indicate if an input variable belongs to a certain interval. The PMML example shown below implements just that. We assume a numeric input variable named "InputVar" which leads to the creation of three dummy variables: "DerivedVar1", "DerivedVar2", and "DerivedVar3". These are used to indicate membership of "InputVar" to one of the following intervals: (-infinity,100], (100,200], and (200,+infinity).

Note that in this example, we are using the Discretize element to map continuous values to an integer which is then used by the NormDiscrete element to indicate interval membership.

Value Mapping

This method is used to map discrete values to discrete values. This is done by using a table which lists the input values and the mapped output values. Each row in the table refers to a possible value for the input variable. Each column in the table has a name which is used to refer to that column. This is defined inside the element MapValues.

In the PMML code below, the attribute outputColumn in element MapValues establishes that the mapped output value will be found in the column named "color". The element FieldColumnPair uses the attribute field to define the input for mapping which consists of one or more input variables. In this example, we use two variables "InputVar1" and "InputVar2". We find the input value in the "animal" column for "InputVar1" and the input value in the "property" column for "InputVar2" and map them to the output value in the"color" column.

The InlineTable element finally defines the table to be used for value mapping. So for example, if variable "InputVar1" has the value "dog" and variable "InputVar2" has the value "smart", we find "dog" in column "animal" and "smart" in column "property", both in the first row. These are them mapped to the value in the column named "color" in the same row. In this way, derived field "DerivedVar" is assigned the value "red".

As in the Discretize element, the MapValues element can have a defaultValue attribute which specifies what to map the input to if it does not have a matching value in any row.

In the PMML code below, we use elements MapValues and NormDiscrete to group small sets of categorical values. More specifically, we want to find out if input variable "VarInputColor" belongs to a specific group of colors. We do that by using the InLineTable element of MapValues to map different colors to the same number. We then use the element NormDiscrete to create dummy variables which are used to indicate group membership. This problem can be represented schematically in the following way:

IF VarInputColor is in ("Yellow","Red")
THEN
VarColorGroup_1 = 1
ELSE
VarColorGroup_1 = 0

IF VarInputColor is in ("Blue","Green")
THEN
VarColorGroup_2 = 1
ELSE
VarColorGroup_2 = 0

Note that in the example shown above we are mapping missing values to value"2" and setting the default value to "1" inside the MapValues transformation element. Derived field "GroupedInputColor" is assigned the result of the mapping between the input strings and the number representing the group. This variable is then used to generate the desired output. If "GroupedInputColor" contains value "1", variable "VarColorGroup_1" is assigned value "1", otherwise it is assigned value "0". If "GroupedInputColor" contains value "2", variable "VarColorGroup_2" is assigned value "1", otherwise it is assigned value "0". This is accomplished by the use of the NormDiscrete element as described earlier.

Large sets of values can be handled in ADAPA by an external mapping table. For example, imagine that you would like to map all existing colors (that you can imagine) into four different groups and then create dummy variables to indicate such grouping. You can design this table externally (ADAPA allows external tables to be referred from a PMML file) and use the PMML element TableLocator to reference it from inside the MapValues transformation element. The PMML code below implements this functionality.

Note that we are mapping missing values to value "4" and setting the default value to "3" inside the MapValues transformation element. Derived field "GroupedInputColor" is assigned the result of the mapping between the input colors and the number representing the group they belong to. As before, variable "GroupedInputColor" is then used to generate the desired output through the use of the element NormDiscrete.

The mapping table itself is only being referenced inside the element TableLocator. We are using a PMML extension in this case to indicate to ADAPA that the mapping table we want to access is named "GroupedInputColor".

Functions

If a certain transformation is to be applied to input data many times and to multiple fields, it makes sense to encapsulate the transformation inside a function and just use is as many times as necessary. This reduces the complexity of the PMML model and greatly simplifies its application. PMML provides a number of built-in functions as well as providing the capability for the user to define a user-defined function.

Built-in Functions

ADAPA supports all PMML built-in functions. The complete list is shown below.

+, -, * and /
min, max, sum, avg, median, product
log10, ln, sqrt, abs, exp, pow, threshold, floor, ceil, round
isMising, isNotMissing
equal, notEqual, lessThan, lessOrEqual, greaterThan, greaterOrEqual
and,or
not
concat
matches, replace (regular expressions)
uppercase
substring
trimBlanks
formatNumber
formatDatetime
dateDaysSinceYear
dateSecondsSinceYear
dateSecondsSinceMidnight

Note that functions such as "min", "max", "sum" and "avg" take a variable number of parameters (derived fields or input fields) and return a single value which can then be assigned to a new derived field. Please refer to the DMG website - Built-in Functions for code examples and descriptions.

The PMML code shown below implements the following arithmetic operation:
ResultVar=maximum(round(InputVar1/3.3),2^(1+log(1.3*InputVar2+1)))

Note that it uses two different input variables: "InputVar1" and "InputVar2" as well as a number of numeric constants. The example shows many built-in functions which are used together to implement such a complex operation. The end result is assigned to derived variable "ResultVar".

PMML also defines many functions that support boolean operations. These are used to compare parameters which are required to be of identical type (e.g., strings or dates) or of compatible type for numeric variables (e.g., double vs. integer). The result is of boolean type: "true" or "false", which is evaluated by functions such as "if" and "not".

PMML functions which operate on two input attributes, e.g. "a" and "b", of identical or compatible types are:

equal: evaluates to "true" if "a" is equal to "b", "false" otherwise.
notEqual: evaluates to "true" if "a" is not equal to "b", "false" otherwise.
lessThan: evaluates to "true" if "a" is less than "b", "false" otherwise.
lessOrEqual: evaluates to "true" if "a" is less or equal to "b", "false" otherwise.
greaterThan: evaluates to "true" if "a" is greater than "b", "false" otherwise.
greaterOrEqual: evaluates to "true" if "a" is greater or equal to "b", "false" otherwise.
isIn: evaluates to "true" if "a" is in "b", "false" otherwise. Attribute "b" in this case is an array of values.
isNotIn: evaluates to "true" if "a" is not in "b", "false" otherwise. Attribute "b" in this case is an array of values.

PMML functions which operate on a single input attribute, e.g. "a", are:

isMissing: evaluates to true if "a" is missing, i.e. equal to NULL, "false" otherwise.
isNotMissing: evaluates to "true" if "a" is not missing, "false" otherwise.

PMML functions which evaluate boolean operations are:

not: operates on a single boolean attribute. Negates existing boolean evaluation result.
and: summarizes the results of two or more independent boolean operations. Evaluates to "true" only if all operations are "true", "false" otherwise.
or: summarizes the results of two or more independent boolean operations. Evaluates to "true" if a single operation is "true", "false" only if all operations are "false".
if: implements IF-THEN-ELSE logic. The ELSE part is optional.

Below, we give a few examples of how these functions can be used to implement logical operations. We start with the PMML code below which implements the following logical and arithmetic operations:

IF InputVar1 == "Partner"
THEN
DerivedVar1 = "P"
ELSE
DerivedVar2 = 2 * InputVar2

In this example, we are using "InputVar1" which contains a string to assign values to two very different derived variables: "DerivedVar1" which is a string and "DerivedVar2" which is an integer.

Note that the code uses functions: "if", "equal", and "not" as well as the built-in fuction "*". The main reason for not using the "else" part of the "if" function is simply because we want to assign the "then" result to "DerivedVar1" and the "else" result to a different variable, "DerivedVar2". Data transformations in PMML are encapsulated under a single DerivedField element.

The PMML code below assumes that both "then" and "else" parts of the "if" use the same derived variable "DerivedVar1" to implement the following operations:

IF InputVar1 == "Partner"
THEN
DerivedVar1 = "5.1 * InputVar2"
ELSE
DerivedVar1 = "InputVar2 / 3.3"

Note that in this case, "DerivedVar2" is not being used. The "then" and "else" part are being used to assign the same variable "DerivedVar1" the result of two different computations.

We showed earlier how to use the Discretize element in conjunction with NormDiscrete to create dummy variables to indicate membership to numeric intervals. What if we would like to do just that, but this time for strings? The PMML code below exemplifies how this could be accomplished by using functions "if" and "lessOrEqual" in conjunction with the NormDiscrete element.

The PMML code below implements the following operations:

IF InputVar less or equal to "Denmark"
THEN
DerivedVar1 = 1 (otherwise = 0)
ELSE
IF InputVar less or equal to "France"
THEN
DerivedVar2 = 1 (otherwise = 0)
ELSE
DerivedVar3 = 1 (otherwise = 0)

Note that the three dummy variables: "DerivedVar1", "DerivedVar2", and "DerivedVar3" are used to indicate the membership of input variable "InputVar" to three different string intervals.

Finally, we end our list of PMML code examples by showing the use of functions "isMissing" and "isIn" combined with function "if". The example shown in implements the following operations:

IF InputVar is missing
THEN
DerivedVar = 1
ELSE
IF InputVar is in ("Partner", "Associate", "Colleague")
THEN
DerivedVar = 2
ELSE
DerivedVar = 3

When defining a PMML document, the pre-processing of the input variables is mainly located inside the following PMML elements: TransformationDictionary and LocalTransformations. Although the TransformationDictionary element is mostly used for user-defined functions (element DefineFunction).

For the formal PMML schema definition of the transformations covered here, please refer to the PMML Transformations page on the DMG website.

PMML 101

The Predictive Model Markup Language (PMML) is an XML-based language developed by the Data Mining Group (DMG) which provides a way for applications to define statistical and data mining models and to share models between PMML compliant applications.

PMML provides applications a vendor-independent method of defining models so that proprietary issues and incompatibilities are no longer a barrier to the exchange of models between applications. It allows users to develop models within one vendor's application, and use another vendors' applications to visualize, analyze, evaluate or otherwise use the models. Previously, this was very difficult, but with PMML, the exchange of models between compliant applications is now straightforward.

Since PMML is an XML-based standard, the specification comes in the form of an XML Schema.

PMML Components

PMML follows a very intuitive structure to describe a data mining model, be it an artificial neural network or a logistic regression model. Sequentially, it can be described by the following components:

PMML Elements - A PMML file is highly structured. The list of PMML elements allows for data manipulation and model to be expressed in a single PMML file.

Header: contains general information about the PMML document, such as copyright information for the model, its description, and information about the application used to generate the model such as name and version. It also contains an attribute for a timestamp which can be used to specify the date of model creation.

Data Dictionary: contains definitions for all the possible fields used by the model. It is in the data dictionary that a field is defined as continuous, categorical, or ordinal (attribute optype). Depending on this definition, the appropriate value ranges are then defined as well as the data type (such as, string or double).

Data Transformations: transformations allow for the mapping of user data into a more desirable form to be used by the mining model. PMML defines several kinds of data transformations.

Normalization: map values to numbers, the input can be continuous or discrete.

Discretization: map continuous values to discrete values.

Value mapping: map discrete values to discrete values.

Functions: derive a value by applying a function to one or more parameters.

Aggregation: used to summarize or collect groups of values.

The ability to represent data transformations (as well as outlier and missing value treatment methods) in conjunction with the model itself is a major advantage of PMML. When it comes to the actual use of a PMML model, pre- and post-processing are embedded into the PMML file itself. All that is needed is the raw input data and users are on the go (see useful links for a primer on how to represent data pre-processing in PMML).

Data transformations and predictive models are represented in a single PMML file, which facilitates model deployment.

Model: contains the definition of the data mining model. A multi-layered feed-forward neural network is the most common neural network representation in contemporary applications, given the popularity and efficacy associated with its training algorithm known as Backpropagation. Such a network is represented in PMML by a "NeuralNetwork" element which contains attributes such as:

Model Name (attribute modelName)

Function Name (attribute functionName)

Algorithm Name (attribute algorithmName)

Activation Function (attribute activationFunction)

Number of Layers (attribute numberOfLayers)

This information is then followed by three kinds of neural layers which specify the architecture of the neural network model being represented in the PMML document. These attributes are NeuralInputs, NeuralLayer, and NeuralOutputs. Besides neural networks, PMML allows for the representation of many other data mining models including support vector machines, association rules, naive bayes classifier, clustering models, text models, decision trees, and different regression models.

Mining Schema: the mining schema lists all fields used in the model. This can be a subset of the fields as defined in the data dictionary. It contains specific information about each field, such as:

Name (attribute name): must refer to a field in the data dictionary.

Usage type (attribute usageType): defines the way a field is to be used in the model. Typical values are: active, predicted, and supplementary. Predicted fields are those whose values are predicted by the model.

Outlier Treatment (attribute outliers): defines the outlier treatment to be use. In PMML, outliers can be treated as missing values, as extreme values (based on the definition of high and low values for a particular field), or as is.

Missing Value Replacement Policy (attribute missingValueReplacement): if this attribute is specified then a missing value is automatically replaced by the given values.

Missing Value Treatment (attribute missingValueTreatment): indicates how the missing value replacement was derived (e.g. as value, mean or median).

Targets: The targets element allows for the scaling of predicted variables. It is a straight-forward way to represent post-processing of raw outputs.

Supported Modeling Techniques

The list of modeling techniques supported by the PMML standard is constantly being updated. Version 4.1 supports the following techniques:

Neural Networks (Feedforward neural networks as well as radial-basis)

Decision Trees (with coding for several missing value strategies)

Support Vector Machines

Linear and Logistic Regression Models (via a generic representation or a simplified one)

Association Rules

Clustering

Naive Bayes

Sequences

Text Models

Time Series

Rulesets

Scorecards

K-Nearest Neighbors

Baseline Models

PMML Example

The example below shows a PMML file used to represent a logistic regression model. In this model, the predicted variable is named honcomb. Note that this is a very simple model. There are only three input variables (female, read_score, and science_score) which are all double. There is no pre-processing of the raw input variables and so these are fed directly into the regression model which produces a value for honcomp (0 or 1).

PMML Example - File containing a simple regression model expressed in PMML.

A comprehensive list of PMML examples can be found at the Zementis website - Examples Page.

PMML Products

A range of products are being offered to produce and consume PMML. Please check the following page at the DMG website for an updated list of PMML-powered products:

http://www.dmg.org/products.html

Useful Pages

PMML Examples - A list of PMML 3.2 files including neural network models, support vector machines, decision trees, regression models and clustering.

ADAPA Predictive Analytics Engine - Available as a Service through the Amazon Elastic Compute Cloud, the ADAPA engine can import several PMML models. After uploading, models are available for scoring or verification.

Data Pre-Processing in PMML and ADAPA - A Primer: Contains several examples on how to manipulate data in PMML.

Data Mining Group Home