Predictive Analytics, Big Data, Hadoop, PMML: PMML 101

The Predictive Model Markup Language (PMML) is an XML-based language developed by the Data Mining Group (DMG) which provides a way for applications to define statistical and data mining models and to share models between PMML compliant applications.

PMML provides applications a vendor-independent method of defining models so that proprietary issues and incompatibilities are no longer a barrier to the exchange of models between applications. It allows users to develop models within one vendor's application, and use another vendors' applications to visualize, analyze, evaluate or otherwise use the models. Previously, this was very difficult, but with PMML, the exchange of models between compliant applications is now straightforward.

Since PMML is an XML-based standard, the specification comes in the form of an XML Schema.

PMML Components

PMML follows a very intuitive structure to describe a data mining model, be it an artificial neural network or a logistic regression model. Sequentially, it can be described by the following components:

PMML Elements - A PMML file is highly structured. The list of PMML elements allows for data manipulation and model to be expressed in a single PMML file.

Header: contains general information about the PMML document, such as copyright information for the model, its description, and information about the application used to generate the model such as name and version. It also contains an attribute for a timestamp which can be used to specify the date of model creation.

Data Dictionary: contains definitions for all the possible fields used by the model. It is in the data dictionary that a field is defined as continuous, categorical, or ordinal (attribute optype). Depending on this definition, the appropriate value ranges are then defined as well as the data type (such as, string or double).

Data Transformations: transformations allow for the mapping of user data into a more desirable form to be used by the mining model. PMML defines several kinds of data transformations.

Normalization: map values to numbers, the input can be continuous or discrete.

Discretization: map continuous values to discrete values.

Value mapping: map discrete values to discrete values.

Functions: derive a value by applying a function to one or more parameters.

Aggregation: used to summarize or collect groups of values.

The ability to represent data transformations (as well as outlier and missing value treatment methods) in conjunction with the model itself is a major advantage of PMML. When it comes to the actual use of a PMML model, pre- and post-processing are embedded into the PMML file itself. All that is needed is the raw input data and users are on the go (see useful links for a primer on how to represent data pre-processing in PMML).

Data transformations and predictive models are represented in a single PMML file, which facilitates model deployment.

Model: contains the definition of the data mining model. A multi-layered feed-forward neural network is the most common neural network representation in contemporary applications, given the popularity and efficacy associated with its training algorithm known as Backpropagation. Such a network is represented in PMML by a "NeuralNetwork" element which contains attributes such as:

Model Name (attribute modelName)

Function Name (attribute functionName)

Algorithm Name (attribute algorithmName)

Activation Function (attribute activationFunction)

Number of Layers (attribute numberOfLayers)

This information is then followed by three kinds of neural layers which specify the architecture of the neural network model being represented in the PMML document. These attributes are NeuralInputs, NeuralLayer, and NeuralOutputs. Besides neural networks, PMML allows for the representation of many other data mining models including support vector machines, association rules, naive bayes classifier, clustering models, text models, decision trees, and different regression models.

Mining Schema: the mining schema lists all fields used in the model. This can be a subset of the fields as defined in the data dictionary. It contains specific information about each field, such as:

Name (attribute name): must refer to a field in the data dictionary.

Usage type (attribute usageType): defines the way a field is to be used in the model. Typical values are: active, predicted, and supplementary. Predicted fields are those whose values are predicted by the model.

Outlier Treatment (attribute outliers): defines the outlier treatment to be use. In PMML, outliers can be treated as missing values, as extreme values (based on the definition of high and low values for a particular field), or as is.

Missing Value Replacement Policy (attribute missingValueReplacement): if this attribute is specified then a missing value is automatically replaced by the given values.

Missing Value Treatment (attribute missingValueTreatment): indicates how the missing value replacement was derived (e.g. as value, mean or median).

Targets: The targets element allows for the scaling of predicted variables. It is a straight-forward way to represent post-processing of raw outputs.

Supported Modeling Techniques

The list of modeling techniques supported by the PMML standard is constantly being updated. Version 4.1 supports the following techniques:

Neural Networks (Feedforward neural networks as well as radial-basis)

Decision Trees (with coding for several missing value strategies)

Support Vector Machines

Linear and Logistic Regression Models (via a generic representation or a simplified one)

Association Rules

Clustering

Naive Bayes

Sequences

Text Models

Time Series

Rulesets

Scorecards

K-Nearest Neighbors

Baseline Models

PMML Example

The example below shows a PMML file used to represent a logistic regression model. In this model, the predicted variable is named honcomb. Note that this is a very simple model. There are only three input variables (female, read_score, and science_score) which are all double. There is no pre-processing of the raw input variables and so these are fed directly into the regression model which produces a value for honcomp (0 or 1).

PMML Example - File containing a simple regression model expressed in PMML.

A comprehensive list of PMML examples can be found at the Zementis website - Examples Page.

PMML Products

A range of products are being offered to produce and consume PMML. Please check the following page at the DMG website for an updated list of PMML-powered products:

http://www.dmg.org/products.html

Useful Pages

PMML Examples - A list of PMML 3.2 files including neural network models, support vector machines, decision trees, regression models and clustering.

ADAPA Predictive Analytics Engine - Available as a Service through the Amazon Elastic Compute Cloud, the ADAPA engine can import several PMML models. After uploading, models are available for scoring or verification.

Data Pre-Processing in PMML and ADAPA - A Primer: Contains several examples on how to manipulate data in PMML.

Data Mining Group Home

Predictive Analytics, Big Data, Hadoop, PMML

Friday, April 3, 2009

PMML 101

No comments:

Post a Comment

Welcome to the World of Predictive Analytics!