Predictive Analytics, Big Data, Hadoop, PMML: January 2012

The Predictive Model Markup Language (PMML) is an XML-based language developed by the Data Mining Group (DMG) which provides a way for applications to define statistical and data mining models and to share models between PMML compliant applications.

PMML provides applications a vendor-independent method of defining models so that proprietary issues and incompatibilities are no longer a barrier to the exchange of models between applications. It allows users to develop models within one vendor's application, and use another vendors' applications to visualize, analyze, evaluate or otherwise use the models. Previously, this was very difficult, but with PMML, the exchange of models between compliant applications is now straightforward.

The adoption of PMML by the major analytic vendors is a great example of companies embracing interoperability. IBM, SAS, Microstrategy, FICO, Equifax, NASA, Salford Systems and Zementis, for example, are part of the Data Mining Group (DMG), the committee shaping PMML. Open-source companies such as KNIME, Open Data Group, and Rapid-I are also part of the committee.

PMML first made its debut in 1997. Today, it is a mature and refined language. The latest version of PMML, 4.1, released in December 2011, adds three new model elements to PMML. These are:

Scorecard: This new element is used to represent Scorecards, a commercially significant formulation of predictive models. Scorecards are used extensively in retail banking to estimate and rank-order consumer credit risk. Scorecards are usually associated with adverse or reason codes and so PMML 4.1 also introduced the abilitity to represent reason codes for explaining any adverse actions derived from a scorecard.

NearestNeighborModel: This new element is used to represent k-Nearest Neighbors. k-NN is an instance-based learning algorithm. In a k-NN model, the prediction is based on the K training instances closest to the case being scored. Therefore, all training cases have to be stored inside the PMML file itself. For cases in which the amount of data is quite large, PMML allows for it to be referenced externally.

BaselineModel: This element is used to represent Baseline Models. These types of models are used for defining a change detection model.

PMML 4.1 also adds to the language:

Generic Post-Processing Capabilities: In previous PMML versions, element "Targets" got all the attention for its ability to implement scaling. PMML 4.1 brought the post-processing capabilities of PMML to a new higher level by expanding the role of element Output. This element can now be used not only to represent scaling, but also any type of data manipulation, since it allows for transformations and built-in functions to be applied to any output values. It also allows for the definition of thresholds and business decisions which can be used as the final model output.

Simplified Multiple Models Capabilities: Besides making the representation of multiple models simpler, PMML 4.1 also made it more generic. The latest PMML release has deprecated the existing model composition approach and now allows for composition to take place inside a generic "Segmentation" element. In this way, a single element can now be used to represent model segmentation, model ensemble, and model composition.

New Built-in Functions: Three new functions were added to the language's existing pletora of built-in functions. Through its logical, arithmetic and string operators, PMML is capable of representing a myriad of data pre-processing steps.

PMML 4.1 also adds a new "isScorable" attribute which was added to all existing model elements to signal if a model is production ready or not. It also offers a new document that specifies all the rules around field scope and field names that were previously scattered over several documents. Scope becomes an important issue when a PMML file is used to represented multiple models that are nested.

As the de facto standard to represent predictive solutions, PMML allows model(s) and data transformations to be represented together in a single and concise way. When used to represent all the computations that make up a predictive solution, PMML becomes the bridge not only between data analysis, model building, and deployment systems, but also between all the people and teams involved in the analytical process inside a company. Needless to say, PMML is already shaping the world of predictive analytics.

Resources

Check out the DMG website to review all new and pre-4.1 PMML language elements

Read the series of articles about PMML published by IBM:

Visit the Zementis PMML Resources page to explore complete PMML examples and access PMML tools

Check out the PMML book on Amazon.com
Join the PMML discussion group in LinkedIn.

Predictive Analytics, Big Data, Hadoop, PMML

Wednesday, January 11, 2012

PMML 4.1 is here! Mature standard for predictive analytics

Welcome to the World of Predictive Analytics!