Friday, October 23, 2015

Converting arbitrary R expressions to PMML

The pmmlTransformations R package can be used to transform data and add new features to be used in predictive PMML models.
In this blog post, we will focus on FunctionXform, a function introduced in version 1.3.0 of pmmlTransformations, and present a few examples of using it to create new data features.

How it works

Transformations in the pmmlTransformations package work in the following manner: given a WrapData object and a transformation name, the code calculates data for a new feature and creates a new WrapData object. This new object is then passed in as the data argument when training an R model with a compatible R package. When PMML is produced with pmml::pmml(), the transformation is inserted into the LocalTransformations node as a DerivedField. Any original fields used by transformations are added to the appropriate nodes in the resulting PMML file.
While other transformations in the package transform only one field, FunctionXform makes it possible to use multiple data fields and functions to produce a new feature.
Note that while FunctionXform is part of the pmmlTransformations package, the code to produce PMML from R is in the pmml package. The following examples require both packages to be installed to work.
To make tables more readable in this blog post, we are using the kable function (part of knitr).

Single numeric field

Using the iris dataset as an example, let’s construct a new numeric feature by transforming one variable.
First, load the required libraries:
Then load the data and display the first 3 lines:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
Create the irisBox wrapper object with WrapData:
irisBox <- WrapData(iris)
irisBox contains the data and transform information that will be used to produce PMML later. The original data is in irisBox$data. Any new features created with a transformation are added as columns to this data frame.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
Transform and field information is in irisBox$fieldData. The fieldData data frame contains information on every field in the dataset, as well as every transform used. The functionXform column contains expressions used in the FunctionXform transform. Here we’ll show only a few of the columns:
type dataType origFieldName sampleMin sampleMax
Sepal.Length original numeric NA NA NA
Sepal.Width original numeric NA NA NA
Petal.Length original numeric NA NA NA
Petal.Width original numeric NA NA NA
Species original factor NA NA NA
Now add a new feature, Sepal.Length.Sqrt, using FunctionXform:
irisBox <- FunctionXform(irisBox,origFieldName="Sepal.Length",
The new feature is calculated and added as a column to the irisBox$data data frame:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length.Sqrt
5.1 3.5 1.4 0.2 setosa 2.258318
4.9 3.0 1.4 0.2 setosa 2.213594
4.7 3.2 1.3 0.2 setosa 2.167948
irisBox$fieldData now contains a new row with the transformation expression in the functionXform column:
type dataType origFieldName functionXform
Sepal.Length.Sqrt derived numeric Sepal.Length sqrt(Sepal.Length)
Construct a linear model to predict Petal.Width using this new feature, and convert it to PMML:
fit <- lm(Petal.Width ~ Sepal.Length.Sqrt, data=irisBox$data)
fit_pmml <- pmml(fit, transform=irisBox)
Since the model predicts Petal.Width using a variable based on Sepal.Length, Sepal.Length will be added to the DataDictionary and MiningSchema nodes in the resulting PMML. We can take a look at the relevant parts of the output like so:
fit_pmml[[2]] #Data Dictionary node
#> <DataDictionary numberOfFields="2">
#>  <DataField name="Petal.Width" optype="continuous" dataType="double"/>
#>  <DataField name="Sepal.Length" optype="continuous" dataType="double"/>
#> </DataDictionary>
fit_pmml[[3]][[1]] #Mining Schema node
#> <MiningSchema>
#>  <MiningField name="Petal.Width" usageType="predicted"/>
#>  <MiningField name="Sepal.Length" usageType="active"/>
#> </MiningSchema>
The LocalTransformations node contains Sepal.Length.Sqrt as a derived field:
#> <LocalTransformations>
#>  <DerivedField name="Sepal.Length.Sqrt" dataType="double" optype="continuous">
#> <Apply function="sqrt">
#>   <FieldRef field="Sepal.Length"/>
#> </Apply> 
#>  </DerivedField>
#> </LocalTransformations>
The PMML model can now be deployed and consumed. For any input data, the new Sepal.Length.Sqrt feature will be created when the data is scored against the model.

Multiple input fields

It is also possible to create new features by combining several fields. Using the same iris dataset, let’s create a new field using squares of Sepal.Length and Petal.Length:
irisBox <- WrapData(iris)
irisBox <- FunctionXform(irisBox,origFieldName="Sepal.Length,Petal.Length",
                         formulaText="(Sepal.Length / Petal.Length)^2")
As before, the new field is added as a column to the irisBox$data data frame:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Squared.Length.Ratio
5.1 3.5 1.4 0.2 setosa 13.27041
4.9 3.0 1.4 0.2 setosa 12.25000
4.7 3.2 1.3 0.2 setosa 13.07101
Fit a linear model for Petal.Length using this new feature, and convert it to PMML:
fit <- lm(Petal.Width ~ Squared.Length.Ratio, data=irisBox$data)
fit_pmml <- pmml(fit, transform=irisBox)
The PMML will contain Sepal.Length and Petal.Length in the DataDictionary and MiningSchema, since these were used in FormulaXform:
fit_pmml[[2]] #Data Dictionary node
#> <DataDictionary numberOfFields="3">
#>  <DataField name="Petal.Width" optype="continuous" dataType="double"/>
#>  <DataField name="Sepal.Length" optype="continuous" dataType="double"/>
#>  <DataField name="Petal.Length" optype="continuous" dataType="double"/>
#> </DataDictionary>
fit_pmml[[3]][[1]] #Mining Schema node
#> <MiningSchema>
#>  <MiningField name="Petal.Width" usageType="predicted"/>
#>  <MiningField name="Sepal.Length" usageType="active"/>
#>  <MiningField name="Petal.Length" usageType="active"/>
#> </MiningSchema>
The Local.Transformations node contains Squared.Length.Ratio as a derived field:
#> <LocalTransformations>
#>  <DerivedField name="Squared.Length.Ratio" dataType="double" optype="continuous">
#> <Apply function="pow">
#>   <Apply function="/">
#>     <FieldRef field="Sepal.Length"/>
#>     <FieldRef field="Petal.Length"/>
#>   </Apply>
#>   <Constant dataType="double">2</Constant>
#> </Apply> 
#>  </DerivedField>
#> </LocalTransformations>

PMML for arbitrary functions

The function functionToPMML (part of the pmml package) makes it possible to convert an R expression into PMML directly, without creating a model or calculating values. This can be useful for debugging.
As long as the expression passed to the function is a valid R expression (e.g., no unbalanced parentheses), it can contain arbitrary function names not defined in R. Constants in the expression passed to FunctionXform are always assumed to be of type double. Variables in the expression are always assumed to be field names, and not substituted. That is, even if x has a value in the R environment, the resulting expression will still use x.
functionToPMML("1 + 2")
#> <Apply function="+">
#>   <Constant dataType="double">1</Constant>
#>   <Constant dataType="double">2</Constant>
#> </Apply>

x <- 3
functionToPMML("foo(bar(x * y))")
#> <Apply function="foo">
#>   <Apply function="bar">
#>     <Apply function="*">
#>       <FieldRef field="x"/>
#>       <FieldRef field="y"/>
#>     </Apply>
#>   </Apply>
#> </Apply>

functionToPMML("if(a<2) {x+3} else if (a>3) {'four'} else {5}")
#> <Apply function="if">
#>   <Apply function="lessThan">
#>     <FieldRef field="a"/>
#>     <Constant dataType="double">2</Constant>
#>   </Apply>
#>   <Apply function="+">
#>     <FieldRef field="x"/>
#>     <Constant dataType="double">3</Constant>
#>   </Apply>
#>   <Apply function="if">
#>     <Apply function="greaterThan">
#>       <FieldRef field="a"/>
#>       <Constant dataType="double">3</Constant>
#>     </Apply>
#>     <Constant dataType="string">four</Constant>
#>     <Constant dataType="double">5</Constant>
#>   </Apply>
#> </Apply>


functionXform makes it possible to easily create new features for PMML models with R.
The pmmlTransformations functionXform vignette contains additional examples, including transforming categorical data, using transformed features in another transform, unsupported functions, and notes on limitations of the function.

Wednesday, May 28, 2014

Online PMML Course @ UCSD Extension: Register today!

The Predictive Model Markup Language (PMML) standard is touted as the standard for predictive analytics and data mining models. It is allows for predictive models built in one application to be moved to another without any re-coding. PMML has become the imperative for companies wanting to extract value and insight from Big Data. In the Big Data era, the agile deployment of predictive models is imperative. Given the volume and velocity associated with Big Data, one cannot spend weeks or months re-coding a predictive model into the IT operational environment where it actually produces value (the fourth V in Big Data).

Also, as predictive models become more complex through the use of random forest models, model ensembles, and deep learning neural networks, PMML becomes even more relevant since model recoding is simply not an option.

Zementis has paired up with UCSD Extension to offer the first online PMML course. This is a great opportunity for individuals and companies alike to master PMML so that they can muster their predictive analytics resources around a single standard and in doing so, benefit from all it can offer.

Course Benefits
  • Learn how to represent an entire data mining solution using open-standards
  • Understand how to use PMML effectively as a vehicle for model logging, versioning and deployment
  • Identify and correct issues with PMML code as well as add missing computations to auto-generated PMML code

Course Dates

07/14/14 - 08/25/14

PMML is supported by most commercial and open-source data mining tools. Companies and tools that support PMML include IBM SPSS, SAS, R, SAP KXEN, Zementis, KNIME, RapidMiner, FICO, StatSoft, Angoss, Microstrategy ... The standard itself is very mature and its latest release is version 4.2.

For more details about PMML, please visit the Zementis PMML Resources page.

Scoring Data from MySQL or SQL Server using KNIME and ADAPA

The video below shows the use of KNIME for handling data (reading data from a flat file and/or a database) as well as model building (data massaging and training a neural network). It also highlights how easy and straightforward it is to move a predictive model represented in PMML, the Predictive Model Markup Language, into the Zementis ADAPA Scoring Engine. ADAPA is then used for model deployment and scoring. PMML is the de facto standard to represent data mining models. It allows for predictive models to be moved between applications and systems without the need for model re-coding.

When training a model, scientists rely on historical data, but when using the model on a regular basis, the model is moved or deployed in production where it presented with new data. ADAPA provides a scalable and blazing fast scoring engine for models in production. And, although KNIME data mining nodes are typically used by scientists to build models, its database and REST nodes nodes can simply be used to create a flow for reading data from a database (MySQL, SQL Server, Oracle, ...) and passing it for scoring in ADAPA via its REST API.


Use-cases are:

  1. Read data from a flat file, use KNIME for data pre-processing and building of a neural network model. Export the entire predictive workflow as a PMML file and then take this PMML file and upload and score it in ADAPA via its Admin Web Console. 
  2. Read data from a database (MySQL, SQLServer, Oracle, ...), build model in KNIME, export model as a PMML file and deploy it in ADAPA using its REST API. This use-case also shows new or testing data flowing from the database and into ADAPA for scoring via a sequence of KNIME nodes. The video also shows a case in which one can use KNIME nodes to simply read a PMML file produced in any PMML-compliant data mining tool (R, SAS EM, SPSS, ...), upload it in ADAPA using the REST API and score new data from MySQL in ADAPA also through the REST interface. Note that in this case, the model has already been trained and we are just using KNIME to deploy the existing PMML file in ADAPA for scoring.


Zementis and SAP HANA: Real-time Scoring for Big Data

The Zementis partnership with SAP is manifesting itself in a number of ways. Two weeks ago we were part of the SAP Big Data Bus parked outside Wells Fargo in San Francisco. This week, we would like to share with you three new developments.

1) ADAPA is not being offered at the SAP HANA Marketplace.

2) An interview with our CEO, Mike Zeller, was just featured by SAP on the SAP Blogs.

3) Zementis was again part of the SAP Big Data Bus and the "Big Data Theatre". This time, the bus was parked outside US Bank in Englewood, Colorado. We were engaged in a myriad of conversations with the many people that came through the bus about how ADAPA and SAP HANA work together to bring predictive analytics and real-time scoring to transactional data and millions of accounts, in any industry.

Visit the Zementis ADAPA for SAP HANA page for more details on the Zementis and SAP real-time solution for predictive analytics.

Friday, April 18, 2014

Real-time scoring of transactional data with ADAPA for SAP HANA

At the recent DEMO Enterprise 2014 conference, Zementis announced its participation in the SAP® Startup Focus program and launched ADAPA for SAP HANA, a standards-based predictive analytics scoring engine. 

ADAPA for SAP HANA provides a simple plug-and-play platform to deploy the most complex predictive models and execute them in real-time, even in the context of Big Data.

In joining the SAP HANA Startup Focus program, Zementis set out to address two key challenges related to the operational deployment of predictive analytics:  Agile deployment and scalable execution.

Transactional data has for years pushed the boundaries of predictive analytics. The financial industry, for example, has been using transactional data to detect fraud and abuse for decades with complex custom solutions. Real-time scoring is paramount for companies to be able to predict and prevent fraudulent activity before it actually happens.  Likewise, the Internet of Things (IoT) demands effective processing of sensor data to employ predictive maintenance for detecting issues before they turn into device failures.

To solve these challenges, Zementis combined its ADAPA predictive analytics scoring engine with SAP HANA in a true plug-and-play platform which is universally applicable across all industries.  ADAPA to serve scoring requests and execute predictive models, HANA to offload complex model preprocessing and computation of aggregates.

In this scenario, real-time execution critically depends on HANA serving complex data lookups and aggregate profile computation in a few milliseconds.  In a high-volume environment, such aggregates or lookups may have to be computed over millions of transactions.

ADAPA provides scalable real-time scoring of the core model, plus agility for model deployment through the Predictive Model Markup Language (PMML) industry standard.  Clients are able to instantly deploy existing predictive models from various data mining tools.  For example, you can take a complex predictive model from SAS Enterprise Miner, export it in PMML format and simply make it available for real-time scoring in ADAPA for SAP HANA.  The same process, of course, applies to most commercial tools, e.g. SAP Predictive Analysis, KXEN, IBM SPSS, as well as open source tools like R and KNIME.

The unique aspect of the Zementis / SAP platform is that it combines the benefits of an open standard for predictive analytics with the power of in-memory computing.

For more product details, please see

Thursday, February 27, 2014

PMML 4.2 is here! What changed? What is new?

PMML 4.2 is out! That's really great. The DMG (Data Mining Group) has been working on this new version of PMML for over two years now. And, I can truly say, it is the best PMML ever! If you haven't seen the press release for the new version, please see posting below:

What changed?

PMML is a very mature language. And so, there isn't really dramatic changes in the language at this point. One noteworthy change is that old PMML used to call the target field on a predictive model "predicted". This was confusing since a predicted field is usually the result of scoring or executing a model. The score so to speak. Well, PMML 4.2 clears things up a bit. The target field is now simply "target". A small change, but a huge step towards making it clear that the Output element is where the predicted outputs should be defined.

Continuous Inputs for Naive Bayes Models

This is a great new enhancement to the NaiveBayes model element. We wrote an entire paper about this new feature and presented it at the KDD 2013 PMML Workshop. If you use Naive Bayes models, you should definitely take a look at our article.

And, now you can benefit from actually having our proposed changes in PMML itself! This is really remarkable and we are all already benefiting from it. The Zementis Py2PMML (Python to PMML) Converter uses the proposed changes to convert Gaussian Naive Bayes models from scikit-learn to PMML.

Complex Point Allocation for Scorecards

The Scorecard model element was introduced to PMML in version 4.1. It was a good element then, but it is really great now in PMML 4.2. We added to it a way for computing complex values for the allocation of points for an attribute (under a certain characteristic) through the use of expressions. That means, you can use input or derived values to derive the actual value for the points. Very cool! 

Andy Flint (FICO) and I wrote a paper about the Scorecard element for the KDD 2011 PMML Workshop. So, if you haven't seen it yet, it will get you started into how to use PMML to represent scorecards and reason codes.

Revised Output Element

The output element was completely revised. It is much simpler to use. With PMML 4.2, you have direct access to all the model outputs + all post-processing directly from the attribute "feature".

The attribute segmentId also allows users to output particular fields from segments in a multiple model scenario. 

The newly revised output element spells flexibility. It allows you to get what you need out of your predictive solutions.

For a complete list of all the changes in PMML 4.2 (small and large), see:

What is new?

PMML 4.2 introduces the use of regular expressions to PMML. This is solely so that users can process text more efficiently. The most straightforward additions are simple: 3 new built-in functions for concatenating, replacing and matching strings using regular expressions.

The more elaborate addition is the incorporation of a brand new transformation element in PMML to extract term frequencies from text. The ideas for this element were presented at the KDD 2013 PMML Workshop by Benjamin De Boe, Misha Bouzinier, Dirk Van Hyfte (InterSystems). Their paper is a great resource for finding out the details behind the ideas that led to the new text mining element in PMML. 

Obviously, the changes described above are also new, but it was nice to break the news into two pieces. For the grand-finale though, nothing better than taking a look at PMML 4.2 itself. 


Wednesday, January 29, 2014

Standards in Predictive Analytics: R, Hadoop and PMML (a white paper by James Taylor)

James Taylor (@jamet123) is remarkable in capturing the nuances and mood of the data analytics and decision management industry and community. As a celebrated author and an avid writer, James has been writing more and more about the technologies that transform Big Data into real value and insights that can then drive smart business decisions. It is not a surprise then that James has just made available a white paper entitled "Standards in Predictive Analytics" focusing on PMML, the Predictive Model Markup Language, R, and Hadoop.

Why R? 

Well, you can use R for pretty much anything in analytics these days. Besides allowing users to do data discovery, it also provides a myriad of packages for model building and predictive analytics.

Why Hadoop? 

I almost goest without saying. Hadoop is an amazing platform for processing predictive analytic models on top of Big Data.

Why PMML? 

PMML is really the glue between model building (say, R, SAS EM, IBM SPSS, KXEN, KNIME, Python scikit-learn, .... ) and the production system. With PMML, moving a model from the scientist's desktop to production (say, Hadoop, Cloud, in-database, ...) is straightforward. It boils down to this:

R -> PMML -> Hadoop

But, I should stop here and let you read James' wise words yourself. The white paper is available through the Zementis website. To download it, simply click below.


And, if you would like to check James' latest writings, make sure to check his website:

Welcome to the World of Predictive Analytics!

© Predictive Analytics by Zementis, Inc. - All Rights Reserved.

Copyright © 2009 Zementis Incorporated. All rights reserved.

Privacy - Terms Of Use - Contact Us