Predictive Analytics, Big Data, Hadoop, PMML: February 2014

Thursday, February 27, 2014

PMML 4.2 is here! What changed? What is new?

PMML 4.2 is out! That's really great. The DMG (Data Mining Group) has been working on this new version of PMML for over two years now. And, I can truly say, it is the best PMML ever! If you haven't seen the press release for the new version, please see posting below:

http://www.kdnuggets.com/2014/02/data-mining-group-pmml-v42-predictive-modeling-standard.html

What changed?

PMML is a very mature language. And so, there isn't really dramatic changes in the language at this point. One noteworthy change is that old PMML used to call the target field on a predictive model "predicted". This was confusing since a predicted field is usually the result of scoring or executing a model. The score so to speak. Well, PMML 4.2 clears things up a bit. The target field is now simply "target". A small change, but a huge step towards making it clear that the Output element is where the predicted outputs should be defined.

Continuous Inputs for Naive Bayes Models

This is a great new enhancement to the NaiveBayes model element. We wrote an entire paper about this new feature and presented it at the KDD 2013 PMML Workshop. If you use Naive Bayes models, you should definitely take a look at our article.

http://kdd13pmml.files.wordpress.com/2013/07/guazzelli_et_al.pdf

And, now you can benefit from actually having our proposed changes in PMML itself! This is really remarkable and we are all already benefiting from it. The Zementis Py2PMML (Python to PMML) Converter uses the proposed changes to convert Gaussian Naive Bayes models from scikit-learn to PMML.

https://zementis.zendesk.com/entries/37045093-Exporting-PMML-for-Class-GaussianNB

Complex Point Allocation for Scorecards

The Scorecard model element was introduced to PMML in version 4.1. It was a good element then, but it is really great now in PMML 4.2. We added to it a way for computing complex values for the allocation of points for an attribute (under a certain characteristic) through the use of expressions. That means, you can use input or derived values to derive the actual value for the points. Very cool!

Andy Flint (FICO) and I wrote a paper about the Scorecard element for the KDD 2011 PMML Workshop. So, if you haven't seen it yet, it will get you started into how to use PMML to represent scorecards and reason codes.

http://kdd11pmml.files.wordpress.com/2011/09/p2_flint_guazzelli_kdd_20112.pdf

Revised Output Element

The output element was completely revised. It is much simpler to use. With PMML 4.2, you have direct access to all the model outputs + all post-processing directly from the attribute "feature".

The attribute segmentId also allows users to output particular fields from segments in a multiple model scenario.

The newly revised output element spells flexibility. It allows you to get what you need out of your predictive solutions.

For a complete list of all the changes in PMML 4.2 (small and large), see:

http://www.dmg.org/v4-2/Changes.html

What is new?

PMML 4.2 introduces the use of regular expressions to PMML. This is solely so that users can process text more efficiently. The most straightforward additions are simple: 3 new built-in functions for concatenating, replacing and matching strings using regular expressions.

The more elaborate addition is the incorporation of a brand new transformation element in PMML to extract term frequencies from text. The ideas for this element were presented at the KDD 2013 PMML Workshop by Benjamin De Boe, Misha Bouzinier, Dirk Van Hyfte (InterSystems). Their paper is a great resource for finding out the details behind the ideas that led to the new text mining element in PMML.

http://kdd13pmml.files.wordpress.com/2013/07/extending_the_pmml_text_model_for_text_categorization.pdf

Obviously, the changes described above are also new, but it was nice to break the news into two pieces. For the grand-finale though, nothing better than taking a look at PMML 4.2 itself.

http://www.dmg.org/v4-2/GeneralStructure.html

Enjoy!