Predictive Analytics, Big Data, Hadoop, PMML: Open Standards

Showing posts with label Open Standards. Show all posts

Thursday, February 27, 2014

PMML 4.2 is here! What changed? What is new?

PMML 4.2 is out! That's really great. The DMG (Data Mining Group) has been working on this new version of PMML for over two years now. And, I can truly say, it is the best PMML ever! If you haven't seen the press release for the new version, please see posting below:

http://www.kdnuggets.com/2014/02/data-mining-group-pmml-v42-predictive-modeling-standard.html

What changed?

PMML is a very mature language. And so, there isn't really dramatic changes in the language at this point. One noteworthy change is that old PMML used to call the target field on a predictive model "predicted". This was confusing since a predicted field is usually the result of scoring or executing a model. The score so to speak. Well, PMML 4.2 clears things up a bit. The target field is now simply "target". A small change, but a huge step towards making it clear that the Output element is where the predicted outputs should be defined.

Continuous Inputs for Naive Bayes Models

This is a great new enhancement to the NaiveBayes model element. We wrote an entire paper about this new feature and presented it at the KDD 2013 PMML Workshop. If you use Naive Bayes models, you should definitely take a look at our article.

http://kdd13pmml.files.wordpress.com/2013/07/guazzelli_et_al.pdf

And, now you can benefit from actually having our proposed changes in PMML itself! This is really remarkable and we are all already benefiting from it. The Zementis Py2PMML (Python to PMML) Converter uses the proposed changes to convert Gaussian Naive Bayes models from scikit-learn to PMML.

https://zementis.zendesk.com/entries/37045093-Exporting-PMML-for-Class-GaussianNB

Complex Point Allocation for Scorecards

The Scorecard model element was introduced to PMML in version 4.1. It was a good element then, but it is really great now in PMML 4.2. We added to it a way for computing complex values for the allocation of points for an attribute (under a certain characteristic) through the use of expressions. That means, you can use input or derived values to derive the actual value for the points. Very cool!

Andy Flint (FICO) and I wrote a paper about the Scorecard element for the KDD 2011 PMML Workshop. So, if you haven't seen it yet, it will get you started into how to use PMML to represent scorecards and reason codes.

http://kdd11pmml.files.wordpress.com/2011/09/p2_flint_guazzelli_kdd_20112.pdf

Revised Output Element

The output element was completely revised. It is much simpler to use. With PMML 4.2, you have direct access to all the model outputs + all post-processing directly from the attribute "feature".

The attribute segmentId also allows users to output particular fields from segments in a multiple model scenario.

The newly revised output element spells flexibility. It allows you to get what you need out of your predictive solutions.

For a complete list of all the changes in PMML 4.2 (small and large), see:

http://www.dmg.org/v4-2/Changes.html

What is new?

PMML 4.2 introduces the use of regular expressions to PMML. This is solely so that users can process text more efficiently. The most straightforward additions are simple: 3 new built-in functions for concatenating, replacing and matching strings using regular expressions.

The more elaborate addition is the incorporation of a brand new transformation element in PMML to extract term frequencies from text. The ideas for this element were presented at the KDD 2013 PMML Workshop by Benjamin De Boe, Misha Bouzinier, Dirk Van Hyfte (InterSystems). Their paper is a great resource for finding out the details behind the ideas that led to the new text mining element in PMML.

http://kdd13pmml.files.wordpress.com/2013/07/extending_the_pmml_text_model_for_text_categorization.pdf

Obviously, the changes described above are also new, but it was nice to break the news into two pieces. For the grand-finale though, nothing better than taking a look at PMML 4.2 itself.

http://www.dmg.org/v4-2/GeneralStructure.html

Enjoy!

Wednesday, January 29, 2014

Standards in Predictive Analytics: R, Hadoop and PMML (a white paper by James Taylor)

James Taylor (@jamet123) is remarkable in capturing the nuances and mood of the data analytics and decision management industry and community. As a celebrated author and an avid writer, James has been writing more and more about the technologies that transform Big Data into real value and insights that can then drive smart business decisions. It is not a surprise then that James has just made available a white paper entitled "Standards in Predictive Analytics" focusing on PMML, the Predictive Model Markup Language, R, and Hadoop.

DOWNLOAD WHITE PAPER

Why R?

Well, you can use R for pretty much anything in analytics these days. Besides allowing users to do data discovery, it also provides a myriad of packages for model building and predictive analytics.

Why Hadoop?

I almost goest without saying. Hadoop is an amazing platform for processing predictive analytic models on top of Big Data.

Why PMML?

PMML is really the glue between model building (say, R, SAS EM, IBM SPSS, KXEN, KNIME, Python scikit-learn, .... ) and the production system. With PMML, moving a model from the scientist's desktop to production (say, Hadoop, Cloud, in-database, ...) is straightforward. It boils down to this:

R -> PMML -> Hadoop

But, I should stop here and let you read James' wise words yourself. The white paper is available through the Zementis website. To download it, simply click below.

DOWNLOAD WHITE PAPER

And, if you would like to check James' latest writings, make sure to check his website: JTonEDM.com

Monday, July 16, 2012

Predicting the future ... in four parts

I recently finished writing a four-part article series about predictive analytics entitled Predicting the Future. The topic is near and dear to my heart, since I have been working on the field since my undergrad years back in Brazil (more than 20 years ago). And, lately, through my work with PMML, the Predictive Model Markup Language.

The four articles have just been published by IBM in their entirety in the developerWorks website together with a video in which I introduce each article.

The article themselves can be found here:

Predicting the future, Part 1: What is predictive analytics?
Predicting the future, Part 2: Predictive modeling techniques
Predicting the future, Part 3: Create a predictive solution
Predicting the future, Part 4: Put a predictive solution to work

And, if you are interested in learning about open-standards and predictive analytics, I would also recommend the following articles:

Predictive Analytics in Healthcare: The importance of open standards
What is PMML? Explore the Power of Predictive Analytics and Open Standards
Representing predictive solutions in PMML: Move from raw data to predictions

Enjoy!

Friday, July 13, 2012

Webcast: Predictive Analytics on Hadoop

UPDATE: Thanks for your interest in our joint webinar with Datameer: Predictive Analytics on Hadoop. If you were not able to attend or would like to watch it again at your own pace, just click HERE.

To extract value and insight from "Big Data", leading organizations increasingly leverage predictive analytics. By using statistical techniques that uncover important patterns present in historical data, companies are able to predict the future. In doing so, they become more precise, consistent and automated in everyday business decisions.

Please join the Datameer/Zementis webcast entitled Predictive Analytics on Hadoop: Gaining Faster Insights through Open Standards to learn to efficiently derive predictions from very large volumes of structured and unstructured data.

WHEN: Thursday, July 19, 2012, 10:00 am PT / 1:00 pm ET

Free registration

In this webinar, we showcase the technical capabilities of the Universal PMML Plug-in for Datameer, a solution that combines open standards and Hadoop to reduce complexity and accelerate time-to-market for predictive analytics in any industry and for any business application.

Leave this webinar knowing:

The benefits of the Predictive Model Markup Language (PMML) standard as a data science best practice for data mining
How to leverage predictive analytics in the context of big data
How to reduce the cost and complexity of predictive analytics

You can register HERE

Wednesday, January 26, 2011

Predictive analytics and the power of open standards and cloud computing

Organizations around the globe increasingly recognize the value that predictive analytics offers to their business. The complexity of development, integration, and deployment of predictive models, however, is often considered cost-prohibitive for many projects. In light of mature open source solutions, open standards, and SOA principles we offer an agile model development life cycle that allows us to quickly leverage predictive analytics in operational environments.

Starting with data analysis and model development, you can effectively use the Predictive Model Markup Language (PMML) standard, to move complex decision models from the scientist's desktop into a scalable production environment hosted on the Amazon Elastic Compute Cloud (Amazon EC2).

Expressing Models in PMML

PMML is an XML-based language used to define predictive models. It was specified by the Data Mining Group (DMG), an independent group of leading technology companies including Zementis. By providing a uniform standard to represent such models, PMML allows for the exchange of predictive solutions between different applications and various vendors.

Open source statistical tools such as R can be used to develop data mining models based on historical data. R allows for models to be exported into PMML which can then be imported into an operational decision platform and be ready for production use in a matter of minutes.

On-Demand Predictive Analytics

Amazon EC2 is a reliable, on-demand infrastructure on which we offer the ADAPA Predictive Decisioning Engine based on the Software as a Service (SaaS) paradigm. ADAPA imports models expressed in PMML and executes these in batch mode, or real-time via web-services.

Our service is implemented as a private, dedicated Amazon EC2 instance of ADAPA. Each client has access to his/her own ADAPA instance via HTTP/HTTPS. In this way, models and data for one client never share the same engine with other clients.

Using a SaaS solution to break down traditional barriers that currently slow the adoption of predictive analytics, our strategy translates predictive models into operational assets with minimal deployment costs and leverages the inherent scalability of utility computing.

In summary, ADAPA allows for:

Cost-effective and reliable service based on Amazon’s EC2 infrastructure
Secure execution of predictive models through dedicated and controlled instances including HTTPS and Web-Services security
On-demand computing. Choice of instance type (small, large, extra-large, ...) and launch of multiple instances.
Superior time-to-market by providing rapid deployment of predictive models and an agile enterprise decision management environment.

For a practical guide, watch:

Friday, November 6, 2009

PMML and Open Source Data Mining

Open source tools provide a cost-effective, yet powerful option for data mining. The following contenders adhere to the PMML standard which facilitates model exchange among open source and commercial vendors, providing a definitive route for production deployment of predictive models.

The R Project

The R Project for Statistical Computing is definitely the most used and revered statistical package among advocates of open-source and community computing projects. Like the iPhone app store, you can basically find anything you need in CRAN (statistical that’s to say ... yep, no navigation system for R), the Comprehensive R Archive Network. It is in CRAN that you will find the R PMML Package. This package allows R users to export PMML for a variety of models, including decision trees and neural networks (among many others). We recently co-authored an article with Graham Williams, the original author and maintainer of the package. It can be downloaded directly from The R Journal website. If you are interested in contributing code for the package, please contact us.

KNIME

Developed by the University of Konstanz, KNIME is an open-source platform that enables users to visually create and execute data flows. Since KNIME 2.0 (available as of December 2008), users can import and export PMML models into and out of KNIME. Given that users can use R within KNIME, the R PMML package can also be used to export and convert R models to PMML within KNIME. New versions of KNIME will most certainly expand its support for PMML even further.

Weka

Developed by the University of Waikato, Weka provides a large collection of machine learning algorithms for solving data mining problems. Although Weka has currently no export functionality for PMML, Mark Hall is currently working on implementing import functionally for PMML. Weka can already import models such as regression, decision trees and neural networks. PMML support in Weka is constantly expanding with the addition of transformations and built-in functions.

RapidMiner

Most recently, Rapid-I announced that it will extend the latest version of its RapidMiner software to include support for PMML. RapidMiner, formerly known as YALE, is an open-source platform that offers operators for all aspects of data mining. As with KNIME, Rapid-I is one of the latest companies to join the rankings of the Data Mining Group (DMG) beside companies like IBM, Microstrategy, SPSS, SAS and Zementis. The DMG is already busy at work refining and adding yet more capabilities and power to PMML.

PMML Discussion Forums

For an on-going discussion and to read about the latest PMML news, we would like to invite you to join the PMML group in LinkedIn or the discussion forum in the PMML group on Analytic Bridge, a social network community for analytics professionals. For PMML resources, examples, and useful links, please take a look at the PMML page on the Zementis website.

Thursday, October 15, 2009

Latest issue of ACM SIGKDD Explorations focuses on open source analytics, PMML and cloud computing.

The latest issue of ACM SIGKDD Explorations is out! This issue is relevant in many ways, since it not only gives special attention to open source analytics (including articles on Weka and KNIME), but it also discusses PMML and cloud computing.

PMML, in particular, gets special treatment. It is described in a full article written by Rick Pechter from Microstrategy. As Rick puts it, "the Predictive Model Markup Language data mining standard has arguably become one of the most widely adopted data mining standards in use today."

PMML is also discussed in most of the other articles, including the one by Zementis, entitled: "Efficient Deployment of Predictive Analytics through Open Standards and Cloud Computing". In this article, we use the ADAPA scoring engine to illustrate how the benefits of PMML and cloud computing can be combined to offer a platform that leverages these elements to deliver an efficient deployment process for statistical models.

So, don't miss out on this issue of SIGKDD Explorations. We invite you to explore all the peer-reviewed articles in detail.

Friday, April 3, 2009

PMML 101

The Predictive Model Markup Language (PMML) is an XML-based language developed by the Data Mining Group (DMG) which provides a way for applications to define statistical and data mining models and to share models between PMML compliant applications.

PMML provides applications a vendor-independent method of defining models so that proprietary issues and incompatibilities are no longer a barrier to the exchange of models between applications. It allows users to develop models within one vendor's application, and use another vendors' applications to visualize, analyze, evaluate or otherwise use the models. Previously, this was very difficult, but with PMML, the exchange of models between compliant applications is now straightforward.

Since PMML is an XML-based standard, the specification comes in the form of an XML Schema.

PMML Components

PMML follows a very intuitive structure to describe a data mining model, be it an artificial neural network or a logistic regression model. Sequentially, it can be described by the following components:

PMML Elements - A PMML file is highly structured. The list of PMML elements allows for data manipulation and model to be expressed in a single PMML file.

Header: contains general information about the PMML document, such as copyright information for the model, its description, and information about the application used to generate the model such as name and version. It also contains an attribute for a timestamp which can be used to specify the date of model creation.

Data Dictionary: contains definitions for all the possible fields used by the model. It is in the data dictionary that a field is defined as continuous, categorical, or ordinal (attribute optype). Depending on this definition, the appropriate value ranges are then defined as well as the data type (such as, string or double).

Data Transformations: transformations allow for the mapping of user data into a more desirable form to be used by the mining model. PMML defines several kinds of data transformations.

Normalization: map values to numbers, the input can be continuous or discrete.

Discretization: map continuous values to discrete values.

Value mapping: map discrete values to discrete values.

Functions: derive a value by applying a function to one or more parameters.

Aggregation: used to summarize or collect groups of values.

The ability to represent data transformations (as well as outlier and missing value treatment methods) in conjunction with the model itself is a major advantage of PMML. When it comes to the actual use of a PMML model, pre- and post-processing are embedded into the PMML file itself. All that is needed is the raw input data and users are on the go (see useful links for a primer on how to represent data pre-processing in PMML).

Data transformations and predictive models are represented in a single PMML file, which facilitates model deployment.

Model: contains the definition of the data mining model. A multi-layered feed-forward neural network is the most common neural network representation in contemporary applications, given the popularity and efficacy associated with its training algorithm known as Backpropagation. Such a network is represented in PMML by a "NeuralNetwork" element which contains attributes such as:

Model Name (attribute modelName)

Function Name (attribute functionName)

Algorithm Name (attribute algorithmName)

Activation Function (attribute activationFunction)

Number of Layers (attribute numberOfLayers)

This information is then followed by three kinds of neural layers which specify the architecture of the neural network model being represented in the PMML document. These attributes are NeuralInputs, NeuralLayer, and NeuralOutputs. Besides neural networks, PMML allows for the representation of many other data mining models including support vector machines, association rules, naive bayes classifier, clustering models, text models, decision trees, and different regression models.

Mining Schema: the mining schema lists all fields used in the model. This can be a subset of the fields as defined in the data dictionary. It contains specific information about each field, such as:

Name (attribute name): must refer to a field in the data dictionary.

Usage type (attribute usageType): defines the way a field is to be used in the model. Typical values are: active, predicted, and supplementary. Predicted fields are those whose values are predicted by the model.

Outlier Treatment (attribute outliers): defines the outlier treatment to be use. In PMML, outliers can be treated as missing values, as extreme values (based on the definition of high and low values for a particular field), or as is.

Missing Value Replacement Policy (attribute missingValueReplacement): if this attribute is specified then a missing value is automatically replaced by the given values.

Missing Value Treatment (attribute missingValueTreatment): indicates how the missing value replacement was derived (e.g. as value, mean or median).

Targets: The targets element allows for the scaling of predicted variables. It is a straight-forward way to represent post-processing of raw outputs.

Supported Modeling Techniques

The list of modeling techniques supported by the PMML standard is constantly being updated. Version 4.1 supports the following techniques:

Neural Networks (Feedforward neural networks as well as radial-basis)

Decision Trees (with coding for several missing value strategies)

Support Vector Machines

Linear and Logistic Regression Models (via a generic representation or a simplified one)

Association Rules

Clustering

Naive Bayes

Sequences

Text Models

Time Series

Rulesets

Scorecards

K-Nearest Neighbors

Baseline Models

PMML Example

The example below shows a PMML file used to represent a logistic regression model. In this model, the predicted variable is named honcomb. Note that this is a very simple model. There are only three input variables (female, read_score, and science_score) which are all double. There is no pre-processing of the raw input variables and so these are fed directly into the regression model which produces a value for honcomp (0 or 1).

PMML Example - File containing a simple regression model expressed in PMML.

A comprehensive list of PMML examples can be found at the Zementis website - Examples Page.

PMML Products

A range of products are being offered to produce and consume PMML. Please check the following page at the DMG website for an updated list of PMML-powered products:

http://www.dmg.org/products.html

Useful Pages

PMML Examples - A list of PMML 3.2 files including neural network models, support vector machines, decision trees, regression models and clustering.

ADAPA Predictive Analytics Engine - Available as a Service through the Amazon Elastic Compute Cloud, the ADAPA engine can import several PMML models. After uploading, models are available for scoring or verification.

Data Pre-Processing in PMML and ADAPA - A Primer: Contains several examples on how to manipulate data in PMML.

Data Mining Group Home