Wednesday, December 14, 2011

Operational Deployment of Predictive Solutions: Lost in Translation? Not with PMML

Traditionally, the deployment of predictive solutions have been, to put it mildly, cumbersome. As shown in the Figure below, data mining scientists work hard to analyze historical data and to build the best predictive solutions out it. Engineers, on the other hand, are usually responsible for bringing these solutions to life, by recoding them into a format suitable for production deployment. Given that data mining scientists and engineers tend to inhabit different information worlds, the process of moving a predictive solution from the scientist's desktop to production can get lost in translation.

Luckily, the advent of PMML (Predictive Model Markup Language) changed this scenario radically. PMML is the de facto standard used to represent predictive solutions. In this way, there is no need for scientists to write a word document describing the solution. They can just export it as a PMML file. Today, all major data mining tools and statistical packages support PMML. These include IBM SPSS, SAS, R, KNIME, RapidMiner, KXEN, ... Also, tools such as the Zementis Transformations Generator and KNIME allow for easy PMML coding for pre- and post-processing steps.

Great! Once a PMML file exists, it can be easily deployed in production with ADAPA, the Zementis scoring engine. ADAPA even allows for models to be deployed in the Amazon Cloud and be accessed from anywhere via web-services. Zementis also offers in-database scoring via its Universal PMML Plug-in, which is also available for Hadoop. In this way, a process that could take 6 months, now takes minutes.

PMML and ADAPA have transformed model deployment forever. If you or your company are still spending time and resources in deploying your predictive analytics the traditional way, make sure to contact us. The secret behind exceptional predictive analytics is out!

Friday, December 9, 2011

PMML and Association Rules

An association rule describes a relation between one group of objects and another group of objects. This may be said in another way: "If a condition A is satisfied, then so is condition B". As an example, consider items people purchase in a grocery store. Suppose most people who buy milk also buy juice. Also, most people who buy chicken and beef also buy bread. Then two association rules exist:

[Milk] --> [Juice]
If you buy milk, then you will also buy juice

[Chicken,Beef] --> [Bread]
If you buy chicken and beef, then you will also buy bread

Data Processing

Normally, as in a typical regression model, one data row (or record) is read at a time and one output is given back. In particular, one input value is read for each of the input variables required by the model, which are positioned in different columns but in the same row. Once read in, the input record is processed through the model. The result, or output, is then appended to the data as an extra column as the predicted value or score. For Association rules, on the other hand, multiple items of a single transaction need to be read in and processed before an output can be returned. As suggested by the example above, "Chicken" and "Beef" need to be read in before "Bread" is produced as an output. In the usual data format, the entire transaction will have its unique value in one column. For association rules, two different data processing methods can be used to read all the items under a single transaction.

These two methods allow for the data to be expressed either in a "rectangular" for or in a "transactional" format.

Rectangular Format

The rectangular format lists all possible items of a single transaction in a separate column for each row. For the above example, if customers purchase from a list of five possible items: Milk, Juice, Chicken, Beef, and Bread, the input data might be represented as:


Note that the first row specifies the header, while the third row, for example, specifies that Chicken, Beef and Bread were purchased together. Of course, it is not clear from these if chicken and beef implies bread, or if chicken implies bread and beef; but together with the PMML file, the scoring machine is able to deduce the correct relationships. And so, for a "rectangular" data file, the output is added to the same row as a different column.

The PMML file for each format is different as well. For a "rectangular" PMML file, all the possible values or items are defined as different fields. And so, these are defined as different "MiningFields" under the "MiningSchema" element. For the example above, instead of a single "MiningField" for the entire purchase, one would have five "MiningFields": Milk, Juice, Chicken, Beef, and Bread, as follows:

<MiningField name="Milk" usageType="active"/>
<MiningField name="Juice" usageType="active"/>
<MiningField name="Chicken" usageType="active"/>
<MiningField name="Beef" usageType="active"/>
<MiningField name="Bread" usageType="active"/>

PMML Example - Association Rules in Rectangular Format

For an example of a PMML file and its correspondent data file in rectangular format, click HERE.

Transactional Format

The "transactional" format, on the other hand, allows for the input data to be specified in two columns: the first one is the identifier and the second one contains the possible items. For the example above, the data file might be represented as:


The identifier (column "ID") indicates which items belong together. And so, in this example, ID = 1 specifies that the first two items (Milk and Juice) belong to the same input group or transaction, while ID = 2 indicates that Chicken, Beef and Bread belong to a different group. In this case, for the "transactional" data file, the predicted value is added as an extra column in the first row of each group only.

A "transactional" PMML file defines two "MiningFields". One is of type "group" which indicates which group the items belong to. The second is of type 'active' which includes, as in our example, all the possible items that were purchased. Note that is not necessary to list all items one by one. And so, the "MiningSchema" in a "transactional" PMML file might look like:

<MiningField name="ID" usageType="group"/>
<MiningField name="item" usageType="active"/>

In this case, the columns with the same "ID" belong together: since Milk and Juice in our example both have ID = 1, they both are in the same group. The second column, titled "item" in the data file, lists all the items for that group: Milk and Juice. One can thus read the first group as: “Milk and juice are purchased together”.

PMML Example - Association Rules in Transactional Format

For an example of a PMML file and its correspondent data file in transactional format, click HERE.

Real-Time Recommendations, KNIME, and PMML

What is KNIME?

According to

KNIME (Konstanz Information Miner) is a
user-friendly and comprehensive open-source data integration, processing, analysis, and exploration platform.

Yes, KNIME is user-friendly, not only because it offers an intuitive GUI to analyze data, but also because it is open-source. KNIME is also standards friendly. KNIME 2.0 released in 2008 was the first release to offer PMML support. Since then, PMML support in KNIME has matured considerably, from the import and export of predictive models all the way to the pre-processing of input variables. KNIME 2.5, released December 01, 2011 offers a series of PMML-enabled pre-processing nodes which can be embedded automatically in the final PMML model. All these features are documented in a paper presented at the KDD 2011 PMML Workshop:

Peer-reviewed article: KDD 2011 - PMML Pre-processing in KNIME

The picture below shows part of a typical workflow in KNIME. Note that KNIME nodes now come with "blue" ports which signify PMML support. In this way, one can link a series of PMML-enabled pre-processing nodes to a model and obtain not only the model but also all the pre-processing steps in the resulting PMML file.

Want to see more? Take a look at a step-by-step example of KNIME and PMML at work.

Whenever a PMML file is exported by KNIME, it can be directly deployed in any of the Zementis scoring products, including the ADAPA Scoring Engine or the Universal PMML Plug-in for in-database scoring. This enables models to be ready for operational use right away.

Social Media, Recommendations, and Real-Time Execution with KNIME and ADAPA

There is a lot of theory and hype around the topics of social media, recommendation engines and real time modeling, but until now not many practical examples that can be measured in terms of ROI. KNIME AG and Zementis have joined together to provide a white paper, which summarizes a practical case study that combines all three topics, and delivers a measured and solid business case.

Our case study is just one example as to how advanced analytics combined with real-time execution has real world benefits for organizations. Regardless of whether a requirement to control risk, increase personalization with the customer or maximize sales and margin exists, the combination of KNIME and ADAPA are ideal for leveraging the power of data by providing an end-to-end solution, from model development to operational deployment and real-time execution within any business process.

Download our white-paper today: Social Media, Recommendation Engines and Real-Time Model Execution with KNIME and ADAPA

Welcome to the World of Predictive Analytics!

© Predictive Analytics by Zementis, Inc. - All Rights Reserved.

Copyright © 2009 Zementis Incorporated. All rights reserved.

Privacy - Terms Of Use - Contact Us