Welcome to the World of Predictive Analytics!

© Predictive Analytics by Zementis, Inc. - All Rights Reserved.

Wednesday, January 11, 2012

PMML 4.1 is here! Mature standard for predictive analytics

The Predictive Model Markup Language (PMML) is an XML-based language developed by the Data Mining Group (DMG) which provides a way for applications to define statistical and data mining models and to share models between PMML compliant applications.

PMML provides applications a vendor-independent method of defining models so that proprietary issues and incompatibilities are no longer a barrier to the exchange of models between applications. It allows users to develop models within one vendor's application, and use another vendors' applications to visualize, analyze, evaluate or otherwise use the models. Previously, this was very difficult, but with PMML, the exchange of models between compliant applications is now straightforward.

The adoption of PMML by the major analytic vendors is a great example of companies embracing interoperability. IBM, SAS, Microstrategy, FICO, Equifax, NASA, Salford Systems and Zementis, for example, are part of the Data Mining Group (DMG), the committee shaping PMML. Open-source companies such as KNIME, Open Data Group, and Rapid-I are also part of the committee.

PMML first made its debut in 1997. Today, it is a mature and refined language. The latest version of PMML, 4.1, released in December 2011, adds three new model elements to PMML. These are:
  1. Scorecard: This new element is used to represent Scorecards, a commercially significant formulation of predictive models. Scorecards are used extensively in retail banking to estimate and rank-order consumer credit risk. Scorecards are usually associated with adverse or reason codes and so PMML 4.1 also introduced the abilitity to represent reason codes for explaining any adverse actions derived from a scorecard.

  2. NearestNeighborModel: This new element is used to represent k-Nearest Neighbors. k-NN is an instance-based learning algorithm. In a k-NN model, the prediction is based on the K training instances closest to the case being scored. Therefore, all training cases have to be stored inside the PMML file itself. For cases in which the amount of data is quite large, PMML allows for it to be referenced externally.

  3. BaselineModel: This element is used to represent Baseline Models. These types of models are used for defining a change detection model.

PMML 4.1 also adds to the language:
  • Generic Post-Processing Capabilities: In previous PMML versions, element "Targets" got all the attention for its ability to implement scaling. PMML 4.1 brought the post-processing capabilities of PMML to a new higher level by expanding the role of element Output. This element can now be used not only to represent scaling, but also any type of data manipulation, since it allows for transformations and built-in functions to be applied to any output values. It also allows for the definition of thresholds and business decisions which can be used as the final model output.

  • Simplified Multiple Models Capabilities: Besides making the representation of multiple models simpler, PMML 4.1 also made it more generic. The latest PMML release has deprecated the existing model composition approach and now allows for composition to take place inside a generic "Segmentation" element. In this way, a single element can now be used to represent model segmentation, model ensemble, and model composition.

  • New Built-in Functions: Three new functions were added to the language's existing pletora of built-in functions. Through its logical, arithmetic and string operators, PMML is capable of representing a myriad of data pre-processing steps.

PMML 4.1 also adds a new "isScorable" attribute which was added to all existing model elements to signal if a model is production ready or not. It also offers a new document that specifies all the rules around field scope and field names that were previously scattered over several documents. Scope becomes an important issue when a PMML file is used to represented multiple models that are nested.

As the de facto standard to represent predictive solutions, PMML allows model(s) and data transformations to be represented together in a single and concise way. When used to represent all the computations that make up a predictive solution, PMML becomes the bridge not only between data analysis, model building, and deployment systems, but also between all the people and teams involved in the analytical process inside a company. Needless to say, PMML is already shaping the world of predictive analytics.

Resources
  • Check out the DMG website to review all new and pre-4.1 PMML language elements

Wednesday, December 14, 2011

Operational Deployment of Predictive Solutions: Lost in Translation? Not with PMML

Traditionally, the deployment of predictive solutions have been, to put it mildly, cumbersome. As shown in the Figure below, data mining scientists work hard to analyze historical data and to build the best predictive solutions out it. Engineers, on the other hand, are usually responsible for bringing these solutions to life, by recoding them into a format suitable for production deployment. Given that data mining scientists and engineers tend to inhabit different information worlds, the process of moving a predictive solution from the scientist's desktop to production can get lost in translation.


Luckily, the advent of PMML (Predictive Model Markup Language) changed this scenario radically. PMML is the de facto standard used to represent predictive solutions. In this way, there is no need for scientists to write a word document describing the solution. They can just export it as a PMML file. Today, all major data mining tools and statistical packages support PMML. These include IBM SPSS, SAS, R, KNIME, RapidMiner, KXEN, ... Also, tools such as the Zementis Transformations Generator and KNIME allow for easy PMML coding for pre- and post-processing steps.

Great! Once a PMML file exists, it can be easily deployed in production with ADAPA, the Zementis scoring engine. ADAPA even allows for models to be deployed in the Amazon Cloud and be accessed from anywhere via web-services. Zementis also offers in-database scoring via its Universal PMML Plug-in, which is also available for Hadoop. In this way, a process that could take 6 months, now takes minutes.


PMML and ADAPA have transformed model deployment forever. If you or your company are still spending time and resources in deploying your predictive analytics the traditional way, make sure to contact us. The secret behind exceptional predictive analytics is out!

Friday, December 9, 2011

PMML and Association Rules

An association rule describes a relation between one group of objects and another group of objects. This may be said in another way: "If a condition A is satisfied, then so is condition B". As an example, consider items people purchase in a grocery store. Suppose most people who buy milk also buy juice. Also, most people who buy chicken and beef also buy bread. Then two association rules exist:

[Milk] --> [Juice]
If you buy milk, then you will also buy juice

[Chicken,Beef] --> [Bread]
If you buy chicken and beef, then you will also buy bread


Data Processing

Normally, as in a typical regression model, one data row (or record) is read at a time and one output is given back. In particular, one input value is read for each of the input variables required by the model, which are positioned in different columns but in the same row. Once read in, the input record is processed through the model. The result, or output, is then appended to the data as an extra column as the predicted value or score. For Association rules, on the other hand, multiple items of a single transaction need to be read in and processed before an output can be returned. As suggested by the example above, "Chicken" and "Beef" need to be read in before "Bread" is produced as an output. In the usual data format, the entire transaction will have its unique value in one column. For association rules, two different data processing methods can be used to read all the items under a single transaction.

These two methods allow for the data to be expressed either in a "rectangular" for or in a "transactional" format.

Rectangular Format

The rectangular format lists all possible items of a single transaction in a separate column for each row. For the above example, if customers purchase from a list of five possible items: Milk, Juice, Chicken, Beef, and Bread, the input data might be represented as:

Milk,Juice,Chicken,Beef,Bread
1,1,0,0,0
0,0,1,1,1

Note that the first row specifies the header, while the third row, for example, specifies that Chicken, Beef and Bread were purchased together. Of course, it is not clear from these if chicken and beef implies bread, or if chicken implies bread and beef; but together with the PMML file, the scoring machine is able to deduce the correct relationships. And so, for a "rectangular" data file, the output is added to the same row as a different column.

The PMML file for each format is different as well. For a "rectangular" PMML file, all the possible values or items are defined as different fields. And so, these are defined as different "MiningFields" under the "MiningSchema" element. For the example above, instead of a single "MiningField" for the entire purchase, one would have five "MiningFields": Milk, Juice, Chicken, Beef, and Bread, as follows:

 <MiningSchema>
<MiningField name="Milk" usageType="active"/>
<MiningField name="Juice" usageType="active"/>
<MiningField name="Chicken" usageType="active"/>
<MiningField name="Beef" usageType="active"/>
<MiningField name="Bread" usageType="active"/>
</MiningSchema>

PMML Example - Association Rules in Rectangular Format

For an example of a PMML file and its correspondent data file in rectangular format, click HERE.

Transactional Format

The "transactional" format, on the other hand, allows for the input data to be specified in two columns: the first one is the identifier and the second one contains the possible items. For the example above, the data file might be represented as:

ID,value
1,Milk
1,Juice
2,Chicken
2,Beef
2,Bread

The identifier (column "ID") indicates which items belong together. And so, in this example, ID = 1 specifies that the first two items (Milk and Juice) belong to the same input group or transaction, while ID = 2 indicates that Chicken, Beef and Bread belong to a different group. In this case, for the "transactional" data file, the predicted value is added as an extra column in the first row of each group only.

A "transactional" PMML file defines two "MiningFields". One is of type "group" which indicates which group the items belong to. The second is of type 'active' which includes, as in our example, all the possible items that were purchased. Note that is not necessary to list all items one by one. And so, the "MiningSchema" in a "transactional" PMML file might look like:

  <MiningSchema>
<MiningField name="ID" usageType="group"/>
<MiningField name="item" usageType="active"/>
</MiningSchema>

In this case, the columns with the same "ID" belong together: since Milk and Juice in our example both have ID = 1, they both are in the same group. The second column, titled "item" in the data file, lists all the items for that group: Milk and Juice. One can thus read the first group as: “Milk and juice are purchased together”.

PMML Example - Association Rules in Transactional Format

For an example of a PMML file and its correspondent data file in transactional format, click HERE.

Real-Time Recommendations, KNIME, and PMML

What is KNIME?

According to knime.com:

KNIME (Konstanz Information Miner) is a
user-friendly and comprehensive open-source data integration, processing, analysis, and exploration platform.

Yes, KNIME is user-friendly, not only because it offers an intuitive GUI to analyze data, but also because it is open-source. KNIME is also standards friendly. KNIME 2.0 released in 2008 was the first release to offer PMML support. Since then, PMML support in KNIME has matured considerably, from the import and export of predictive models all the way to the pre-processing of input variables. KNIME 2.5, released December 01, 2011 offers a series of PMML-enabled pre-processing nodes which can be embedded automatically in the final PMML model. All these features are documented in a paper presented at the KDD 2011 PMML Workshop:

Peer-reviewed article: KDD 2011 - PMML Pre-processing in KNIME

The picture below shows part of a typical workflow in KNIME. Note that KNIME nodes now come with "blue" ports which signify PMML support. In this way, one can link a series of PMML-enabled pre-processing nodes to a model and obtain not only the model but also all the pre-processing steps in the resulting PMML file.


Want to see more? Take a look at a step-by-step example of KNIME and PMML at work.

Whenever a PMML file is exported by KNIME, it can be directly deployed in any of the Zementis scoring products, including the ADAPA Scoring Engine or the Universal PMML Plug-in for in-database scoring. This enables models to be ready for operational use right away.

Social Media, Recommendations, and Real-Time Execution with KNIME and ADAPA

There is a lot of theory and hype around the topics of social media, recommendation engines and real time modeling, but until now not many practical examples that can be measured in terms of ROI. KNIME AG and Zementis have joined together to provide a white paper, which summarizes a practical case study that combines all three topics, and delivers a measured and solid business case.

Our case study is just one example as to how advanced analytics combined with real-time execution has real world benefits for organizations. Regardless of whether a requirement to control risk, increase personalization with the customer or maximize sales and margin exists, the combination of KNIME and ADAPA are ideal for leveraging the power of data by providing an end-to-end solution, from model development to operational deployment and real-time execution within any business process.

Download our white-paper today: Social Media, Recommendation Engines and Real-Time Model Execution with KNIME and ADAPA

Thursday, November 3, 2011

In-database Scoring with PMML, Zementis, and Sybase IQ: Big Data Analytics Made Easy

Not all analytic tasks are born the same. If one is confronted with massive volumes of data that need to be scored on a regular basis, in-database scoring sounds like the logical thing to do. In all likelihood, the data in these cases is already stored in a database and, with in-database scoring, there is no data movement. Data and models reside together hence scores and predictions flow on an accelerated pace.

So, wouldn't it be great if you could now benefit from the flexibility of a standard such as PMML combined with in-database scoring? Zementis is offering just such a solution. It is called the Universal PMML Plug-in™ and it is truly amazing!

Here is why: for starters, it is simple to deploy and maintain. Our Universal PMML Plug-in was designed from the ground up to take advantage of efficient in-database execution, and, as its name suggests, it is PMML-based. PMML, the Predictive Model Markup Language is the standard for representing predictive models currently exported from all major commercial and open-source data mining tools. So, if you build your models in either SAS, IBM/SPSS, or R, you are ready to start benefiting from in-database scoring right away.

Announcing the Universal PMML Plug-in for Sybase IQ

It is our pleasure to announce, together with Sybase, the availability of the Zementis Universal PMML Plug-In for Sybase IQ 15.4 (Press Release: Sybase Does More Big Data Analytics). This solution allows external predictive models created in the PMML standard to be parsed, ingested and executed In-database in Sybase IQ. This unique capability is extremely appealing to most enterprises that leverage multiple data mining tools or seek to deploy their existing predictive models closer to the data for better performance and broader applicability.


The PMML Plug-in seamlessly embeds models within Sybase IQ. In this way, data scoring requires nothing more than adding a simple function call into your SQL statements. You can score data against one model or against multiple models at the same time. There is no need to code connection weights, regression equations or other more complex calculations in SQL or stored procedures. PMML and our Universal Plug-in can easily take care of that.

PMML execution combined with Sybase IQ existing capabilities for text and multimedia analytics provides enterprises with a breadth of available techniques for analyzing big data.

For more details about the Universal PMML Plug-in for Sybase IQ, contact Zementis, or download the product data sheet.

Tuesday, April 19, 2011

KDD 2011 PMML Workshop - Call for Papers

Predictive Model Markup Language (PMML) Workshop at KDD 2011

Organized by the Data Mining Group (DMG – www.dmg.org ), Sunday August 21, 2011


A half-day workshop on the Predictive Model Markup Language (PMML), including PMML deployment success stories, PMML-based applications, PMML-based architectures, extensions to the PMML standard, and related topics.

The annual ACM SIGKDD conference ( http://www.sigkdd.org/kdd2011/ ) is the premier international forum for data mining researchers and practitioners from academia, industry, and government to share their ideas, research results and experiences. KDD-2011 will feature keynote presentations, oral paper presentations, poster sessions, workshops, tutorials, panels, exhibits, demonstrations, and the KDD Cup competition.

We invite submission of papers describing implementations of the Predictive Model Markup Language (PMML), including PMML deployment success stories, PMML-based applications, PMML-based architectures, proposed extensions to the PMML standard, and related topics.

Key Dates:
- Abstracts due: April 30, 2011
- Papers due: May 15, 2011

Please visit the PMML Workshop web site for details:

http://kdd2011-pmml.dmg.org/

Organizers:
- Rick Pechter (MicroStrategy), Chair
- Robert Grossman (Open Data Partners and University of Chicago)
- Christoph Lingenfelder (IBM)
- Ashok Savasere (SAS)
- Michael Zeller (Zementis)

Tuesday, April 12, 2011

Universal PMML Plug-in for EMC Greenplum Database

It is our pleasure to announce a new Zementis product, the Universal PMML Plug-in for in-database scoring. Available now for the EMC Greenplum Database, a high-performance massively parallel processing (MPP) database, the plug-in leverages the Predictive Model Markup Language (PMML) to execute predictive models directly within EMC Greenplum, for highly optimized in-database scoring.
Developed by the Data Mining Group (DMG), PMML is supported by all major data mining vendors, e.g., IBM SPSS, SAS, Teradata, FICO, STASTICA, Microstrategy, TIBCO and Revolution Analytics as well as open source tools like R, KNIME and RapidMiner. With PMML, models built in any of these data mining tools can now instantly be deployed in the EMC Greenplum database. The net result is the ability to leverage the power of standards-based predictive analytics on a massive scale, right where the data resides.
"By partnering with Zementis, a true PMML innovator, we are able to offer a vendor-agnostic solution for moving enterprise-level predictive analytics into the database execution environment," said Dr. Steven Hillion, Vice President of Analytics at EMC Greenplum. "With Zementis and PMML, the de-facto standard for representing data mining models, we are eliminating the need to recode predictive analytic models in order to deploy them within our database. In turn, this enables an analyst to reduce the time to insight required in most businesses today."

Want to learn more?

To learn more about how the EMC Greenplum Database and the Universal PMML Plug-in work together, feel free to:
The Universal PMML Plug-in for the EMC Greenplum Database is available now. Contact us today for more information.




Copyright © 2009 Zementis Incorporated. All rights reserved.

Privacy - Terms Of Use - Contact Us