Predictive Analytics, Big Data, Hadoop, PMML: 2011

Wednesday, December 14, 2011

Operational Deployment of Predictive Solutions: Lost in Translation? Not with PMML

Traditionally, the deployment of predictive solutions have been, to put it mildly, cumbersome. As shown in the Figure below, data mining scientists work hard to analyze historical data and to build the best predictive solutions out it. Engineers, on the other hand, are usually responsible for bringing these solutions to life, by recoding them into a format suitable for production deployment. Given that data mining scientists and engineers tend to inhabit different information worlds, the process of moving a predictive solution from the scientist's desktop to production can get lost in translation.

Luckily, the advent of PMML (Predictive Model Markup Language) changed this scenario radically. PMML is the de facto standard used to represent predictive solutions. In this way, there is no need for scientists to write a word document describing the solution. They can just export it as a PMML file. Today, all major data mining tools and statistical packages support PMML. These include IBM SPSS, SAS, R, KNIME, RapidMiner, KXEN, ... Also, tools such as the Zementis Transformations Generator and KNIME allow for easy PMML coding for pre- and post-processing steps.

Great! Once a PMML file exists, it can be easily deployed in production with ADAPA, the Zementis scoring engine. ADAPA even allows for models to be deployed in the Amazon Cloud and be accessed from anywhere via web-services. Zementis also offers in-database scoring via its Universal PMML Plug-in, which is also available for Hadoop. In this way, a process that could take 6 months, now takes minutes.

PMML and ADAPA have transformed model deployment forever. If you or your company are still spending time and resources in deploying your predictive analytics the traditional way, make sure to contact us. The secret behind exceptional predictive analytics is out!

Friday, December 9, 2011

PMML and Association Rules

An association rule describes a relation between one group of objects and another group of objects. This may be said in another way: "If a condition A is satisfied, then so is condition B". As an example, consider items people purchase in a grocery store. Suppose most people who buy milk also buy juice. Also, most people who buy chicken and beef also buy bread. Then two association rules exist:

[Milk] --> [Juice]
If you buy milk, then you will also buy juice

[Chicken,Beef] --> [Bread]
If you buy chicken and beef, then you will also buy bread

Data Processing

Normally, as in a typical regression model, one data row (or record) is read at a time and one output is given back. In particular, one input value is read for each of the input variables required by the model, which are positioned in different columns but in the same row. Once read in, the input record is processed through the model. The result, or output, is then appended to the data as an extra column as the predicted value or score. For Association rules, on the other hand, multiple items of a single transaction need to be read in and processed before an output can be returned. As suggested by the example above, "Chicken" and "Beef" need to be read in before "Bread" is produced as an output. In the usual data format, the entire transaction will have its unique value in one column. For association rules, two different data processing methods can be used to read all the items under a single transaction.

These two methods allow for the data to be expressed either in a "rectangular" for or in a "transactional" format.

Rectangular Format

The rectangular format lists all possible items of a single transaction in a separate column for each row. For the above example, if customers purchase from a list of five possible items: Milk, Juice, Chicken, Beef, and Bread, the input data might be represented as:

Milk,Juice,Chicken,Beef,Bread
1,1,0,0,0
0,0,1,1,1

Note that the first row specifies the header, while the third row, for example, specifies that Chicken, Beef and Bread were purchased together. Of course, it is not clear from these if chicken and beef implies bread, or if chicken implies bread and beef; but together with the PMML file, the scoring machine is able to deduce the correct relationships. And so, for a "rectangular" data file, the output is added to the same row as a different column.

The PMML file for each format is different as well. For a "rectangular" PMML file, all the possible values or items are defined as different fields. And so, these are defined as different "MiningFields" under the "MiningSchema" element. For the example above, instead of a single "MiningField" for the entire purchase, one would have five "MiningFields": Milk, Juice, Chicken, Beef, and Bread, as follows:

 <MiningSchema>
    <MiningField name="Milk" usageType="active"/>
    <MiningField name="Juice" usageType="active"/>
    <MiningField name="Chicken" usageType="active"/>
    <MiningField name="Beef" usageType="active"/>
    <MiningField name="Bread" usageType="active"/>
 </MiningSchema>

PMML Example - Association Rules in Rectangular Format

For an example of a PMML file and its correspondent data file in rectangular format, click HERE.

Transactional Format

The "transactional" format, on the other hand, allows for the input data to be specified in two columns: the first one is the identifier and the second one contains the possible items. For the example above, the data file might be represented as:

ID,value
1,Milk
1,Juice
2,Chicken
2,Beef
2,Bread

The identifier (column "ID") indicates which items belong together. And so, in this example, ID = 1 specifies that the first two items (Milk and Juice) belong to the same input group or transaction, while ID = 2 indicates that Chicken, Beef and Bread belong to a different group. In this case, for the "transactional" data file, the predicted value is added as an extra column in the first row of each group only.

A "transactional" PMML file defines two "MiningFields". One is of type "group" which indicates which group the items belong to. The second is of type 'active' which includes, as in our example, all the possible items that were purchased. Note that is not necessary to list all items one by one. And so, the "MiningSchema" in a "transactional" PMML file might look like:

  <MiningSchema>
    <MiningField name="ID" usageType="group"/>
    <MiningField name="item" usageType="active"/>
 </MiningSchema>

In this case, the columns with the same "ID" belong together: since Milk and Juice in our example both have ID = 1, they both are in the same group. The second column, titled "item" in the data file, lists all the items for that group: Milk and Juice. One can thus read the first group as: “Milk and juice are purchased together”.

PMML Example - Association Rules in Transactional Format

For an example of a PMML file and its correspondent data file in transactional format, click HERE.

Real-Time Recommendations, KNIME, and PMML

What is KNIME?

According to knime.com:

KNIME (Konstanz Information Miner) is a user-friendly and comprehensive open-source data integration, processing, analysis, and exploration platform.

Yes, KNIME is user-friendly, not only because it offers an intuitive GUI to analyze data, but also because it is open-source. KNIME is also standards friendly. KNIME 2.0 released in 2008 was the first release to offer PMML support. Since then, PMML support in KNIME has matured considerably, from the import and export of predictive models all the way to the pre-processing of input variables. KNIME 2.5, released December 01, 2011 offers a series of PMML-enabled pre-processing nodes which can be embedded automatically in the final PMML model. All these features are documented in a paper presented at the KDD 2011 PMML Workshop:

Peer-reviewed article: KDD 2011 - PMML Pre-processing in KNIME

The picture below shows part of a typical workflow in KNIME. Note that KNIME nodes now come with "blue" ports which signify PMML support. In this way, one can link a series of PMML-enabled pre-processing nodes to a model and obtain not only the model but also all the pre-processing steps in the resulting PMML file.

Want to see more? Take a look at a step-by-step example of KNIME and PMML at work.

Whenever a PMML file is exported by KNIME, it can be directly deployed in any of the Zementis scoring products, including the ADAPA Scoring Engine or the Universal PMML Plug-in for in-database scoring. This enables models to be ready for operational use right away.

Social Media, Recommendations, and Real-Time Execution with KNIME and ADAPA

There is a lot of theory and hype around the topics of social media, recommendation engines and real time modeling, but until now not many practical examples that can be measured in terms of ROI. KNIME AG and Zementis have joined together to provide a white paper, which summarizes a practical case study that combines all three topics, and delivers a measured and solid business case.

Our case study is just one example as to how advanced analytics combined with real-time execution has real world benefits for organizations. Regardless of whether a requirement to control risk, increase personalization with the customer or maximize sales and margin exists, the combination of KNIME and ADAPA are ideal for leveraging the power of data by providing an end-to-end solution, from model development to operational deployment and real-time execution within any business process.

Download our white-paper today: Social Media, Recommendation Engines and Real-Time Model Execution with KNIME and ADAPA

Thursday, November 3, 2011

In-database Scoring with PMML, Zementis, and Sybase IQ: Big Data Analytics Made Easy

Not all analytic tasks are born the same. If one is confronted with massive volumes of data that need to be scored on a regular basis, in-database scoring sounds like the logical thing to do. In all likelihood, the data in these cases is already stored in a database and, with in-database scoring, there is no data movement. Data and models reside together hence scores and predictions flow on an accelerated pace.

So, wouldn't

it be great if you could now benefit from the flexibility of a standard such as PMML combined with in-database scoring? Zementis is offering just such a solution. It is called the Universal PMML Plug-in™ and it is truly amazing!

Here is why: for starters, it is simple to deploy and maintain. Our Universal PMML Plug-in was designed from the ground up to take advantage of efficient in-database execution, and, as its name suggests, it is PMML-based. PMML, the Predictive Model Markup Language is the standard for representing predictive models currently exported from all major commercial and open-source data mining tools. So, if you build your models in either SAS, IBM/SPSS, or R, you are ready to start benefiting from in-database scoring right away.

Announcing the Universal PMML Plug-in for Sybase IQ

It is our pleasure to announce, together with Sybase, the availability of the Zementis Universal PMML Plug-In for Sybase IQ 15.4 (Press Release: Sybase Does More Big Data Analytics). This solution allows external predictive models created in the PMML standard to be parsed, ingested and executed In-database in Sybase IQ. This unique capability is extremely appealing to most enterprises that leverage multiple data mining tools or seek to deploy their existing predictive models closer to the data for better performance and broader applicability.

The PMML Plug-in seamlessly embeds models within Sybase IQ. In this way, data scoring requires nothing more than adding a simple function call into your SQL statements. You can score data against one model or against multiple models at the same time. There is no need to code connection weights, regression equations or other more complex calculations in SQL or stored procedures. PMML and our Universal Plug-in can easily take care of that.

PMML execution combined with Sybase IQ existing capabilities for text and multimedia analytics provides enterprises with a breadth of available techniques for analyzing big data.

For more details about the Universal PMML Plug-in for Sybase IQ, contact Zementis, or download the product data sheet.

Tuesday, April 19, 2011

KDD 2011 PMML Workshop - Call for Papers

Predictive Model Markup Language (PMML) Workshop at KDD 2011

Organized by the Data Mining Group (DMG – www.dmg.org ), Sunday August 21, 2011

A half-day workshop on the Predictive Model Markup Language (PMML), including PMML deployment success stories, PMML-based applications, PMML-based architectures, extensions to the PMML standard, and related topics.

The annual ACM SIGKDD conference ( http://www.sigkdd.org/kdd2011/ ) is the premier international forum for data mining researchers and practitioners from academia, industry, and government to share their ideas, research results and experiences. KDD-2011 will feature keynote presentations, oral paper presentations, poster sessions, workshops, tutorials, panels, exhibits, demonstrations, and the KDD Cup competition.

We invite submission of papers describing implementations of the Predictive Model Markup Language (PMML), including PMML deployment success stories, PMML-based applications, PMML-based architectures, proposed extensions to the PMML standard, and related topics.

Key Dates:
- Abstracts due: April 30, 2011
- Papers due: May 15, 2011

Please visit the PMML Workshop web site for details:

http://kdd2011-pmml.dmg.org/

Organizers:
- Rick Pechter (MicroStrategy), Chair
- Robert Grossman (Open Data Partners and University of Chicago)
- Christoph Lingenfelder (IBM)
- Ashok Savasere (SAS)
- Michael Zeller (Zementis)

Tuesday, April 12, 2011

Universal PMML Plug-in for EMC Greenplum Database

It is our pleasure to announce a new Zementis product, the Universal PMML Plug-in for in-database scoring. Available now for the EMC Greenplum Database, a high-performance massively parallel processing (MPP) database, the plug-in leverages the Predictive Model Markup Language (PMML) to execute predictive models directly within EMC Greenplum, for highly optimized in-database scoring.

Developed by the Data Mining Group (DMG), PMML is supported by all major data mining vendors, e.g., IBM SPSS, SAS, Teradata, FICO, STASTICA, Microstrategy, TIBCO and Revolution Analytics as well as open source tools like R, KNIME and RapidMiner. With PMML, models built in any of these data mining tools can now instantly be deployed in the EMC Greenplum database. The net result is the ability to leverage the power of standards-based predictive analytics on a massive scale, right where the data resides.

"By partnering with Zementis, a true PMML innovator, we are able to offer a vendor-agnostic solution for moving enterprise-level predictive analytics into the database execution environment," said Dr. Steven Hillion, Vice President of Analytics at EMC Greenplum. "With Zementis and PMML, the de-facto standard for representing data mining models, we are eliminating the need to recode predictive analytic models in order to deploy them within our database. In turn, this enables an analyst to reduce the time to insight required in most businesses today."

Want to learn more?

To learn more about how the EMC Greenplum Database and the Universal PMML Plug-in work together, feel free to:

Visit the PMML Plug-in product page
Download the white paper

The Universal PMML Plug-in for the EMC Greenplum Database is available now. Contact us today for more information.

Thursday, March 17, 2011

Data Transformations in PMML: New Interactive Tool Available

PMML (Predictive Model Markup Language) is the de facto standard used to represent and share predictive analytic solutions between applications. It enables data mining scientists and users alike to easily build, visualize, and deploy their solutions using different platforms and systems.

PMML is the brain child of the DMG (Data Mining Group), an independent, vendor led consortium that develops data mining standards. Besides developing PMML, another DMG goal is to make learning material and PMML tools available to the data mining community. Last year, the first book about PMML was published. This book, entitled "PMML in Action: Unleashing the Power of Open Standards for Data Mining and Predictive Analytics", is now available for purchase on Amazon.com.

And so, as part of the DMG's continuing commitment to the success and adoption of the PMML standard, it is our pleasure to announce today the availability of a PMML tool, companion to the book: "PMML in Action", that provides an interactive, hands-on approach to the generation of data transformations.

To access the tool, simply click HERE (or click on the "Transformations Generator" link available on the DMG website).

PMML provides a variety of data transformations, including value mapping, normalization, and discretization. It also offers several built-in functions as well as arithmetic and logical operators which can be combined to represent complex pre-processing steps. In here we show two of the available operations in the interactive PMML tool: "Value Mapping" and the "Generic Operation".

Value Mapping

The figure below shows the "Value Mapping" transformation as represented by the transformations generator tool. The left side shows the graphical representation of the mapping of primary colors to numbers. The left side shows the corresponding PMML code generated by the tool.

Generic Operation

The "Generic Operation" allows users to use the "IF-THEN" and "IF-THEN-ELSE" constructs besides logical and arithmetic operators. The figure below shows the graphical representation and expression tree for the following construct:

IF isMissing (primaryColor)
THEN processedColor = 4

The right sides shows the interface together with the expression tree while the left side shows the respective PMML code.

Start building your own transformations in PMML today! To access the transformations generator tool, simply click HERE.

Wednesday, February 23, 2011

Zementis forms partnership with Revolution Analytics

We are proud to announce our strategic partnership with Revolution Analytics, the leading commercial provider of software and support for the popular open source R statistics language. With this partnership, predictive models developed on Revolution R Enterprise are now accessible for real-time scoring through the ADAPA Decisioning Engine by Zementis.

ADAPA is an extremely fast and scalable predictive platform. Models deployed in ADAPA are automatically available for execution in real-time and batch-mode as Web Services. ADAPA allows Revolution R Enterprise to leverage the Predictive Model Markup Language (PMML) for better decision management. With PMML, models built in R can be used in a wide variety of real-world scenarios without requiring laborious or expensive proprietary processes to convert them into applications capable of running on an execution system.

Got demo?

Yes, we do! Revolution Analytics and Zementis have put together a demo which combines the building of models in R with automatic deployment and execution in ADAPA. It uses Revolution Analytics' RevoDeployR, a new Web Services framework that allows for data analysts working in R to publish R scripts to a server-based installation of Revolution R Enterprise.

Action Items:

Try our INTERACTIVE DEMO
DOWNLOAD the white paper
Try the ADAPA FREE TRIAL

RevoDeployR & ADAPA allow for real-time analysis and predictions from R to be effectively used by existing Excel spreadsheets, BI dashboards and Web-based applications, all in real-time.

Predictive analytics with RevoDeployR from Revolution Analytics and ADAPA from Zementis put model building and real-time scoring into a league of their own. Seriously!

Wednesday, January 26, 2011

Predictive analytics and the power of open standards and cloud computing

Organizations around the globe increasingly recognize the value that predictive analytics offers to their business. The complexity of development, integration, and deployment of predictive models, however, is often considered cost-prohibitive for many projects. In light of mature open source solutions, open standards, and SOA principles we offer an agile model development life cycle that allows us to quickly leverage predictive analytics in operational environments.

Starting with data analysis and model development, you can effectively use the Predictive Model Markup Language (PMML) standard, to move complex decision models from the scientist's desktop into a scalable production environment hosted on the Amazon Elastic Compute Cloud (Amazon EC2).

Expressing Models in PMML

PMML is an XML-based language used to define predictive models. It was specified by the Data Mining Group (DMG), an independent group of leading technology companies including Zementis. By providing a uniform standard to represent such models, PMML allows for the exchange of predictive solutions between different applications and various vendors.

Open source statistical tools such as R can be used to develop data mining models based on historical data. R allows for models to be exported into PMML which can then be imported into an operational decision platform and be ready for production use in a matter of minutes.

On-Demand Predictive Analytics

Amazon EC2 is a reliable, on-demand infrastructure on which we offer the ADAPA Predictive Decisioning Engine based on the Software as a Service (SaaS) paradigm. ADAPA imports models expressed in PMML and executes these in batch mode, or real-time via web-services.

Our service is implemented as a private, dedicated Amazon EC2 instance of ADAPA. Each client has access to his/her own ADAPA instance via HTTP/HTTPS. In this way, models and data for one client never share the same engine with other clients.

Using a SaaS solution to break down traditional barriers that currently slow the adoption of predictive analytics, our strategy translates predictive models into operational assets with minimal deployment costs and leverages the inherent scalability of utility computing.

In summary, ADAPA allows for:

Cost-effective and reliable service based on Amazon’s EC2 infrastructure
Secure execution of predictive models through dedicated and controlled instances including HTTPS and Web-Services security
On-demand computing. Choice of instance type (small, large, extra-large, ...) and launch of multiple instances.
Superior time-to-market by providing rapid deployment of predictive models and an agile enterprise decision management environment.

For a practical guide, watch:

Monday, January 24, 2011

Predictions in the Cloud

ADAPA is the first standards-based, real-time predictive decisioning engine available on the market and the first scoring engine accessible on the Amazon Cloud as a service. ADAPA on the Cloud combines the benefits of Software as a Service (SaaS), the scalability of cloud computing and the extensive feature set of ADAPA on Site.

What do you mean by standards-based?

ADAPA executes predictive models represented in PMML (Predictive Model Markup Language). PMML is the standard for representing predictive models currently exported from all major commercial and open-source data mining tools. If you'd like to use ADAPA on the Cloud, but do not want to bother with PMML, contact us. We'll be happy to help you so that you start benefiting from ADAPA right away.

Is ADAPA really fast?

ADAPA is very fast. We recently published a study on the ACM SIGKDD Newsletter in which we show that ADAPA can easily score thousands of transactions per second. In the High-CPU Extra-Large instance, ADAPA can score 300 million transactions per hour. FAST!

What kind of models does it support?

Modeling techniques currently supported are:

Neural Networks
Association Rules
Support Vector Machines
Naive Bayes Classifiers
Ruleset Models
Clustering Models (including Two-Step Clustering)
Decision Trees
Regression Models (including Cox Regression Models)
Scorecards

How about data pre- and post-processing?

ADAPA transforms your raw data into meaningful feature detectors before scoring it. It post-processes the output of your predictive model so that it conforms to your requirements. ADAPA supports all the PMML built-in functions and data manipulations (as well as user defined functions). To learn more about how to represent pre- and post-processing operations in PMML, please take a look at our PMML data manipulation primer or simply contact us.

Can I combine predictive analytics with business rules?

ADAPA provides seamless integration of predictive analytics and rules. Simply put, ADAPA allows data driven insight and expert knowledge to be combined into a single and powerful decision strategy. That is because in addition of a sophisticated predictive analytics engine, ADAPA also incorporates the full functionality of a rules engine.

How do I pay for it? Is it expensive?

Once you sign up for ADAPA on the Cloud through Amazon.com, ADAPA charges show up on your credit card bill. Amazon handles all the billing. You can even use the same account you use to buy books. ADAPA on the Cloud does not cost an arm and a leg. Check out our pricing! And, the best part, you pay only for what you actually use.

Friday, January 14, 2011

Zoey on Model Deployment, PMML, and ADAPA

PMML (Predictive Model Markup Language) is the de facto standard used by all the top analytic vendors (commercial and open-source) to represent data mining models.

When represented in PMML format, a predictive model can be uploaded in ADAPA where it is readily available for execution from anywhere at anytime. ADAPA makes it easy for predictive models to be put to work right away, wherever they are needed, in real-time or batch-mode.

Zementis, the maker of ADAPA, is a company committed to the success of its clients. With PMML and ADAPA, you are on the right track to predictive analytic bliss.

In the video below Zoey takes you in a PMML journey featuring ADAPA and Zementis. Enjoy!

Friday, January 7, 2011

Predictive Analytics with R, PMML, ADAPA, and Excel

PMML (Predictive Model Markup Language) is the standard language used by all the top analytic vendors (commercial and open-source) to represent predictive models. These include IBM/SPSS, SAS, KXEN, TIBCO, STATISICA, R, KNIME, and RapidMiner (for details, see list of PMML-powered tools at DMG.org).

Once a predictive model is exported in PMML format, it can be easily deployed in ADAPA, the predictive decisioning platform from Zementis. ADAPA is PMML-based and is able to upload new and older versions of PMML files and make them available for execution, right away.

Model execution can be performed via the ADAPA Console, Web-services or from within Excel. ADAPA makes predictive models accessible from anywhere at anytime.

See for yourself. Watch our new R to PMML to ADAPA video and learn how to execute your predictive models from within Excel by using ADAPA on the Amazon Cloud.