Predictive Analytics, Big Data, Hadoop, PMML: October 2012

Wednesday, October 31, 2012

When Big Data and Predictive Analytics Collide

Big Data is usually defined in terms of Volume, Variety and Velocity (the so called 3 Vs). Volume implies breadth and depth, while variety is simply the nature of the beast: on-line transactions, tweets, text, video, sound, ... Velocity, on the other hand, implies that data is being produced amazingly fast (according to IBM, 90% of the data that exists today was generated in the last 2 years), but that it also gets old pretty fast. In fact, a few data varieties tend to age quicker than others.

To be able to tackle Big Data, systems and platforms need to be robust, scalable, and agile.

It is in this context that IntelliFest 2012 came to be. The conference theme this year was "Intelligence in the Cloud", exploring the use of applied AI in cloud computing, mobile apps, Big Data, and many other application areas. Among several amazing speakers at Intellifest were Stephen Grossberg from Boston University, Rajat Monga from Google, Carlos Serrano-Morales from Sparkling Logic, Paul Vincent from TIBCO, and Alex Guazzelli from Zementis.

Dr. Alex Guazzelli's talk on Big Data, Predictive Analytics, and PMML is now available for on-demand viewing on YouTube. The abstract follows below, together with several resources including the presentation slides and files used in the live demo.

Abstract:

Predictive analytics has been used for many years to learn patterns from historical data to literally predict the future. Well known techniques include neural networks, decision trees, and regression models. Although these techniques have been applied to a myriad of problems, the advent of big data, cost-efficient processing power, and open standards have propelled predictive analytics to new heights.

Big data involves large amounts of structured and unstructured data that are captured from people (e.g., on-line transactions, tweets, ... ) as well as sensors (e.g., GPS signals in mobile devices). With big data, companies can now start to assemble a 360 degree view of their customers and processes. Luckily, powerful and cost-efficient computing platforms such as the cloud and Hadoop are here to address the processing requirements imposed by the combination of big data and predictive analytics.

But, creating predictive solutions is just part of the equation. Once built, they need to be transitioned to the operational environment where they are actually put to use. In the agile world we live today, the Predictive Model Markup Language (PMML) delivers the necessary representational power for solutions to be quickly and easily exchanged between systems, allowing for predictions to move at the speed of business.

This talk will give an overview of the colliding worlds of big data and predictive analytics. It will do that by delving into the technologies and tools available in the market today that allow us to truly benefit from the barrage of data we are gathering at an ever-increasing pace.

Resources:

Download the presentation slides
Download the KNIME workflow used to generate a sample neural network for predicting churn
Download the PMML file created during the demo

Wednesday, October 17, 2012

Big data insights through predictive analytics, open-standards and cloud computing

Organizations increasingly recognize the value that predictive analytics and big data offer to their business. The complexity of development, integration, and deployment of predictive solutions, however, is often considered cost-prohibitive for many projects. In light of mature open source solutions, open standards, and SOA principles we propose an agile model development life cycle that quickly leverages predictive analytics in operational environments.

Starting with data analysis and model development, you can effectively use the Predictive Model Markup Language (PMML) standard, to move complex decision models from the scientist's desktop into a scalable production environment hosted in the cloud (Amazon EC2 and IBM SmartCloud Enterprise).

Expressing Models in PMML

PMML is an XML-based language used to define predictive models. It was specified by the Data Mining Group, an independent group of leading technology companies including Zementis. By providing a uniform standard to represent such models, PMML allows for the exchange of predictive solutions between different applications and various vendors.

Open source PMML-compliant statistical tools such as R, KNIME, and RapidMiner can be used to develop data mining models based on historical data. Once models are exported into a PMML file, they can then be imported into an operational decision platform and be ready for production use in a matter of minutes.

On-Demand Predictive Analytics

Both Amazon and IBM offer a reliable and on-demand cloud computing infrastructure on which we offer the ADAPA® Predictive Decisioning Engine based on the Software as a Service (SaaS) paradigm. ADAPA imports models expressed in PMML and executes these in batch mode, or real-time via web-services.

Our service is implemented as a private, dedicated instance of ADAPA. Each client has access to his/her own ADAPA Engine instance via HTTP/HTTPS. In this way, models and data for one client never share the same engine with other clients.

The ADAPA Web Console

Each instance executes a single version of the ADAPA engine. The engine itself is accessible through the ADAPA Web Console which allows for the easy managing of predictive models and data files. The instance owner can use the console to upload new models as well as score or classify records on data files in batch mode. Real-time execution of predictive models is achieved through the use of web-services. The ADAPA Console offers a very intuitive interface which is divided into two main sections: model and data management. These allow for existing models to be used for generating decisions on different data sets. Also, new models can be easily uploaded and existing models can be removed in a matter of seconds.

Predicting in the Cloud

Using a SaaS solution to break down traditional barriers that currently slow the adoption of predictive analytics, our strategy translates predictive solutions into operational assets with minimal deployment costs and leverages the inherent scalability of utility computing.

In summary, ADAPA revolutionizes the world of predictive analytics and cracks the big data code, since it allows for:

Cost-effective and reliable service based on two outstanding cloud computing infrastructures: Amazon and IBM.

Secure execution of predictive models through dedicated and controlled instances including HTTPS and Web-Services security

On-demand computing. Choice of instance type and launch of multiple instances.

Superior time-to-market by providing rapid deployment of predictive solutions and an agile enterprise decision management environment.

Monday, October 8, 2012

ADAPA in the Cloud: Feature List

Broad support for predictive algorithms

ADAPA supports an extensive collection of statistical and data mining algorithms. These are:

Ruleset Models (flat Decision Trees)
Clustering Models (Distribution-Based, Center-Based, and 2-Step Clustering)
Decision Trees (for classification and regression) together with multiple missing value handling strategies (Default Child, Last Prediction, Null Prediction, Weighted Confidence, Aggregate Nodes)
Naive Bayes Classifiers
Association Rules
Neural Networks (Back-Propagation, Radial-Basis Function, and Neural-Gas)
Regression Models (Linear, Polynomial, and Logistic) and General Regression Models (General Linear, Ordinal Multinomial, Generalized Linear, Cox)
Support Vector Machines (for regression and multi-class and binary classification)
Scorecards (including reason codes and point allocation for categorical, continuous, and complex attributes)
Multiple Models (Segmentation, Ensembles - including Random Forest Models and Stochastic Boosting, Chaining and Model Composition)

Model interfaces: pre- and post-processing

Additionally, ADAPA supports a myriad of functions for implementing data pre- and post-processing. These include:

Text Mining
Value Mapping
Discretization
Normalization
Scaling
Logical and Arithmetic Operators
Business Rules
Lookup Tables
Regular Expressions
Custom Functions

and much much more.

If you think of anything ADAPA cannot do or something else you need to do in terms of data manipulation, let us know.

Automatic conversion (and correction) for older versions of PMML

ADAPA consumes model files that conform to PMML, version 2.0 through 4.2. If your model development environment exports an older version, ADAPA will automatically convert your file into a 4.2 compliant format. It will also correct a number of common problems found in PMML generated by some popular modeling tools, allowing the models to work as intended.

Web-based management and interactive execution of predictive models and business rules

Model management: Models and rule sets are deployed and managed through an intuitive, Web-based management console, the ADAPA Console.

Model verification: The ADAPA Console includes a model validation test, allowing models to be verified for correctness. By providing ADAPA a test file containing input data and expected results for a model, the engine will report any deviations from expected results, greatly enhancing traceability of errors and debugging of model deployment issues. The console also provides easy access to our rules testing framework in which business rules are submitted to regression testing and acceptance.
Batch-scoring: The console also provides functionality to upload a (compressed) CSV data file and batch-scores it against any of the deployed models. Results are returned in the same format and may be downloaded for further processing and visualization.

Simplified integration via SOA

Service Oriented Architecture (SOA) principles simplify integration with existing IT infrastructure. Since ADAPA publishes all deployed models as a Web-Service, you can score data records from within your own environment. With the simple execution of a web service call (SOAP or REST), you are able to leverage the power of predictive models and business rules on-demand or in real-time.

Data scoring from inside Excel

The ADAPA Add-in for Microsoft Office Excel 2007, 2010, and 2013 allows you to easily score data using ADAPA on the Cloud. Once the Add-in is installed, all you need to do is to select your data in Excel, connect to ADAPA and start scoring right away. Your predictions will be made available as new columns.

On-demand predictive analytics solution

ADAPA in the Cloud is a fully hosted Software-as-a-Service (SaaS) solution. You only pay for the service and the capacity that is used, eliminating the necessity for expensive software licenses and in-house hardware resources. As the business grows, ADAPA in the Cloud provides a cost-effective expansion path, for example, by adding multiple ADAPA instances for scalability or failover. The SaaS model removes the burden for you to manage a scalable, on-demand computing infrastructure.

Private instance for all your decisioning needs

We provide you with a single-tenant architecture. The service is implemented as a private, dedicated instance of ADAPA that encapsulates your predictive models and business rules. Only you have access to your private ADAPA instance(s) via HTTPS. Your decisioning files and data never share the same engine with other clients.

Trusted, secure, scalable cloud infrastructure

Zementis leverages FICO and Amazon EC2 for providing on-demand infrastructure for ADAPA in the Cloud. Cloud computing offers utility computing with virtually unlimited scalability.

Friday, October 5, 2012

Seamless Integration of Predictive Analytics and Business Rules

Operational deployment of predictive solutions includes exporting the data mining models you built in SAS, IBM SPSS, STASTISTICA, KNIME, R, ... into PMML, the Predictive Model Markup Language. Once in PMML standard, these models can be easily moved into production: on-site, in the cloud, Hadoop or in-database. Zementis offers a range of products that make this possible. These include the ADAPA Decisioning Engine and the Universal PMML Plug-in. Besides providing a predictive analytics engine, ADAPA also encapsulates a rules engine which allows for predictive models to be seamlessly integrated with business rules.

In this demo, we show a pre-qualification app that uses predictive models and rules to analyze the risk of mortgage default on loan applications. An application is accepted or referred for a variety of loan products depending on its perceived risk. ADAPA is the engine driving this application in the back-end.

Once logged in we use the ADAPA Web to download the mortgage solution files which are used throughout the demo. Predictive models expressed in PMML format are uploaded and verified in ADAPA along with rulesets expressed in tabular format. The ADAPA Web Console is used for managing predictive models, rulesets, and resource files as well as for batch-scoring. Real-time scoring is obtained via web-services or the Java API.

Finally, we show how the ADAPA Add-in for Excel is used to score data directly from within Excel. This part of the demo features the scoring of loan and tax data as well as the visualization of results via dashboards.

Tuesday, October 2, 2012

Amazing In-database Analytics with PMML and UPPI

Not all analytic tasks are born the same. If one is confronted with massive volumes of data that need to be scored on a regular basis, in-database scoring sounds like the logical thing to do. In all likelihood, the data in these cases is already stored in a database and, with in-database scoring, there is no data movement. Data and models reside together hence scores and predictions flow on an accelerated pace.

A new day has come!

Zementis is now offering its amazing Universal PMML Plug-in™ (UPPI) for in-database scoring for the IBM Netezza appliance, SAP Sybase IQ, EMC Greenplum, Teradata and Teradata Aster.

Amazing! Why?

For starters, it won't break your budget (feel free to contact us for details). Also, it is simple to deploy and maintain. Our Universal PMML Plug-in was designed from the ground up to take advantage of efficient in-database execution. Last but not least, as its name suggests, it is PMML-based. PMML, the Predictive Model Markup Language is the standard for representing predictive models currently exported from all major commercial and open-source data mining tools. So, if you build your models in either SAS, IBM SPSS, STATISTICA, or R, you are ready to start benefiting from in-database scoring right away.

The PMML plugin seamlessly embeds models within your database. Data scoring requires nothing more than adding a simple function call into your SQL statements. You can score data against one model or against multiple models at the same time. There is no need to code complex data transformations and calculations in SQL or stored procedures. PMML and our Universal Plug-in can easily take care of that.

Modeling techniques currently supported are:

Neural Networks
Support Vector Machines
Naive Bayes Classifiers
Ruleset Models
Clustering Models (including Two-Step Clustering)
Decision Trees
Regression Models (including Cox Regression Models)
Scorecards (including reason codes)
Association Rules
Multiple Models (model composition, chaining, segmentation, and ensemble - including Random Forest models)

As well as extensive data pre- and post-processing capabilities.

In addition to all these predictive techniques, UPPI accepts PMML models of all versions (2.0, 2.1, 3.0, 3.1, 3.2, 4.0, 4.1 and 4.2) generated by any of the major commercial and open source mining tools (SAS, SPSS/IBM, STATISTICA, MicroStrategy, Microsoft, Oracle, KXEN, Salford Systems, TIBCO, R/Rattle, KNIME, RapidMiner, etc.). It does not get more universal than this!