Predictive Analytics, Big Data, Hadoop, PMML: Model Deployment

Showing posts with label Model Deployment. Show all posts

Wednesday, May 28, 2014

Online PMML Course @ UCSD Extension: Register today!

The Predictive Model Markup Language (PMML) standard is touted as the standard for predictive analytics and data mining models. It is allows for predictive models built in one application to be moved to another without any re-coding. PMML has become the imperative for companies wanting to extract value and insight from Big Data. In the Big Data era, the agile deployment of predictive models is imperative. Given the volume and velocity associated with Big Data, one cannot spend weeks or months re-coding a predictive model into the IT operational environment where it actually produces value (the fourth V in Big Data).

Also, as predictive models become more complex through the use of random forest models, model ensembles, and deep learning neural networks, PMML becomes even more relevant since model recoding is simply not an option.

Zementis has paired up with UCSD Extension to offer the first online PMML course. This is a great opportunity for individuals and companies alike to master PMML so that they can muster their predictive analytics resources around a single standard and in doing so, benefit from all it can offer.

http://extension.ucsd.edu/studyarea/index.cfm?vAction=singleCourse&vCourse=CSE-41184

Course Benefits

Learn how to represent an entire data mining solution using open-standards
Understand how to use PMML effectively as a vehicle for model logging, versioning and deployment
Identify and correct issues with PMML code as well as add missing computations to auto-generated PMML code

Course Dates

07/14/14 - 08/25/14

PMML is supported by most commercial and open-source data mining tools. Companies and tools that support PMML include IBM SPSS, SAS, R, SAP KXEN, Zementis, KNIME, RapidMiner, FICO, StatSoft, Angoss, Microstrategy ... The standard itself is very mature and its latest release is version 4.2.

For more details about PMML, please visit the Zementis PMML Resources page.

Scoring Data from MySQL or SQL Server using KNIME and ADAPA

The video below shows the use of KNIME for handling data (reading data from a flat file and/or a database) as well as model building (data massaging and training a neural network). It also highlights how easy and straightforward it is to move a predictive model represented in PMML, the Predictive Model Markup Language, into the Zementis ADAPA Scoring Engine. ADAPA is then used for model deployment and scoring. PMML is the de facto standard to represent data mining models. It allows for predictive models to be moved between applications and systems without the need for model re-coding.

When training a model, scientists rely on historical data, but when using the model on a regular basis, the model is moved or deployed in production where it presented with new data. ADAPA provides a scalable and blazing fast scoring engine for models in production. And, although KNIME data mining nodes are typically used by scientists to build models, its database and REST nodes nodes can simply be used to create a flow for reading data from a database (MySQL, SQL Server, Oracle, ...) and passing it for scoring in ADAPA via its REST API.

Use-cases are:

Read data from a flat file, use KNIME for data pre-processing and building of a neural network model. Export the entire predictive workflow as a PMML file and then take this PMML file and upload and score it in ADAPA via its Admin Web Console.
Read data from a database (MySQL, SQLServer, Oracle, ...), build model in KNIME, export model as a PMML file and deploy it in ADAPA using its REST API. This use-case also shows new or testing data flowing from the database and into ADAPA for scoring via a sequence of KNIME nodes. The video also shows a case in which one can use KNIME nodes to simply read a PMML file produced in any PMML-compliant data mining tool (R, SAS EM, SPSS, ...), upload it in ADAPA using the REST API and score new data from MySQL in ADAPA also through the REST interface. Note that in this case, the model has already been trained and we are just using KNIME to deploy the existing PMML file in ADAPA for scoring.

Zementis and SAP HANA: Real-time Scoring for Big Data

The Zementis partnership with SAP is manifesting itself in a number of ways. Two weeks ago we were part of the SAP Big Data Bus parked outside Wells Fargo in San Francisco. This week, we would like to share with you three new developments.

1) ADAPA is not being offered at the SAP HANA Marketplace.

2) An interview with our CEO, Mike Zeller, was just featured by SAP on the SAP Blogs.

3) Zementis was again part of the SAP Big Data Bus and the "Big Data Theatre". This time, the bus was parked outside US Bank in Englewood, Colorado. We were engaged in a myriad of conversations with the many people that came through the bus about how ADAPA and SAP HANA work together to bring predictive analytics and real-time scoring to transactional data and millions of accounts, in any industry.

Visit the Zementis ADAPA for SAP HANA page for more details on the Zementis and SAP real-time solution for predictive analytics.

Thursday, February 27, 2014

PMML 4.2 is here! What changed? What is new?

PMML 4.2 is out! That's really great. The DMG (Data Mining Group) has been working on this new version of PMML for over two years now. And, I can truly say, it is the best PMML ever! If you haven't seen the press release for the new version, please see posting below:

http://www.kdnuggets.com/2014/02/data-mining-group-pmml-v42-predictive-modeling-standard.html

What changed?

PMML is a very mature language. And so, there isn't really dramatic changes in the language at this point. One noteworthy change is that old PMML used to call the target field on a predictive model "predicted". This was confusing since a predicted field is usually the result of scoring or executing a model. The score so to speak. Well, PMML 4.2 clears things up a bit. The target field is now simply "target". A small change, but a huge step towards making it clear that the Output element is where the predicted outputs should be defined.

Continuous Inputs for Naive Bayes Models

This is a great new enhancement to the NaiveBayes model element. We wrote an entire paper about this new feature and presented it at the KDD 2013 PMML Workshop. If you use Naive Bayes models, you should definitely take a look at our article.

http://kdd13pmml.files.wordpress.com/2013/07/guazzelli_et_al.pdf

And, now you can benefit from actually having our proposed changes in PMML itself! This is really remarkable and we are all already benefiting from it. The Zementis Py2PMML (Python to PMML) Converter uses the proposed changes to convert Gaussian Naive Bayes models from scikit-learn to PMML.

https://zementis.zendesk.com/entries/37045093-Exporting-PMML-for-Class-GaussianNB

Complex Point Allocation for Scorecards

The Scorecard model element was introduced to PMML in version 4.1. It was a good element then, but it is really great now in PMML 4.2. We added to it a way for computing complex values for the allocation of points for an attribute (under a certain characteristic) through the use of expressions. That means, you can use input or derived values to derive the actual value for the points. Very cool!

Andy Flint (FICO) and I wrote a paper about the Scorecard element for the KDD 2011 PMML Workshop. So, if you haven't seen it yet, it will get you started into how to use PMML to represent scorecards and reason codes.

http://kdd11pmml.files.wordpress.com/2011/09/p2_flint_guazzelli_kdd_20112.pdf

Revised Output Element

The output element was completely revised. It is much simpler to use. With PMML 4.2, you have direct access to all the model outputs + all post-processing directly from the attribute "feature".

The attribute segmentId also allows users to output particular fields from segments in a multiple model scenario.

The newly revised output element spells flexibility. It allows you to get what you need out of your predictive solutions.

For a complete list of all the changes in PMML 4.2 (small and large), see:

http://www.dmg.org/v4-2/Changes.html

What is new?

PMML 4.2 introduces the use of regular expressions to PMML. This is solely so that users can process text more efficiently. The most straightforward additions are simple: 3 new built-in functions for concatenating, replacing and matching strings using regular expressions.

The more elaborate addition is the incorporation of a brand new transformation element in PMML to extract term frequencies from text. The ideas for this element were presented at the KDD 2013 PMML Workshop by Benjamin De Boe, Misha Bouzinier, Dirk Van Hyfte (InterSystems). Their paper is a great resource for finding out the details behind the ideas that led to the new text mining element in PMML.

http://kdd13pmml.files.wordpress.com/2013/07/extending_the_pmml_text_model_for_text_categorization.pdf

Obviously, the changes described above are also new, but it was nice to break the news into two pieces. For the grand-finale though, nothing better than taking a look at PMML 4.2 itself.

http://www.dmg.org/v4-2/GeneralStructure.html

Enjoy!

Wednesday, October 9, 2013

CIO Review: Zementis selected as one of the top 20 most promising big data companies

Selected by a distinguished panel comprising of CEOs, CIOs, VCs, industry analysts and the editorial board of CIO Review, Zementis has been named by CIO Review as one of the "Top 20 Most Promising Big Data Companies in 2013." Congratulations Zementis!

Read CIO Review - FULL ARTICLE

That comes as no surprise since Zementis is all about kicking down barriers for the fast deployment and execution of predictive solutions. By leveraging the PMML (Predictive Model Markup Language) standard, Zementis' products allow for predictive models built anywhere (IBM SPSS, KXEN, KNIME R, SAS, ...) to be deployed right-away on-site, in the cloud (Amazon, IBM, FICO), in-database (Pivotal/Greenplum, SAP Sybase IQ, IBM PureData for Analytics/Netezza, Teradata and Teradata Aster) or in Hadoop (Hive or Datameer).

Predictive analytics has been used for many years to learn patterns from historical data to literally predict the future. Well known techniques include neural networks, decision trees, and regression models. Although these techniques have been applied to a myriad of problems, the advent of big data, cost-efficient processing power, and open standards have propelled predictive analytics to new heights.

Big data involves large amounts of structured and unstructured data that are captured from people (e.g., on-line transactions, tweets, ... ) as well as sensors (e.g., GPS signals in mobile devices). With big data, companies can now start to assemble a 360 degree view of their customers and processes. Luckily, powerful and cost-efficient computing platforms such as the cloud and Hadoop are here to address the processing requirements imposed by the combination of big data and predictive analytics.

Creating predictive solutions is just part of the equation. Once built, they need to be transitioned to the operational environment where they are actually put to use. In the agile world we live today, the Predictive Model Markup Language (PMML) delivers the necessary representational power for solutions to be quickly and easily exchanged between systems, allowing for predictions to move at the speed of business.

Zementis' PMML-based products: ADAPA for real-time scoring and UPPI for big data scoring, are designed from the ground up to deliver the agility necessary for models to be easily deployed in a variety of platforms and to be put to work right-away.

Zementis ADAPA and UPPI kick-down the barriers for big data adoption!

Wednesday, October 2, 2013

R PMML Support: BetteR than EveR

How does it work? Simple! Once you build your model in R using any of the PMML supported model types, pass the model object as an input parameter to the pmml package as shown in the figure below.

pmml package

The pmml package offers export for a variety of model types, including:

   •   ksvm (kernlab): Support Vector Machines
   •   nnet: Neural Networks
   •   rpart: C&RT Decision Trees
   •   lm & glm (stats): Linear and Binary Logistic Regression Models
   •   arules: Association Rules
   •   kmeans and hclust: Clustering Models
   •   multinom (nnet): Multinomial Logistic Regression Models
   •   glm (stats): Generalized Linear Models for classification and regression with
         a wide variety of link functions
   •   randomForest: Random Forest Models for classification and regression
   •   coxph (survival): Cox Regression Models to calculate survival and stratified
         cumulative hazards
   •   naiveBayes (e1071): Naive Bayes Classifiers
   •   glmnet: Linear ElasticNet Regression Models
   •   ada: Stochastic Boosting (coming soon)
   •   svm (e1071): Support Vector Machines (coming soon)

The pmml package can also export data transformations built with the pmmlTransformations package (see below). It can also be used to merge two distinct PMML files into one. For example, if transformations and model were saved into separate PMML files, it can combine both files, as described in Chapter 5 of the PMML book - PMML in Action

Data Transformations - the R pmmlTransformations Package

The pmmlTransformations package transforms data and, when used in conjunction with the pmml package, allows for data transformations to be exported together with the predictive model in a single PMML file. Transformations currently supported are:

   •   Min-max normalization
   •   Z-score normalization
   •   Dummy-fication of categorical variables
   •   Value Mapping
   •   Variable renaming

To learn more about this package, check out the paper we presented at the KDD 2013 PMML Workshop.

Tuesday, September 10, 2013

Predictive model deployment with PMML

Model deployment used to be a big task. Predictive models, once built, needed to be re-coded into production to be able to score new data. This process was prone to errors and could easily take up to six months. Re-coding of predictive models has no place in the big data era we live in. Since data is changing rapidly, model deployment needs to be instantaneous and error-free.

PMML, the Predictive Model Markup Language, is the standard to represent predictive models. Given that PMML can be produced by all the top commercial and open-source data mining tools (e.g., FICO Model Builder, SAS EM, IBM SPSS, R, KNIME, ...), a predictive model can be easily moved into the production environment once it is represented as a PMML file.

Zementis offers ADAPA for real-time scoring and UPPI for big data scoring which make the entire model deployment process a no-brainer. Given that ADAPA and UPPI are universal PMML consumers (accept any version of PMML produced by any PMML-compliant tool), they can make predictive models instantly available for execution inside the production environment.

Check out the Zementis website for details.

Predictive Models with PMML - Upcoming workshop at UCSD Extension - Oct 24-25

October 24-25, 2013
San Diego Supercomputer Center (SDSC), UC San Diego Campus

TO REGISTER, FOLLOW THE LINK BELOW:

http://extension.ucsd.edu/studyarea/index.cfm?vAction=singleCourse&vCourse=CSE-41184&vStudyAreaID=4

The Predictive Model Markup Language (PMML) is the de facto standard to represent data mining and predictive analytic models. With PMML, one can easily share a predictive solution among PMML-compliant applications and systems.

Developed in partnership with the San Diego Supercomputer Center’s (SDSC) Predictive Analytics Center of Excellence (PACE), this 2-day, hands-on workshop, will explore how the PMML language allows for models to be deployed in minutes. You will get to know its business value and the data mining tools and companies supporting PMML. You will also begin to understand the language elements and capabilities and learn how to effectively extract the most out of your PMML code.

Workshop Benefits

Practice PMML on SDSC’s Gordon with the guidance of world class instructors from industry and academia.
Learn how to represent an entire data mining solution using open-standards
Understand how to use PMML effectively as a vehicle for model logging, versioning and deployment
Identify and correct issues with PMML code as well as add missing computations to auto-generated PMML code
PLUS…Receive a comprehensive tour of SDSC to discover its inner workings, extensive capabilities and current projects.

Instructors

Alex Guazzelli, Ph.D., Vice President of Analytics, Zementis, Inc.
Natasha Balac, Ph.D., Director of PACE, SDSC, UC San Diego
Paul Rodriguez, Ph.D., Research Programmer Analyst, SDSC, UC San Diego

Scholarships Available!
Thanks to the generous underwriting of Zementis, three (3) half-tuition scholarships are available. Learn more and apply

Note: Students should have a fundamental knowledge of data mining methods and basic experience with computer programming language. Students must bring a laptop (MAC or PC) each day to fully participate during the hands-on portion of the workshop.

Course Number: CSE-41184 Credit: 2 units

This course is part of the following Certificate Program(s):

Data Mining

TO REGISTER, FOLLOW THE LINK BELOW:

http://extension.ucsd.edu/studyarea/index.cfm?vAction=singleCourse&vCourse=CSE-41184&vStudyAreaID=4

Wednesday, August 21, 2013

R and PMML Support

A PMML package for R that exports all kinds of predictive models is available directly from CRAN.
Traditionally, the pmml package offered support for the following data mining algorithms:

ksvm (kernlab): Support Vector Machines
nnet: Neural Networks
rpart: C&RT Decision Trees
lm & glm (stats): Linear and Binary Logistic Regression Models
arules: Association Rules
kmeans and hclust: Clustering Models

Recently, it has been expanded to support:

multinom (nnet): Multinomial Logistic Regression Models
glm (stats): Generalized Linear Models for classification and regression with a wide variety of link functions
randomForest: Random Forest Models for classification and regression
coxph (survival): Cox Regression Models to calculate survival and stratified cumulative hazards
naiveBayes (e1071): Naive Bayes Classifiers
glmnet: Linear ElasticNet Regression Models

The pmml package can also export data transformations built with the pmmlTransformations package (see below). It can also be used to merge two disctinct PMML files into one. For example, if transformations and model were saved into separate PMML files, it can combine both files into one, as described in Chapter 5 of the PMML book - PMML in Action.

How does it work?

Simple, once you build your model using any of the supported model types, pass the model object as an input parameter to the pmml function as shown in the figure below:

Example - sequence of R commands used to build a linear regression model using lm and the Iris dataset:

Documentation

For more on the pmml package, please take a look at the paper we published in The R Journal. For that, just follow the link below:
1) Paper: PMML: An Open Standard for Sharing Models
Also, make sure to check out the package's documentation from CRAN:
2) CRAN: pmml Package

R PMML Transformations Package

This is a brand new R package. Called pmmlTranformations, this package transforms data and when used in conjunction with the pmml package, it allows for data transformations to be exported together with the predictive model in a single PMML file. Transformations currently supported are:

Min-max normalization
Z-score normalization
Dummy-fication of categorical variables
Value Mapping
Discretization (binning)
Variable renaming

If you would like to contribute code to the pmmlTransformations package, please feel free to contact us.

How does it work?

The pmmlTransformations package works in tandem with the pmml package so that data pre-processing can be represented together with the model in the resulting PMML code.

In R, as shown in the figure below, this process includes three steps:

With the use of the pmmlTransformations package, transform the raw input data as appropriate
Use transformed and raw data as inputs to the modeling function/package (hclust, nnet, glm, ...)
Output the entire solution (data pre-processing + model) in PMML using the pmml package

Example - sequence of R commands used to build a linear regression model using lm with transformed data

Documentation

For more on the pmmlTransformations package, please take a look at the paper we wrote for the KDD 2013 PMML Workshop. For that, just follow the link below:
1) KDD Paper: The R pmmlTransformations Package

Also, make sure to check out the package's documentation from CRAN:
2) CRAN: pmmlTransformations Package

Wednesday, July 10, 2013

PMML Workshop at KDD 2013 and UCSD Extension PMML Class

KDD 2013 PMML Workshop

Join us for the KDD PMML Workshop to be held in Chicago on August 11. Organized by the Data Mining Group (DMG), this workshop will feature invited talks and presentations of selected papers.

Zementis will be presenting two papers about PMML-support in R: Coding and representing data transformations and model through the pmmltransformations and pmml packages.

UCSD PMML Class (Coming this Fall)

UCSD Extension has teamed up with the San Diego Supercomputer Center Predictive Analytics Center of Excellence (PACE) and Zementis to offer a PMML class to the data mining community on October 24 and 25.

For more information about this great opportunity to learn the standard that is revolutionizing how predictive solutions are documented and deployed, refer to the UCSD Extension catalog.

Tuesday, May 7, 2013

The Zementis Partnership with FICO

Stuart Wells, FICO CTO, announced the strategic partnership between Zementis and FICO at FICO World on May 2, 2013. FICO clients will now benefit from the outstanding Zementis scoring technology.

How? The Zementis ADAPA scoring engine provides a highly scalable framework to deploy, integrate, and execute complex data mining and predictive models based on the PMML standard. Models built in most commercial and open source data mining tools, such as FICO Model Builder or R, can now instantly be deployed in the FICO Anaytic Cloud.

Customers, application developers and FICO partners will be able to extract value and insight from their predictive models and data immediately, using ADAPA and PMML. This will result in quicker time to innovation and value on their analytic applications.

Read the press release!

Predictive Analytics Deployment

Zementis offers software solutions that enable scalable, real-time execution of predictive analytics across a variety of platforms based on the PMML standard. These include:

ADAPA Scoring Engine: Our solution for real-time scoring. ADAPA is available for on-site deployment as a traditional license or as a service in the Amazon Elastic Compute Cloud (EC2) and IBM SmartCloud Enterprise. And now, with our FICO partnership, ADAPA will also be available in the FICO Analytic Cloud.

UPPI, the Universal PMML Plug-in: The leading solution for Big Data, UPPI provides scoring in-database and for Hadoop. It is available for EMC Greenplum, IBM Netezza, SAP Sybase IQ, Teradata/Aster as well as Hadoop/Hive and Datameer.

Friday, April 12, 2013

The Zementis Partnership with Infocom in Japan

It is our pleasure to announce a strategic partnership with Infocom. If you missed out on our press release, here is the headline:

Zementis and Infocom partner to deliver predictive analytic solutions in Japan.

Dedicated to the Japanese market, Infocom combines strong expertise in data mining and predictive analytics with extensive delivery and consulting capabilities.

Zementis offers software solutions that enable scalable, real-time execution of predictive analytics across a variety of platforms based on the PMML standard. These include the ADAPA Scoring Engine available for on-site deployment or in the cloud, and UPPI, the Universal PMML Plug-in for in-database scoring and Hadoop (available for IBM Netezza, Teradata/Aster, EMC Greenplum, SAP Sybase IQ as well as Hadoop and Datameer).

Infocom will market, distribute and support Zementis's predictive analytics software in Japan.

To take a look at the press release, click HERE.

Additional Online Resources

Visit the Zementis resources pages for videos and articles on our products and PMML
Follow @Zementis on Twitter
Join the PMML discussion forum on LinkedIn

Thursday, March 21, 2013

R PMML Support: BetteR than EveR

Once represented as a PMML file, a predictive solution (data transformations + model) can be readily moved into the operational environment where it can be put to work immediately. That's the promise of PMML.

R is living up to that promise through its strong PMML export capabilities. The latest addition to the list of supported model types is Naive Bayes classifiers. More specifically, the R PMML package allows for PMML export for Naive Bayes models built using the naiveBayes function of the e1071 package.

For more details and for a complete list of supported model types (as well as data pre-processing), click HERE.

Thursday, March 7, 2013

Making the case for PMML and ADAPA

If you are not familiar with PMML, the Predictive Model Markup Language, you may be wondering what all the fuss is about ...

PMML is the de facto standard to represent data mining and predictive analytic solutions. With PMML, one can easily share a predictive solution among PMML-compliant applications and systems For example, you can build your model in R, export it in PMML, and use ADAPA, the Zementis Scoring Engine, to deploy it in production.

Many data mining models are a one-time affair. You use historical data to build the model and use it to analyze ... historical data. Wait! That sounds more like descriptive analytics, not predictive analytics. Well, that is sort of true. To be truly predictive, a data mining model needs to be applied to new data. These are the models that need to be operationally deployed and, from my point of view, these are the solutions that are truly revolutionizing the way we do business and live in the Big Data world.

If you want then to use your data mining model to make predictions when presented with new data, it needs to be a dynamic asset. It cannot be static. You need to be able to build it and instantly put it to use. And, that's where PMML and ADAPA come in handy.

Obviously, a few data mining tools try to lock you in. You happily build the model using tool A, just to realize that you need the same tool to execute it. In this case, you are missing out. Here are some of the benefits of moving your predictive model to ADAPA:

Overcome speed/memory limitations
Dramatically lower your infrastructure cost
Tap into all the advantages of cloud computing with ADAPA on the Cloud (IBM SmartCloud or Amazon EC2)
Produce scores in real-time (using Web Services or Java API), on-demand, or batch-mode
Execute your models directly from Excel, by using the ADAPA Add-in for Excel
Benefit from using a set of PMML-compliant model development tools (best of breed)
Deploy your models in minutes
Manage models via Web Services or a Web console
Upload one or many models into ADAPA at once
Benefit from the seamless integration of business rules and predictive models (yes, for those who need it, ADAPA comes with a business rules engine)

PMML and ADAPA allow you to use best of breed tools (not the same old tool) for the job at hand. Also, you can leverage the expertise from a diverse group of data scientists. That means, not all your data scientists need to be experts on a single tool. They can use different tools that share one thing in common, the PMML standard. And, once represented in PMML, models can be easily understood by all team members. PMML allows for transparency and, in doing so, fosters best practices.

Why not benefit from: 1) an open standard to represent data mining models; and 2) a proven scoring engine that consumes any version of PMML and make it available for execution right away, in real-time?

Keep also in mind that ADAPA's sister product, the Universal PMML Plug-in (UPPI), allows you to move the same PMML file in-database or Hadoop. UPPI is currently available for EMC Greenplum, SAP Sybase IQ, IBM Netezza, and Teradata/Aster. With UPPI for in-database scoring, there is no need to move your data outside the database. Data and models reside inside it and so there is minimal data movement and maximum scoring speed. UPPI is also available for Datameer and will soon be available for Hadoop/Hive.

Making a model operational in minutes has never been easier! And, it is all because of PMML and scoring tools such as ADAPA and UPPI.

Monday, February 25, 2013

The Zementis Partnership with Teradata

The partnership between Zementis and Teradata allows customers with a variety of data mining tools to efficiently deploy predictive models based on the Predictive Model Markup Language (PMML) standard. Focused on Big Data applications, the Universal PMML Plug-in (UPPI) for Teradata enables scalable execution of standards-based predictive analytics directly within the Teradata data warehouse.

To read more about the benefits of running your predictive solutions inside Teradata and Teradata Aster, please visit:

http://www.teradata.com/templates/Partners/PartnerProfile.aspx?id=12884902321

PMML Scoring

Zementis offers a range of products that make possible the deployment of predictive solutions and data mining models built in all the top commercial and open-source data mining vendors. Our products include the ADAPA Scoring Engine for real-time scoring and UPPI, which is currently available for a host of database platforms as well as Hadoop/Datameer. For a list of available platforms, please visit our in-database products page.

Rationale

Not all analytic tasks are born the same. If one is confronted with massive volumes of data that need to be scored on a regular basis, in-database scoring sounds like the logical thing to do. In all likelihood, the data in this case is already stored in a database and, with in-database scoring, there is no data movement. Data and models reside together hence scores and predictions flow on an accelerated pace

Wednesday, January 9, 2013

PMML, Big Data, and Hadoop: Predictive Analytics at Work!

Big Data and Hadoop are somewhat synonymous terms these days, since the latter offers an important technological platform to tackle the challenge of analyzing large volumes of data. By the same token, predictive analytics is paramount for companies to extract value and insight from big data. It is in this context that Zementis brings its standards-based predictive scoring engine into a variety of Big Data platforms, including the cloud as well as in-database. By offering the Universal PMML Plug-in (UPPI) for Hadoop, Zementis takes a big step in making its technology available for companies around the globe to easily deploy, execute, and integrate scalable standards-based predictive analytics on a massive parallel scale through the use of Hive, a data warehouse system for Hadoop, and Datameer, an end-to-end BI solution that works on top of Hadoop.

UPPI brings together essential technologies, offering the best combination of open standards and scalability for the application of predictive analytics. It fully supports the Predictive Model Markup Language (PMML), the de facto standard for data mining applications, which enables the integration of predictive models from IBM/SPSS, SAS, R, and many more.

UPPI for Hadoop/Hive

Hive makes it possible for large datasets stored in Hadoop compatible systems to be easily analyzed. Since it provides a mechanism to project structure onto the data, Hive allows for queries to be made using a SQL-like language called HiveQL.

Once deployed in UPPI, predictive models turn into UDFs (User-defined Functions). These can then be invoked directly in HiveQL. In this way, UPPI offers Hadoop users the best combination of open standards and scalability for the application of predictive analytics.

UPPI for Hadoop/Hive delivers instant and scalable scoring for Big Data while retaining compatibility with most major data mining tools through the PMML Standard. It also brings brings the scalability of Hadoop to the execution of predictive analytics.

UPPI for Datameer

Zementis and Datameer have partnered to deliver standards-based execution of predictive analytics on a massive parallel scale. This joint solution combines the Zementis plug-in for execution of predictive models with the power and scale of Datameer, an end-to-end BI solution that includes data source integration, an analytics engine, visualization and dashboarding.

Datameer uses Apache Hadoop, a Java-based framework that supports the parallel storage and processing of large data sets in a distributed environment, as its back-end storage and processing engine to scale cost-effectively to 4000 servers and petabytes of data. It provides wizard-based data integration to integrate large datasets of structured and unstructured data, integrated analytics with familiar spreadsheet-like interface and over 200 built-in analytic functions and drag and drop reporting and dashboarding visualization for end-users. Open API's for data integration, analytics and dashboarding make it easy to access custom data sources, utilize advanced or custom analytics like predictive modeling as well as custom visualizations.

Predictive Scoring for Hadoop - Advantages

UPPI for Datameer delivers instant and scalable scoring for Big Data while retaining compatibility with most major data mining tools through the PMML Standard. Through its versatile deployment solution, the Zementis/Datameer partnership:

Brings the scalability of Hadoop to the execution of predictive analytics
Supports PMML to avoid time-consuming and expensive one-off predictive analytics projects
Integrates data from multiple data sources and formats without complex data and schema mappings that are time consuming to set up and difficult to change
Provides cost effective storage and processing of large volumes of highly granular data that predictive applications often require
Brings together a 100% standards-based approach to analytics that lowers total cost of ownership and increases reuse control and flexibility for orchestrating critical day-to-day business decisions.

Thursday, November 8, 2012

Model Deployment with PMML, the Predictive Model Markup Language

The idea behind this demo is to show you how easy it is to operationally deploy a predictive solution once it is represented in PMML, the Predictive Model Markup Language.

As a model building environment, I use KNIME to generate a neural network model for predicting customer churn. Once data pre-processing and model are represented in PMML, I go on to deploy it in the Amazon Cloud using the ADAPA Scoring Engine and on top of Hadoop using the Universal PMML Plug-in (UPPI) for Datameer. So, the very same model is readily available for execution in two very distinct Big Data platforms: cloud and Hadoop.

The easy of model deployment and interoperability between platforms is the power of PMML, the de facto standard for predictive analytics and data mining models.

Resources:

Download the KNIME workflow used to generate a sample neural network for predicting churn
Download the PMML file created during the demo

Wednesday, October 31, 2012

When Big Data and Predictive Analytics Collide

Big Data is usually defined in terms of Volume, Variety and Velocity (the so called 3 Vs). Volume implies breadth and depth, while variety is simply the nature of the beast: on-line transactions, tweets, text, video, sound, ... Velocity, on the other hand, implies that data is being produced amazingly fast (according to IBM, 90% of the data that exists today was generated in the last 2 years), but that it also gets old pretty fast. In fact, a few data varieties tend to age quicker than others.

To be able to tackle Big Data, systems and platforms need to be robust, scalable, and agile.

It is in this context that IntelliFest 2012 came to be. The conference theme this year was "Intelligence in the Cloud", exploring the use of applied AI in cloud computing, mobile apps, Big Data, and many other application areas. Among several amazing speakers at Intellifest were Stephen Grossberg from Boston University, Rajat Monga from Google, Carlos Serrano-Morales from Sparkling Logic, Paul Vincent from TIBCO, and Alex Guazzelli from Zementis.

Dr. Alex Guazzelli's talk on Big Data, Predictive Analytics, and PMML is now available for on-demand viewing on YouTube. The abstract follows below, together with several resources including the presentation slides and files used in the live demo.

Abstract:

Predictive analytics has been used for many years to learn patterns from historical data to literally predict the future. Well known techniques include neural networks, decision trees, and regression models. Although these techniques have been applied to a myriad of problems, the advent of big data, cost-efficient processing power, and open standards have propelled predictive analytics to new heights.

Big data involves large amounts of structured and unstructured data that are captured from people (e.g., on-line transactions, tweets, ... ) as well as sensors (e.g., GPS signals in mobile devices). With big data, companies can now start to assemble a 360 degree view of their customers and processes. Luckily, powerful and cost-efficient computing platforms such as the cloud and Hadoop are here to address the processing requirements imposed by the combination of big data and predictive analytics.

But, creating predictive solutions is just part of the equation. Once built, they need to be transitioned to the operational environment where they are actually put to use. In the agile world we live today, the Predictive Model Markup Language (PMML) delivers the necessary representational power for solutions to be quickly and easily exchanged between systems, allowing for predictions to move at the speed of business.

This talk will give an overview of the colliding worlds of big data and predictive analytics. It will do that by delving into the technologies and tools available in the market today that allow us to truly benefit from the barrage of data we are gathering at an ever-increasing pace.

Resources:

Download the presentation slides
Download the KNIME workflow used to generate a sample neural network for predicting churn
Download the PMML file created during the demo

Wednesday, October 17, 2012

Big data insights through predictive analytics, open-standards and cloud computing

Organizations increasingly recognize the value that predictive analytics and big data offer to their business. The complexity of development, integration, and deployment of predictive solutions, however, is often considered cost-prohibitive for many projects. In light of mature open source solutions, open standards, and SOA principles we propose an agile model development life cycle that quickly leverages predictive analytics in operational environments.

Starting with data analysis and model development, you can effectively use the Predictive Model Markup Language (PMML) standard, to move complex decision models from the scientist's desktop into a scalable production environment hosted in the cloud (Amazon EC2 and IBM SmartCloud Enterprise).

Expressing Models in PMML

PMML is an XML-based language used to define predictive models. It was specified by the Data Mining Group, an independent group of leading technology companies including Zementis. By providing a uniform standard to represent such models, PMML allows for the exchange of predictive solutions between different applications and various vendors.

Open source PMML-compliant statistical tools such as R, KNIME, and RapidMiner can be used to develop data mining models based on historical data. Once models are exported into a PMML file, they can then be imported into an operational decision platform and be ready for production use in a matter of minutes.

On-Demand Predictive Analytics

Both Amazon and IBM offer a reliable and on-demand cloud computing infrastructure on which we offer the ADAPA® Predictive Decisioning Engine based on the Software as a Service (SaaS) paradigm. ADAPA imports models expressed in PMML and executes these in batch mode, or real-time via web-services.

Our service is implemented as a private, dedicated instance of ADAPA. Each client has access to his/her own ADAPA Engine instance via HTTP/HTTPS. In this way, models and data for one client never share the same engine with other clients.

The ADAPA Web Console

Each instance executes a single version of the ADAPA engine. The engine itself is accessible through the ADAPA Web Console which allows for the easy managing of predictive models and data files. The instance owner can use the console to upload new models as well as score or classify records on data files in batch mode. Real-time execution of predictive models is achieved through the use of web-services. The ADAPA Console offers a very intuitive interface which is divided into two main sections: model and data management. These allow for existing models to be used for generating decisions on different data sets. Also, new models can be easily uploaded and existing models can be removed in a matter of seconds.

Predicting in the Cloud

Using a SaaS solution to break down traditional barriers that currently slow the adoption of predictive analytics, our strategy translates predictive solutions into operational assets with minimal deployment costs and leverages the inherent scalability of utility computing.

In summary, ADAPA revolutionizes the world of predictive analytics and cracks the big data code, since it allows for:

Cost-effective and reliable service based on two outstanding cloud computing infrastructures: Amazon and IBM.

Secure execution of predictive models through dedicated and controlled instances including HTTPS and Web-Services security

On-demand computing. Choice of instance type and launch of multiple instances.

Superior time-to-market by providing rapid deployment of predictive solutions and an agile enterprise decision management environment.

Monday, October 8, 2012

ADAPA in the Cloud: Feature List

Broad support for predictive algorithms

ADAPA supports an extensive collection of statistical and data mining algorithms. These are:

Ruleset Models (flat Decision Trees)
Clustering Models (Distribution-Based, Center-Based, and 2-Step Clustering)
Decision Trees (for classification and regression) together with multiple missing value handling strategies (Default Child, Last Prediction, Null Prediction, Weighted Confidence, Aggregate Nodes)
Naive Bayes Classifiers
Association Rules
Neural Networks (Back-Propagation, Radial-Basis Function, and Neural-Gas)
Regression Models (Linear, Polynomial, and Logistic) and General Regression Models (General Linear, Ordinal Multinomial, Generalized Linear, Cox)
Support Vector Machines (for regression and multi-class and binary classification)
Scorecards (including reason codes and point allocation for categorical, continuous, and complex attributes)
Multiple Models (Segmentation, Ensembles - including Random Forest Models and Stochastic Boosting, Chaining and Model Composition)

Model interfaces: pre- and post-processing

Additionally, ADAPA supports a myriad of functions for implementing data pre- and post-processing. These include:

Text Mining
Value Mapping
Discretization
Normalization
Scaling
Logical and Arithmetic Operators
Business Rules
Lookup Tables
Regular Expressions
Custom Functions

and much much more.

If you think of anything ADAPA cannot do or something else you need to do in terms of data manipulation, let us know.

Automatic conversion (and correction) for older versions of PMML

ADAPA consumes model files that conform to PMML, version 2.0 through 4.2. If your model development environment exports an older version, ADAPA will automatically convert your file into a 4.2 compliant format. It will also correct a number of common problems found in PMML generated by some popular modeling tools, allowing the models to work as intended.

Web-based management and interactive execution of predictive models and business rules

Model management: Models and rule sets are deployed and managed through an intuitive, Web-based management console, the ADAPA Console.

Model verification: The ADAPA Console includes a model validation test, allowing models to be verified for correctness. By providing ADAPA a test file containing input data and expected results for a model, the engine will report any deviations from expected results, greatly enhancing traceability of errors and debugging of model deployment issues. The console also provides easy access to our rules testing framework in which business rules are submitted to regression testing and acceptance.
Batch-scoring: The console also provides functionality to upload a (compressed) CSV data file and batch-scores it against any of the deployed models. Results are returned in the same format and may be downloaded for further processing and visualization.

Simplified integration via SOA

Service Oriented Architecture (SOA) principles simplify integration with existing IT infrastructure. Since ADAPA publishes all deployed models as a Web-Service, you can score data records from within your own environment. With the simple execution of a web service call (SOAP or REST), you are able to leverage the power of predictive models and business rules on-demand or in real-time.

Data scoring from inside Excel

The ADAPA Add-in for Microsoft Office Excel 2007, 2010, and 2013 allows you to easily score data using ADAPA on the Cloud. Once the Add-in is installed, all you need to do is to select your data in Excel, connect to ADAPA and start scoring right away. Your predictions will be made available as new columns.

On-demand predictive analytics solution

ADAPA in the Cloud is a fully hosted Software-as-a-Service (SaaS) solution. You only pay for the service and the capacity that is used, eliminating the necessity for expensive software licenses and in-house hardware resources. As the business grows, ADAPA in the Cloud provides a cost-effective expansion path, for example, by adding multiple ADAPA instances for scalability or failover. The SaaS model removes the burden for you to manage a scalable, on-demand computing infrastructure.

Private instance for all your decisioning needs

We provide you with a single-tenant architecture. The service is implemented as a private, dedicated instance of ADAPA that encapsulates your predictive models and business rules. Only you have access to your private ADAPA instance(s) via HTTPS. Your decisioning files and data never share the same engine with other clients.

Trusted, secure, scalable cloud infrastructure

Zementis leverages FICO and Amazon EC2 for providing on-demand infrastructure for ADAPA in the Cloud. Cloud computing offers utility computing with virtually unlimited scalability.

Wednesday, May 28, 2014

Thursday, February 27, 2014

What changed?

Continuous Inputs for Naive Bayes Models

Complex Point Allocation for Scorecards

Revised Output Element

What is new?

Wednesday, October 9, 2013

Wednesday, October 2, 2013

Tuesday, September 10, 2013

This course is part of the following Certificate Program(s):

Wednesday, August 21, 2013

How does it work?

Documentation

R PMML Transformations Package

How does it work?

Documentation

Wednesday, July 10, 2013

Tuesday, May 7, 2013

Friday, April 12, 2013

Thursday, March 21, 2013

Thursday, March 7, 2013

Monday, February 25, 2013

PMML Scoring

Rationale

Wednesday, January 9, 2013

Thursday, November 8, 2012

Resources:

Wednesday, October 31, 2012

Wednesday, October 17, 2012

Monday, October 8, 2012

Welcome to the World of Predictive Analytics!