Predictive Analytics, Big Data, Hadoop, PMML: 2014

Wednesday, May 28, 2014

Online PMML Course @ UCSD Extension: Register today!

The Predictive Model Markup Language (PMML) standard is touted as the standard for predictive analytics and data mining models. It is allows for predictive models built in one application to be moved to another without any re-coding. PMML has become the imperative for companies wanting to extract value and insight from Big Data. In the Big Data era, the agile deployment of predictive models is imperative. Given the volume and velocity associated with Big Data, one cannot spend weeks or months re-coding a predictive model into the IT operational environment where it actually produces value (the fourth V in Big Data).

Also, as predictive models become more complex through the use of random forest models, model ensembles, and deep learning neural networks, PMML becomes even more relevant since model recoding is simply not an option.

Zementis has paired up with UCSD Extension to offer the first online PMML course. This is a great opportunity for individuals and companies alike to master PMML so that they can muster their predictive analytics resources around a single standard and in doing so, benefit from all it can offer.

http://extension.ucsd.edu/studyarea/index.cfm?vAction=singleCourse&vCourse=CSE-41184

Course Benefits

Learn how to represent an entire data mining solution using open-standards
Understand how to use PMML effectively as a vehicle for model logging, versioning and deployment
Identify and correct issues with PMML code as well as add missing computations to auto-generated PMML code

Course Dates

07/14/14 - 08/25/14

PMML is supported by most commercial and open-source data mining tools. Companies and tools that support PMML include IBM SPSS, SAS, R, SAP KXEN, Zementis, KNIME, RapidMiner, FICO, StatSoft, Angoss, Microstrategy ... The standard itself is very mature and its latest release is version 4.2.

For more details about PMML, please visit the Zementis PMML Resources page.

Scoring Data from MySQL or SQL Server using KNIME and ADAPA

The video below shows the use of KNIME for handling data (reading data from a flat file and/or a database) as well as model building (data massaging and training a neural network). It also highlights how easy and straightforward it is to move a predictive model represented in PMML, the Predictive Model Markup Language, into the Zementis ADAPA Scoring Engine. ADAPA is then used for model deployment and scoring. PMML is the de facto standard to represent data mining models. It allows for predictive models to be moved between applications and systems without the need for model re-coding.

When training a model, scientists rely on historical data, but when using the model on a regular basis, the model is moved or deployed in production where it presented with new data. ADAPA provides a scalable and blazing fast scoring engine for models in production. And, although KNIME data mining nodes are typically used by scientists to build models, its database and REST nodes nodes can simply be used to create a flow for reading data from a database (MySQL, SQL Server, Oracle, ...) and passing it for scoring in ADAPA via its REST API.

Use-cases are:

Read data from a flat file, use KNIME for data pre-processing and building of a neural network model. Export the entire predictive workflow as a PMML file and then take this PMML file and upload and score it in ADAPA via its Admin Web Console.
Read data from a database (MySQL, SQLServer, Oracle, ...), build model in KNIME, export model as a PMML file and deploy it in ADAPA using its REST API. This use-case also shows new or testing data flowing from the database and into ADAPA for scoring via a sequence of KNIME nodes. The video also shows a case in which one can use KNIME nodes to simply read a PMML file produced in any PMML-compliant data mining tool (R, SAS EM, SPSS, ...), upload it in ADAPA using the REST API and score new data from MySQL in ADAPA also through the REST interface. Note that in this case, the model has already been trained and we are just using KNIME to deploy the existing PMML file in ADAPA for scoring.

Zementis and SAP HANA: Real-time Scoring for Big Data

The Zementis partnership with SAP is manifesting itself in a number of ways. Two weeks ago we were part of the SAP Big Data Bus parked outside Wells Fargo in San Francisco. This week, we would like to share with you three new developments.

1) ADAPA is not being offered at the SAP HANA Marketplace.

2) An interview with our CEO, Mike Zeller, was just featured by SAP on the SAP Blogs.

3) Zementis was again part of the SAP Big Data Bus and the "Big Data Theatre". This time, the bus was parked outside US Bank in Englewood, Colorado. We were engaged in a myriad of conversations with the many people that came through the bus about how ADAPA and SAP HANA work together to bring predictive analytics and real-time scoring to transactional data and millions of accounts, in any industry.

Visit the Zementis ADAPA for SAP HANA page for more details on the Zementis and SAP real-time solution for predictive analytics.

Friday, April 18, 2014

Real-time scoring of transactional data with ADAPA for SAP HANA

At the recent DEMO Enterprise 2014 conference, Zementis announced its participation in the SAP® Startup Focus program and launched ADAPA for SAP HANA, a standards-based predictive analytics scoring engine.

ADAPA for SAP HANA provides a simple plug-and-play platform to deploy the most complex predictive models and execute them in real-time, even in the context of Big Data.

In joining the SAP HANA Startup Focus program, Zementis set out to address two key challenges related to the operational deployment of predictive analytics: Agile deployment and scalable execution.

Transactional data has for years pushed the boundaries of predictive analytics. The financial industry, for example, has been using transactional data to detect fraud and abuse for decades with complex custom solutions. Real-time scoring is paramount for companies to be able to predict and prevent fraudulent activity before it actually happens. Likewise, the Internet of Things (IoT) demands effective processing of sensor data to employ predictive maintenance for detecting issues before they turn into device failures.

To solve these challenges, Zementis combined its ADAPA predictive analytics scoring engine with SAP HANA in a true plug-and-play platform which is universally applicable across all industries. ADAPA to serve scoring requests and execute predictive models, HANA to offload complex model preprocessing and computation of aggregates.

In this scenario, real-time execution critically depends on HANA serving complex data lookups and aggregate profile computation in a few milliseconds. In a high-volume environment, such aggregates or lookups may have to be computed over millions of transactions.

ADAPA provides scalable real-time scoring of the core model, plus agility for model deployment through the Predictive Model Markup Language (PMML) industry standard. Clients are able to instantly deploy existing predictive models from various data mining tools. For example, you can take a complex predictive model from SAS Enterprise Miner, export it in PMML format and simply make it available for real-time scoring in ADAPA for SAP HANA. The same process, of course, applies to most commercial tools, e.g. SAP Predictive Analysis, KXEN, IBM SPSS, as well as open source tools like R and KNIME.

The unique aspect of the Zementis / SAP platform is that it combines the benefits of an open standard for predictive analytics with the power of in-memory computing.

For more product details, please see http://zementis.com/saphana.htm

Thursday, February 27, 2014

PMML 4.2 is here! What changed? What is new?

PMML 4.2 is out! That's really great. The DMG (Data Mining Group) has been working on this new version of PMML for over two years now. And, I can truly say, it is the best PMML ever! If you haven't seen the press release for the new version, please see posting below:

http://www.kdnuggets.com/2014/02/data-mining-group-pmml-v42-predictive-modeling-standard.html

What changed?

PMML is a very mature language. And so, there isn't really dramatic changes in the language at this point. One noteworthy change is that old PMML used to call the target field on a predictive model "predicted". This was confusing since a predicted field is usually the result of scoring or executing a model. The score so to speak. Well, PMML 4.2 clears things up a bit. The target field is now simply "target". A small change, but a huge step towards making it clear that the Output element is where the predicted outputs should be defined.

Continuous Inputs for Naive Bayes Models

This is a great new enhancement to the NaiveBayes model element. We wrote an entire paper about this new feature and presented it at the KDD 2013 PMML Workshop. If you use Naive Bayes models, you should definitely take a look at our article.

http://kdd13pmml.files.wordpress.com/2013/07/guazzelli_et_al.pdf

And, now you can benefit from actually having our proposed changes in PMML itself! This is really remarkable and we are all already benefiting from it. The Zementis Py2PMML (Python to PMML) Converter uses the proposed changes to convert Gaussian Naive Bayes models from scikit-learn to PMML.

https://zementis.zendesk.com/entries/37045093-Exporting-PMML-for-Class-GaussianNB

Complex Point Allocation for Scorecards

The Scorecard model element was introduced to PMML in version 4.1. It was a good element then, but it is really great now in PMML 4.2. We added to it a way for computing complex values for the allocation of points for an attribute (under a certain characteristic) through the use of expressions. That means, you can use input or derived values to derive the actual value for the points. Very cool!

Andy Flint (FICO) and I wrote a paper about the Scorecard element for the KDD 2011 PMML Workshop. So, if you haven't seen it yet, it will get you started into how to use PMML to represent scorecards and reason codes.

http://kdd11pmml.files.wordpress.com/2011/09/p2_flint_guazzelli_kdd_20112.pdf

Revised Output Element

The output element was completely revised. It is much simpler to use. With PMML 4.2, you have direct access to all the model outputs + all post-processing directly from the attribute "feature".

The attribute segmentId also allows users to output particular fields from segments in a multiple model scenario.

The newly revised output element spells flexibility. It allows you to get what you need out of your predictive solutions.

For a complete list of all the changes in PMML 4.2 (small and large), see:

http://www.dmg.org/v4-2/Changes.html

What is new?

PMML 4.2 introduces the use of regular expressions to PMML. This is solely so that users can process text more efficiently. The most straightforward additions are simple: 3 new built-in functions for concatenating, replacing and matching strings using regular expressions.

The more elaborate addition is the incorporation of a brand new transformation element in PMML to extract term frequencies from text. The ideas for this element were presented at the KDD 2013 PMML Workshop by Benjamin De Boe, Misha Bouzinier, Dirk Van Hyfte (InterSystems). Their paper is a great resource for finding out the details behind the ideas that led to the new text mining element in PMML.

http://kdd13pmml.files.wordpress.com/2013/07/extending_the_pmml_text_model_for_text_categorization.pdf

Obviously, the changes described above are also new, but it was nice to break the news into two pieces. For the grand-finale though, nothing better than taking a look at PMML 4.2 itself.

http://www.dmg.org/v4-2/GeneralStructure.html

Enjoy!

Wednesday, January 29, 2014

Standards in Predictive Analytics: R, Hadoop and PMML (a white paper by James Taylor)

James Taylor (@jamet123) is remarkable in capturing the nuances and mood of the data analytics and decision management industry and community. As a celebrated author and an avid writer, James has been writing more and more about the technologies that transform Big Data into real value and insights that can then drive smart business decisions. It is not a surprise then that James has just made available a white paper entitled "Standards in Predictive Analytics" focusing on PMML, the Predictive Model Markup Language, R, and Hadoop.

DOWNLOAD WHITE PAPER

Why R?

Well, you can use R for pretty much anything in analytics these days. Besides allowing users to do data discovery, it also provides a myriad of packages for model building and predictive analytics.

Why Hadoop?

I almost goest without saying. Hadoop is an amazing platform for processing predictive analytic models on top of Big Data.

Why PMML?

PMML is really the glue between model building (say, R, SAS EM, IBM SPSS, KXEN, KNIME, Python scikit-learn, .... ) and the production system. With PMML, moving a model from the scientist's desktop to production (say, Hadoop, Cloud, in-database, ...) is straightforward. It boils down to this:

R -> PMML -> Hadoop

But, I should stop here and let you read James' wise words yourself. The white paper is available through the Zementis website. To download it, simply click below.

DOWNLOAD WHITE PAPER

And, if you would like to check James' latest writings, make sure to check his website: JTonEDM.com

Wednesday, January 22, 2014

Zementis/Datameer Webinar - Best Practices for Big Data Analytics with Machine Learning (View Recording)

Please watch the Zementis and Datameer webinar entitled "Best Practices for Big Data Analytics with Machine Learning."

VIEW RECORDING

In this webinar, we demonstrate through an industry specific use case how to identify patterns and relationships to make sound predictions using smart data analytics. You will learn best practices on:

Selecting the right machine learning approach for business and IT
Visualizing machine learning on Hadoop
Leveraging existing predictive algorithms on Hadoop

Wednesday, January 8, 2014

Zementis and Teradata Announce In-database Scoring for Big Data

As a result of its partnership with Teradata, Zementis is excited to announce the availability of the Universal PMML Plug-in (UPPI) for Teradata analytic platforms. It does not get easier than this! Simply deploy your predictive models built in R, IBM SPSS, SAS EM, ... and score your big data, directly in-database, where it resides.

The Zementis Universal PMML Plug-in (UPPI) enables the execution of standards-based predictive analytics directly within the Teradata Unified Data Architecture™. Users can now easily deploy predictive models built in R, IBM SPSS, SAS EM and other popular analytic tools on Aster and/or Teradata to achieve scale. The bridge between these systems is PMML, the Predictive Model Markup Language standard. It allows for models to be instantly moved from the scientist's desktop to the database where they will be executed.

As described by Teradata's Chris Twogood, VP for Product and Services Marketing, "by partnering with Zementis, we are able to offer high performance, enterprise-level predictive analytics scoring for the major analytics tools that support PMML. With Zementis and PMML, we are eliminating the need for customers to recode predictive analytic models in order to deploy them within our database. In turn, this enables an analyst to reduce the time to insight required in most businesses today."

Available for Teradata and Teradata Aster databases, UPPI leverages the massively parallel databases as a scalable, high-performance, scoring engine that easily processes through petabyte-scale data volumes. UPPI takes full advantage of the high-performance data warehouse with its massively parallel processing capabilities for rapid execution of standards-based predictive analytics based on the PMML standard.

Models built in most commercial and open source data mining tools can now instantly be deployed in Teradata or Aster. The net result is the ability to leverage the power of standards-based predictive analytics on a massive scale, right where the data resides.

Read the full press release!