Predictive Analytics, Big Data, Hadoop, PMML

Thursday, February 27, 2014

PMML 4.2 is here! What changed? What is new?

PMML 4.2 is out! That's really great. The DMG (Data Mining Group) has been working on this new version of PMML for over two years now. And, I can truly say, it is the best PMML ever! If you haven't seen the press release for the new version, please see posting below:

http://www.kdnuggets.com/2014/02/data-mining-group-pmml-v42-predictive-modeling-standard.html

What changed?

PMML is a very mature language. And so, there isn't really dramatic changes in the language at this point. One noteworthy change is that old PMML used to call the target field on a predictive model "predicted". This was confusing since a predicted field is usually the result of scoring or executing a model. The score so to speak. Well, PMML 4.2 clears things up a bit. The target field is now simply "target". A small change, but a huge step towards making it clear that the Output element is where the predicted outputs should be defined.

Continuous Inputs for Naive Bayes Models

This is a great new enhancement to the NaiveBayes model element. We wrote an entire paper about this new feature and presented it at the KDD 2013 PMML Workshop. If you use Naive Bayes models, you should definitely take a look at our article.

http://kdd13pmml.files.wordpress.com/2013/07/guazzelli_et_al.pdf

And, now you can benefit from actually having our proposed changes in PMML itself! This is really remarkable and we are all already benefiting from it. The Zementis Py2PMML (Python to PMML) Converter uses the proposed changes to convert Gaussian Naive Bayes models from scikit-learn to PMML.

https://zementis.zendesk.com/entries/37045093-Exporting-PMML-for-Class-GaussianNB

Complex Point Allocation for Scorecards

The Scorecard model element was introduced to PMML in version 4.1. It was a good element then, but it is really great now in PMML 4.2. We added to it a way for computing complex values for the allocation of points for an attribute (under a certain characteristic) through the use of expressions. That means, you can use input or derived values to derive the actual value for the points. Very cool!

Andy Flint (FICO) and I wrote a paper about the Scorecard element for the KDD 2011 PMML Workshop. So, if you haven't seen it yet, it will get you started into how to use PMML to represent scorecards and reason codes.

http://kdd11pmml.files.wordpress.com/2011/09/p2_flint_guazzelli_kdd_20112.pdf

Revised Output Element

The output element was completely revised. It is much simpler to use. With PMML 4.2, you have direct access to all the model outputs + all post-processing directly from the attribute "feature".

The attribute segmentId also allows users to output particular fields from segments in a multiple model scenario.

The newly revised output element spells flexibility. It allows you to get what you need out of your predictive solutions.

For a complete list of all the changes in PMML 4.2 (small and large), see:

http://www.dmg.org/v4-2/Changes.html

What is new?

PMML 4.2 introduces the use of regular expressions to PMML. This is solely so that users can process text more efficiently. The most straightforward additions are simple: 3 new built-in functions for concatenating, replacing and matching strings using regular expressions.

The more elaborate addition is the incorporation of a brand new transformation element in PMML to extract term frequencies from text. The ideas for this element were presented at the KDD 2013 PMML Workshop by Benjamin De Boe, Misha Bouzinier, Dirk Van Hyfte (InterSystems). Their paper is a great resource for finding out the details behind the ideas that led to the new text mining element in PMML.

http://kdd13pmml.files.wordpress.com/2013/07/extending_the_pmml_text_model_for_text_categorization.pdf

Obviously, the changes described above are also new, but it was nice to break the news into two pieces. For the grand-finale though, nothing better than taking a look at PMML 4.2 itself.

http://www.dmg.org/v4-2/GeneralStructure.html

Enjoy!

Wednesday, January 29, 2014

Standards in Predictive Analytics: R, Hadoop and PMML (a white paper by James Taylor)

James Taylor (@jamet123) is remarkable in capturing the nuances and mood of the data analytics and decision management industry and community. As a celebrated author and an avid writer, James has been writing more and more about the technologies that transform Big Data into real value and insights that can then drive smart business decisions. It is not a surprise then that James has just made available a white paper entitled "Standards in Predictive Analytics" focusing on PMML, the Predictive Model Markup Language, R, and Hadoop.

DOWNLOAD WHITE PAPER

Why R?

Well, you can use R for pretty much anything in analytics these days. Besides allowing users to do data discovery, it also provides a myriad of packages for model building and predictive analytics.

Why Hadoop?

I almost goest without saying. Hadoop is an amazing platform for processing predictive analytic models on top of Big Data.

Why PMML?

PMML is really the glue between model building (say, R, SAS EM, IBM SPSS, KXEN, KNIME, Python scikit-learn, .... ) and the production system. With PMML, moving a model from the scientist's desktop to production (say, Hadoop, Cloud, in-database, ...) is straightforward. It boils down to this:

R -> PMML -> Hadoop

But, I should stop here and let you read James' wise words yourself. The white paper is available through the Zementis website. To download it, simply click below.

DOWNLOAD WHITE PAPER

And, if you would like to check James' latest writings, make sure to check his website: JTonEDM.com

Wednesday, January 22, 2014

Zementis/Datameer Webinar - Best Practices for Big Data Analytics with Machine Learning (View Recording)

Please watch the Zementis and Datameer webinar entitled "Best Practices for Big Data Analytics with Machine Learning."

VIEW RECORDING

In this webinar, we demonstrate through an industry specific use case how to identify patterns and relationships to make sound predictions using smart data analytics. You will learn best practices on:

Selecting the right machine learning approach for business and IT
Visualizing machine learning on Hadoop
Leveraging existing predictive algorithms on Hadoop

Wednesday, January 8, 2014

Zementis and Teradata Announce In-database Scoring for Big Data

As a result of its partnership with Teradata, Zementis is excited to announce the availability of the Universal PMML Plug-in (UPPI) for Teradata analytic platforms. It does not get easier than this! Simply deploy your predictive models built in R, IBM SPSS, SAS EM, ... and score your big data, directly in-database, where it resides.

The Zementis Universal PMML Plug-in (UPPI) enables the execution of standards-based predictive analytics directly within the Teradata Unified Data Architecture™. Users can now easily deploy predictive models built in R, IBM SPSS, SAS EM and other popular analytic tools on Aster and/or Teradata to achieve scale. The bridge between these systems is PMML, the Predictive Model Markup Language standard. It allows for models to be instantly moved from the scientist's desktop to the database where they will be executed.

As described by Teradata's Chris Twogood, VP for Product and Services Marketing, "by partnering with Zementis, we are able to offer high performance, enterprise-level predictive analytics scoring for the major analytics tools that support PMML. With Zementis and PMML, we are eliminating the need for customers to recode predictive analytic models in order to deploy them within our database. In turn, this enables an analyst to reduce the time to insight required in most businesses today."

Available for Teradata and Teradata Aster databases, UPPI leverages the massively parallel databases as a scalable, high-performance, scoring engine that easily processes through petabyte-scale data volumes. UPPI takes full advantage of the high-performance data warehouse with its massively parallel processing capabilities for rapid execution of standards-based predictive analytics based on the PMML standard.

Models built in most commercial and open source data mining tools can now instantly be deployed in Teradata or Aster. The net result is the ability to leverage the power of standards-based predictive analytics on a massive scale, right where the data resides.

Read the full press release!

Friday, November 8, 2013

Big Data Scoring with UPPI for IBM Pure Data (for Analytics and Hadoop)

In-database scoring is one of the most straightforward ways to gain insights from Big Data. It is no surprise then that the Zementis Universal PMML Plug-in (UPPI) is now being offered for a variety of database platforms. These include IBM Pure Data for Analytics (Netezza), Pivotal/Greenplum, SAP Sybase IQ, Teradata and Teradata Aster. Zementis also offers UPPI for Hadoop/Hive, including IBM Pure Data for Hadoop as well as InfoSphere BigInsights. It is in this context that we travelled to Vegas to attend the IBM Information on Demand (IOD) Conference.

I must say, I am always impressed by the IBM universe of products and tools that are being offered for analytics (descriptive and predictive) as well as Big Data in general. Zementis had a booth inside the Pure Data exhibit area and next to all the Pure Data appliances. As you can imagine, traffic was solid not just because of all the blinking lights but also because the conference itself attracts a lot of people. I believe there were 14 thousand attendants this year.

Why in-database scoring? Well, simple. Not all analytic tasks are born the same. If one is confronted with massive volumes of data that need to be scored on a regular basis, in-database scoring sounds like the logical thing to do. In all likelihood, the data in this case is already stored in a database and, with in-database scoring, there is no data movement. Data and models reside together hence scores and predictions flow on an accelerated pace.

Why scoring in Hadoop? Big Data and Hadoop are somewhat synonymous terms these days, since the latter offers an important technological platform to tackle the challenge of analyzing large volumes of data. In fact, predictive analytics is paramount for companies to extract value and insight from such data. By offering the Universal PMML Plug-in (UPPI) for Hadoop, Zementis takes a big step in making its technology available for companies around the globe to easily deploy, execute, and integrate scalable standards-based predictive analytics on a massive parallel scale through the use of Hive, a data warehouse system for Hadoop.

UPPI brings together essential technologies, offering the best combination of open standards and scalability for the application of predictive analytics. It fully supports the Predictive Model Markup Language (PMML), the de facto standard for data mining applications, which enables the integration of predictive models from IBM/SPSS, SAS, R, and many more.

Saturday, November 2, 2013

1-Click Launch for Big Data Scoring: ADAPA on AWS Marketplace

Clients benefit from our solutions by being able to use PMML, the Predictive Model Markup Language, to move their predictive models from IBM SPSS, R, SAS EM, ... and deploy them instantly in a variety of platforms, including the Amazon Elastic Compute Cloud (Amazon EC2).

ADAPA on the Amazon Cloud offers the power of our real-time PMML-based scoring engine on the Amazon Cloud. ADAPA on the Amazon Cloud comes pre-installed on a virtual server on the cloud. We call that an "ADAPA Instance".

The AWS (Amazon Web Services) Marketplace gives you the power of having ADAPA at your fingertips on three different types of virtual machines. Once you select the machine type and the cloud region in which you want it to run (US, Europe, Latin America or Asia-Pacific), all you need to select is 1-Click Launch and moments later your ADAPA instance is up and running, ready for deployment and execution.

Visit us at the AWS Marketplace!

Big Data Scoring through ADAPA with S3 Processing

Zementis makes it super easy to score your big data by connecting your Amazon S3 (Simple Storage Service) bucket to your predictive models deployed in ADAPA on the Amazon Cloud. ADAPA with S3 Processing is intended for mission critical applications that require very high throughput of predictive analytics. While ADAPA provides real-time scoring via a Web-services API, S3 Processing addresses use cases with scoring requirements that involve tens or hundreds of millions of rows at a time.