Wednesday, November 17, 2010

Examining PMML 4.0: Pre-Processing and Data Manipulation

You may be wondering what is all the fuss around PMML and its 4.0 version. So, we decided to explore all that PMML 4.0 has to offer in a series of blogs. In part I, we will be exploring its improved pre-processing capabilities.

All data mining models manipulate the raw data in a way or another before passing it through a neural network, support vector machine, or regression model. Therefore, a language that wants to represent all the computations that go into a model needs also to be able to represent the data transformations that were applied to the raw data before scoring takes place. PMML is this language! It is the Yin and Yang of data mining.

Let's first re-cap on the pre-processing capabilities available in PMML 3.2. This version of PMML allows for the following out of the box data transformations:
  • Normalization of continuous variables: this is accomplished via the NormContinuous element of PMML. It is mostly used to normalized a variable between 0 and 1. See example below (real PMML code) in which two variables are normalized. The first between 0 and 1 and the second between 0 and 4.
  • Normalizing Categorical Inputs: normally used to transform strings into numerical variables. This is accomplished by the element NormDiscrete. In the PMML example below, a categorical variable creates dummy variables that will be assigned values 1 or 0 depending on the category assumed by the input variable.
  • Discretization: this is used to transform continuous variables into strings. This is accomplished by the Discretize element. In the PMML example below, if the input variable is equal to 500, it is transformed to low; if equal to 5000, it is transformed to medium; and if 50,000, it is high.
  • Value Mapping: this is accomplished in PMML by the use of a mapping table and the element MapValues. To make things more interesting, in the PMML example below, we combine elements MapValues and NormDiscrete to group small sets of categorical values. In specific, we want to find out if the input variable belongs to a specific group of colors. We do that by using MapValues to map different colors to the same number. We then use the element NormDiscrete to create dummy variables which are used to indicate group membership.
  • Arithmetic Expressions: PMML offers a range of arithmetic functions (as well as string and date/time maniputation functions) that can be arranged in different ways to express complex arithmetic expressions. The example below solves the following operation:
ResultVar=maximum(round(InputVar1/3.3),2^(1+log(1.3*InputVar2+1)))

  • PMML 4.0 - Boolean Operations: Not only PMML 4.0 allows for Boolean operations to be fully expressed, but it also allows these to be nested into IF-THEN-ELSE logic. These new buit-in functions offer a vast new array of possibilites for representing data transformations in PMML. So, we devote the rest of this review by looking at transformations that can now be easily expressed in PMML 4.0.
We start with the PMML code below which implements the following logical and arithmetic operations:
IF InputVar1 == "Partner" THEN DerivedVar1 = "P" ELSE DerivedVar2 = 2 * InputVar2



Note that it uses the newly defined 4.0 functions: "if", "equal", and "not" as well as function "*".

The PMML code below assumes that both "then" and "else" parts of the "if" use the same derived variable to implement the following operations:
IF InputVar1 == "Partner" THEN DerivedVar1 = "5.1 * InputVar2" ELSE DerivedVar1 = "InputVar2 / 3.3"

Finally, we end our list of PMML pre-processing examples by showing the use of 4.0 functions "isMissing" and "isIn" combined with function "if". The PMML example below implements the following operations:
IF InputVar is missing THEN DerivedVar = 1 ELSE (IF InputVar is in ("Partner", "Associate", "Colleague") THEN DerivedVar = 2 ELSE DerivedVar = 3)


We finish part I of our PMML tour hoping that this short description of its pre-processing capabilities can help you to easily navigate through all the data transformations available in PMML 4.0.

11 comments:

  1. Understanding the latest concepts is possible only through contents like this. Thanks for sharing this page in here. It will be useful for my future projects as well. Keep blogging articles like this.


    Hadoop Training Chennai | Big Data Hadoop Training in Chennai | Manual testing training in Chennai

    ReplyDelete
  2. I have finally found a Worth able content to read. The way you have presented information here is quite impressive. I have bookmarked this page for future use. Thanks for sharing content like this once again. Keep sharing content like this.

    Software testing training in chennai | Software testing training | Automation testing courses in chennai

    ReplyDelete
  3. To keep ourselves up to date with the current trend is not an easy task in IT. But we can, through quality and worth able content like this. Thanks for sharing this web page. Please write more articles like this in future.


    Hadoop Training Chennai | Best hadoop training institute in chennai | Big Data Hadoop Training in Chennai

    ReplyDelete
  4. Great content. I really enjoyed while reading this content with useful information, keep sharing.
    Hadoop Training in Chennai | Hadoop Training Chennai | FITA Velachery | FITA Academy Chennai.

    ReplyDelete
  5. Processing superannuation payments using auto super. To process super payments you need to submit a payment batch for the nominated authoriser to approve.This blog is very nice.

    ReplyDelete
  6. I am following your blog from the beginning, it was so distinct & I had a chance to collect conglomeration of information that helps me a lot to improvise myself. I hope this will help many readers who are in need of this vital piece of information. Thanks for sharing & keep your blog updated.
    Regards,
    CCNA Training in Chennai | CCNA Training institute in Chennai | CCNA Training

    ReplyDelete
  7. Your explanation is awesome..Clearly explained about it..Keep on blogging..
    Hadoop training in chennai

    ReplyDelete
  8. Excellent post!!! Your article helped to under the future of java development. Being an open source platform, java is integrated in most of the software development industries to create rich featured applications. Java Course in Chennai | Best JAVA Training in Chennai

    ReplyDelete
  9. Your article is trustworthy.This content creates a new hope and inspiration with in me. Thanks for sharing article like this. The way you have started everything above is quite awesome. Keep blogging like this.
    Thanks,
    Hadoop Training in Chennai | Hadoop Training institutes in Chennai

    ReplyDelete
  10. Excellant content. To know the details and importance of python course visit below. Python is an object oriented high level programming language which is built in data structures combined with dynamic typing and dynamic binding making it very attractive for rapid application development.
    Python Training in Chennai | Python Course in Chennai

    ReplyDelete

Note: Only a member of this blog may post a comment.

Welcome to the World of Predictive Analytics!

© Predictive Analytics by Zementis, Inc. - All Rights Reserved.





Copyright © 2009 Zementis Incorporated. All rights reserved.

Privacy - Terms Of Use - Contact Us