Predictive Analytics, Big Data, Hadoop, PMML: July 2016

PMML Post-processing: Output Helper Function

Introduction

This is the fourth in a series of posts describing some PMML editing capabilities of the latest R ‘pmml’ package. For practical use, creating predictive solutions is just the beginning. Once built, they need to be deployed to the operational environment where they are actually put to use. The Predictive Model Markup Language (PMML) delivers the necessary representational power and agility for solutions to be quickly and easily exchanged between systems, allowing for predictions to move at the speed of business.

The Output Element

The Output element in a PMML model performs 2 functions. First, it lists the variables which are to be output as the scored values after applying the model to a dataset. Using various attributes, this allows a model to predict multiple variables and automatically define those variables as some standard features of the predicted value.These features include the predicted value, predicted category, probability of the winning category, probability of all other categories and others.

The second responsibility of the Output element is to post-process data. The model scores the data and passes on the predictions to the output element; one can then define operations to be applied to those values to further process them and output the post-processed data.

It is not uncommon that a model is fit and then transformed into a PMML representation, however missing any information about any post-processing the modeler would have wished. Practical applications of a model may well require extra operations which are designed by the modeler, not automatically generated. We will look at just such a scenario where after a model is made, one wishes to add new Output nodes to the pre-fit model to define new features or new processing information to be included in the PMML representation.

The values to be output are listed as OutputField child elements of the Output element. All the information in those elements are contained in its attributes. The most commonly used are the attributes name, optype, dataType, feature and value. The name is obviously the name of the field which is defined. The optype and dataType, as usual, define the kind of variable being defined; is is a string or integer, is it a numeric value or a categorical value? The feature attribute defines what the output actually is and how to calculate it. Some possible output methods are predefined as an attribute value. For example, if the feature is predictedValue then the output field is the predicted value of the model. If the feature is probability, the output is the probability of the winning category. This is automatically defined so that the method to calculate probability does not have to be defined. If the feature is probability and the value attribute is one of the allowed values of the categorical variable being predicted, the field is the probability of that particular value.

Often there are calculations desired on the model predictions which are not simply predefined as a possible value of the feature attribute. In such a case, feature is set to transformedValue and an expression is given as a child element. That expression is used to make the calculation desired.

We start by first making a simple regression model.

we will use the iris dataset

dat <- iris

the model will have both continuous and categorical input fields

mod <- lm(Sepal.Length ~ ., data=dat)

lets see the Output element of the created model

  <Output>
   <OutputField name="Predicted_Sepal.Length" feature="predictedValue"/>
  </Output>

The model now outputs just one variable, the predicted length. Now we wish to output not just the predicted length but several post-processed values of that predicted value. We show this using several example transformations for illustration, they do not necessarily make sense for a simple iris dataset.

The first step is to create the OutputField element to add inside the Output element. The pmml package provides a function, makeOutputNodes, which makes creating such elements easily. It can be used to make multiple OutputField nodes directly as well. As an example, we will create 2 new OutputField nodes; the first one computes the log of the predicted value and the second one applies the function ln(x/(1-x)) to the predicted field.

Create additional output nodes

onodes0<-makeOutputNodes(name=list("OutputField","OutputField"),
                  attributes=list(list(name="name1",optype="continuous"),list(name="name2")),
                  expression=list("ln(Predicted_Sepal.Length)",
                             "ln(Predicted_Sepal.Length/(1-Predicted_Sepal.Length))"))

Note that the values of the parameters are given one-by-one in a list format. The OutputField nodes are created:

> onodes0
[[1]]
<OutputField name="name1" optype="continuous">
  <Apply function="ln">
    <FieldRef field="Predicted_Sepal.Length"/>
  </Apply>
</OutputField> 

[[2]]
<OutputField name="name1">
  <Apply function="ln">
    <Apply function="/">
      <FieldRef field="Predicted_Sepal.Length"/>
      <Apply function="-">
        <Constant dataType="double">1</Constant>
        <FieldRef field="Predicted_Sepal.Length"/>
      </Apply>
    </Apply>
  </Apply>
</OutputField>

Next we have to insert these new nodes inside the Output node. The pmml package provides a helper function to do just that; addOutputField. This function takes as its input the PMML model, the OutputField nodes and a parameter ‘at’. Of the many possible OutputField elements under the Output element, it finds the OutputField element at index ‘at’ and inserts the given OutputField elements after that one.

pmod<-addOutputField(xmlmodel=pmml(mod), outputNodes=onodes0)

By default, the value of ‘at’ is “End” which means the new nodes were inserted at the end:

    <Output>
      <OutputField name="Predicted_Sepal.Length" feature="predictedValue"/>
      <OutputField name="name1" optype="continuous">
        <Apply function="ln">
          <FieldRef field="Predicted_Sepal.Length"/>
        </Apply>
      </OutputField>
      <OutputField name="name2">
        <Apply function="ln">
          <Apply function="/">
            <FieldRef field="Predicted_Sepal.Length"/>
            <Apply function="-">
              <Constant dataType="double">1</Constant>
              <FieldRef field="Predicted_Sepal.Length"/>
            </Apply>
          </Apply>
        </Apply>
      </OutputField>
    </Output>

One can also use the function to add a single OutputField element at a time. Let us imagine there are yet more OutputField nodes named ‘name3’ and ‘name4’ to work with. To show how the function works, first we add these hypothetical nodes right after the predicted field. With multiple such fields it might be better to predefine lists with the names and attributes of the new elements but here, we just do it in one line.

onodes1<-makeOutputNodes(name=list("OutputField","OutputField"),
          attributes=list(list(name="name3",dataType="double",optype="continuous"),
                      list(name="name4")))

which results in:

> onodes1
[[1]]
<OutputField name="name3" dataType="double" optype="continuous"/> 

[[2]]
<OutputField name="name4"/>

As an example, let us insert them as the second OutputField node:

pmod2 <- addOutputField(xmlmodel=pmod, outputNodes=onodes1,at=2)

so that the output now looks like:

    <Output>
      <OutputField name="Predicted_Sepal.Length" feature="predictedValue"/>
      <OutputField name="name1" optype="continuous">
        <Apply function="ln">
          <FieldRef field="Predicted_Sepal.Length"/>
        </Apply>
      </OutputField>
      <OutputField name="name3" dataType="double" optype="continuous"/>
      <OutputField name="name4"/>
      <OutputField name="name2">
        <Apply function="ln">
          <Apply function="/">
            <FieldRef field="Predicted_Sepal.Length"/>
            <Apply function="-">
              <Constant dataType="double">1</Constant>
              <FieldRef field="Predicted_Sepal.Length"/>
            </Apply>
          </Apply>
        </Apply>
      </OutputField>
    </Output>

now we can add some post-processing to an OutputField with a given name, say ‘name3’:

addOutputField(xmlmodel=pmod2, xformText=list("exp(name1) && !name1"), nodeName="name3")

which results in:

    <Output>
      <OutputField name="Predicted_Sepal.Length" feature="predictedValue"/>
      <OutputField name="name1" optype="continuous">
        <Apply function="ln">
          <FieldRef field="Predicted_Sepal.Length"/>
        </Apply>
      </OutputField>
      <OutputField name="name3" dataType="double" optype="continuous">
        <Apply function="and">
          <Apply function="exp">
            <FieldRef field="name1"/>
          </Apply>
          <Apply function="not">
            <FieldRef field="name1"/>
          </Apply>
        </Apply>
      </OutputField>
      <OutputField name="name4"/>
      <OutputField name="name2">
        <Apply function="ln">
          <Apply function="/">
            <FieldRef field="Predicted_Sepal.Length"/>
            <Apply function="-">
              <Constant dataType="double">1</Constant>
              <FieldRef field="Predicted_Sepal.Length"/>
            </Apply>
          </Apply>
        </Apply>
      </OutputField>
    </Output>

The addOutputField function also takes an attributes parameter if one wants to just add attributes. It also takes another parameter ‘whichOutput’. This is used only in multiple models; if the PMML has multiple models and so multiple Output elements, this parameter chooses which of those Output elements to perform the operations on. The function then can do 3 things, one at a time though: add OutputField elements created elsewhere, add transformation operators to OutputField name already existing, add attributes to OutputField names already existing.

The purpose of this function is then that if we are given a PMML model, we can add new variables to be output and add post-processing operations to new output fields.

Predictive Analytics, Big Data, Hadoop, PMML

Friday, July 22, 2016

PMML Post-processing: Output Helper Function

Introduction

The Output Element

Welcome to the World of Predictive Analytics!