Predictive Analytics, Big Data, Hadoop, PMML: Complex Outputs in PMML

Complex Outputs in PMML

Introduction

Despite a seemingly strict schema to follow, PMML allows one to specify complex outputs and output formats which may not be obvious from looking at the PMML element definitions. As an example, we show how one would define a PMML model to output model predictions in a non-standard format decided by the user. This clarifies the idea that PMML, the industry standard, has an extreme flexibility already built into it.

Consider a model with multiple outputs, of similar kind. One example might be a model which predicts the category of a variable with multiple categories. It is not unusual for the number of categories to be large in practical applications. if this number is in the order of 100, going through 100 categories and looking at their probabilities becomes impractical. As another concrete example, consider the MNIST data set. It only has 10 possible categories for the predicted value. However, it is not surprising if we wish to see the most likely predictions instead of all possible ones. We might want to only the prediction whose probability is more than 0.5; or more than 1 predictions all of whose probability is more than a critical value.

Model Creation

These examples in mind, let us consider a classification model which predicts the probabilities of 10 categories. We wish to only see an output where only the categories with probabilities greater than 0.3 are predicted. Since there could be more than 1, and we wish to minimize the number of variables we look at, we will further join all the categories chosen as a coma separated string with their probabilities as another coma separate string in the same order. Let us choose an example dataset with a large number of categories to showcase this problem. We use the ‘audit’ data set which is attached to the ‘pmml’ package. One of the variables in that data set is ‘Marital’ which has 6 levels, we choose this as the predicted variable. Further suppose that we consider a multinomial regression model good enough for our purposes. A straight-forward model fit is simple enough:

library(pmml)
library(nnet)
library(knitr)
data(audit)

model <- multinom(Marital~., data=audit[,-1])
pmodel <- pmml(model)

The pmml representation will specify that all possible output probabilities be output along with the 1 predicted category.

pmodel[[3]][[2]]

## <Output>
##  <OutputField name="Predicted_Marital" feature="predictedValue"/>
##  <OutputField name="Probability_Absent" optype="continuous" dataType="double" feature="probability" value="Absent"/>
##  <OutputField name="Probability_Divorced" optype="continuous" dataType="double" feature="probability" value="Divorced"/>
##  <OutputField name="Probability_Married" optype="continuous" dataType="double" feature="probability" value="Married"/>
##  <OutputField name="Probability_Married-spouse-absent" optype="continuous" dataType="double" feature="probability" value="Married-spouse-absent"/>
##  <OutputField name="Probability_Unmarried" optype="continuous" dataType="double" feature="probability" value="Unmarried"/>
##  <OutputField name="Probability_Widowed" optype="continuous" dataType="double" feature="probability" value="Widowed"/>
## </Output>

A partial tabular representation of the output of this model would be:

Predicted_Marital	Probability_Absent	Probability_Divorced	Probability_Married
Absent	0.40	0.1	0.05
Divorced	0.35	0.5	0.10

Output Modification

Before we can modify the PMML, we must parse it so that we can access various parts of it easily using XPath.

pmodelstring <- toString.XMLNode(pmodel)
pmodelTree <- xmlTreeParse(pmodelstring,asText=TRUE,useInternalNodes=TRUE)

We eventually want to see just the final output we are going to derive, the outputs defined at present must not be allowed to print out values. Fortunately we have all we need. PMML defines an attribute ‘isFinalResult’ in an ‘OutputField’ element; if set to ‘FALSE’ it is defined but does not print out its value. This was introduced so as to enable intermediate derived fields which were not useful to output. We also have functions in the ‘pmml’ package to help us add ‘OutputField’ elements and attributes to an preexisting PMML; just what we need. We begin by checking that the model we just created was in version 4.3 of PMML and add the ‘isFinalResult’ to all existing ‘OutputField’ elements.

namespace <- pmml:::.getNamespace("4_3")
outputs <- getNodeSet(pmodelTree,"/p:PMML//p:OutputField",c(p=namespace))
for(i in 1:length(outputs)) {
  addAttributes(outputs[[i]],isFinalResult="false")
}

Now the main modifications start. Our first step is to find out all possible values that are involved; these are simply all ‘value’ attributes in the ‘OutputField’ elements.

val <- getNodeSet(pmodelTree,"/p:PMML/p:DataDictionary/p:DataField[1]/p:Value/@value",c(p=namespace))
Vals<-vector(length=length(val))
# Vals is now all the categories used in the PMML
Vals <- sapply(1:length(val),function(x){Vals[x]<-val[[x]][[1]]})

We create 2 output fields per category used. One is to include the value of that category in the final output if its probability is greater than a critical value and the other to include the probability of that category in the final output, again, only if it is greater than a critical value. To do so in 1 go, we create lists of the field names, ‘OutputField’ attributes and the transformations for each ‘OutputField’ element. The idea is to create information in a format which can be used by the ‘addOutputField’ function in the pmml package, designed just for this kind of situation.

Lets start the process. For each Value and Probability, ‘out’ is the list of element names, ‘fattr’ is the list of attributes each ‘OutputField’ element is to have and ‘fexpr’ is the expression or the post-processing calculation done in each OutputField. The processing is done by calling 2 functions, one dealing with the category and one with its probability. These functions were added by hand to the PMML in a ‘TransformationsDictionary’ element. Although it could be done programmatically by creating the XML and adding it via XML, this part was simple enough that adding by hand is justified.

count <- 0
fexprV <- list()
fattrV <- list()
outV <- list()
fexprP <- list()
fattrP <- list()
outP <- list()

for(v in Vals) {
  count <- count + 1
  # The name of every node is 'OutputField"!
  outV[[count]] <- "OutputField"
  # Create the 3 parameters to pass to the function for the 'value' list
  nm <- paste0("FinalValues",count)
  p1 <- paste0('Probability_',v)
  p2 <- paste0('FinalValues',count)
  # Store the attributes as a list. Each member is another list of attributes for that element
  fattrV[[count]] <- list(name=nm,feature="transformedValue",dataType="string",isFinalResult="false")
  # Store the post-processing information as a list. The information is in text format, converted to XML
  # format automatically by the node creating function used later ('makeOutputNodes')
  fexprV[[count]] <- paste0("addValues(", p1, ",", p2, ",'",v,"')")
  # Repeat for the 'probability' list
  outP[[count]] <- "OutputField"
  nm <- paste0("FinalProbs",count)
  p1 <- paste0('Probability_',v)
  p2 <- paste0("FinalProbs",count-1)
  fattrP[[count]] <- list(name=nm,feature="transformedValue",dataType="string",isFinalResult="false")
  fexprP[[count]] <- paste0("addProbs(",p1,",",p2,")")
}
# 2 elements are needed to initialize the output series
first2nodes <- makeOutputNodes(name=list("OutputField","OutputField"),attributes=list(list(name="FinalValues0",feature="transformedValue",dataType="string",isFinalResult="false"),list(name="FinalProbs0",feature="transformedValue",dataType="string",isFinalResult="false")),expression=list("''","''"))

Instead of explaining the above R fragment line by line, it is probably easier if we just look at the result of the code. Let us just finish adding the ‘OutputField’ elements we created above to the PMML:

# Since we collected information about the kind of nodes needed in a careful way, we can now
# create the 6*2 outputfield nodes in 1 line each
mV <- makeOutputNodes(name=outV,attributes=fattrV,expression=fexprV)
mP <- makeOutputNodes(name=outP,attributes=fattrP,expression=fexprP)

Lets take a look. Initially the outputs looks as follows:

    <Output>
      <OutputField name="Predicted_Marital" feature="predictedValue" isFinalResult="false"/>
      <OutputField name="Probability_Absent" optype="continuous" dataType="double" feature="probability" value="Absent" isFinalResult="false"/>
      <OutputField name="Probability_Divorced" optype="continuous" dataType="double" feature="probability" value="Divorced" isFinalResult="false"/>
      <OutputField name="Probability_Married" optype="continuous" dataType="double" feature="probability" value="Married" isFinalResult="false"/>
      <OutputField name="Probability_Married-spouse-absent" optype="continuous" dataType="double" feature="probability" value="Married-spouse-absent" isFinalResult="false"/>
      <OutputField name="Probability_Unmarried" optype="continuous" dataType="double" feature="probability" value="Unmarried" isFinalResult="false"/>
      <OutputField name="Probability_Widowed" optype="continuous" dataType="double" feature="probability" value="Widowed" isFinalResult="false"/>
    </Output>

For each of the ‘OutputField’ elements which gives us a probability, we add 2 new ‘OutputField’ elements. So, for example, for the first one, which computes the probability for the value “Absent”, we added

mV[[1]]

## <OutputField name="FinalValues1" feature="transformedValue" dataType="string" isFinalResult="false">
##   <Apply function="addValues">
##     <FieldRef field="Probability_Absent"/>
##     <FieldRef field="FinalValues1"/>
##     <Constant dataType="string">Absent</Constant>
##   </Apply>
## </OutputField>

mP[[1]]

## <OutputField name="FinalProbs1" feature="transformedValue" dataType="string" isFinalResult="false">
##   <Apply function="addProbs">
##     <FieldRef field="Probability_Absent"/>
##     <FieldRef field="FinalProbs0"/>https://accounts.google.com/ServiceLogin?elo=1#identifier
##   </Apply>
## </OutputField>

So the new OutputField elements take the corresponding value v, and apply the functions addValues and addProbs. These are the functions mentioned earlier which were added by hand in a ‘TransformationDictionary’ element which occurs right after the ‘DataDictionary’ element.

  <TransformationDictionary>
    <DefineFunction name="addValues" dataType="string" optype="categorical">
      <ParameterField name="Prob" dataType="double"/>
      <ParameterField name="PrevVal" dataType="string"/>
      <ParameterField name="Val" dataType="string"/>
      <Apply function="if">
        <Apply function="greaterThan">
          <FieldRef field="Prob"/>
          <Constant dataType="double">0.3</Constant>
        </Apply>
        <Apply function="concat">
          <FieldRef field="PrevVal"/>
          <Constant dataType="string">,</Constant>
          <FieldRef field="Val"/>getNodeSet(pmodelTree,"//p:Output",c(p=namespace))[[1]]
        </Apply>
        <FieldRef field="PrevVal"/>
      </Apply>
    </DefineFunction>
    <DefineFunction name="addProbs" dataType="string" optype="categorical">
      <ParameterField name="Prob" dataType="double"/>
      <ParameterField name="PrevProb" dataType="string"/>
      <Apply function="if">
        <Apply function="greaterThan">
          <FieldRef field="Prob"/>
          <Constant dataType="double">0.3</Constant>
        </Apply>
        <Apply function="concat">
          <FieldRef field="PrevProb"/>
          <Constant dataType="string">,</Constant>
          <FieldRef field="Prob"/>
        </Apply>
        <FieldRef field="PrevProb"/>
      </Apply>
    </DefineFunction>
  </TransformationDictionary>

Lets look at what these functions do. The addValues function takes as its input the corresponding value, the corresponding probability and another variable which is a concatenated string of all previous chosen values. If the probability is greater than a chosen critical value, here 0.3, then the concatenated string is increased by adding a comma and the present value. Similarly, addProbs will add the present probability to a string with a comma inserted if the probability is higher than a critical value set in the function.

Since the string addition is always a comma and a value, the final concatenated string will have a comma at the very beginning. We use regular expressions (supported in PMML) to delete that.

last2nodes <- makeOutputNodes(name=list("OutputField","OutputField"),attributes=list(list(name="FinalValues",feature="transformedValue",dataType="string"),list(name="FinalProbs",feature="transformedValue",dataType="string")),expression=list(paste0("replace(FinalValues", count, ",'^,','')"), paste0("replace(FinalProbs", count, ",'^,','')")))
# The new OutputFields are all created, now add them to the initial xml file
pmodelTree <- addOutputField(pmodelTree,outputNodes = c(first2nodes,mV,mP,last2nodes),namespace="4_3")
# add OutputFields to model: list of nodes:                 A      , B,C ,    D 
# A = initialization nodes
# B = vaues node for all other categories
# C = probability node for al other categories
# D = last nodes to remove the beginning "," character

The final ‘output’ element looks like

getNodeSet(pmodelTree,"//p:Output",c(p=namespace))[[1]]

## <Output>
##   <OutputField name="Predicted_Marital" feature="predictedValue" isFinalResult="false"/>
##   <OutputField name="Probability_Absent" optype="continuous" dataType="double" feature="probability" value="Absent" isFinalResult="false"/>
##   <OutputField name="Probability_Divorced" optype="continuous" dataType="double" feature="probability" value="Divorced" isFinalResult="false"/>
##   <OutputField name="Probability_Married" optype="continuous" dataType="double" feature="probability" value="Married" isFinalResult="false"/>
##   <OutputField name="Probability_Married-spouse-absent" optype="continuous" dataType="double" feature="probability" value="Married-spouse-absent" isFinalResult="false"/>
##   <OutputField name="Probability_Unmarried" optype="continuous" dataType="double" feature="probability" value="Unmarried" isFinalResult="false"/>
##   <OutputField name="Probability_Widowed" optype="continuous" dataType="double" feature="probability" value="Widowed" isFinalResult="false"/>
##   <OutputField name="FinalValues0" feature="transformedValue" dataType="string" isFinalResult="false">
##     <Constant dataType="string"></Constant>
##   </OutputField>
##   <OutputField name="FinalProbs0" feature="transformedValue" dataType="string" isFinalResult="false">
##     <Constant dataType="string"></Constant>
##   </OutputField>
##   <OutputField name="FinalValues1" feature="transformedValue" dataType="string" isFinalResult="false">
##     <Apply function="addValues">
##       <FieldRef field="Probability_Absent"/>
##       <FieldRef field="FinalValues1"/>
##       <Constant dataType="string">Absent</Constant>
##     </Apply>
##   </OutputField>
##   <OutputField name="FinalValues2" feature="transformedValue" dataType="string" isFinalResult="false">
##     <Apply function="addValues">
##       <FieldRef field="Probability_Divorced"/>
##       <FieldRef field="FinalValues2"/>
##       <Constant dataType="string">Divorced</Constant>
##     </Apply>
##   </OutputField>
##   <OutputField name="FinalValues3" feature="transformedValue" dataType="string" isFinalResult="false">
##     <Apply function="addValues">
##       <FieldRef field="Probability_Married"/>
##       <FieldRef field="FinalValues3"/>
##       <Constant dataType="string">Married</Constant>
##     </Apply>
##   </OutputField>
##   <OutputField name="FinalValues4" feature="transformedValue" dataType="string" isFinalResult="false">
##     <Apply function="addValues">
##       <Apply function="-">
##         <Apply function="-">
##           <FieldRef field="Probability_Married"/>
##           <FieldRef field="spouse"/>
##         </Apply>
##         <FieldRef field="absent"/>
##       </Apply>
##       <FieldRef field="FinalValues4"/>
##       <Constant dataType="string">Married-spouse-absent</Constant>
##     </Apply>
##   </OutputField>
##   <OutputField name="FinalValues5" feature="transformedValue" dataType="string" isFinalResult="false">
##     <Apply function="addValues">
##       <FieldRef field="Probability_Unmarried"/>
##       <FieldRef field="FinalValues5"/>
##       <Constant dataType="string">Unmarried</Constant>
##     </Apply>
##   </OutputField>
##   <OutputField name="FinalValues6" feature="transformedValue" dataType="string" isFinalResult="false">
##     <Apply function="addValues">
##       <FieldRef field="Probability_Widowed"/>
##       <FieldRef field="FinalValues6"/>
##       <Constant dataType="string">Widowed</Constant>
##     </Apply>
##   </OutputField>
##   <OutputField name="FinalProbs1" feature="transformedValue" dataType="string" isFinalResult="false">
##     <Apply function="addProbs">
##       <FieldRef field="Probability_Absent"/>
##       <FieldRef field="FinalProbs0"/>
##     </Apply>
##   </OutputField>
##   <OutputField name="FinalProbs2" feature="transformedValue" dataType="string" isFinalResult="false">
##     <Apply function="addProbs">
##       <FieldRef field="Probability_Divorced"/>
##       <FieldRef field="FinalProbs1"/>
##     </Apply>
##   </OutputField>
##   <OutputField name="FinalProbs3" feature="transformedValue" dataType="string" isFinalResult="false">
##     <Apply function="addProbs">
##       <FieldRef field="Probability_Married"/>
##       <FieldRef field="FinalProbs2"/>
##     </Apply>
##   </OutputField>
##   <OutputField name="FinalProbs4" feature="transformedValue" dataType="string" isFinalResult="false">
##     <Apply function="addProbs">
##       <Apply function="-">
##         <Apply function="-">
##           <FieldRef field="Probability_Married"/>
##           <FieldRef field="spouse"/>
##         </Apply>
##         <FieldRef field="absent"/>
##       </Apply>
##       <FieldRef field="FinalProbs3"/>
##     </Apply>
##   </OutputField>
##   <OutputField name="FinalProbs5" feature="transformedValue" dataType="string" isFinalResult="false">
##     <Apply function="addProbs">
##       <FieldRef field="Probability_Unmarried"/>
##       <FieldRef field="FinalProbs4"/>
##     </Apply>
##   </OutputField>
##   <OutputField name="FinalProbs6" feature="transformedValue" dataType="string" isFinalResult="false">
##     <Apply function="addProbs">
##       <FieldRef field="Probability_Widowed"/>
##       <FieldRef field="FinalProbs5"/>
##     </Apply>
##   </OutputField>
##   <OutputField name="FinalValues" feature="transformedValue" dataType="string">
##     <Apply function="replace">
##       <FieldRef field="FinalValues6"/>
##       <Constant dataType="string">^,</Constant>
##       <Constant dataType="string"></Constant>
##     </Apply>
##   </OutputField>
##   <OutputField name="FinalProbs" feature="transformedValue" dataType="string">
##     <Apply function="replace">
##       <FieldRef field="FinalProbs6"/>
##       <Constant dataType="string">^,</Constant>
##       <Constant dataType="string"></Constant>
##     </Apply>
##   </OutputField>
## </Output>

We can summarize the steps we took as:

add 2 functions by hand to the ‘TransformationDictionary’ element.
given N outputs of probability, create another 2N ‘OutputField’ elements for each ‘OutputField’ element. Used makeOutputNodes
initialized the ‘OutputField’ elements, fixed the necessary unwanted addition to the final outputs and added all these ‘OututField’ elements to the ‘Output’ element. Used addOutputField

The functions represented in the outputs can be represented as the following workflow. We are given as inputs an ‘OutputField’ element which gives as the category (the ‘value’ attribute) and its probability (represented by the ‘fieldName’ attribute).

Initialize ’FinalValues=“” and FinalProbs=“” (empty string). Initialize critical probability as 0.3
For each value
Get the probability:the field name
If probability > 0.3
- FinalValues = FinalValues + “,” + value
- FinalProbs = FinalProbs + “,” probability
else skip

So if values “Absent” and “Divorced” are picked, FinalProbs=“,Absent,Divorced”. So the last output replaces a “,” character in the beginning by “”. If the sample outputs given earlier are assumed, the new output would look like:

FinalValues	FinalProbs
Absent	0.3
Absent,Divorced	0.3,0.5

Perhaps not too big an improvement for 6 categories, but is definitely worth it for 200 caregories.

Predictive Analytics, Big Data, Hadoop, PMML

Thursday, April 20, 2017

Complex Outputs in PMML

Introduction

Model Creation

Output Modification

No comments:

Post a Comment

Welcome to the World of Predictive Analytics!