## Friday, June 17, 2016

### Adding arbitary attributes to the PMML MiningField element

MiningFieldAttributes

## Introduction

This is the first in a series of posts describing some PMML editing capabilities of the latest R ‘pmml’ package. For practical use, creating predictive solutions is just part of the equation. Once built, they need to be transitioned to the operational environment where they are actually put to use. In the agile world we live today, the Predictive Model Markup Language (PMML) delivers the necessary representational power for solutions to be quickly and easily exchanged between systems, allowing for predictions to move at the speed of business.

## The MiningSchema Element

The MiningSchema element in a PMML model acts as a sieve through which all input data has to pass through. It lists the input variables which are needed to use the model and allows the model to assign some attributes to those variables which help describe the usage of those fields.
Among other attributes, the attributes missingValueReplacement and invalidValueTreatment are of particular interest and are looked at as an example in this post. A variable may be accepted if it is in the mining schema but what should a model do if the input value is missing or invalid; an all too frequent possibility. These attributes define that missing values are to be replaced by the value of the missingValueReplacement attribute and whether to treat the invalid value as a missing value or just keep its present, invalid value.
Let us imagine a scenario when a model is made using training data; frequently the model is simply the one fit to the training data and has no information on how to handle missing values. Handling such values are typically done in a pre-processing stage and thus not directly a part of the model. It is possible that the data scientist makes the model and then wants to insert this information to the model representation after the fact so that test data is properly handled. To illustrate, we will use a simple linear regression model.
We will use the iris dataset
dat <- iris
the model will have both continuous and categorical input fields
mod <- lm(Sepal.Length ~ ., data=dat)
lets see the MiningSchema element of the created model
pmml(mod)[[3]][[1]]

<MiningSchema>
<MiningField name="Sepal.Length" usageType="predicted"/>
<MiningField name="Sepal.Width" usageType="active"/>
<MiningField name="Petal.Length" usageType="active"/>
<MiningField name="Petal.Width" usageType="active"/>
<MiningField name="Species" usageType="active"/>
</MiningSchema>
We now wish to make it clear that Species is a categorical variable and missing values of all numeric variables are to be replaced by 5.0. Further, we wish to handle invalid values the same way, treat them as if they are missing values instead. The invalid value for the categorical variable is to be kept as it is, an invalid value. All this can be done in 1 step by defining the attributes, their value and the field thay apply to in a data frame. For this example, this would be:
df <- data.frame(Sepal.Width = c(NA,5.0,"asMissing"), Petal.Length = c(NA,5.0,"asMissing"), Petal.Width = c(NA,5.0,"asMissing"), Species = c("categorical",NA,"asInvalid"), row.names=c("optype","missingValueReplacement","invalidValueTreatment")) 
The data frame above looks like:
Sepal.Width Petal.Length Petal.Width Species
optype categorical
missingValueReplacement 5 5 5
invalidValueTreatment asMissing asMissing asMissing asInvalid
The method to store information is evident. The command to now modify the pmml is:
addMSAttributes(pmml(mod),df)
The MiningSchema now looks like:
 <MiningSchema>
<MiningField name="Sepal.Length" usageType="predicted"/>
<MiningField name="Sepal.Width" usageType="active" missingValueReplacement="5" invalidValueTreatment="asMissing"/>
<MiningField name="Petal.Length" usageType="active" missingValueReplacement="5" invalidValueTreatment="asMissing"/>
<MiningField name="Petal.Width" usageType="active" missingValueReplacement="5" invalidValueTreatment="asMissing"/>
<MiningField name="Species" usageType="active" optype="categorical" invalidValueTreatment="asInvalid"/>
</MiningSchema>