Introduction
This is the first in a series of posts describing some PMML editing capabilities of the latest R ‘pmml’ package. For practical use, creating predictive solutions is just part of the equation. Once built, they need to be transitioned to the operational environment where they are actually put to use. In the agile world we live today, the Predictive Model Markup Language (PMML) delivers the necessary representational power for solutions to be quickly and easily exchanged between systems, allowing for predictions to move at the speed of business.The MiningSchema Element
TheMiningSchema element in a PMML model acts as a sieve through which all input data has to pass through. It lists the input variables which are needed to use the model and allows the model to assign some attributes to those variables which help describe the usage of those fields.Among other attributes, the attributes
missingValueReplacement and invalidValueTreatment are of particular interest and are looked at as an example in this post. A variable may be accepted if it is in the mining schema but what should a model do if the input value is missing or invalid; an all too frequent possibility. These attributes define that missing values are to be replaced by the value of the missingValueReplacement attribute and whether to treat the invalid value as a missing value or just keep its present, invalid value.Let us imagine a scenario when a model is made using training data; frequently the model is simply the one fit to the training data and has no information on how to handle missing values. Handling such values are typically done in a pre-processing stage and thus not directly a part of the model. It is possible that the data scientist makes the model and then wants to insert this information to the model representation after the fact so that test data is properly handled. To illustrate, we will use a simple linear regression model.
We will use the iris dataset
dat <- irismod <- lm(Sepal.Length ~ ., data=dat)pmml(mod)[[3]][[1]] 
 <MiningSchema>
  <MiningField name="Sepal.Length" usageType="predicted"/>
  <MiningField name="Sepal.Width" usageType="active"/>
  <MiningField name="Petal.Length" usageType="active"/>
  <MiningField name="Petal.Width" usageType="active"/>
  <MiningField name="Species" usageType="active"/>
 </MiningSchema>df <- data.frame(Sepal.Width = c(NA,5.0,"asMissing"), Petal.Length = c(NA,5.0,"asMissing"), Petal.Width = c(NA,5.0,"asMissing"), Species = c("categorical",NA,"asInvalid"), row.names=c("optype","missingValueReplacement","invalidValueTreatment")) | Sepal.Width | Petal.Length | Petal.Width | Species | |
|---|---|---|---|---|
| optype | categorical | |||
| missingValueReplacement | 5 | 5 | 5 | |
| invalidValueTreatment | asMissing | asMissing | asMissing | asInvalid | 
addMSAttributes(pmml(mod),df) <MiningSchema>
   <MiningField name="Sepal.Length" usageType="predicted"/>
   <MiningField name="Sepal.Width" usageType="active" missingValueReplacement="5" invalidValueTreatment="asMissing"/>
   <MiningField name="Petal.Length" usageType="active" missingValueReplacement="5" invalidValueTreatment="asMissing"/>
   <MiningField name="Petal.Width" usageType="active" missingValueReplacement="5" invalidValueTreatment="asMissing"/>
   <MiningField name="Species" usageType="active" optype="categorical" invalidValueTreatment="asInvalid"/>
 </MiningSchema> 
No comments:
New comments are not allowed.