Introduction
The latest release of the R ‘pmml’ package adds further support for gradient boosted algorithms, specifically the gbm
and xgboost
functions. The xgboost
conversion will be discussed in a future post, this post concentrates on converting gbm
models to the PMML format.
Gradient boosted models gained popularity due to their better ability, relative to random forests, to correctly classify categorical values. The probability of the predicted category is not as accurate as that of random forests, but if the goal is to simply get a predicted level, gradient boosting became the algorithm of choice. Broadly speaking, the algorithm works by training to predict a correct category using any model of choice. Once this step is finished, the incorrectly predicted data set is chosen and the same model type is trained again; thus making the overall prediction more and more accurate. The ‘gbm’ package does just this, and the model of choice is a tree model; a popular choice.
The gbm model converter
In principle, the gbm
model is simply a collection of trees and trees are easily implemented in the PMML language. The conversion is somewhat complicated due to the method used to make predictions. Each tree calculates the probabilities of each possible category to be predicted. The probability of each category is summed over all the trees and the final prediction is the category with the highest probability. Therefore the models are not independent and moreover, each tree is a regression tree. The output of all the trees must be combined to convert a regression output to a classification output.
A list of trees, each linked to the earlier one, can be represented in PMML as a model chain. PMML requires the model to explicitly specify the type of model, regression or classification. Since each tree outputs a numeric value, which is interpreted as a probability, the model would be a regression model. However since it is understood that the output is a probability which should be used to infer a category, the model should be a classification model. This could be implemented by defining a regression model chain and applying the appropriate functions to the output to predict a category, however this is not a very satisfactory solution. We would prefer that a classification model should be represented by a classification PMML model. This would also make the PMML model simpler when it is desired to predict the probabilities of categories other than the predicted category.
The only requirement of a classification model chain is that the last model used in the chain should be a classification model, no matter what the intermediate model types. Our solution was to add a final multinomial regression model to the chain. Although this adds an extra model, it is a very simple model which automatically enables efficient extraction of all category probabilities, especially for a large number of categories. Each tree segment sums up the probabilities of each category with the previous segment. This way, the input to the last segment is the final sum of the probabilities of each category and it simply normalizes those inputs to probabilities.
These are the details of the model representation, however the actual conversion, like other functions in the ‘pmml’ package, is very simple; just apply the pmml
function. As an example, let us make a GBM model by using the ‘gbm’ package and the ‘iris’ data set.
library(pmml)
library(gbm)
model <- gbm(Species~., data=iris, n.trees=2, interaction.depth=3, distribution="multinomial")
The gbm
function specifies that 2 trees must be fit and each tree should have a maximum depth of 3. Note that these are not default values, just sample small values. The function also requires that the distribution type of the response variable be given; for a classification model, we picked multinomial
.
pmodel <- pmml(model)
Let us look at the input and output of the first model in the chain.
## <MiningSchema>
## <MiningField name="Species" usageType="predicted"/>
## <MiningField name="Sepal.Length" usageType="active"/>
## <MiningField name="Sepal.Width" usageType="active"/>
## <MiningField name="Petal.Length" usageType="active"/>
## <MiningField name="Petal.Width" usageType="active"/>
## </MiningSchema>
## <Output>
## <OutputField name="UpdatedPredictedValue11" optype="continuous" dataType="double" feature="predictedValue"/>
## </Output>
The inputs are as expected and the output is the predicted probability of the first category by the first tree. Now consider the input and output of the second tree.
## <MiningSchema>
## <MiningField name="Species" usageType="predicted"/>
## <MiningField name="Sepal.Length" usageType="active"/>
## <MiningField name="Sepal.Width" usageType="active"/>
## <MiningField name="Petal.Length" usageType="active"/>
## <MiningField name="Petal.Width" usageType="active"/>
## <MiningField name="UpdatedPredictedValue11" optype="continuous"/>
## </MiningSchema>
## <Output>
## <OutputField name="TreePredictedValue12" optype="continuous" dataType="double" feature="predictedValue"/>
## <OutputField name="UpdatedPredictedValue12" optype="continuous" dataType="double" feature="transformedValue">
## <Apply function="+">
## <FieldRef field="UpdatedPredictedValue11"/>
## <FieldRef field="TreePredictedValue12"/>
## </Apply>
## </OutputField>
## </Output>
The input now is the previously calculated probability and the output contains the running sum of the probability of category 1 from the trees.
Next let us look at the input and output of the \(3^{rd}\) tree.
## <MiningSchema>
## <MiningField name="Species" usageType="predicted"/>
## <MiningField name="Sepal.Length" usageType="active"/>
## <MiningField name="Sepal.Width" usageType="active"/>
## <MiningField name="Petal.Length" usageType="active"/>
## <MiningField name="Petal.Width" usageType="active"/>
## </MiningSchema>
## <Output>
## <OutputField name="UpdatedPredictedValue21" optype="continuous" dataType="double" feature="predictedValue"/>
## </Output>
Since each category was specified to have 2 trees each, the third tree now changes attention to the second category. Its inputs are the original inputs and it outputs the predicted probability of the second category from the \(1^{st}\) tree.
In this way, after 2 trees each for 3 categories, that is, 6 trees, the \(7^{th}\) tree is the regression model which normalizes the probabilities to the final one. One can see this from the inputs and outputs defined in the last model.
## <MiningSchema>
## <MiningField name="UpdatedPredictedValue12"/>
## <MiningField name="UpdatedPredictedValue22"/>
## <MiningField name="UpdatedPredictedValue32"/>
## </MiningSchema>
## <Output>
## <OutputField name="Predicted_Species" feature="predictedValue"/>
## <OutputField name="Probability_setosa" optype="continuous" dataType="double" feature="probability" value="setosa"/>
## <OutputField name="Probability_versicolor" optype="continuous" dataType="double" feature="probability" value="versicolor"/>
## <OutputField name="Probability_virginica" optype="continuous" dataType="double" feature="probability" value="virginica"/>
## </Output>
We see that although the actual PMML representation of a gbm
object is not straightforward, the actual conversion is the usual, simple application of the pmml
function, a one line command.