Introduction
This is the second in a series of posts which describe some new functions implemented in the latest release of the R ‘pmml’ package. These functions, intended to help define any pre-processing to be done on the input data, give the user PMML editing capabilities so that they can insert such new information even after a PMML file is created.
The DataDictionary element
The DataDictionary
element in a PMML model defines the allowed inputs to the model via the DataField
sub-element. Each field can have several attributes which further
define the properties of the input field. It is in this way that the
model specification describes the model input and checks if the input
data to be scored satisfies such criteria. Models are frequently made
with certain assumptions about the training data and the DataDictionary
restricts the new data to be scored so that the model is not applied to
the data if it is of a different variety; such cases should be detected
and remodeled if necessary.
The DataField
s are obviously defined via their names, they are also given the attributes optype
and dataType
. The optype
attribute indicates if the field is a continuous or a categorical field. The dataType
attribute defines if the field is a string or integer or double or
others. For categorical variables, the model can accept any categorical
value as an input, however one can restrict those to a given list of
allowed categories. These are given in the Value
subelement
which lists those allowed values. For continuous variables, one may
wish to restrict the values to lie in a given range. This is done via
the Interval
subelement which defines the allowed range of
values a continuous variable can have. These two subelements are very
useful to detect invalid values and outliers; an invalid value is a
value not given in the list of the Values
elements and an outlier is a value outside the interval defined in the Interval
element.
Let us imagine a scenario when a model is made using training data,
and then the modeler decides to specify the properties of the data so as
to ensure it can detect test data not conforming to the test data
properties. The pmml function of the pmml package exports the model in
pmml format. As the aim of this post is to show how the DataDictionary
can be modified to include one such information which is not directly
included during the modeling process, we will use a simple linear
regression model.
we will use the iris
dataset
dat <- iris
the model will have both continuous and categorical input fields
mod <- lm(Sepal.Length ~ ., data=dat)
lets see the DataDictionary element of the created model
pmml(mod)[[2]]
<DataDictionary numberOfFields="5">
<DataField name="Sepal.Length" optype="continuous" dataType="double"/>
<DataField name="Sepal.Width" optype="continuous" dataType="double"/>
<DataField name="Petal.Length" optype="continuous" dataType="double"/>
<DataField name="Petal.Width" optype="continuous" dataType="double"/>
<DataField name="Species" optype="categorical" dataType="string">
<Value value="setosa"/>
<Value value="versicolor"/>
<Value value="virginica"/>
</DataField>
</DataDictionary>
The pmml function is smart enough to extract the number of categories in the test data and include them in the DataField element. A future post will discuss when cases may arise where such Value elements are not given and how to add these elements in those cases.
At this point one may decide to impose limits on the possible values the variable Sepal.Length may have. The range of this variable in the iris dataset is 4.3 - 7.9 and the standard deviation is 0.82, so we decide to let the range be 3.5 to 8.7. Just for fun and to showcase the capabilities of R, we then decide the range to be (-8.7,-3.5] and [3.5,8.7). Perhaps the test data in a realistic case was corrupted. Here the ‘(’ symbol means the end point is not included while the ‘[’ symbol means the end point is included in the range.
We first make the Interval elements:
mi <- makeIntervals(list("openClosed","closedClosed"), list(-8.7,3.5), list(-3.5,8.7))
The first element can be read down as the 1st element of each list. Hence the first element is: -8.7 < interval <= -3.5 && 3.5 <= interval <= 8.7
As an example, the 1st PMML element looks like:
mi[[1]]
<Interval closure="openClosed" leftMargin="-8.7" rightMargin="-3.5"/>
Finally one can add these elements to the DataDictionary element as originally desired:
addDFChildren(pmml(mod),field="Sepal.Length",intervals=mi)
Note that the model the to which the new elements have to be added to is aleady in PMML format. The DataDictionary is now:
<DataDictionary numberOfFields="5">
<DataField name="Sepal.Length" optype="continuous" dataType="double">
<Interval closure="openClosed" leftMargin="-8.7" rightMargin="-3.5"/>
<Interval closure="closedClosed" leftMargin="3.5" rightMargin="8.7"/>
</DataField>
<DataField name="Sepal.Width" optype="continuous" dataType="double"/>
<DataField name="Petal.Length" optype="continuous" dataType="double"/>
<DataField name="Petal.Width" optype="continuous" dataType="double"/>
<DataField name="Species" optype="categorical" dataType="string">
<Value value="setosa"/>
<Value value="versicolor"/>
<Value value="virginica"/>
</DataField>
</DataDictionary>