Introduction
This is the third in a series of posts which describe some new functions implemented in the latest release of the R ‘pmml’ package. These functions help define any pre-processing to be done on the input data, give the user PMML editing capabilities so that they can insert such new information even after a PMML file is created.
PMML (Predictive Model Markup Language) is an XML-based standard for the vendor-independent exchange of predictive analytics, data mining and machine learning models (call them what you like) between originating design tools and the production execution platform. Developed by the Data Mining Group in the late 1990s, PMML has matured quietly to the point where it now has extensive vendor support and has become the backbone of big data predictive analytics. When model deployment needs to be near instantaneous and error-free, PMML is the standard to use.
The DataDictionary element
The DataField
element, a child node of the DataDictionary
element, defines the allowed inputs to a PMML model. Each field, a
variable used in the model, can have several attributes which further
define the properties of the input field. It is by using these
attributes that the model specification describes the model input and
checks if the input data to be scored satisfies desired criteria. Models
are frequently made with certain assumptions about the training dataset
and the DataDictionary
restricts the new data to be scored
so that the model is not applied to the data if it is of a different
variety; such cases should be detected and remodeled if necessary.
The DataField
s are defined via their names, they are also given the attributes optype
and dataType
. The optype
attribute indicates if the variable is a continuous or a categorical field. The dataType
attribute defines the field type, string or integer or double or
others. For categorical variables, the model by default can accept any
categorical value as an input, however one can restrict those to a given
list of allowed categories. These are given in the Value
subelement which lists those allowed values. For continuous variables,
one may wish to restrict the values to lie in a given range. This is
done via the Interval
subelement which defines the allowed
range of values a continuous variable can have. These two subelements
are very useful to detect invalid values and outliers; an invalid value
is a value not given in the list in the Values
elements and an outlier is a value outside the interval defined in the Interval
element.
The Values
element
Let us imagine a scenario when a model is made using training data, and then later, the modeler decides to specify the properties of the data so as to ensure it can detect test data not conforming to the training data properties. In this scenario, the pmml function of the pmml package has already been used to convert the model into the pmml format. Since the aim of this post is to show how the DataDictionary can be modified to include one such information which is not directly included during the modeling process, we will use a simple linear regression model.
we will use the iris dataset
dat <- iris
the model will have both continuous and categorical input fields
mod <- lm(Sepal.Length ~ ., data=dat)
lets see the DataDictionary element of the created model
pmml(mod)[[2]]
<DataDictionary numberOfFields="5">
<DataField name="Sepal.Length" optype="continuous" dataType="double"/>
<DataField name="Sepal.Width" optype="continuous" dataType="double"/>
<DataField name="Petal.Length" optype="continuous" dataType="double"/>
<DataField name="Petal.Width" optype="continuous" dataType="double"/>
<DataField name="Species" optype="categorical" dataType="string">
<Value value="setosa"/>
<Value value="versicolor"/>
<Value value="virginica"/>
</DataField>
</DataDictionary>
The pmml function is smart enough to extract the number of categories in the test data and include them in the DataField element.
One can imagine a scenario where one would wish to impose the allowed values a categorical variable can have. Perhaps the test data is slightly different so that some of the values are capitalized. We wish to ignore those records as we do not yet know if they are corrupted data or actually different values. As a first step, one might want to add those possible values in the DataDictionary as invalid values. This way, the scoring process does not ground to a halt and the new variables are dealt with later in the model, perhaps via some transformations.
We first make the Value elements:
mv <- makeValues(list("Setosa","Versicolor","Virginica"), list("setosa","versicolor","virginica"), list("invalid","invalid","invalid"))
The first element can be read down as the 1st element of each list. Hence the first element is: Add “Setosa” as an allowed value of the field, to remind us it may be the same as setosa, we add setosa as a displayName attribute and finally we indicate that if this value does occur, it is an invalid value.
As an example, the 1st PMML elemen looks like:
mv[[1]]
<Value value="Setosa" displayValue="setosa" property="invalid"/>
Finally one can add these elements to the DataDictionary element as originally desired:
addDFChildren(pmml(mod),field="Species",values=mv)
The DataDictionary now looks like:
<DataDictionary numberOfFields="5">
<DataField name="Sepal.Length" optype="continuous" dataType="double"/>
<DataField name="Sepal.Width" optype="continuous" dataType="double"/>
<DataField name="Petal.Length" optype="continuous" dataType="double"/>
<DataField name="Petal.Width" optype="continuous" dataType="double"/>
<DataField name="Species" optype="categorical" dataType="string">
<Value value="setosa"/>
<Value value="versicolor"/>
<Value value="virginica"/>
<Value value="Setosa" displayValue="setosa" property="invalid"/>
<Value value="Versicolor" displayValue="versicolor" property="invalid"/>
<Value value="Virginica" displayValue="virginica" property="invalid"/>
</DataField>
</DataDictionary>
No comments:
Post a Comment