## Introduction

This is the third in a series of posts which describe some new functions implemented in the latest release of the R ‘pmml’ package. These functions help define any pre-processing to be done on the input data, give the user PMML editing capabilities so that they can insert such new information even after a PMML file is created.

PMML (Predictive Model Markup Language) is an XML-based standard for the vendor-independent exchange of predictive analytics, data mining and machine learning models (call them what you like) between originating design tools and the production execution platform. Developed by the Data Mining Group in the late 1990s, PMML has matured quietly to the point where it now has extensive vendor support and has become the backbone of big data predictive analytics. When model deployment needs to be near instantaneous and error-free, PMML is the standard to use.

The DataField element, a child node of the DataDictionary element, defines the allowed inputs to a PMML model. Each field, a variable used in the model, can have several attributes which further define the properties of the input field. It is by using these attributes that the model specification describes the model input and checks if the input data to be scored satisfies desired criteria. Models are frequently made with certain assumptions about the training dataset and the DataDictionary restricts the new data to be scored so that the model is not applied to the data if it is of a different variety; such cases should be detected and remodeled if necessary.

The DataFields are defined via their names, they are also given the attributes optype and dataType. The optype attribute indicates if the variable is a continuous or a categorical field. The dataType attribute defines the field type, string or integer or double or others. For categorical variables, the model by default can accept any categorical value as an input, however one can restrict those to a given list of allowed categories. These are given in the Value subelement which lists those allowed values. For continuous variables, one may wish to restrict the values to lie in a given range. This is done via the Interval subelement which defines the allowed range of values a continuous variable can have. These two subelements are very useful to detect invalid values and outliers; an invalid value is a value not given in the list in the Values elements and an outlier is a value outside the interval defined in the Interval element.

## The Values element

Let us imagine a scenario when a model is made using training data, and then later, the modeler decides to specify the properties of the data so as to ensure it can detect test data not conforming to the training data properties. In this scenario, the pmml function of the pmml package has already been used to convert the model into the pmml format. Since the aim of this post is to show how the DataDictionary can be modified to include one such information which is not directly included during the modeling process, we will use a simple linear regression model.

we will use the iris dataset

dat <- iris

the model will have both continuous and categorical input fields

mod <- lm(Sepal.Length ~ ., data=dat)

lets see the DataDictionary element of the created model

pmml(mod)[[2]]

<DataField name="Sepal.Length" optype="continuous" dataType="double"/>
<DataField name="Sepal.Width" optype="continuous" dataType="double"/>
<DataField name="Petal.Length" optype="continuous" dataType="double"/>
<DataField name="Petal.Width" optype="continuous" dataType="double"/>
<DataField name="Species" optype="categorical" dataType="string">
<Value value="setosa"/>
<Value value="versicolor"/>
<Value value="virginica"/>
</DataField>
</DataDictionary>

The pmml function is smart enough to extract the number of categories in the test data and include them in the DataField element.

One can imagine a scenario where one would wish to impose the allowed values a categorical variable can have. Perhaps the test data is slightly different so that some of the values are capitalized. We wish to ignore those records as we do not yet know if they are corrupted data or actually different values. As a first step, one might want to add those possible values in the DataDictionary as invalid values. This way, the scoring process does not ground to a halt and the new variables are dealt with later in the model, perhaps via some transformations.

We first make the Value elements:

 mv <- makeValues(list("Setosa","Versicolor","Virginica"), list("setosa","versicolor","virginica"), list("invalid","invalid","invalid"))

The first element can be read down as the 1st element of each list. Hence the first element is: Add “Setosa” as an allowed value of the field, to remind us it may be the same as setosa, we add setosa as a displayName attribute and finally we indicate that if this value does occur, it is an invalid value.

As an example, the 1st PMML elemen looks like:

 mv[[1]]
<Value value="Setosa" displayValue="setosa" property="invalid"/> 

Finally one can add these elements to the DataDictionary element as originally desired:

 addDFChildren(pmml(mod),field="Species",values=mv)

  <DataDictionary numberOfFields="5">
<DataField name="Sepal.Length" optype="continuous" dataType="double"/>
<DataField name="Sepal.Width" optype="continuous" dataType="double"/>
<DataField name="Petal.Length" optype="continuous" dataType="double"/>
<DataField name="Petal.Width" optype="continuous" dataType="double"/>
<DataField name="Species" optype="categorical" dataType="string">
<Value value="setosa"/>
<Value value="versicolor"/>
<Value value="virginica"/>
<Value value="Setosa" displayValue="setosa" property="invalid"/>
<Value value="Versicolor" displayValue="versicolor" property="invalid"/>
<Value value="Virginica" displayValue="virginica" property="invalid"/>
</DataField>
</DataDictionary>