Predictive Analytics, Big Data, Hadoop, PMML: June 2016

DataDictionary Helper Functions I

Introduction

This is the second in a series of posts which describe some new functions implemented in the latest release of the R ‘pmml’ package. These functions, intended to help define any pre-processing to be done on the input data, give the user PMML editing capabilities so that they can insert such new information even after a PMML file is created.

The DataDictionary element

The DataDictionary element in a PMML model defines the allowed inputs to the model via the DataField sub-element. Each field can have several attributes which further define the properties of the input field. It is in this way that the model specification describes the model input and checks if the input data to be scored satisfies such criteria. Models are frequently made with certain assumptions about the training data and the DataDictionary restricts the new data to be scored so that the model is not applied to the data if it is of a different variety; such cases should be detected and remodeled if necessary.

The DataFields are obviously defined via their names, they are also given the attributes optype and dataType. The optype attribute indicates if the field is a continuous or a categorical field. The dataType attribute defines if the field is a string or integer or double or others. For categorical variables, the model can accept any categorical value as an input, however one can restrict those to a given list of allowed categories. These are given in the Value subelement which lists those allowed values. For continuous variables, one may wish to restrict the values to lie in a given range. This is done via the Interval subelement which defines the allowed range of values a continuous variable can have. These two subelements are very useful to detect invalid values and outliers; an invalid value is a value not given in the list of the Values elements and an outlier is a value outside the interval defined in the Interval element.

Let us imagine a scenario when a model is made using training data, and then the modeler decides to specify the properties of the data so as to ensure it can detect test data not conforming to the test data properties. The pmml function of the pmml package exports the model in pmml format. As the aim of this post is to show how the DataDictionary can be modified to include one such information which is not directly included during the modeling process, we will use a simple linear regression model.

we will use the iris dataset

dat <- iris

the model will have both continuous and categorical input fields

mod <- lm(Sepal.Length ~ ., data=dat)

lets see the DataDictionary element of the created model

pmml(mod)[[2]] 

 <DataDictionary numberOfFields="5">
  <DataField name="Sepal.Length" optype="continuous" dataType="double"/>
  <DataField name="Sepal.Width" optype="continuous" dataType="double"/>
  <DataField name="Petal.Length" optype="continuous" dataType="double"/>
  <DataField name="Petal.Width" optype="continuous" dataType="double"/>
  <DataField name="Species" optype="categorical" dataType="string">
   <Value value="setosa"/>
   <Value value="versicolor"/>
   <Value value="virginica"/>
  </DataField>
 </DataDictionary>

The pmml function is smart enough to extract the number of categories in the test data and include them in the DataField element. A future post will discuss when cases may arise where such Value elements are not given and how to add these elements in those cases.

At this point one may decide to impose limits on the possible values the variable Sepal.Length may have. The range of this variable in the iris dataset is 4.3 - 7.9 and the standard deviation is 0.82, so we decide to let the range be 3.5 to 8.7. Just for fun and to showcase the capabilities of R, we then decide the range to be (-8.7,-3.5] and [3.5,8.7). Perhaps the test data in a realistic case was corrupted. Here the ‘(’ symbol means the end point is not included while the ‘[’ symbol means the end point is included in the range.

We first make the Interval elements:

 mi <- makeIntervals(list("openClosed","closedClosed"), list(-8.7,3.5), list(-3.5,8.7))

The first element can be read down as the 1st element of each list. Hence the first element is: -8.7 < interval <= -3.5 && 3.5 <= interval <= 8.7

As an example, the 1st PMML element looks like:

 mi[[1]]
 <Interval closure="openClosed" leftMargin="-8.7" rightMargin="-3.5"/>

Finally one can add these elements to the DataDictionary element as originally desired:

 addDFChildren(pmml(mod),field="Sepal.Length",intervals=mi)

Note that the model the to which the new elements have to be added to is aleady in PMML format. The DataDictionary is now:

  <DataDictionary numberOfFields="5">
    <DataField name="Sepal.Length" optype="continuous" dataType="double">
      <Interval closure="openClosed" leftMargin="-8.7" rightMargin="-3.5"/>
      <Interval closure="closedClosed" leftMargin="3.5" rightMargin="8.7"/>
    </DataField>
    <DataField name="Sepal.Width" optype="continuous" dataType="double"/>
    <DataField name="Petal.Length" optype="continuous" dataType="double"/>
    <DataField name="Petal.Width" optype="continuous" dataType="double"/>
    <DataField name="Species" optype="categorical" dataType="string">
      <Value value="setosa"/>
      <Value value="versicolor"/>
      <Value value="virginica"/>
    </DataField>
  </DataDictionary>

Predictive Analytics, Big Data, Hadoop, PMML

Thursday, June 30, 2016

PMML pre-processing with DataDictionary

Introduction

The DataDictionary element

Welcome to the World of Predictive Analytics!