Predictive Analytics, Big Data, Hadoop, PMML: PMML and Association Rules

An association rule describes a relation between one group of objects and another group of objects. This may be said in another way: "If a condition A is satisfied, then so is condition B". As an example, consider items people purchase in a grocery store. Suppose most people who buy milk also buy juice. Also, most people who buy chicken and beef also buy bread. Then two association rules exist:

[Milk] --> [Juice]
If you buy milk, then you will also buy juice

[Chicken,Beef] --> [Bread]
If you buy chicken and beef, then you will also buy bread

Data Processing

Normally, as in a typical regression model, one data row (or record) is read at a time and one output is given back. In particular, one input value is read for each of the input variables required by the model, which are positioned in different columns but in the same row. Once read in, the input record is processed through the model. The result, or output, is then appended to the data as an extra column as the predicted value or score. For Association rules, on the other hand, multiple items of a single transaction need to be read in and processed before an output can be returned. As suggested by the example above, "Chicken" and "Beef" need to be read in before "Bread" is produced as an output. In the usual data format, the entire transaction will have its unique value in one column. For association rules, two different data processing methods can be used to read all the items under a single transaction.

These two methods allow for the data to be expressed either in a "rectangular" for or in a "transactional" format.

Rectangular Format

The rectangular format lists all possible items of a single transaction in a separate column for each row. For the above example, if customers purchase from a list of five possible items: Milk, Juice, Chicken, Beef, and Bread, the input data might be represented as:

Milk,Juice,Chicken,Beef,Bread
1,1,0,0,0
0,0,1,1,1

Note that the first row specifies the header, while the third row, for example, specifies that Chicken, Beef and Bread were purchased together. Of course, it is not clear from these if chicken and beef implies bread, or if chicken implies bread and beef; but together with the PMML file, the scoring machine is able to deduce the correct relationships. And so, for a "rectangular" data file, the output is added to the same row as a different column.

The PMML file for each format is different as well. For a "rectangular" PMML file, all the possible values or items are defined as different fields. And so, these are defined as different "MiningFields" under the "MiningSchema" element. For the example above, instead of a single "MiningField" for the entire purchase, one would have five "MiningFields": Milk, Juice, Chicken, Beef, and Bread, as follows:

 <MiningSchema>
    <MiningField name="Milk" usageType="active"/>
    <MiningField name="Juice" usageType="active"/>
    <MiningField name="Chicken" usageType="active"/>
    <MiningField name="Beef" usageType="active"/>
    <MiningField name="Bread" usageType="active"/>
 </MiningSchema>

PMML Example - Association Rules in Rectangular Format

For an example of a PMML file and its correspondent data file in rectangular format, click HERE.

Transactional Format

The "transactional" format, on the other hand, allows for the input data to be specified in two columns: the first one is the identifier and the second one contains the possible items. For the example above, the data file might be represented as:

ID,value
1,Milk
1,Juice
2,Chicken
2,Beef
2,Bread

The identifier (column "ID") indicates which items belong together. And so, in this example, ID = 1 specifies that the first two items (Milk and Juice) belong to the same input group or transaction, while ID = 2 indicates that Chicken, Beef and Bread belong to a different group. In this case, for the "transactional" data file, the predicted value is added as an extra column in the first row of each group only.

A "transactional" PMML file defines two "MiningFields". One is of type "group" which indicates which group the items belong to. The second is of type 'active' which includes, as in our example, all the possible items that were purchased. Note that is not necessary to list all items one by one. And so, the "MiningSchema" in a "transactional" PMML file might look like:

  <MiningSchema>
    <MiningField name="ID" usageType="group"/>
    <MiningField name="item" usageType="active"/>
 </MiningSchema>

In this case, the columns with the same "ID" belong together: since Milk and Juice in our example both have ID = 1, they both are in the same group. The second column, titled "item" in the data file, lists all the items for that group: Milk and Juice. One can thus read the first group as: “Milk and juice are purchased together”.

PMML Example - Association Rules in Transactional Format

For an example of a PMML file and its correspondent data file in transactional format, click HERE.

Predictive Analytics, Big Data, Hadoop, PMML

Friday, December 9, 2011

PMML and Association Rules

No comments:

Post a Comment

Welcome to the World of Predictive Analytics!