[Milk] --> [Juice]
If you buy milk, then you will also buy juice
[Chicken,Beef] --> [Bread]
If you buy chicken and beef, then you will also buy bread
Data ProcessingIf you buy milk, then you will also buy juice
[Chicken,Beef] --> [Bread]
If you buy chicken and beef, then you will also buy bread
Normally, as in a typical regression model, one data row (or record) is read at a time and one output is given back. In particular, one input value is read for each of the input variables required by the model, which are positioned in different columns but in the same row. Once read in, the input record is processed through the model. The result, or output, is then appended to the data as an extra column as the predicted value or score. For Association rules, on the other hand, multiple items of a single transaction need to be read in and processed before an output can be returned. As suggested by the example above, "Chicken" and "Beef" need to be read in before "Bread" is produced as an output. In the usual data format, the entire transaction will have its unique value in one column. For association rules, two different data processing methods can be used to read all the items under a single transaction.
These two methods allow for the data to be expressed either in a "rectangular" for or in a "transactional" format.
Rectangular Format
The rectangular format lists all possible items of a single transaction in a separate column for each row. For the above example, if customers purchase from a list of five possible items: Milk, Juice, Chicken, Beef, and Bread, the input data might be represented as:
Milk,Juice,Chicken,Beef,Bread
1,1,0,0,0
0,0,1,1,1
Note that the first row specifies the header, while the third row, for example, specifies that Chicken, Beef and Bread were purchased together. Of course, it is not clear from these if chicken and beef implies bread, or if chicken implies bread and beef; but together with the PMML file, the scoring machine is able to deduce the correct relationships. And so, for a "rectangular" data file, the output is added to the same row as a different column.
The PMML file for each format is different as well. For a "rectangular" PMML file, all the possible values or items are defined as different fields. And so, these are defined as different "MiningFields" under the "MiningSchema" element. For the example above, instead of a single "MiningField" for the entire purchase, one would have five "MiningFields": Milk, Juice, Chicken, Beef, and Bread, as follows:
<MiningSchema> |
PMML Example - Association Rules in Rectangular Format
For an example of a PMML file and its correspondent data file in rectangular format, click HERE.
Transactional Format
The "transactional" format, on the other hand, allows for the input data to be specified in two columns: the first one is the identifier and the second one contains the possible items. For the example above, the data file might be represented as:
ID,value
1,Milk
1,Juice
2,Chicken
2,Beef
2,Bread
The identifier (column "ID") indicates which items belong together. And so, in this example, ID = 1 specifies that the first two items (Milk and Juice) belong to the same input group or transaction, while ID = 2 indicates that Chicken, Beef and Bread belong to a different group. In this case, for the "transactional" data file, the predicted value is added as an extra column in the first row of each group only.
A "transactional" PMML file defines two "MiningFields". One is of type "group" which indicates which group the items belong to. The second is of type 'active' which includes, as in our example, all the possible items that were purchased. Note that is not necessary to list all items one by one. And so, the "MiningSchema" in a "transactional" PMML file might look like:
<MiningSchema> |
In this case, the columns with the same "ID" belong together: since Milk and Juice in our example both have ID = 1, they both are in the same group. The second column, titled "item" in the data file, lists all the items for that group: Milk and Juice. One can thus read the first group as: “Milk and juice are purchased together”.
PMML Example - Association Rules in Transactional Format
For an example of a PMML file and its correspondent data file in transactional format, click HERE.
No comments:
Post a Comment