## Introduction

We recently released Neighbr, a package for performing k-nearest neighbor classification and regression. Highlights of version 1.0 include:

- comparison measures that support continuous or logical features
- support for categorical and continuous targets
- neighbor ranking

Neighbr models can also be converted to the PMML (Predictive Model Markup Language) standard using the pmml R package.

In this blog post, we will provide some examples of how to use `neighbr`

to create knn models.

## Examples

First, load necessary libraries and set the seed and number display options. `knitr::kable`

is used to display data frames.

```
library(neighbr)
library(knitr)
set.seed(123)
options(digits=3)
```

## Continuous features and categorical target

This example shows using squared euclidean distance with 3 neighbors to classify the Species of flowers in the `iris`

dataset. Each training instance consists of 4 features and 1 class variable. The categorical target is predicted by a majority vote from the closest `k`

neighbors. The `knn()`

function requires that all columns in `test_set`

are feature columns, and have the same names and are in the same order as the features in `train_set`

. The `train_set`

is assumed to only contain features and targets (one categorical, one continuous, and/or ID for neighbor ranking); i.e., if a column name is not specified as a target, it is assumed to be a feature. The `fit`

object contains predictions for `test_set`

in `fit$test_set_scores`

(there is no `predict`

method for `knn`

).

```
data(iris)
train_set <- iris[1:147,] #train set contains all targets and features
test_set <- iris[148:150,!names(iris) %in% c("Species")] #test set does not contain any targets
#run knn function
fit <- knn(train_set=train_set,test_set=test_set,
k=3,
categorical_target="Species",
comparison_measure="squared_euclidean")
#show predictions
kable(fit$test_set_scores)
```

categorical_target | |
---|---|

148 | virginica |

149 | virginica |

150 | virginica |

The returned data frame contains predictions for the categorical target (Species).

## Mixed targets and neighbor ranking

It is possible to predict categorical and continuous targets simultaneously, as well as to return the IDs of closest neighbors of a given instance. In the next example, an ID column is added to the data for ranking, and `Petal.Width`

is used as a continuous target. By default, the prediction for the continuous target is calculated by averaging the closest `k`

neighbors.

```
data(iris)
iris$ID <- c(1:150) #an ID column is necessary if ranks are to be calculated
train_set <- iris[1:147,] #train set contains all predicted variables, features, and ID column
test_set <- iris[148:150,!names(iris) %in% c("Petal.Width","Species","ID")] #test set does not contain predicted variables or ID column
fit <- knn(train_set=train_set,test_set=test_set,
k=3,
categorical_target="Species",
continuous_target= "Petal.Width",
comparison_measure="squared_euclidean",
return_ranked_neighbors=3,
id="ID")
kable(fit$test_set_scores)
```

categorical_target | continuous_target | neighbor1 | neighbor2 | neighbor3 | |
---|---|---|---|---|---|

148 | virginica | 2.20 | 146 | 111 | 116 |

149 | virginica | 2.17 | 137 | 116 | 138 |

150 | virginica | 1.93 | 115 | 128 | 84 |

The ranked neighbor IDs are returned along with the categorical and continuous targets, with `neghbor1`

being the closest in terms of distance. If a similarity measure were being used, `neighbor1`

would be the most similar. Any number of neighbors can be returned, as long as `return_ranked_neighbors <= k`

.

## Neighbor ranking without targets

It is possible to get neighbor ranks without a target variable. In this unsupervised learning case, `continuous_target`

and `categorical_target`

are left as `NULL`

by default.

```
data(iris)
iris$ID <- c(1:150) #an ID column is necessary if ranks are to be calculated
train_set <- iris[1:147,-c(5)] #remove `Species` categorical variable
test_set <- iris[148:150,!names(iris) %in% c("Species","ID")] #test set does not contain predicted variables or ID column
fit <- knn(train_set=train_set,test_set=test_set,
k=5,
comparison_measure="squared_euclidean",
return_ranked_neighbors=4,
id="ID")
kable(fit$test_set_scores)
```

neighbor1 | neighbor2 | neighbor3 | neighbor4 | |
---|---|---|---|---|

148 | 111 | 112 | 117 | 146 |

149 | 137 | 116 | 111 | 141 |

150 | 128 | 139 | 102 | 143 |

## Logical features

The package supports logical features, to be used with an appropriate similarity measure. This example demonstrates predicting a categorical target and ranking neighbors for the `HouseVotes84`

dataset (from the `mlbench`

package). The features may be logical consisting of `{TRUE, FALSE}`

or numeric vectors consisting of `{0,1}`

, but not factors. In this example, the factor features are converted to numeric vectors.

```
library(mlbench)
data(HouseVotes84)
dat <- HouseVotes84[complete.cases(HouseVotes84),] # remove any rows with N/A elements
# change all {yes,no} factors to {0,1}
feature_names <- names(dat)[!names(dat) %in% c("Class","ID")]
for (n in feature_names) {
levels(dat[,n])[levels(dat[,n])=="n"] <- 0
levels(dat[,n])[levels(dat[,n])=="y"] <- 1
}
# change factors to numeric
for (n in feature_names) {dat[,n] <- as.numeric(levels(dat[,n]))[dat[,n]]}
dat$ID <- c(1:nrow(dat)) #an ID column is necessary if ranks are to be calculated
train_set <- dat[1:227,]
test_set <- dat[228:232,!names(dat) %in% c("Class","ID")] #test set does not contain predicted variables or ID column
house_fit <- knn(train_set=train_set,test_set=test_set,
k=7,
categorical_target = "Class",
comparison_measure="jaccard",
return_ranked_neighbors=3,
id="ID")
kable(house_fit$test_set_scores)
```

categorical_target | neighbor1 | neighbor2 | neighbor3 | |
---|---|---|---|---|

424 | democrat | 114 | 96 | 112 |

427 | democrat | 5 | 47 | 91 |

428 | republican | 70 | 156 | 155 |

431 | republican | 115 | 117 | 152 |

432 | democrat | 57 | 130 | 135 |

## Comparison measures

Distance measures are used for vectors with continuous elements. Similarity measures are used for logical vectors. The comparison measures used in `neighbr`

are based on those defined in the PMML standard.

Functions in `neighbr`

can be used to calculate distances or similarities between vectors directly:

```
distance(c(1,2.3,2.9,0.4),c(-0.3,5.3,2.9,3.3),"euclidean")
#> [1] 4.37
similarity(c(0,1,0,1,1,1),c(1,1,0,1,1,0),"tanimoto")
#> [1] 0.5
similarity(c(0,1,0,1,1,1),c(1,1,0,1,1,0),"jaccard")
#> [1] 0.6
```

To check which measures are available, run `?distance`

and `?similarity`

in your R session.

## Neighbr and PMML

This package was developed following the KNN specification in the PMML (Predictive Model Markup Language) standard. The models produced by `neighbr`

can be converted to PMML (using the `pmml`

R package).

For example, to convert the model for `HouseVotes84`

data above:

```
library(pmml)
#> Loading required package: XML
house_fit_pmml <- pmml(house_fit)
```

## More information

Additional examples and details are available in the neighbr vignette, which can also be accessed from an R session by running `vignette("neighbr-help")`

.

For additional examples on converting `neighbr`

models to PMML, run `?pmml.neighbr`

after loading the `pmml`

package in R.

## No comments:

## Post a Comment