Predictive Analytics, Big Data, Hadoop, PMML: Introduction to the R Neighbr package

Introduction

We recently released Neighbr, a package for performing k-nearest neighbor classification and regression. Highlights of version 1.0 include:

comparison measures that support continuous or logical features
support for categorical and continuous targets
neighbor ranking

Neighbr models can also be converted to the PMML (Predictive Model Markup Language) standard using the pmml R package.

In this blog post, we will provide some examples of how to use neighbr to create knn models.

Examples

First, load necessary libraries and set the seed and number display options. knitr::kable is used to display data frames.

library(neighbr)
library(knitr)
set.seed(123)
options(digits=3)

Continuous features and categorical target

This example shows using squared euclidean distance with 3 neighbors to classify the Species of flowers in the iris dataset. Each training instance consists of 4 features and 1 class variable. The categorical target is predicted by a majority vote from the closest k neighbors. The knn() function requires that all columns in test_set are feature columns, and have the same names and are in the same order as the features in train_set. The train_set is assumed to only contain features and targets (one categorical, one continuous, and/or ID for neighbor ranking); i.e., if a column name is not specified as a target, it is assumed to be a feature. The fit object contains predictions for test_set in fit$test_set_scores (there is no predict method for knn).

data(iris)
train_set <- iris[1:147,] #train set contains all targets and features
test_set <- iris[148:150,!names(iris) %in% c("Species")] #test set does not contain any targets

#run knn function
fit <- knn(train_set=train_set,test_set=test_set,
            k=3,
            categorical_target="Species",
            comparison_measure="squared_euclidean")

#show predictions
kable(fit$test_set_scores)

	categorical_target
148	virginica
149	virginica
150	virginica

The returned data frame contains predictions for the categorical target (Species).

Mixed targets and neighbor ranking

It is possible to predict categorical and continuous targets simultaneously, as well as to return the IDs of closest neighbors of a given instance. In the next example, an ID column is added to the data for ranking, and Petal.Width is used as a continuous target. By default, the prediction for the continuous target is calculated by averaging the closest k neighbors.

data(iris)
iris$ID <- c(1:150) #an ID column is necessary if ranks are to be calculated
train_set <- iris[1:147,] #train set contains all predicted variables, features, and ID column
test_set <- iris[148:150,!names(iris) %in% c("Petal.Width","Species","ID")] #test set does not contain predicted variables or ID column

fit <- knn(train_set=train_set,test_set=test_set,
            k=3,
            categorical_target="Species",
            continuous_target= "Petal.Width",
            comparison_measure="squared_euclidean",
            return_ranked_neighbors=3,
            id="ID")

kable(fit$test_set_scores)

	categorical_target	continuous_target	neighbor1	neighbor2	neighbor3
148	virginica	2.20	146	111	116
149	virginica	2.17	137	116	138
150	virginica	1.93	115	128	84

The ranked neighbor IDs are returned along with the categorical and continuous targets, with neghbor1 being the closest in terms of distance. If a similarity measure were being used, neighbor1 would be the most similar. Any number of neighbors can be returned, as long as return_ranked_neighbors <= k.

Neighbor ranking without targets

It is possible to get neighbor ranks without a target variable. In this unsupervised learning case, continuous_target and categorical_target are left as NULL by default.

data(iris)
iris$ID <- c(1:150) #an ID column is necessary if ranks are to be calculated
train_set <- iris[1:147,-c(5)] #remove `Species` categorical variable
test_set <- iris[148:150,!names(iris) %in% c("Species","ID")] #test set does not contain predicted variables or ID column

fit <- knn(train_set=train_set,test_set=test_set,
            k=5,
            comparison_measure="squared_euclidean",
            return_ranked_neighbors=4,
            id="ID")

kable(fit$test_set_scores)

	neighbor1	neighbor2	neighbor3	neighbor4
148	111	112	117	146
149	137	116	111	141
150	128	139	102	143

Logical features

The package supports logical features, to be used with an appropriate similarity measure. This example demonstrates predicting a categorical target and ranking neighbors for the HouseVotes84 dataset (from the mlbench package). The features may be logical consisting of {TRUE, FALSE} or numeric vectors consisting of {0,1}, but not factors. In this example, the factor features are converted to numeric vectors.

library(mlbench)
data(HouseVotes84)
dat <- HouseVotes84[complete.cases(HouseVotes84),] # remove any rows with N/A elements

# change all {yes,no} factors to {0,1}
feature_names <- names(dat)[!names(dat) %in% c("Class","ID")]
for (n in feature_names) {
  levels(dat[,n])[levels(dat[,n])=="n"] <- 0
  levels(dat[,n])[levels(dat[,n])=="y"] <- 1
}

# change factors to numeric
for (n in feature_names) {dat[,n] <- as.numeric(levels(dat[,n]))[dat[,n]]}

dat$ID <- c(1:nrow(dat)) #an ID column is necessary if ranks are to be calculated

train_set <- dat[1:227,]
test_set <- dat[228:232,!names(dat) %in% c("Class","ID")] #test set does not contain predicted variables or ID column

house_fit <- knn(train_set=train_set,test_set=test_set,
            k=7,
            categorical_target = "Class",
            comparison_measure="jaccard",
            return_ranked_neighbors=3,
            id="ID")

kable(house_fit$test_set_scores)

	categorical_target	neighbor1	neighbor2	neighbor3
424	democrat	114	96	112
427	democrat	5	47	91
428	republican	70	156	155
431	republican	115	117	152
432	democrat	57	130	135

Comparison measures

Distance measures are used for vectors with continuous elements. Similarity measures are used for logical vectors. The comparison measures used in neighbr are based on those defined in the PMML standard.

Functions in neighbr can be used to calculate distances or similarities between vectors directly:

distance(c(1,2.3,2.9,0.4),c(-0.3,5.3,2.9,3.3),"euclidean")
#> [1] 4.37
similarity(c(0,1,0,1,1,1),c(1,1,0,1,1,0),"tanimoto")
#> [1] 0.5
similarity(c(0,1,0,1,1,1),c(1,1,0,1,1,0),"jaccard")
#> [1] 0.6

To check which measures are available, run ?distance and ?similarity in your R session.

Neighbr and PMML

This package was developed following the KNN specification in the PMML (Predictive Model Markup Language) standard. The models produced by neighbr can be converted to PMML (using the pmml R package).

For example, to convert the model for HouseVotes84 data above:

library(pmml)
#> Loading required package: XML
house_fit_pmml <- pmml(house_fit)

More information

Additional examples and details are available in the neighbr vignette, which can also be accessed from an R session by running vignette("neighbr-help").

For additional examples on converting neighbr models to PMML, run ?pmml.neighbr after loading the pmml package in R.

Predictive Analytics, Big Data, Hadoop, PMML

Wednesday, March 15, 2017

Introduction to the R Neighbr package