How to Compute the Statistical Significance of Two Classifiers’ Performance Difference

A question I see quite often, in scientific forums, is how to determine whether the observed performance difference of two classifiers is statistically significant. Let us work through an example, as to how we can compute this statistical significance (click on the link below):

How to Compute the Statistical Significance of two Classifiers’ Performance Difference

Principal Components-Based Modeling for the Prediction of a Brand’s Image Rating

All brands strive to have an excellent Image in their respective markets, where Image is comprised of such components as reputation and trustworthiness. For this purpose, companies often conduct surveys among their customers to inquire about their satisfaction, complaints, and whether expectations are being met. An interesting question that arises is that, given such survey results, whether we can predict the brand’s Image.

Package plspm in R contains a dataset called mobile, which contains variables that pertain to the above issue. Specifically it contains 24 variables that encapsulate the following latent (underyling) concepts: Image, Expectations, Quality, Value, Satisfaction, Complaints, and Loyalty. The variable scale is from 0-100. For example, the Image latent concept, has five corresponding variables in the dataset and the Expectations latent concept has three corresponding variables in the dataset. There are a total of 250 observations in the dataset.

A methodology often used for data with latent concepts is that of Principal Components Analysis (PCA), which is based on the computation of eigenvectors and eigenvalues of the data covariance matrix. In this post, we will employ Principal Components Regression(PCR), Partial Least Regression(PLSR), and Ridge Regression (all related to PCA) to build a model and create a prediction for the average Image ratings of the products in the dataset. So the response variable will be the average of the 5 variables related to Image, and predictors will be the variables related to the other concepts (Value, Satisfaction, etc). Although, all aforementioned regressions are based on the Principal Components idea, they differ in how they treat the low/high variance directions. PCR just keeps a certain number of high variance directions and throws away the rest. PLSR inflates some of the high variance directions, while shrinking the low variance directions [Has13]. On the other hand, Ridge Regression shrinks all directions, exhibiting a preference for the high-variance directions (i.e., it shrinks low-variance more). Finally, for comparison purposes, we also compute the linear regression of the predictors. As is shown in the code below, PCR and PLSR do the best in predicting the average Image rating of each product. This could indicate that keeping/inflating the high variance directions produces better predictions for the dataset, than shrinking high-variance directions less than low-variance directions as is done in Ridge Regression.

[Has13] Hastie, T., Tibshirani, R. and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, 2013.

Below is the link for the full R code:

Code for Regressions based on Principal Components

An Example of R Logistic Regression for Weather Prediction (updated 10/17)

In this post, we will see how we can use R Logistic Regression invoked via the Generalized Linear Model function glm() to predict the value of a dichotomous variable that indicates whether it will rain or not in Australia. We use the weather dataset in the rattle package.

Below is the link for the full code:

https://drive.google.com/file/d/0B5vwqG-vGN-ZUE03QkJNYU1jMUk/view?usp=sharing

Code for R Logistic Regression

Prediction using the R SuperLearner package

In this post, the R SuperLearner package is used to predict the values of the testset part of the prostate dataset.
In the SuperLearner approach, prediction is performed by combining weighted versions of different learners. As shown in the code below, the mean square prediction error is 0.319. As a comparison, in an earlier post (3-way variable selection in R regression), the mean square errors of linear regression and lasso regression were 0.516 and 0.493 respectively.

Here is the SuperLearner code:
> library(SuperLearner)
>library(ElemStatLearn)
> data(prostate)
#Separate the training and test sets.
> head(prostate)
lcavol lweight age lbph svi lcp gleason pgg45 lpsa train
1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829 TRUE
2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189 TRUE
3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189 TRUE
4 -1.2039728 3.282789 58 -1.386294 0 -1.386294 6 0 -0.1625189 TRUE
5 0.7514161 3.432373 62 -1.386294 0 -1.386294 6 0 0.3715636 TRUE
6 -1.0498221 3.228826 50 -1.386294 0 -1.386294 6 0 0.7654678 TRUE
#The variable to be predicted is lpsa. The train variable is a dummy variable that indicates whether a case belongs to the trainset or the testset.
> trainset<-prostate[prostate$train==TRUE,] > testset<-prostate[prostate$train==FALSE,] > testset1<-testset[,-10] > testset2<-testset1[,-9] > trainset1<-trainset[,-10] > trainset2<-trainset1[,-9] #Specify the learners that will be used by the superlearner. mylibrary<-c("SL.glm","SL.randomForest","SL.svm","SL.glmnet") #Specify the training set input/output (X,ay below) and the testsetinput(newX). > X<-trainset2 > newX<-testset2 ay<-trainset[,9] #Call the SuperLearner > out<-SuperLearner(ay,X,newX,SL.library=mylibrary) #These below are the predicted values by the SuperLearner. > out$SL.predict
[,1]
7 1.814375
9 1.110389
10 1.237715
15 1.871654
22 2.699901
25 1.943528
26 1.977818
28 1.965107
32 1.988913
34 1.227339
36 2.875185
42 2.231019
44 2.346486
48 2.783155
49 2.419382
50 2.120854
53 2.388720
54 3.046706
55 3.001810
57 1.612384
62 3.444458
64 3.635944
65 2.350286
66 2.748382
73 2.853494
74 3.341276
80 3.117150
84 3.186377
95 3.241076
97 3.657895
#Let’s compute now the mean square error between the predicted values and the actual testset values.
> sum=0
> tt<-length(testset) > for(i in 1:tt) {
+ sum<-sum+(testset[i,9]-out$SL.predict[i])^2 + } > sumg<-sum/tt > sumg
[1] 0.3191994
>

Two-way ANOVA in R

In this post we look at how we can compute the two-way ANOVA of a balanced design. The dataset is weightgain in package HSAUR and it shows the weight gains of rats put on four different diets, with two varying factors (the source of protein, which can be beef or cereal and the type which can be high or low).

The R Code and graphs (main effects, interactions) are shown in the link below.

Two-way ANOVA Code Link

Reduction of Regression Prediction Error by Incorporating Var Interactions and Factorization

In this post, we work with dataset mtcars in R. The dataset has 32 observations and 11 variables. Various regression models were tried on the model. Each one of these models was optimized in regards to AIC, using stepwise regression. The prediction error was computed using leave-one-out cross validation.

The smallest prediction error and also the smallest regression standard error was achieved, when we incorporated as much knowledge as possible about our independent variables. Specifically, looking at the correlation matrix of the data one can see that some of the variables are correlated and to account for that an interaction term was included in the model. In addition, some of the variables were of discrete nature taking only a few unique values. Knowledge about this was incorporated in the regression, by entering these variables as factors in the model. The complete code for the development and testing of the models is in the link below.

Regression Code Link

Below is a version that takes into account that some categorical variables are ordered. However, the prediction and standard regression errors remain the same as above:

Regression Code Link

Partial Correlation in R

When computing the correlation between two variables, an interesting question is how it will be affected if we control for another variable(s), where control means hold constant.

R offers the function pcor.test() in the ppcor package. This function has the format pcor.test(X,Y,Z) where X and Y are the variables for which we want to compute the correlation, while Z consists of column vectors containing the controlling variable(s).

In the link below, we apply pcor.test() to a dataset and compare it will computing correlations by group. The results show that computing correlations by group, provides useful insights, while pcor.test() does not.

 

Code for partial correlation and correlation by group

3-way Variable Selection in R Regression (lasso,stepwise,and best subset)

In this post, you can find code (link below) for doing variable selection in R regression in three different ways. The variable selection was done on the well-known R dataset prostate. The data is inherently separated in train and test cases. The regressions were applied on the training data and then the prediction mean square error was computed for the test data.

  • Stepwise regression: Here we use the R function step(), where the AIC criterion serves as a guide to add/delete variables. The regression implementation that is returned by step() has achieved the lowest AIC.
  • Lasso regression: This is a form of penalized regression that does feature selection inherently. Penalized regression adds bias to the regression equation in order to reduce variance and therefore, reduce prediction error and avoid overfitting. Lasso regression sets some coefficients to zero, and therefore does implicit feature selection.
  • Best subset regression: Here we use the R package leaps and specifically the function regsubsets(), which returns the best model of size m=1…,n where n is the number of input variables.

Regarding which variables are removed, it is interesting to note that:

  • Lasso regression and stepwise regression result in the removal of the same variable (gleason).
  • In best subset selection, when we select the regression with the smallest cp (mallow’s cp), the best subset is the one of size 7, with one variable removed (gleason again). When we select, the subset with the smallest BIC (Bayes Information Criterion), the best subset is the one of size 2 (the two variables that remain are lcavol and lweight).

Regarding the test error, the smallest values are achieved with lasso regression and best subset selection with regression of size 2.

Code for regression variable selection

 

Prediction in R using Ridge Regression

Ridge regression is a regularization method, where the coefficients are shrank with the purpose of reducing the variance of the solution and therefore improving prediction accuracy. Below we will implement ridge regression on the longley and prostate data sets using two methods: the lm.ridge() function and  the linearRidge() function. Pay special attention to the scaling of the coefficients and the offseting of the predicted values for the lm.ridge() function.

Ridge regression in R examples