# How to Compute the Statistical Significance of Two Classifiers’ Performance Difference

A question I see quite often, in scientific forums, is how to determine whether the observed performance difference of two classifiers is statistically significant. Let us work through an example, as to how we can compute this statistical significance (click on the link below):

** How to Compute the Statistical Significance of two Classifiers’ Performance Difference**

# Principal Components-Based Modeling for the Prediction of a Brand’s Image Rating

All brands strive to have an excellent **Image **in their respective markets, where **Image **is comprised of such components as reputation and trustworthiness. For this purpose, companies often conduct surveys among their customers to inquire about their satisfaction, complaints, and whether expectations are being met. An interesting question that arises is that, given such survey results, whether we can predict the brand’s Image.

Package **plspm **in R contains a dataset called **mobile**, which contains variables that pertain to the above issue. Specifically it contains 24 variables that encapsulate the following latent (underyling) concepts: **Image, Expectations, Quality, Value, Satisfaction, Complaints, and Loyalty.** The variable scale is from 0-100. For example, the **Image** latent concept, has five corresponding variables in the dataset and the** Expectations** latent concept has three corresponding variables in the dataset. There are a total of 250 observations in the dataset.

A methodology often used for data with latent concepts is that of **Principal Components Analysis (PCA),** which is based on the computation of eigenvectors and eigenvalues of the data covariance matrix. In this post, we will employ **Principal Components Regression(PCR), Partial Least Regression(PLSR),** and **Ridge Regression** (all related to PCA) to build a model and create a prediction for the average **Image** ratings of the products in the dataset. So the response variable will be the average of the 5 variables related to **Image**, and predictors will be the variables related to the other concepts (Value, Satisfaction, etc). Although, all aforementioned regressions are based on the Principal Components idea, they differ in how they treat the low/high variance directions. PCR just keeps a certain number of high variance directions and throws away the rest. PLSR inflates some of the high variance directions, while shrinking the low variance directions [Has13]. On the other hand, Ridge Regression shrinks all directions, exhibiting a preference for the high-variance directions (i.e., it shrinks low-variance more). Finally, for comparison purposes, we also compute the linear regression of the predictors. As is shown in the code below, **PCR **and **PLSR** do the best in predicting the average **Image** rating of each product. This could indicate that keeping/inflating the high variance directions produces better predictions for the dataset, than shrinking high-variance directions less than low-variance directions as is done in Ridge Regression.

[Has13] Hastie, T., Tibshirani, R. and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, 2013.

Below is the link for the full R code:

# An Example of R Logistic Regression for Weather Prediction (updated 10/17)

In this post, we will see how we can use R Logistic Regression invoked via the Generalized Linear Model function glm() to predict the value of a dichotomous variable that indicates whether it will rain or not in Australia. We use the weather dataset in the rattle package.

Below is the link for the full code:

https://drive.google.com/file/d/0B5vwqG-vGN-ZUE03QkJNYU1jMUk/view?usp=sharing

# Prediction using the R SuperLearner package

In this post, the R SuperLearner package is used to predict the values of the testset part of the prostate dataset.

In the SuperLearner approach, prediction is performed by combining weighted versions of different learners. As shown in the code below, the mean square prediction error is 0.319. As a comparison, in an earlier post (3-way variable selection in R regression), the mean square errors of linear regression and lasso regression were 0.516 and 0.493 respectively.

Here is the SuperLearner code:

> library(SuperLearner)

>library(ElemStatLearn)

> data(prostate)

#Separate the training and test sets.

> head(prostate)

lcavol lweight age lbph svi lcp gleason pgg45 lpsa train

1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829 TRUE

2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189 TRUE

3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189 TRUE

4 -1.2039728 3.282789 58 -1.386294 0 -1.386294 6 0 -0.1625189 TRUE

5 0.7514161 3.432373 62 -1.386294 0 -1.386294 6 0 0.3715636 TRUE

6 -1.0498221 3.228826 50 -1.386294 0 -1.386294 6 0 0.7654678 TRUE

#The variable to be predicted is lpsa. The train variable is a dummy variable that indicates whether a case belongs to the trainset or the testset.

> trainset<-prostate[prostate$train==TRUE,]
> testset<-prostate[prostate$train==FALSE,]
> testset1<-testset[,-10]
> testset2<-testset1[,-9]
> trainset1<-trainset[,-10]
> trainset2<-trainset1[,-9]
#Specify the learners that will be used by the superlearner.
mylibrary<-c("SL.glm","SL.randomForest","SL.svm","SL.glmnet")
#Specify the training set input/output (X,ay below) and the testsetinput(newX).
> X<-trainset2
> newX<-testset2
ay<-trainset[,9]
#Call the SuperLearner
> out<-SuperLearner(ay,X,newX,SL.library=mylibrary)
#These below are the predicted values by the SuperLearner.
> out$SL.predict

[,1]

7 1.814375

9 1.110389

10 1.237715

15 1.871654

22 2.699901

25 1.943528

26 1.977818

28 1.965107

32 1.988913

34 1.227339

36 2.875185

42 2.231019

44 2.346486

48 2.783155

49 2.419382

50 2.120854

53 2.388720

54 3.046706

55 3.001810

57 1.612384

62 3.444458

64 3.635944

65 2.350286

66 2.748382

73 2.853494

74 3.341276

80 3.117150

84 3.186377

95 3.241076

97 3.657895

#Let’s compute now the mean square error between the predicted values and the actual testset values.

> sum=0

> tt<-length(testset)
> for(i in 1:tt) {

+ sum<-sum+(testset[i,9]-out$SL.predict[i])^2
+ }
> sumg<-sum/tt
> sumg

[1] 0.3191994

>

# Two-way ANOVA in R

In this post we look at how we can compute the two-way ANOVA of a balanced design. The dataset is weightgain in package HSAUR and it shows the weight gains of rats put on four different diets, with two varying factors (the source of protein, which can be beef or cereal and the type which can be high or low).

The R Code and graphs (main effects, interactions) are shown in the link below.

# Reduction of Regression Prediction Error by Incorporating Var Interactions and Factorization

In this post, we work with dataset mtcars in R. The dataset has 32 observations and 11 variables. Various regression models were tried on the model. Each one of these models was optimized in regards to AIC, using stepwise regression. The prediction error was computed using leave-one-out cross validation.

** The smallest prediction error and also the smallest regression standard error was achieved, when we incorporated as much knowledge as possible about our independent variables.** Specifically, looking at the correlation matrix of the data one can see that some of the variables are correlated and to account for that an interaction term was included in the model. In addition, some of the variables were of discrete nature taking only a few unique values. Knowledge about this was incorporated in the regression, by entering these variables as factors in the model. The complete code for the development and testing of the models is in the link below.

Below is a version that takes into account that some categorical variables are ordered. However, the prediction and standard regression errors remain the same as above:

# Partial Correlation in R

When computing the correlation between two variables, an interesting question is how it will be affected if we control for another variable(s), where control means hold constant.

R offers the function *pcor.test()* in the *ppcor *package. This function has the format pcor.test(X,Y,Z) where X and Y are the variables for which we want to compute the correlation, while Z consists of column vectors containing the controlling variable(s).

In the link below, we apply *pcor.test()* to a dataset and compare it will computing correlations by group. The results show that computing correlations by group, provides useful insights, while pcor.test() does not.

# 3-way Variable Selection in R Regression (lasso,stepwise,and best subset)

In this post, you can find code (link below) for doing variable selection in R regression in three different ways. The variable selection was done on the well-known R dataset *prostate.* The data is inherently separated in train and test cases. The regressions were applied on the training data and then the prediction mean square error was computed for the test data.

**Stepwise regression**: Here we use the R function*step(),*where the*AIC*criterion serves as a guide to add/delete variables. The regression implementation that is returned by*step()*has achieved the lowest*AIC*.**Lasso regression**: This is a form of penalized regression that does feature selection inherently. Penalized regression adds bias to the regression equation in order to reduce variance and therefore, reduce prediction error and avoid overfitting. Lasso regression sets some coefficients to zero, and therefore does implicit feature selection.**Best subset regression**: Here we use the R package*leaps*and specifically the function*regsubsets(),*which returns the best model of size m=1…,n where n is the number of input variables.

Regarding which variables are removed, it is interesting to note that:

- Lasso regression and stepwise regression result in the removal of the same variable
*(gleason*). - In best subset selection, when we select the regression with the smallest cp (mallow’s cp), the best subset is the one of size 7, with one variable removed (
*gleason*again). When we select, the subset with the smallest*BIC*(Bayes Information Criterion), the best subset is the one of size 2 (the two variables that remain are*lcavol*and l*weight*).

Regarding the test error, the smallest values are achieved with lasso regression and best subset selection with regression of size 2.

Code for regression variable selection

# Prediction in R using Ridge Regression

Ridge regression is a regularization method, where the coefficients are shrank with the purpose of reducing the variance of the solution and therefore improving prediction accuracy. Below we will implement ridge regression on the *longley and prostate* data sets using two methods: the *lm.ridge()* function and the *linearRidge()* function. **Pay special attention to the scaling of the coefficients and the offseting of the predicted values for the lm.ridge()**

**function**

**.**