Classification rule based on Bayesian naive Bayes models with feature selections bias corrected

Longhai Li, Department of Mathematics and Statistics, University of Saskatchewan


Permission is granted for anyone to copy, use, modify, or distribute these programs and accompanying documents for any purpose, provided this copyright notice is retained and prominently displayed, and note is made of any changes made to these programs. These programs and documents are distributed without any warranty, express or implied. As the programs were written for research purposes only, they have not been tested to the degree that would be advisable in any important application. All use of these programs is entirely at the user's own risk.


This R package is used to predict the binary response based on high dimensional binary features with Bayesian naive Bayes models. The software also accepts real values but they will be converted into binary by thresholding at the medians estimated from the data. A small number of features can be selected based on the correlations with the response. The bias due to this selection can be corrected.

A short-cut function for doing cross-validation with the classifier is also provided.

The software is most suitable for analyzing the data with very high dimension, for example the diagnosis of cancer based on the gene expression data.

Source Packages and Documentations


  • Li, L., Zhang, J., and Neal, R. M. (2007), A Method for Avoiding Bias from Feature Selection with Application to Naive Bayes Classification Models, Bayesian Analysis, 2008, volume 3, number 1, pp 171-196: abstract

  • Li, L. (2007), Bayesian Classification and Regression with High Dimensional Features, Ph.D. thesis, University of Toronto: abstract

    Instruction of Installing an R package and Using R

    Click here.

    Examples of classification with Colon gene expression data

    The original real-valued colon data of R format: colon.rda. The binary colon data of R format: colon.bin.rda. There are 62 patients (40 vs 22) and 2000 genes. They can be loaded into R workspace by using "load" function:

    > load("colon.bin.rda")

    Test how well the above method with leave-one-out crossvalidation:


    The result of above R command is shown by cv-colon-result. The error rate of above analysis is 0.0967742, i.e. 6 out 62 cases were misclassified. This is the lowest error rate for Colon data (compared to the results collected by Prof. Tibshirani). We selected only 4 features out of 2000 for each iteration in cross-validation. Our method is also very fast, taking totally 103 secs for 62 folds crossvalidation, which includes also the time for feature selection. One more thing, our method is also pretty simple conceptually.