[Frontiers in Bioscience E2, 849-856, June 1, 2010]

Comparison of the predictive qualities of three prognostic models of colorectal cancer

Billie Anderson1, J. Michael Hardin2, Dominik D. Alexander3, Sreelatha Meleth4, William E. Grizzle5, Upender Manne 5,6

1SAS Institute, Cary, NC, 2Department of Information Systems, Statistics, and Management, University of Alabama at Tuscaloosa, AL, 3Exponent Health Sciences, Wood Dale, IL, 4 Division of Preventive Medicine, 5Department of Pathology,6 Comprehensive Cancer Center, University of Alabama at Birmingham, Birmingham, AL 35294

TABLE CONTENTS

1. Abstract
2. Introduction
3. Patient populations, materials and methods
3.1. Patient populations
3.2. Pathological features
3.3. Follow-up
3.4. Immunohistochemistry
3.5. Statistical analyses
3.6. Variable selection
3.7. Predictive models
3.8. Logistic regression
3.9. Artificial neural networks (ANN)
3.10. Decision trees
3.11. Measures of performance evaluation
4. Results
5. Discussion
6. Acknowledgements
7.References

1. ABSTRACT

Most discoveries of cancer biomarkers involve construction of a single model to determine predictions of survival.. 'Data-mining' techniques, such as artificial neural networks (ANNs), perform better than traditional methods, such as logistic regression. In this study, the quality of multiple predictive models built on a molecular data set for colorectal cancer (CRC) was evaluated. Predictive models (logistic regressions, ANNs, and decision trees) were compared, and the effect of techniques for variable selection on the predictive quality of these models was investigated. The Kolmogorov-Smirnoff (KS) statistic was used to compare the models. Overall, the logistic regression and ANN methods outperformed use of a decision tree. In some instances (e.g., for a model that included 'all variables without tumor stage' and use of a decision tree for variable selection), the ANN marginally outperformed logistic regression, although the difference between the accuracy of the KS statistic was minimal (0.80 versus 0.82). Regardless of the variable(s) and the methods for variable selection, all three predictive models identified survivors and non-survivors with the same level of statistical accuracy.