[Frontiers in Bioscience E2, 849-856, June 1, 2010] |
|
|
Comparison of the predictive qualities of three prognostic models of colorectal cancer Billie Anderson1, J. Michael Hardin2, Dominik D. Alexander3, Sreelatha Meleth4, William E. Grizzle5, Upender Manne 5,6
1 TABLE CONTENTS
1. ABSTRACT Most discoveries of cancer biomarkers involve construction of a single model to determine predictions of survival.. 'Data-mining' techniques, such as artificial neural networks (ANNs), perform better than traditional methods, such as logistic regression. In this study, the quality of multiple predictive models built on a molecular data set for colorectal cancer (CRC) was evaluated. Predictive models (logistic regressions, ANNs, and decision trees) were compared, and the effect of techniques for variable selection on the predictive quality of these models was investigated. The Kolmogorov-Smirnoff (KS) statistic was used to compare the models. Overall, the logistic regression and ANN methods outperformed use of a decision tree. In some instances (e.g., for a model that included 'all variables without tumor stage' and use of a decision tree for variable selection), the ANN marginally outperformed logistic regression, although the difference between the accuracy of the KS statistic was minimal (0.80 versus 0.82). Regardless of the variable(s) and the methods for variable selection, all three predictive models identified survivors and non-survivors with the same level of statistical accuracy. 2. INTRODUCTION Researchers are now examining the methodology for predicting the survival or disease recurrence in cancer patients by use of data-mining techniques, such as artificial neural networks (ANNs), decision trees, and k-nearest neighbor (k-NN). In particular, this is being done to predict the clinical outcome for patients with colorectal cancer (CRC) (1-4). Most of these studies have compared the prediction of ANNs to other methods, such as survival analysis or logistic regression. For example, Burke et al. (5) demonstrated that use of the TNM components of tumor staging variables by themselves in an ANN significantly increased the predictive accuracy of the variables when compared to a model for survival analysis with the same variables. The predictive accuracy of the variables used in the models was measured by the area-under-the-ROC curve. The ANN increased the predictive accuracy of the model by 44-74%. Also, ANNs were used by this group to build predictive models for breast cancer survival; they found that, compared to the TNM staging, the ANN provided better predictive accuracy (5, 6). Therefore, it is not clear whether the improved predictive capacity was a reflection of the ANN method, or whether the variables such as positive lymph nodes and p53 status contributed to improving the predictability of a model. Furthermore, the predictive accuracy of a model is a factor of the variables in the model, and the technique for variable selection determines the quality of a model (7). ANNs and other data mining tools are called 'black-box' techniques, since the logic used to determine the final model is not transparent. Although "black box" techniques such as ANNs perform better than traditional methods, the results are not uniform. In some cases, within the same study, the superiority of a classifier seems to be dependent on the variable (8). As described above, data-mining techniques such as ANNs provide higher predictive accuracy than familiar, traditional models, such as logistic regression (9-13). If the same predictive accuracy can be obtained from logistic regression, which is generally understood by statisticians, basic researchers, and clinicians, it is difficult to justify the expense, in terms of computing time, and clarity of process for the "black-box" methods. As suggested by some of the studies above, if the accuracy of prediction is variable-dependent, which in turn makes it dependent on techniques for variable selection (8), it is important to determine the effect of these techniques on the accuracy of the predictive model. Therefore, this study aimed to provide answers to these questions: 1) Do data-mining techniques, such as ANNs and decision trees, provide models with higher predictive accuracy than logistic regression, and 2) How does the technique of variable selection affect the predictive accuracy of these models? 3. PATIENT POPULATIONS, MATERIALS, AND METHODS 3.1. Patient populations As described in previous publications (14, 15), 491 patients who had undergone surgical resection for 'first primary' colorectal carcinoma between 1981 to 1993 at the University of Alabama at Birmingham (UAB) hospital were identified for this study. These patients were identified from the UAB Tumor Registry, following the selection criteria described below. During the initial selection process, patients who died within a week of surgery, whose archival tissues were not available, who had surgical margin-involvement, who had an unspecified tumor location, who had multiple primaries within the colorectum, who had multiple malignancies (except non-melanotic lesions of the skin), or who had a family or personal history of CRC were excluded from the study population. To control for treatment bias, only patients who underwent surgery as a therapeutic intervention were included, and patients who received any pre- or post-surgical therapy were excluded. Since adjuvant chemotherapy was not in widespread use during this study's time frame (1981-1993), a large number of Stage III and IV patients (n=212) who had not received adjuvant therapy were included. 3.2. Pathological features Slides, stained with hematoxylin and eosin, were reviewed to determine the degree of histologic differentiation and categorized as well, moderate, poor, or undifferentiated. Tumors that were either well or moderately differentiated were designated as low-grade and those classified as either poor or undifferentiated as high-grade (16). The pathologic staging was determined according to the criteria of the American Joint Commission on Cancer (17). The codes of the International Classification of Diseases for Oncology (ICD-O) were used to specify anatomic location (colon versus rectum) of the tumor (18). 3.3. Follow-up Patients were followed by the UAB Tumor Registries until their death or the date of the last documented contact within the study time frame. The tumor registries ascertain outcome information directly from patients (or living relatives) and from the physicians of the patients through telephone and mail contacts. This information was further validated against State Death Lists. The tumor registries update information every six months. Follow-up of our cohort ended in August 2008. The median follow-up period of the complete study population was 8.91 years (range <1->20 years). 3.4. Immunohistochemistry Archival tissues, formalin-fixed and paraffin-embedded, were collected from the Surgical Pathology Division of the UAB Hospital. Earlier publications describe evaluation of immunohistochemical staining and immunostaining of nuclear accumulation of p53 and Bcl-2 (14, 19, 20). The staining was evaluated semi-quantitatively; the investigators involved were blinded to the patients' clinicopathologic data and their treatment status. The expression of Bcl-2 was in the cell cytoplasm; p53 accumulated in the nucleus (p53nac). Both the percentage of positive cells and the staining intensity were taken into consideration in determining the final immunostaining score (ISS), as described in prior reports (14, 19-21). Molecular marker expression was dichotomized into high expressers and low expressers, based on the cut-off values described below. Consistent with findings from our prior studies (19, 20), 0.5 ISS was chosen as the cut-off value for Bcl-2 expression. Only tumor cells with distinct nuclear immunostaining for p53 were considered as positive; the tumor was considered positive only if ≥ 10% of all malignant cells in a tissue section were positive, as described in earlier publications (14, 19-21). 3.5. Statistical analyses The outcome variable for the predictive models was a binary variable indicating survival (or death) five years post-surgery for patients with CRC (disease-specific five-year survival). The training data set consisted of 234 Caucasians (80% of initial 292), and 159 (80% of 199) African-Americans; the remaining individuals formed the test set for validation (98 patients). The variables were age (≥ 65 and < 65 years), race (Caucasians vs. African-Americans), tumor stage (I & II vs. III & IV), tumor differentiation (low grade vs. high grade), location of tumor (proximal colon, distal colon and rectum), and p53nac and Bcl-2 expression. All variables were dichotomized (22-24). 3.6. Variable selection No variable selection was used for the first type of analyses performed on the data set, as all available variables were used to build the predictive models. In subsequent analyses, the two procedures for variable selection were decision-tree (25) and stepwise regression. These techniques were performed with SAS® Enterprise Miner, version 5.3. The software default settings were used for each selection technique. The two techniques were used on the full data set, and then on data sets stratified by race. The race-based analysis was conducted because our prior studies indicated that p53nac was a strong predictor of poor survival, but only among non-Hispanic Caucasians who had tumors located in the proximal colon. In contrast, p53nac was not a useful prognostic marker for Caucasians with distal tumors, or for African-Americans with tumors of any anatomic location in the colorectum (14). For each data set, the techniques for variable selection were conducted by including the tumor stage in the list of variables, and again after omitting the tumor stage from the list of variables available for selection. Since there was interest in judging the predictive value of the biomarker, Bcl-2, in the absence of the tumor stage variables, the analysis was conducted twice. For the first analysis, the list of all possible variables that could be included in the model had the tumor-stage variable; for the second, the stage variable was excluded from the variable list. 3.7. Predictive models By use of all variables and the subset of variables identified by the decision-tree and the stepwise-regression methods, three predictive models were built. These included logistic regression, decision trees, and ANNs. These models were chosen because of their value, as described in recent reports (1-4). The models were built by use of SAS® Enterprise Miner software. Following is a brief description of each model type. 3.8. Logistic regression A logistic regression was built to model the probability of survival (or lack of survival) five years post-surgery for CRC. Logistic regression is used to model data when the outcome variable is binary (e.g., survive-yes/no; recurrence-yes/no). The probability of an outcome is related to a set of predictor variables by an equation < 3.9. Artificial neural networks (ANN) Development of the ANN was inspired by the mechanism through which the brain recognizes patterns (26). The goal of an ANN is the same as in logistic regression, predicting an outcome based on the values of predictor variables. The approach used in developing the ANN model is different from logistic regression. ANNs have the capacity to "learn" mathematical relationships between a series of input (predictor) variables and the corresponding output (outcome) variables. This is achieved by "training" the network with a data set that consists of the predictor variables and a known outcome variable. Once the ANN has been "trained," the model can be used for classification of a validation data set. Figure 1 is a diagram illustrating an ANN that has been trained to predict the probability of a patient dying of CRC five years post-surgery based on only two predictor variables, age and race. ANNs are often represented in diagrams such as this. The circles are known as nodes. A typical ANN consists of three layers of nodes: input, hidden, and output. The values of the predictor variables reside in the input node. The output node contains the predicted output of the network. The hidden nodes in the diagram contain an activation function that allows the network to model complex nonlinear associations between the predictor variables and the outcome. Of several activation functions examined, the tangent hyperbolic gave the best results; it was chosen for use in this study. Each input node is connected to each hidden node, and each hidden node is connected to the output node. In this example, there are two input nodes where the values of age (X1) and race (X2) are input into the network, along with a bias weight, which is equivalent to an intercept term in a regression model. The input nodes are connected to the hidden nodes by a connection weight. (These are the lines in Figure 1 connecting the input and hidden nodes.) The connection weights can be thought of as the ANN equivalent of the beta coefficients in a logistic regression model. At each hidden node, the connection weights are passed to an activation function, most commonly the sigmoid function. The activation function uses the connection weights to model any non-linear relationships among the predictor variables and the outcome variable. Another set of connection weights are then passed from the hidden node to the output node to obtain the output of the network, which corresponds to the predicted probability of the outcome variable. In the ANN analysis, there were as many input nodes as predictor variables. (The number of predictor nodes varies depending on the method used for variable selection.) There were three hidden nodes (the default setting in SAS® Enterprise Miner) and one output node (probability of survival five years after surgery for patients with CRC). 3.10. Decision trees The type of decision tree was CART (Classification and Regression Tree) (27). The settings in SAS® Enterprise Miner were set to create such a tree. CART is an algorithm used to split the data into smaller segments called 'nodes' that are homogeneous with respect to the outcome variable. At each node, the algorithm examines all predictor variables and all values of these predictors with respect to determining the best predictor variable and a value of that variable that will "best" separate the data into more homogenous subgroups with respect to the outcome variable. In other words, each node is a classification question, and the branches of the tree are partitions of the data set into different classes (those patients who will survive/not survive five years after surgery). This process repeats itself in a recursive, iterative manner until no further separation of the data is feasible. Therefore, the nodes at the end of the branches of the decision tree represent the different classes. The second part of the algorithm is pruning. Pruning is applied to the decision tree to ensure that the algorithm does not over-fit the training data. At each subsequent node, smaller numbers of observations are available. Towards the end of the splitting algorithm, idiosyncrasies of the training observations at a particular node can display a pattern that is specific only to those observations that become meaningless and detrimental for prediction when applied to larger populations. Pruning removes smaller branches that fail to generalize use of the validation data set. 3.11. Measures of performance evaluation The Kolmogorov-Smirnov (KS) statistic, which measures the difference between two different distributions, was used to evaluate model performance. The actual KS statistic is the maximum difference between two different distributions. In this case, the two distributions of interest are the estimated probabilities of belonging to the survival or non-survival groups produced by the models. If the two distributions are the same, the model does not effectively separate between survivors and non-survivors (implying a small KS statistic). On the other hand, significantly different distributions suggest good separation between the two groups (implying a larger KS statistic). Since the KS statistic has a probability distribution, a p-value is computed to determine if the two distributions are significantly different. The predictive models were built on the set of training data, and the set of validation data was used to obtain the KS statistics and the corresponding p-value.
4. RESULTS Table 1 displays the KS statistics for each of the predictive models when all variables were used. For all groups of interest (model categories), the logistic-regression and the ANN methods outperformed the decision-tree method, except for the group of African-Americans without tumor stage. The ANN outperformed the logistic regression method in the groups 'Caucasians without tumor stage' and 'African-Americans with tumor stage.' Table 2 displays the KS statistics for the predictive models when a decision tree was used for variable selection. For the three groups that included the tumor stage variable, the decision tree selected only tumor stage as a predictive variable. For the four groups that were stratified by race, all three predictive models had the same KS statistic for each predictive model, regardless of the variable(s) used in building the predictive model. The group 'all variables with stage' also had the same KS statistic for all three predictive models. The only group in which there was a difference among the KS statistics was 'all variables without stage,' which had a slightly higher KS statistic for the ANN than for the logistic regression method, with both of these models outperforming the decision tree. Table 3 presents the KS statistics for the models when stepwise regression was used as a variable selection technique. Once again, the groups that were stratified on race had the same KS statistic for all three predictive models. The logistic regression and the decision tree had the same KS statistic for the group 'all variables with stage,' in which these models outperform the ANN. The ANN model was best for the group 'all variables without stage.' 5. DISCUSSION Although the focus of this manuscript is a comparison of the predictive quality of three statistical models, it is noteworthy that increased phenotypic expression of Bcl-2 in colorectal cancer tissues emerged as a strong predictor of five-year post-surgery survival, especially for non-Hispanic, Caucasian patients when the tumor stage variables were not included (Tables 2 & 3). Another clinically relevant finding is that the pathologic feature, tumor differentiation, is an important predictor of survival for both African-American and Caucasian patients when the information on tumor stage was not available. These findings are relevant to the treatment of colorectal cancer, particularly in predicting the outcome of patients who undergo excisional biopsies. Since information on all components of TNM staging cannot be obtained from histologic assessment of biopsy specimens, these findings may be useful in assessing the aggressiveness of the tumor at the time of biopsy. In the analyses, regardless of the variable(s) and the methods for variable selection used in the models, all three prediction approaches including logistic regression, neural networks and decision tree-based models have identified survivors and non-survivors with the same level of statistical accuracy. In a study in which the predictor variables have been dichotomized, there is the potential for residual confounding in the prediction analysis. The value of categorization of continuous predictor variables in medical studies has been considered. Most of the discussion has focused on issues such as power, specifying the correct outcome and predictor variable association, or the selection of cut-points for the continuous predictors (28, 29). Placing a continuous predictor variable into two categories reduces the bias in the model up to 64% (30, 31). Most of the confounding from a predictor variable is removed by placing the variable into two categories; it is rarely necessary to have more than five categories (32). Overall, the logistic regression and ANN techniques outperformed the decision-tree method. In the instances where the ANN outperformed logistic regression (such as the group 'all variables without stage' by use of a decision tree as a variable selection technique), the difference between the accuracy of the KS statistic was minimal (0.80 versus 0.82). In cases such as these, it is difficult to justify use of an ANN, which is a much more complex model and is not as well understood as logistic regression. However, logistic regression analysis does not take censoring into account. The recommendation from this study is that, unless the ANN outperforms a simpler model (such as a logistic regression or a decision tree), the simpler model should be used. Relative to interpretation and understandability, more will be gained by use of the less complex model, although it may not be as predictive). Thus, the techniques for variable selection had a significant effect on the predictive accuracy of the models. For example, the highest KS statistics were obtained when all variables were used. The models involving all the predictor variables outperformed the models built by use of the two techniques for variable selection across all model categories. Yet, when the models were compared by use of different forms of techniques for variable selection, there was less consistency. For example, Table 2 shows that the logistic and ANN developed by use of a decision-tree approach for variable selection outperforms the logistic and ANN developed by use of an approach involving stepwise regression for the model category "Caucasians excluding tumor stage." Table 3 shows that the logistic and decision tree outperforms the logistic and decision tree developed in Table 2 for the model category that included all features, including tumor stage. Tables 2 and 3 demonstrate that when variable selection procedures are used to develop predictive models, the results are not consistent (i.e., there is not one model consistently outperforming the others for the different model categories). In conclusion, various models can be used to predict survival of CRC patients five years post-surgery. This study used stepwise regression and decision trees to select the variables to be entered into the models. The KS statistic was then used on a validation data set to determine how well the models separated between the survival and non-survival groups by use of predictor variables chosen by the techniques for selection of variables. This proof of principle study demonstrates the potential of predictive models to assess the survival probabilities of patients with CRC. These findings might be useful in cancer biomarker data analyses specifically to address a binary outcome with all binary predictive variables. We are now examining the capacity of these models to discover additional associations and patterns in the prognostic markers of CRC. 6. ACKNOWLEDGEMENTS This work was supported in part by grants from the National Institute of Health/National Cancer Institute to Dr. U. Manne (R01-CA98932, U54-CA118948, and R03-CA139629) and to Dr. W.E. Grizzle (the Early Detection Research Network, U24-CA086359). We thank Dr. Donald L. Hill, Division of Preventive Medicine, University of Alabama at Birmingham, for his critical review of this manuscript. 7. REFERENCES . Anand, S. S., A. E. Smith, P. W. Hamilton, J. S. Anand, J. G. Hughes & P. H. Bartels: An evaluation of intelligent prognostic systems for colorectal cancer. Artif Intell Med, 15, 193-214 (1999) 9. Finne, P., R. Finne, A. Auvinen, H. Juusela, J. Aro, L. Maattanen, M. Hakama, S. Rannikko, T. L. Tammela & U. Stenman: Predicting the outcome of prostate biopsy in screen-positive men by a multilayer perceptron network. Urology, 56, 418-422 (2000)
Key Words: Artificial neural networks, Colorectal cancer, Decision trees, Kolmogorov-Smirnoff statistic, Logistic regression, Predictive models Send correspondence to: Upender Manne, Department of Pathology, University of Alabama at Birmingham, 515B1-Kracke Building 619, 19th Street South, Birmingham, AL, 35294-7331, Tel: 205-934-4276, Fax: 205-934-4418, E-mail:manne@uab.edu |