[Frontiers in Bioscience E2, 171-186, January 1, 2010] |
|
|
Cross phenotype normalization of microarray data Jianhua Xuan1, Yue Wang1, Eric Hoffman2, Robert Clarke3
1 TABLE OF CONTENTS
1. ABSTRACT Normalization is a prerequisite for almost all follow-up steps in microarray data analysis. Accurate normalization across different experiments and phenotypes assures a common base for comparative yet quantitative studies using gene expression data. In this paper, we report a comparison study of four normalization approaches, namely, linear regression (LR), Loess regression, invariant ranking (IR) and iterative nonlinear regression (INR) method, for gene expression. Among these four methods, LR and Loess regression methods use all available genes to estimate either a linear or nonlinear normalization function; while IR and INR methods feature some iterative processes to identify invariantly expressed genes (IEGs) for nonlinear normalization. We tested these normalization approaches on three real microarray data sets and evaluated their performance in terms of variance reduction and fold-change preservation. By comparison, we found that (1) LR method exhibits the worst performance in both variance reduction and fold-change preservation, and (2) INR method shows an improved performance in achieving low expression variance across replicates and excellent fold-change preservation for differently expressed genes. 2. INTRODUCTION DNA microarray technology has enabled high-throughput measurements of tens of thousands of mRNA levels, providing us a powerful tool to investigate biochemical pathways and gene regulatory networks, to identify phenotype-specific biomarkers, to assess cellular response to drug compounds, and to classify disease states at molecular level. For example, recent studies in cancer research demonstrate that gene expression profiling can reveal distinct tumor subtypes not evident by traditional histopathological methods (1). Although it is optimistic to assume that gene expression data alone will be sufficient for the reconstruction of complete regulatory pathways, several recent studies successfully demonstrate the potential for inferring regulatory networks from gene expression data (3). While high-throughput measurements of gene expression levels are likely to provide important information about cellular processes (e.g., revealing previously unrecognized patterns of gene regulation) and generate new hypotheses warranting further study, widespread use of microarray profiling methods is limited by the need for further technology developments, particularly computational bioinformatics tools not previously included by the instruments. Recently, much effort has been devoted to the development of high-level data analysis tools such as clustering (4-6), classification (2,7,8) and Bayesian network methods (3). As more and more computational tools are made available to researchers, it has become increasingly clear that the key issue in microarray data analysis is how to extract quality information about the biological system being studied. As a first step in accurately exacting biological information, it is necessary to filter out experimental noise and correct for systematic errors confounding the raw data obtained by this complex technology. Potential sources of systematic errors include array surface chemistry, microarray printing, labeling methods, hybridization parameters, image analysis and RNA isolation (9-11). The process to correct for systematic error, generally termed normalization, is introduced to correct the differences across different arrays in probe labeling, probe concentration, hybridization efficiency and potentially other factors. Normalizing multiple arrays to allow quantitative follow-up analyses presents one of the great challenges in microarray data analysis. Many normalization methods have been proposed in literature, the popular ones include global normalization or linear regression (LR) (12), Loess normalization (13), invariant ranking (IR) method (14), quantile normalization (15), and iterative nonlinear regression (INR) method (16,17). Regardless of their large technical differences, two basic steps in these methods involve: (1) selection of reference genes for normalization and (2) choice of a linear or nonlinear regression function for normalization (11). For instances, Affymetrix's global normalization method uses all the genes for normalization with a linear regression function; Loess normalization method also uses all the genes for normalization but with a nonlinear regression function derived from M-A plots (18). In contrast, IR and INR methods use a subset of genes (i.e., rank invariant genes) for deriving a nonlinear regression function for normalization (14,16), while quantile normalization uses all the genes but the transformation function is derived in such a way that makes the distribution for each array in a set of arrays the same (15). In addition, housekeeping genes were used in the past for normalization under the assumption that they are constantly expressed genes (19), while in fact the expression levels of housekeeping genes can vary significantly (20). Exogenous control genes can also be used for normalization, and many reports have supported that it is an excellent and universally applicable normalization strategy (21). Profound effect of normalization has been found on detection of differentially expressed genes and classification of phenotypes (22,23). Hoffmann et al. employed four different normalization methods and all possible combinations with three statistical algorithms for detection of differentially expressed genes (22). They found that the influence of normalization is significantly higher than that of three subsequent statistical analysis procedures examined. Hua et al. used a model-based approach to generate synthetic gene expression values for studying normalization procedures' impact on classification performance. Their experiment results demonstrated that normalization could have a significant benefit for classification under difficult experimental conditions (23). In this paper, we report the experimental results from a comparison study of four normalization methods, namely, LR, Loess, IR and INR. We tested the normalization methods on three real and representative microarray data sets and evaluated their performance in terms of variance reduction and fold-change preservation. The performance in variance reduction mainly reflects the ability of a normalization method in correcting system error to make consistent gene expression measurements across multiple arrays. The performance of fold-change preservation, on the other end, shows the ability of a cross phenotype normalization method to reveal true phenotypic changes in gene expression measurements, which is critical to follow-up analyses to detect differentially expressed genes and classify different phenotypes. Note that we use the term "cross phenotype normalization" here to emphasize the importance of fold-change preservation. Between different phenotypes, the gene expressions are more diverse than within a phenotype. Hence, the normalization based on whole gene population (like LR method and Loess method) will introduce even large variance, impacting the fold-changes of the differentially expressed genes particularly. INR and IR are two methods that are based on invariantly expressed genes (IEGs) for normalization, although different ways are used to select the IEGs. We believe that for cross phenotype normalization, this is an important strategy to combat the large variance between different phenotypes. 3. Normalization Methods There have been many approaches proposed to normalize microarray gene expression data. Even though a variety of normalization strategies exit, we can categorize different normalization methods into three practical approaches: (1) global approach, (2) invariant gene approach and (3) exogenous control gene approach. The global approach is based on the assumption that the total mass of mRNA per cell is constant. Consequently, the total integrated intensity across all the genes should be roughly same in any two arrays. Rather than using all the genes, a subset of non-differentially expressed genes, i.e., invariantly expressed genes (IEGs), would be a good choice for normalizing microarray data across conditions. In contrast, using exogenous control genes for normalization is a universally applicable strategy since it does not depend on the assumptions like the ones described above. The use of exogenous control genes to normalize microarray data, while technically the most complex to set up and calibrate, may provide the best strategy for refining normalization methods. In this section, we will focus on global and invariant gene approaches and give a detailed description of each method. Specifically, we will describe the algorithms of three global approaches - linear regression (LR), Loess regression and quantile normalization; and two invariant gene approaches - invariant ranking (IR) and iterative nonlinear regression (INR). 3.1. Linear regression Linear regression (LR) method is the most commonly used normalization approach in large-scale gene expression analysis. This LR method is also referred to as "global scaling" in Affymetrix's analysis tools (12). The LR method can be described as follows for performing normalization in the probe level. A baseline array is first chosen; in practice, the array with median intensity is a reasonable choice for normalization. All other arrays are then normalized to this baseline array by some scaling factors estimated by linear regression. If
As pointed out in (24), the linear regression method implicitly rests on the assumption that the amount of mRNA per cell is constant. However, this assumption is theoretically and practically questionable for several reasons like due to gene-dependent multiplicative errors and not the whole genome covered by an array (24). 3.2. Loess regression Microarray expression levels may have large dynamic range that will cause scanner systematic deviations such as nonlinear response at lower intensity range and saturation at higher intensity. Although data falling into these ranges are commonly discarded for further analysis, the transition range, without proper handling, may still cause some significant error in differential expression gene detection. To account for this deviation, locally weighted linear regression (Loess) is regularly employed as a normalization method for such intensity-dependent effects (18). This approach is based upon the idea of the M versus A plot (i.e., M-A plot), where M is the difference in log expression values and A the average of log expression values (25). For any two arrays (denote as i-th array and j-th array) with probe intensities
A normalization function can be obtained by fitting this M-A plot using Loess regression. With the fitted normalization function
To deal with more than two arrays, the method can be further extended to look at all distinct pair-wise combinations. For each pair of arrays, we perform the Loess regression fitting in M-A plot and record the adjustment accordingly. After having performed on all distinct pairs, we have adjustments
This process can be performed iteratively, which was properly termed as cyclic Loess regression method in (15). Although it seems to be a time consuming process, as reported in (15), the process will be stopped in a few iterations when the adjustments to be applied become small enough. 3.3. Invariant ranking Contrast to the three normalization methods described in previous subsections, invariant ranking (IR) method differs in selecting a subset of genes for normalization (14). In particular, a subset of non-differentially expressed genes, i.e., invariantly expressed genes (IEGs), is identified by an iterative ranking method to estimate the normalization function. The method can be described as follows. For any two arrays, as in previous subsection, denoted as i-th array and j-th array with probe intensities
The invariant set of genes is selected by comparing
where, L and H are the rank difference thresholds for the low and high ends of the difference intensity range. As we can see, 3.4. Iterative nonlinear regression In this subsection, we describe INR normalization method in details. Figure 1 illustrates the block diagram of INR method consisting of two basic steps: (1) iterative IEG selection and (2) nonlinear regression normalization. As we can see, IEG selection is based on an iterative procedure that alternatively selects control genes (IEGs) and estimates nonlinear regression function for normalization. The final set of IEGs will be obtained when the iterative IEG selection procedure converges and subsequently, a nonlinear regression function will be estimated based on these IEGs. Next, we will describe the iterative IEG selection procedure and the outline of INR algorithm. Different from most existing methods, INR normalization method relies on IEGs that can be selected iteratively by sector-shaped nonlinear regression (16). Specifically, we have developed an INR algorithm that alternatively selects IEGs and estimates normalization regression function. In an ideal case, i.e., without systematic errors, IEGs are the genes whose expression ratios are close to 1 between two microarray experiments, defined by the following equation mathematically:
where Microarray data normalization aims to find a mapping function between the gene expression levels obtained from two samples or experiments. Mathematically, the gene expression levels in a floating array (
where NIEG is the number of IEGs and In this paper, we describe an iterative procedure to find IEGs for nonlinear normalization as follows (15,16). The procedure repeats the following two steps until it converges: (1) selecting IEGs from a sector-shaped region in scatter plot of the floating and reference arrays; and (2) normalizing the floating array using the estimated nonlinear regression function based on selected IEGs (see Figure 1). Initially, we use a relatively large sector for selecting potential IEGs. For instance, we can start with using all the genes as IEGs (i.e., using a 90-degree sector angle), and perform an initial normalization accordingly. We then gradually decrease the angle of the sector-shaped region and select a new set of IEGs for normalization. The iterative procedure continues until there is no significant change in the content of IEGs and the estimated regression function converges to a 45-degree straight line (i.e., f(s) = s). Figure 3 illustrates the iterative process of IEG selection as the size of the sector decreases. The rationale of this approach lies in that after each normalization iteration the true IEGs shall move closer to a narrow sector around the 45-degree line as shown in Figure 2. Our numerical experiments have provided compelling evidence in support of such an iterative IEG selection scheme. 4. EXPERIMENTAL RESULTS We have implemented several normalization methods including LR, Loess, IR and INR algorithms in C/C++ and integrated the modules into dChip software (27). In addition, all four above-mentioned methods have been implemented in a way that normalization can be carried out either at probe level for oligonucleotide array data or at gene level for cDNA array data. When carried out at probe level, we only use perfect match (PM) probes to select IEGs for normalization. Note that this is consistent with the implementation of iterative ranking (IR) method (14), but different from Bolstad's implementation where both PM and mismatch (MM) probes are used for invariant probe selection (15). 4.1. Data Sets and expression measurement We used three data sets in our experimental tests - a dilution study from GeneLogic, a muscular dystrophy (MD) profiling study from Children's National Medical Center (CNMC), and a non-biological variability study from the Consortium for Functional Glycomics (CFG). The dilution data set was made available to the public specifically for comparison between different normalization methods (28). A total number of 60 arrays were acquired by Affymetrix's 75 HG-U95A microarrays to study the dilution/mixture effect of two sources of RNA from human liver tissue and central nervous system (CNS) cell line. The CNMC's MD data set with 125 arrays was acquired by Affymetrix's GeneChip (U133A) microarrays to study different types of muscular dystrophy (29). The Consortium for Functional Glycomics (CFG) has acquired 32 microarrays using custom-designed Glyco-gene Chips for assessment of sources of non-biological (technical) variability (30). Variables examined in the processing of RNA samples and gene chips include: (a) technician extracting RNA, (b) RNA isolation, (c) DNAse treatment of RNA, (d) biotin labeling of cRNA, and (e) day of hybridization to GLYCOv1. All samples in this study were C57/Bl mouse brain RNA. For all three data sets, the gene expression measurements were obtained using Affymetrix's Microarray Suite 5.0 (MAS 5.0) probe set interpretation algorithm (12), although other algorithms like robust multiple-array average (RMA) (31) and model-based expression index (MBEI) (32) can also be used. 4.2. Normalization plots Figure 4 - Figure 7 show some typical results of the normalization methods when applied to GeneLogic's data set on dilution study. In the experiment, we chose an array (94407hgu95a11) as the baseline array since it is of median intensity among all arrays. The upper row in Figure 4 shows a second array (94394hgu95a11) normalized to the baseline array, while the lower row showing a third array (94420hgu95a11) before and after normalization. In a similar fashion, Figure 5 shows the normalization results using Loess regression method. As we can see, after normalization, the M-A plot of an array and baseline array is centered on Figure 8 shows examples of the iterative IEG selection process when applied to CNMC's MD data set. A large sector was initially used for IEG selection and regression function estimation. As iteration goes on, the sector was gradually narrowed down since IEGs were expected to move closer to the 45-degree line after each interim normalization step. The final set of IEGs was obtained when the following two conditions met: (1) the selected IEGs differ little from those selected in the previous step, and (2) the estimated regression function is close to the 45-degree line in scatter plot. Figure 9 shows the normalization result of INR, showing scatter plots of two MD arrays prior to normalization, final selected IEGs, and normalized MD arrays, respectively. 4.3. Performance comparison To compare the performance of normalization methods such as LR, Loess, IR and INR methods, we used the following two criteria to quantitatively assess whether one method outperforms the other (14): (1) lower variance of expression level across replicated arrays, and (2) preservation of true fold-change in controlled realistic simulations. As discussed in (14), the first criterion ensures that genes known to have identical expression levels shall remain or incline to being identically expressed after normalization. The second criterion ensures that the first criterion is not achieved at the expense of destroying the very biological variations the technology aims to detect. Note that other criteria such as bias comparison based on spike-ins are also valuable to assess the performance of a normalization method under consideration (15).
4.3.1. Variance comparison In GeneLogic's dilution study, there are 30 arrays for each RNA source (Liver or CNS) with 6 different masses of cRNA (1.25, 2.5, 5.0, 7.5, 10.0, and 20.0 μg). Each dilution level was hybridized on HG-U95A chips and then scanned by 5 different scanners as replicate measurements. This data set is ideal for performance comparison of different normalization methods, since non-biological variability (or systematic errors) was purposely introduced through replicates and dilutions, while the goal of normalization is to correct these system errors so that multiple arrays can be further analyzed for the problem being studied. We used two sets of the 60 arrays of dilution study and one set of non-biological variability study from CFG for our variance comparison; the first set consisting of 30 arrays of liver, the second set consisting of 30 arrays of CNS, and the third set consisting of 32 arrays of C57/Bl mouse brain RNA. The following normalization methods were applied to the data sets: (1) LR method, (2) Loess method, (3) IR method and (4) INR method. After having normalized the arrays by these normalization methods respectively, we calculated expression measurements for each probe set on each array using MAS 5.0. We then computed the mean and variance of the expression measurements across all 30 arrays in each set. For variance comparison, we performed a pair-wise comparison between all four normalization methods. For any two methods (e.g., INR against IR), we counted the number of probe sets that have a larger variance of expression measurements using INR than that using IR. The percentage of the probe sets with larger variance was then calculated and used to assess the method's performance according to Criterion 1 (14). Figure 10 shows the results using the liver data set from GeneLogic's dilution study. As we can see, all four normalization methods significantly reduced the expression variance when compared to the raw data (denoted as "UN" in Figure 7). All these normalization methods, in overall, produce more consistent expression measurements across these 30 arrays. In particular, IR and INR methods outperformed LR method in reducing the variance of expression measure (only about 30% and 28% of probe sets having larger variance than that using LR method, respectively). Furthermore, INR method showed 68% of probe sets having less variance than that from IR method, i.e., only 32% of probe sets having larger variance than that of IR method. Figure 11 and Figure 12 shows the variance comparison results on the CNS data set and mouse brain data set, which again confirmed similar observations: (1) INR method exhibited a much better performance than LR and Loess methods in keeping the expression measurements consistent; (2) INR method further reduced the expression variance compared to IR method.
4.3.2. Fold-change comparison In order to conduct fold-change comparison, we have constructed two sets of controlled realistically simulated microarray data based on GeneLogic's dilution data set. We chose ten replicates and dilution arrays to begin with - five of them were the replicate arrays at 5μg mass of cRNA from liver tissue and the other five were at 10μg mass of liver cRNA. The simulated microarray data sets were constructed using the same procedure as originally designed by Schadt et al. (14). Below, we give a brief description of the procedure. In the first set, 300 genes that were consistently detected as present across five low-intensity replicate arrays (5μg Liver cRNA) and 600 from high-intensity replicate arrays (10μg Liver cRNA) were randomly selected. Six sets containing 50 genes each for the low-intensity arrays and 100 genes each for the high-intensity arrays were then generated by a random selection process from the sets of 300 and 600 genes selected. The expression measurements of the selected genes in each of the six sets were then multiplied by 2.0, 0.5, 4.0, 0.25, 6.0, and 0.17, respectively, to simulate fold-changes between the samples. The ten original arrays (without modification) and ten modified arrays were used to compare the performance of normalization methods in preserving the controlled fold-changes. The same procedure was used to construct the second simulated data set consisting of ten replicates (5μg and 10μg of CNS cRNA) from dilution study of CNS. Similarly, the ten original arrays and ten modified arrays were used in the comparative experiments. Finally, a third data set was constructed in the same way using 10 mouse brain arrays from CFG as described in Section 3.1. We tested the four different normalization methods (LR, Loess, IR and INR) on the same simulated data sets. After normalization, we calculated the fold-changes of the altered genes and computed the mean square errors (MSEs) between the observed and true fold-changes across replicates as follows: where N is the number of arrays being modified (in this case, N = 10); Figure 13 shows the comparison results of fold-change preservation on the first testing data set (Liver). The performance can be observed as follows. First, LR method was the worst one among all four normalization methods in preserving the authentic fold-changes. Second, Loess method was the second worst method in that it exhibited 96% of genes having larger Figure 14 and Figure 15 show the comparison results on the second and third testing data sets (CNS and mouse brain, respectively). Among all four normalization methods, LR method was again the worst one in terms of fold-change preservation. As expected, INR method continued to show the best performance in preserving fold-changes, specifically, only 30% (or 25%) of the genes having larger 5. CONCLUSION In this paper, we have reported a comparison study of four normalization methods, namely LR, Loess, IR and INR methods, for normalizing gene expression data. We tested the normalization methods on three real microarray data sets - GeneLogic's array data set for dilution study, CNMC's microarray data set for muscular dystrophy study and CFG's data set for non-biological variability study - to evaluate their performance. The experimental results have demonstrated that an improved performance can be obtained using normalization methods for correcting systematic errors. It becomes evident to us that correct selection of IEGs is the key to assure the success of any normalization method. Not like other methods (e.g., LR method, Loess method and quantile method), INR and IR methods are the only ones that perform the normalization based on IEGs selected via carefully designed procedures. Specifically, we compared the performance of the above-mentioned normalization methods based on the following two criteria: (1) expression variance reduction and (2) fold-change preservation. From the experimental results, we have come to the conclusions that (1) LR method was the worst one among the four normalization methods tested on the data sets used in the experiments; and (2) INR method outperformed all other three methods (LR, Loess and IR methods) in reducing expression variance across replicates and preserving the fold-changes of targeted differentially expressed genes. It is worth noting that there have been several other important comparison studies on microarray data normalization (14,15,32). Schadt et al. have compared the IR method with LR and GCVSS methods (14). They have demonstrated that the selection of invariantly expressed genes (IEGs) is very important for reducing the array variance without compromising the fold-change preservation. Since they used the same criteria (variance reduction and fold-change preservation) as those in this paper to evaluate the performance, the results from their study were consistent with our results. Furthermore, our method (i.e., INR) improves the performance with a novel IEG identification method (by iteratively normalizing the floating array so that the IEGs are moved to the 45-degree line in the scatter plot). Bolstad et al. also conducted a comparison study to evaluate several normalization methods like Loess, LR, IR and quantile methods (15). In their study, they used variance and bias as the criteria to evaluate the performance of each method; they also used GeneLogic's dilution data set for variance comparison, but used GeneLogic's Spike-in data set for bias comparison. They found that the quantile method gave a slightly better performance than the other methods; the LR method was the worst among the methods, which is consistent with the comparison result in this paper; however, the nonlinear method, IR, also did poorly for the spike-in regression in their study. Recently, Fujita et al. have also compared different methods, including Loess, splines, wavelets, kernel smoothing and support vector regression (SVR) to evaluate their performances using simulated microarray data (32). They have shown that the SVR method was favored for microarray normalization due to its robustness in estimating the normalization curve. The SVR method is essentially a nonlinear method using all the genes, but optimized to limit the fitting error while finding a linear mapping function as flat as possible (32). In the future, we will further include those new approaches like SVR and quantile normalization to complete the performance evaluation of different normalization methods for microarray data normalization. In addition, we will use the simulated microarray data as generated in (32) and bench microarray data together to further assess the robustness of each method in terms of variance reduction and fold-change preservation. Finally, we also plan to further improve the INR method with the idea of SVR method to optimize the nonlinear regression function for better identifying the ISGs. 6. ACKNOWLEDGMENTS This work was supported in part by the NIH under Grants CA109872, EB000830 and NS29525-13A; DOD/CDMRP under Grant BC030280. We thank C. Li and W. H. Wong at Harvard University for making dChip software, especially the implementation of their iterative ranking (IR) normalization method, available to us for comparison. We also thank the anonymous reviewers for their invaluable inputs that lead to several important improvements. 7. REFERENCES 1. T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R., Downing, M. A. Caligiuri, C. D. Bloomfield and E.S. Lander: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531-537 (1999) Abbreviations: LR: linear regression, IR: invariant ranking, INR: iterative nonlinear regression, GCVSS: smoothing splines with generalized cross validation, IEG: invariantly expressed genes, MD: muscular dystrophy, MSE: mean square error Key Words: Normalization; nonlinear regression; gene expression profiling; microarray data analysis; computational bioinformatics Send correspondence to: Jianhua Xuan, Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, 4300 Wilson Blvd., Arlington, VA 22203, USA, Tel: 703-387-6057, Fax: 703-528- 5543, E-mail:xuan@vt.edu |