[Frontiers in Bioscience E4, 2607-2617, June 1, 2012] |
|
|
A U-Statistic-based random forest approach for genetic interaction study Ming Li1, Ruo-Sin Peng1, Changshuai Wei1, Qing Lu1
1 TABLE of CONTENTS
1. ABSTRACT Variations in complex traits are influenced by multiple genetic variants, environmental risk factors, and their interactions. Though substantial progress has been made in identifying single genetic variants associated with complex traits, detecting the gene-gene and gene-environment interactions remains a great challenge. When a large number of genetic variants and environmental risk factors are involved, searching for interactions is limited to pair-wise interactions due to the exponentially increased feature space and computational intensity. Alternatively, recursive partitioning approaches, such as random forests, have gained popularity in high-dimensional genetic association studies. In this article, we propose a U-Statistic-based random forest approach, referred to as Forest U-Test, for genetic association studies with quantitative traits. Through simulation studies, we showed that the Forest U-Test outperformed exiting methods. The proposed method was also applied to study Cannabis Dependence (CD), using three independent datasets from the Study of Addiction: Genetics and Environment. A significant joint association was detected with an empirical p-value <0.001. The finding was also replicated in two independent datasets with p-values of 5.93e-19 and 4.70e-17, respectively. 2. INTRODUCTION The past decade has witnessed an evolutional change of genetic association studies from the research of a limited number of candidate genes to the investigation of entire human genomes. This is made possible by the knowledge of comprehensive high-density maps of single nucleotide polymorphisms (SNPs) and the advancement of genotyping technologies (1-4). These genome-wide association studies (GWASs) have advanced the field of human genetics, allowing us to explore unknown regions for potential risk variants associated with diseases. So far, a large number of GWASs have been conducted and hundreds of novel disease-susceptibility loci have been reported (5, 6). Despite these successes, the current identified loci only explain a small fraction of the diseases' heritability (7). Moreover, the identified genetic variants have a low replication rate in follow-up studies, and have shown a limited contribution to disease prediction (8, 9). Following the initial association scan of the GWASs, it is a natural step for future genetic association studies to explore the potential complex interactions among genetic variants and environmental risk factors (10). Such studies can increase the power to detect unknown genetic variants that are associated with diseases, and provide novel insights into biological pathways underlying the development of diseases (9). Detecting the gene-gene and gene-environmental interactions has been a longstanding goal of genetic association studies. One strategy is to explore all possible combinations of genetic and environment risk factors and select the best combination predisposing to the risk of disease. Ritchie et al. proposed a Multifactor Dimensionality Reduction (MDR) method, which is based on such a strategy (11). For each combination of multiple SNPs, it partitions all possible multi-SNP genotypes into two risk groups, and calculates their classification error for the disease outcome. The combination of SNPs with the lowest classification error is then selected and assessed for joint association with the disease. The limitations of the MDR method are that it requires balanced data and does not allow for covariate adjustments. To address these limitations, Lou et al. extended it to a generalized MDR (GMDR) framework (12). Similar as the MDR, the GMDR searches all genetic variants for the 'best' multi-SNP combination. However, it uses score statistics instead of a case control ratio, when partitioning the genotypes into two risk groups. Both the MDR and GMDR methods are non-parametric and model free, and have been successfully applied for detecting the gene-gene and gen-environment interactions associated with complex diseases, such as autism and nicotine dependence (13, 14). However, they both search exhaustively for combinations of SNPs, which are computationally impractical for high-dimensional data (15). Alternatively, recursive partitioning approaches- such as classification trees and random forests-have been widely adopted for the identification of the gene-gene and gene-environment interactions (16-18). A classification tree is constructed by sequentially adding into the tree model the best predictor among a large number of features. Based on this idea, Li et al. proposed a Forward U-Test that examines the combined effect of genetic variants for quantitative traits, with the consideration of gene-gene interactions (19). It first uses a U-Statistic-based forward algorithm to select potential disease-susceptibility loci and then assesses the joint association of the selected loci using U-Statistics. It has been reported that the classification accuracy of a single classification tree can be improved substantially by averaging an ensemble of trees, referred to as random forests (20). A random forest is comprised of multiple decision trees, each of which is grown by using bootstrap samples of the same size. For each decision tree, the approach sequentially selects the best predictor among a subset of randomly selected features, and adds it into the tree model. The random-forest-based methods have been applied successfully to detect gene-gene interactions underlying various complex diseases, such as asthma and age-related macular degeneration (21, 22). In addition to its improved performance, random-forest-based methods have the advantage of detecting high-order interactions on high-dimensional data (23, 24). So far, random forests have mostly been used in genetic association studies with binary outcomes. In addition, a few linkage analyses have used random forests for the analysis of quantitative traits (25, 26). However, the application of random forests for genetic association studies of quantitative traits is still lacking. Furthermore, the current methods commonly assess the significance level of each SNP using the permutation-based importance scores. Such a procedure is computationally intensive and time demanding. In this article, we propose a Forest U-Test approach for the identification of gene-gene and gene-environmental interactions associated with quantitative traits. The Forest U-Test approach can be looked upon as an extension of the previously developed Forward U-Test with an implementation of random forests. We also derive an asymptotical U-Statistic test to assess the joint association of multiple genetic variants with quantitative traits. Through simulations and a real data application, we compare the proposed Forest U-Test with the existing methods, such as Forward U-Test and GMDR. 3. METHODS Suppose we have a study population of 3.1. U-Statistics We have previously introduced a U-Statistic to measure the joint association of multiple SNPs with quantitative traits with the consideration of possible gene-gene interactions (19). Following the same notation, we assume To measure the overall trait difference among a total of Here, the weight parameter 3.2. U-Statistic-based decision tree For common complex traits, the disease susceptibility loci are commonly unknown. In order to determine the 3.3. U-Statistic-based random forest The performance of the U-statistic-based decision tree can be further improved by constructing a random forest, built on an ensemble of In addition, while considering an ensemble of where 3.4. Significance level Hypothesis testing can then be conducted to evaluate the joint association of where This U-Statistic can be used to test the joint association of the SNPs with the traits. A null distribution of U-Statistics can be obtained by permuting the trait and applying the same random forest procedure to calculate the U-Statistics. Based on the null distribution, an empirical p-value can be obtained. The empirical p-value accounts for the inflated type I error due to model selection and ordering of subjects. We also derive an asymptotic test for replicating the initial association in an independent study. We refer to the data from the original study as training data and the data from the follow-up study as testing data. We denote multi-SNP genotypes of each decision tree Meanwhile, a testing predicted trait value can be calculated by averaging the observed traits in testing data across all subjects in We further average trait values for each subject Again, we treat each subject in the testing set as a separated genotype group. Assuming they are sorted by their training average trait values, we can calculate a global U-Statistic by using the testing average trait value (Equation (6)). Since the sequence of subjects is pre-determined by the training set, we expect the global U-Statistic follows a normal distribution asymptotically with a zero mean under the null distribution. Assuming all subjects in the testing set are independent and have the same variance of It should also be noted that, although the above method is illustrated with genotype data, it applies to environmental risk factors with categorical levels as well. For continuous environmental risk factors, we need to first categorize the continuous variables before applying the proposed method. 4. RESULTS 4.1. Simulation I In the first simulation, we compared the performance of the proposed method with that of two existing methods, the Forward U-Test and GMDR. The comparison was conducted under various underlying diseases models with different levels of disease complexity. We also incorporated two types of genetic interactions into the disease models, a multiplicative-effect model and a threshold-effect model (Table 1) (27). We started with a simple disease model comprised of a two-locus multiplicative effect interaction. The second disease model included a two-locus multiplicative effect interaction and two independent SNPs with additive effects. The third disease model included a two-locus multiplicative effect interaction, a two-locus threshold effect interaction and two independent SNPs. The fourth disease model included a two-locus multiplicative effect interaction, a two-locus threshold effect interaction and four independent SNPs. The SNP genotypes were simulated under the assumption of Hardy-Weinberg Equilibrium (HWE). For all SNPs, the minor alleles corresponded to higher traits and the minor allele frequencies were set at 0.3. For each disease model, a number of non-disease related SNPs were also introduced to bring the total number of SNPs to ten. The minor allele frequencies of these noise SNPs were sampled from a uniform distribution ranging from 0.1 to 0.9. For each underlying disease model, we first simulated a reference population with one million subjects. An expected trait The simulation results were summarized in Table 2. From the results, we observed power improvement for all methods as the number of causal loci increased. For the simplest disease model with only two causal SNPs, the power of Forward U-Test is close to that of Forest U-Test, and is higher than the power of the GMDR. With the increase of model complexity, the Forest U-Test had the most significant power increase, and outperformed the other two approaches. In all scenarios, the type I errors were properly controlled for all methods. In the simulation study, we fixed the minor alleles at 0.3 for all causal SNPs. We expect power increase for all methods if the minor allele frequencies were increased. 4.2. Simulation II In the second simulation, we evaluated the performance of Forest U-Test with respect to two pre-determined parameters, the number of random features and the tree depth The simulation results were summarized in Table 3. We reported on the power, type I error and average time cost to construct a random forest with 500 trees. The computational time was based on the time of running R programs on a high-performance computer with a dual-core 1.6GHz processor and 4 GB memory. We found the computational time increased with the increase of either 4.3. Simulation III In the simulation, we compared the proposed approach with the Random Forest (RF) approach developed by Breiman L. et al. (20). RF does not provide a testing statistic for association testing. Instead, it focuses on variable selection and ranks all SNPs by importance score. Therefore, we compared the performance of two approaches by their importance ranking of all SNPs. The performance of two approaches was evaluated based on the fourth disease model in simulation I. For both RF and Forest U-Test, 500 trees were constructed. The parameters were fixed at In Figure 1, we found that both methods selected causal SNPs (SNP 1 - SNP 8) with relatively higher importance. We also found that RF ranked the causal SNPs with threshold effect as less important than the causal SNPs with multiplicative effect or independent effect. On the other hand, the ranking of the causal SNPs by Forest U-Test was consistent across different modes of inheritance. This may be partially due to the splitting strategy of RF. During the tree constructing, RF can choose different SNPs to split nodes at the same level, which might be a disadvantage for capturing interaction models, in particular the threshold effect model. Suppose the first SNP can split the root node into two offspring nodes as a 'risk' allele node and a 'non-risk' allele node. The second SNP is less likely to be selected by RF to further split the 'non-risk' allele node, which lead to the low chance of capturing threshold effect interactions. On the other hand, Forest U-Test uses the same SNP to split nodes at the same level, and hence, increase the power to capture interactions. 4.4. Application to cannabis dependence We applied the proposed method to study Cannabis Dependence by using the Study of Addiction: Genetics and Environment (SAGE) GWAS dataset (29). As part of the Gene Environment Association (GENEVA) consortium (30), SAGE was designed by selecting unrelated participants from three independent studies: the Family Study of Cocaine Dependence (FSCD), the Collaborative Study on the Genetics of Alcoholism (COGA), and the Collaborative Genetic Study of Nicotine Dependence (COGEND). In our analysis, the trait of interest is Cannabis Dependence, measured by the number of marijuana symptoms endorsed (mj_sx_tot). The trait has eight numerical values, ranging from 0 to 7. The distribution of traits is given in Figure 2. We also collected 25 SNPs that had been reported in the previous literatures as having potential association with Cannabis Dependence. The genotypes of 13 SNPs were available in SAGE GWAS dataset and the genotypes of the remaining 12 SNPs were imputed by software package IMPUTE2 (31, 32). The CEU and YRI populations from the HapMap phase III and 1000 Genome project were used as the reference panels to impute the 12 SNPs for white and black subjects (33, 34). In addition to the 25 SNPs, gender was also included in the analysis as a covariate. We used FSCD as an initial association dataset, and COGA and COGEND as replicate datasets. While applying the Forest U-Test, we set the parameters as We also applied the Forward U-Test and GMDR to the same datasets. Similar to the above analysis, FSCD was used as initial dataset for model selection, and COGA and COGEND were used for replication. The results were listed in Table 5 and Table 6. The analysis result of FSCD showed that both the Forward U-Test and GMDR selected the most parsimonious models with gender only. Therefore, the analyses of COGA and COGEND only examined the association between the trait and gender. Though the association remained significant in all studies, neither method detected any genetic effects. To compare the Forest U-Test with the conventional RF approach, we also applied RF to the same datasets. When RF was applied to FSCD, gender was also ranked as the most important covariate. The three top SNPs were rs2501432, rs1019238 and rs324420. These three SNPs were ranked as 1st, 6th and 3rd important SNPs by Forest U-Test. While the findings of both methods were highly consistent, it was not straightforward for RF to replicate its initial finding in COGA and COGEND. When RF was applied to COGA and COGA, gender remained to be the most important covariates. However, the top SNPs changed across studies (Table 7). 5. DISCUSSION Evidence has shown that multiple genes are interacting in biological pathways to influence the development of diseases (9, 35). It is also common for genetic effects to be modified by environmental risk factors (36). Ignoring the complex interactions between genes and environmental factors will likely reduce the power of detecting novel risk factors underlying complex traits (37). Though there is an increasing awareness that the gene-gene and gene-environment interactions are crucial for understanding the etiology of complex diseases, detecting the complex interactions remains a major challenge in genetic association studies (15, 38). Recently, random forest approaches have been adopted to detect the association of multiple risk factors while allowing for high order interactions. However, most of the current studies have applied random forests to binary disease outcomes, and few studies have considered their applications for quantitative traits. In this article, we propose a U-Statistic-based random forest approach for genetic association studies with quantitative trait. The proposed method was found to have a greater power than two existing methods, the Forward U-Test and GMDR. The simulation results showed that the Forest U-Test had the largest advantage over the existing methods when the underlying disease model is highly complex. This improvement can be explained by the following reasons: 1) By constructing an ensemble of decision trees based on random features and bootstrap samples, the method not only considers the risk factors with large effects, but also incorporate those with only small or moderate effects. Though many risk factors may only play a limited role in the disease development, they can collectively contribute to a significant portion of the variation of traits. On the other hand, the Forward U-Test and GMDR only search the risk factors for the best combination and may overlook those with small or moderate effects; 2) Compared to a single decision tree, the random forests provide a more robust performance, making the result replicable in the follow-up studies; 3) By averaging the predicted trait values of multiple decision trees, Forest U-Test allows for a large number of risk groups (i.e. every subject may form a risk group). On the contrary, the Forward U-Test only allows for a limited number of risk groups formed by a few selected risk factors, while GMDR always assumes two risk groups for each combination of risk factors. Such assumptions may be questionable in real disease scenarios. We also note that the Forest U-Test evaluates the joint association of multiple risk factors without directly selecting the most parsimonious combination of risk factors. Therefore, compared to the Forward U-Test and GMDR, the results of Forest U-Test are less easy for interpretation, which is a common limitation for most random-forest-based methods. Nevertheless, the asymptotic result of Forest U-Test can easily be used to evaluate the association in independent follow-up studies, which is an advantage over the conventional random-forest-based methods. Cannabis Dependence is a disorder that may involve complex interactions among multiple risk factors. In our analysis, the three top SNPs come from three genes CNR2, FAAH, and ANKFN1 respectively. CNR2, also known as Cannabinoid receptor type 2, belongs to the cannabinoid receptor family. The encoded protein functions as a receptor for cannabinoids, which are the principal psychoactive ingredients of marijuana (39). This gene has been verified to occur in the central nervous system, and is expressed in the brain (40, 41). SNP rs2501432 is a non-synonymous mutation that locates in CNR2. Experimental evidence has demonstrated that this mutation can change the function of CNR2 protein (42). Previous studies have also reported the association of the SNP with substance use disorders (43, 44). SNP rs324420, also known as Pro129Thr, is another non-synonymous mutation that locates in exon 3 of the fatty acid amide hydrolase gene (FAAH), which has shown associations with many substance use disorders (45-47). It was estimated that the minor allele homozygote leaded to a reduced risk of 0.25 for the development of Cannabis Dependence (48). The third SNP, rs1431318, lies in gene ANKFN1, which was previously identified as being involved in substance use disorders (49, 50). Interestingly, this SNP was reported as the most significant SNP (p-value <1e-7) by a recent GWAS for Cannabis Dependence (51). Whereas it is biologically plausible that all these genes may play an important role in developing Cannabis Dependence, our method does not provide any inference for the underlying biological mechanism. The identified association may result from either additive or interactive effects between the genes. Future study would be necessary to further replicate the association and investigate potential joint actions among these genes. 6. ACKNOWLEDGEMENT This work is supported by start-up funds from Michigan State University. The authors want to thank four referees for helpful comments that improved this manuscript. 7. REFERENCES 1. The_International_HapMap_Consortium.: A haplotype map of the human genome. Nature, 437(7063), 1299-320 (2005) 51. A. Agrawal, M. T. Lynskey, A. Hinrichs, R. Grucza, S. F. Saccone, R. Krueger, R. Neuman, W. Howells, S. Fisher, L. Fox, R. Cloninger, D. M. Dick, K. F. Doheny, H. J. Edenberg, A. M. Goate, V. Hesselbrock, E. Johnson, J. Kramer, S. Kuperman, J. I. Nurnberger Jr, E. Pugh, M. Schuckit, J. Tischfield, J. P. Rice, K. K. Bucholz and L. J. Bierut: A genome-wide association study of DSM-IV cannabis dependence. Addiction Biology, 514-518 (2011) doi:10.1111/j.1369-1600.2010.00255.x Appendix: Derivation of the variation of U-Statistics in Equation (7). Note that when each subject is treated as a separated group, the global U-Statistic in Equation (2) is equivalent to
Assuming
Key Words: Forest U-Test, Random Forest, quantitative trait, U-Statistics Send correspondence to: Qing Lu, Address: Department of Epidemiology, Michigan State University, East Lansing, MI 48824 USA, Tel: 517-353-8623, Fax: 517-432-1130, E-mail:qlu@epi.msu.edu |