[Frontiers in Bioscience E4, 2607-2617, June 1, 2012]

A U-Statistic-based random forest approach for genetic interaction study

Ming Li1, Ruo-Sin Peng1, Changshuai Wei1, Qing Lu1

1Department of Epidemiology, Michigan State University, East Lansing, MI 48824

TABLE of CONTENTS

1. Abstract
2. Introduction
3. Methods
3.1. U-Statistics
3.2. U-Statistic-based decision tree
3.3. U-Statistic-based random forest
3.4. Significance level
4. Results
4.1. Simulation I
4.2. Simulation II
4.3. Simulation III
4.4. Application to Cannabis Dependence
5. Discussion
6. Acknowledgement
7. References

1. ABSTRACT

Variations in complex traits are influenced by multiple genetic variants, environmental risk factors, and their interactions. Though substantial progress has been made in identifying single genetic variants associated with complex traits, detecting the gene-gene and gene-environment interactions remains a great challenge. When a large number of genetic variants and environmental risk factors are involved, searching for interactions is limited to pair-wise interactions due to the exponentially increased feature space and computational intensity. Alternatively, recursive partitioning approaches, such as random forests, have gained popularity in high-dimensional genetic association studies. In this article, we propose a U-Statistic-based random forest approach, referred to as Forest U-Test, for genetic association studies with quantitative traits. Through simulation studies, we showed that the Forest U-Test outperformed exiting methods. The proposed method was also applied to study Cannabis Dependence (CD), using three independent datasets from the Study of Addiction: Genetics and Environment. A significant joint association was detected with an empirical p-value <0.001. The finding was also replicated in two independent datasets with p-values of 5.93e-19 and 4.70e-17, respectively.