[Frontiers in Bioscience E4, 2150-2161, January 1, 2012]

Comparison and evaluation of network clustering algorithms applied to genetic interaction networks

Lin Hou1,2,3, Lin Wang2, Arthur Berg4, Minping Qian1,2, Yunping Zhu3, Fangting Li5, Minghua Deng1,2,6

1LMAM, School of Mathematical Sciences, Peking University, Beijing 100871, China, 2Center for Theoretical Biology, Peking University, Beijing 100871, China, 3State Key Laboratory of Proteomics, Beijing Proteome Research Center, Beijing Institute of Radiation Medicine, Beijing 102206, China, 4Center for Statistical Genetics, Pennsylvania State University, Hershey, Pennsylvania, USA, 5School of Physics, Peking University, Beijing 100871, China, 6Center for Statistical Science, Peking University, Beijing 100871, China

TABLE OF CONTENTS

1. Abstract
2. Introduction
3. Materials and methods
3.1. Notation
3.2. Experimental genetic interaction networks
3.3. Synthetic data
3.4. Benchmark functional gene sets
3.5. Network clustering algorithms
3.6. Jaccard index: evaluation measure of the predicted modules
4. Results
4.1. Comparisons with the experimental data
4.2. Comparisons with the synthetic data
5. Discussion
6. Acknowledgements
7. References

1. ABSTRACT

The goal of network clustering algorithms detect dense clusters in a network, and provide a first step towards the understanding of large scale biological networks. With numerous recent advances in biotechnologies, large-scale genetic interactions are widely available, but there is a limited understanding of which clustering algorithms may be most effective. In order to address this problem, we conducted a systematic study to compare and evaluate six clustering algorithms in analyzing genetic interaction networks, and investigated influencing factors in choosing algorithms. The algorithms considered in this comparison include hierarchical clustering, topological overlap matrix, bi-clustering, Markov clustering, Bayesian discriminant analysis based community detection, and variational Bayes approach to modularity. Both experimentally identified and synthetically constructed networks were used in this comparison. The accuracy of the algorithms is measured by the Jaccard index in comparing predicted gene modules with benchmark gene sets. The results suggest that the choice differs according to the network topology and evaluation criteria. Hierarchical clustering showed to be best at predicting protein complexes; Bayesian discriminant analysis based community detection proved best under epistatic miniarray profile (EMAP) datasets; the variational Bayes approach to modularity was noticeably better than the other algorithms in the genome-scale networks.