[Frontiers in Bioscience 15, 801-825, June 1, 2010]

Computational identification and analysis of protein short linear motifs

Norman E. Davey 1,2,3,4 , Richard J. Edwards5, Denis C. Shields1,2,3

1UCD Complex and Adaptive Systems Laboratory, University College Dublin, Dublin, Ireland, 2UCD Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Dublin, Ireland, 3UCD School of Medicine and Medical Sciences, University College Dublin, Dublin, Ireland, 4 EMBL Structural and Computational Biology Unit, Meyerhofstrasse 1, 69117 Heidelberg, Germany, 5 School of Biological Sciences, University of Southampton, Southampton, United Kingdom

TABLE OF CONTENTS

1. Abstract
2. Introduction
2.1. Biological attributes of SLiMs
2.1.1. Structural disorder
2.1.2. Sequence conservation
2.1.3. Specificity
2.1.4. Affinity
2.1.5. Structure
2.1.6. Amino acid preference
2.2. Potential for novel SLiM discovery
2.3. Sources of SLiM information
2.3.1. Classical motifs
2.3.2. Modification motifs
3. SLiM discovery
3.1. A priori motif discovery
3.1.1. Primary sequence
3.1.2. Structural information
3.1.3. Keyword searches
3.2. Post-translational modification prediction
3.3. De novo motif discovery
3.3.1. Algorithmic motif discovery
3.3.2. Biological models
3.3.3. Structural models
4. Dataset design for SLiM discovery
4.1. Data sources
4.1.1. Gene ontology
4.1.2. Localization
4.1.3. Protein-protein interaction data
4.2. Working with PPI data
4.2.1. Binary interaction
4.2.2. Protein complex interaction
4.2.3. Atomic interaction
4.2.4. Topology specific interaction
4.3. Issues with PPI data
4.3.1. Comparability of sources
4.3.2. High affinity bias
4.3.3. Ascertainment bias
4.3.4. Incomplete data
4.4. Reducing noise in datasets
4.4.1. Network pruning
4.4.1.1. Domain-domain interactions
4.4.1.2. Multidomain proteins
4.4.1.3. Physical contact
4.4.1.4. Topology
4.4.2. Motif enrichment
4.4.2.1. Domains/globular regions
4.4.2.2. Evolutionarily under-constrained residues
4.4.2.3. Topology
4.4.2.4. Surface accessibility
5. Motif statistics
5.1. Motif-based metrics
5.2. Protein-based metrics
5.2.1. Probabilistic calculation
5.2.2. Empirical calculations
5.2.3. Background sampling
5.3. Dataset-based motif probability
5.3.1. Achieving independence
5.4. Dataset-based motif significance
5.5. Outstanding issues for motif statistics
5.5.1. Selection against motif occurrences
5.5.2. Classification of motifs
5.5.3. Significance of ambiguous motifs
5.5.4. Non-independence of datasets
6. Motif analysis
6.1. Matching known motifs
6.2. Conservation
6.3. Confidence through context
6.3.1. Structural information
6.4. Off-target motifs
6.4.1. Modification
6.4.2. Localization
6.4.3. Indirect binding
6.4.4. Multi-functionality
7. Conclusion
8. Acknowledgements
9. References

1. ABSTRACT

Short linear motifs (SLiMs) in proteins can act as targets for proteolytic cleavage, sites of post-translational modification, determinants of sub-cellular localization, and mediators of protein-protein interactions. Computational discovery of SLiMs involves assembling a group of proteins postulated to share a potential motif, masking out residues less likely to contain such a motif, down-weighting shared motifs arising through common evolutionary descent, and calculation of statistical probabilities allowing for the multiple testing of all possible motifs. Much of the challenge for motif discovery lies in the assembly and masking of datasets of proteins likely to share motifs, since the motifs are typically short (between 3 and 10 amino acids in length), so that potential signals can be easily swamped by the noise of stochastically recurring motifs. Focusing on disordered regions of proteins, where SLiMs are predominantly found, and masking out non-conserved residues can reduce the level of noise but more work is required to improve the quality of high-throughput experimental datasets (e.g. of physical protein interactions) as input for computational discovery.