[Frontiers in Bioscience 6, a1-12, April 1, 2001]

CLUSTERING AMINO ACID CONTENTS OF PROTEIN DOMAINS: BIOCHEMICAL FUNCTIONS OF PROTEINS AND IMPLICATIONS FOR ORIGIN OF BIOLOGICAL MACROMOLECULES

Ivan Y. Torshin

Laboratory Of Chemical Kinetics And Catalysis, Chair Of Physical Chemistry, Chem. Dept. of Moscow State University, Moscow, 119899, Russia and Kimmel Cancer Center, Thomas Jefferson University, Philadelphia, PA 19107, USA.

TABLE OF CONTENTS

1. Abstract
2. Introduction
3. Methods
3.1. Databases
3.2. Clustering procedure
4. Results
4.1. Clustering procedure: central cluster and the core of it
4.2. Amino acid contents of the central cluster and of the core
4.3. Structural classes, cellular locations and ligand-binding properties of the proteins of the core
4.4. Proteins of the core that contain Fe-S clusters
5. Discussion
5.1. Protein biochemistry and clustering of amino acid contents
5.2. Amino acid composition, structural classes and functions of proteins
5.3. Origin of cellular life in hydrothermal vents and Fe-S proteins
5.4. Formation of protein-DNA/nucleotide interface
5.5. En-block polymerization of the amino acids in amino acid and nucleotide mixtures co-adsorbed on clay and formation of non-random amino acid sequences
6. Perspective
7. Acknowledgement
8. References

1. ABSTRACT

Structural classes of protein domains correlate with their amino acid compositions. Several successful algorithms (that use only amino acid composition) have been elaborated for the prediction of structural class or potential biochemical significance. This work deals with dynamic classification (clustering) of the domains on the basis of their amino acid composition. Amino acid contents of domains from a non-redundant PDB set were clustered in 20-dimensional space of amino acid contents. Despite the variations of an empirical parameter and non-redundancy of the set, only one large cluster (tens-hundreds of proteins) surrounded by hundreds of small clusters (1-5 proteins), was identified. The core of the largest cluster contains at least 64% DNA (nucleotide)-interacting protein domains from various sources. About 90% of the proteins of the core are intracellular proteins. 83% of the DNA/nucleotide interacting domains in the core belong to the mixed alpha-beta folds (a+b, a/b), 14% are all-alpha (mostly helices) and all-beta (mostly beta-strands) proteins. At the same time, when core domains that belong to one organism (E.coli) are considered, over 80% of them prove to be DNA/nucleotide interacting proteins. The core is compact: amino acid contents of domains from the core lie in relatively narrow and specific ranges. The core also contains several Fe-S cluster-binding domains, amino acid contents of the core overlap with ferredoxin and CO-dehydrogenase clusters, the oldest known proteins. As Fe-S clusters are thought to be the first biocatalysts, the results are discussed in relation to contemporary experiments and models dealing with the origin of biological macromolecules. The origin of most primordial proteins is considered here to be a result of co-adsorption of nucleotides and amino acids on specific clays, followed by en-block polymerization of the adsorbed mixtures of amino acids.