[Frontiers in Bioscience S4, 1333-1343, June 1, 2012] |
|
|
Computational methods for the analysis of tag sequences in metagenomics studies Qin Chang1, Yihui Luan1, Ting Chen2,3, Jed A. Fuhrman4, Fengzhu Sun2,3
1 TABLE OF CONTENTS
1. ABSTRACT Metagenomics commonly refers to the study of genetic materials directly derived from environments without culturing. Several ongoing large-scale metagenomics projects related to human and marine life, as well as pedology studies, have generated enormous amounts of data, posing a key challenge for efficient analysis, as we try to 1) understand microbial organism assemblage under different conditions, 2) compare different communities, and 3) understand how microbial organisms associate with each other and the environment. To address such questions, investigators are using new sequencing technologies, including Sanger, Illumina Solexa, and Roche 454, to sequence either particular genes, called tag sequences, mostly 16S or 18S ribosomal RNA sequences or other conserved genes, or whole metagenome shotgun sequences of all the genetic materials in a given community. In this paper, we review computational methods used for the analysis of tag sequences. 2. INTRODUCTION Metagenomics is the study of genetic materials derived directly from a microbial community without culturing. The term "metagenome" was first used by Handelsman et al. (1) in 1998. Since then, many large-scale metagenomics projects have been undertaken, including, for example, the human microbial project (HMP, http://commonfund.nih.gov/hmp/), the MetaHIT project (http://www.metahit.eu/), the Global Ocean Survey (http://www.jcvi.org/cms/research/projects/gos/), ICOMM (http://icomm.mbl.edu/) and the Earth Microbiome Project (http://www.earthmicrobiome.org/). Enormous amounts of metagenomics data have resulted using sequencing technologies such as Sanger, Illumina Solexa, Roche 454 and others. The challenges for metagenomics studies include the direct sampling of genetic materials from the microbial communities, data storage and data analysis (2, 3). Here we review computational methods for analyzing particular genes, called tag sequences, which are mostly 16S or 18S rRNA sequences. It was found that 16S/18S rRNA genes can be horizontally transferred between different species and one species can contain multiple copies of 16S/18S genes. Thus, the use of 16S/18S rRNA genes may not be optimal for community comparison. Other single copy house-keeping genes such as rpoB (4) and other conserved genes (5) were also used for comparing microbial communities. The methods described in this review can equally be applicable to such sequence data. Microbes are key players and ubiquitous in almost all natural and man-made habitats. It is estimated that the population of bacteria may be as high as Traditional microbiological studies heavily depend on in vitro studies. However, only a very small fraction of microbial organisms can be cultured, and the use of culturing techniques alone significantly limits our understanding of the microbial world. Various culture-independent methodologies able to retrieve genetic materials directly from natural environments have been developed. These culture-independent techniques have revealed high microbial diversity present in many different environments. Among these techniques, profiling, or fingerprinting, methods provide information on the whole community at once, usually in the form of a list of gene fragments representing different operational taxonomic units (OTUs), and the OTUs are supposed to represent closely related organisms. These methods include T-RFLP (6, 7), DGGE (8, 9), and ARISA (10, 11). By allowing relatively easy analysis and simultaneous comparison of many samples, fingerprinting studies have revealed spatial and temporal patterns among the OTUs and environmental factors (12-14). However, profiling-based methods do not yield detailed information about the microbial organism composition within communities. Fortunately, with the rapid development of sequencing technologies and significant drop of sequencing cost, it is now possible to use new sequencing technologies to study community diversity. These studies have shown that changes of the microbial community structure within the human body are associated with human health, such as obesity (15-17) and clinically defined bacterial vaginosis (18). The human gut microbial community can significantly change after treatment with antibiotics (19, 20). Modern molecular techniques have revealed high microbial diversity in various human tissues. In the first phase of the HMP, investigators studied the composition of the microbial communities in various human tissues. In the second phase, the microbial communities associated with human diseases will be identified. Several studies have been carried out, including the study of microbial communities in the gut (21-23), saliva (24), skin (25, 26), vagina (18) and stomach (27). In addition to HMP, many other metagenomics projects, including marine and pedology studies (28-36), are underway or are in the planning stages. For instance, Dinsdale et al. (37) compared the metagenomic communities of 45 distinct microbiomes and 42 distinct viromes and found that they have distinct metabolic profiles. To date, most studies of microbial diversity have used ribosomal RNA (rRNA) sequences, in particular 16S and 18S, because they are ubiquitous and largely well conserved during evolution (38). Other types of gene sequences have also been used (4, 5). These sequences are sometimes called tag sequences. Because tag sequences are generally short, very deep sequencing is possible. Tag sequences are generally highly conserved and they can be used to study the microbial organism compositions in communities. However, it is impossible to study the functions of individual genes based on tag sequences. To study functions of genes and pathways, whole genome shotgun sequences are needed. With the accumulation of enormous amounts of sequence data, there is an urgent need for novel computational tools able to analyze them and link the results to knowledge databases, such as Greengenes (39) and SILVA (40), to learn how different organisms interact with each other and with the environment. In this paper, we review computational methods for the analysis of tag sequences, including how to 1) classify the tag sequences into OTUs, 2) compare different communities, and 3) study the association of OTUs and environmental factors. 3. OPERATIONAL TAXONOMIC UNIT (OTU)-BASED ANALYSIS OF METAGENOMICS COMMUNITIES The comparison of different communities is an important problem in many different fields, including ecology and microbiology. Many different measures, termed beta diversity, have been proposed to compare communities, and many of the methods were reviewed in (41). Some studies comparing communities based on gene contents and their metabolic functions (32, 42) depend heavily on the accuracy of the functional annotation process. Here, we concentrate on operational taxonomic unit (OTU)-based methods using tag sequences. In this section, we review computational methods for defining OTUs and for comparing communities based on OTUs. 3.1. Computational methods for the identification of OTUs Tag sequences can be grouped into different clusters such that the sequences in each cluster are similar, but sequences in different clusters have relatively large differences. The sequences in each cluster form an operational taxonomic unit (OTU). The motivations of using OTUs are as follows. Microbial communities are usually highly diverse, containing hundreds to thousands of microbial organisms. However, tag sequences of only a small fraction of these organisms are known and well studied. Thus, studies based on the relationships of known tag sequences maybe biased and do not present full understanding of the microbial diversity in communities. On the other hand, OTUs do not depend on the available information about the known tag sequences and thus present an unbiased view of the microbial diversity. However, the OTUs do depend on the algorithms used and it is still an active area of research for the optimal definition of OTUs. Although the definition of OTUs is conceptually simple, the computational implementation for the identification of biologically meaningful OTUs has turned out to be a very challenging problem, and optimal methods are still being debated and developed. The difficulties in defining biologically meaningful OTUs can be attributed to several factors. First, there is a large quantity of tag sequence data from metagenomics projects, which mandates that the computational algorithm be both storage and computationally efficient. Second, errors are present in sequence reads which can make the number of predicted OTUs much higher than the true number of OTUs present in microbial communities. Third, many different clustering approaches are available, and it is not clear which clustering methods give the most biologically meaningful results. Recent active studies on this topic have begun to answer some of these questions. Two steps are frequently used in algorithms for defining OTUs. The first step is the calculation of distances between any pair of sequences. Some programs used multiple sequence alignment (MSA) to first align the sequences. Afterwards, the distance between any pair of sequences is calculated on the basis of this alignment (43, 44). Schloss (45) and White et al. (46) studied factors that can significantly affect estimating the diversity of individual communities, termed alpha diversity, and comparing multiple communities, termed beta diversity. These factors include the quality of the MSA, the inclusion/exclusion of variable regions along the tag sequences, and the distance calculation methods between groups of tag sequences. It is well known that MSA for a very large number of sequences, e.g., on the order of The second step in defining OTUs is to cluster the sequences based on the pairwise distances (43, 44, 47, 48). Some programs used complete linkage in hierarchical clustering (43, 44) where the distance between two groups of sequences is defined as the maximum distance between sequence pairs from the two groups. However, recent studies showed that average linkage, where the distance between two groups of sequences is defined as the average distance between sequence pairs from the two groups, may give more biological meaningful OTUs than using complete linkage (47, 49, 50). Previously, defining OTUs has required a threshold value so that the distances between any two clusters would be above the threshold. However, if OTUs correspond to actual species, then no such threshold values exist as confirmed by recent studies (51). In addition, experimental errors such as the PCR errors and the sequencing errors are unavoidable, and as a result, the hierarchical clustering over-estimate the number of OTUs. To avoid using a threshold value in clustering, a probabilistic Bayesian clustering method, termed Clustering 16S rRNA for OTU Prediction (CROP), was recently proposed to cluster sequencing data and define OTUs (48). CROP models the sequencing data with a Gaussian mixture model, and uses a soft threshold for clustering. It was shown to accurately estimate the number of OTUs when applied to a sequence dataset of mixtures of cultures (48). Ye (52) proposed AbundantOTU to group sequences from closely related abundant species. The algorithm does not depend on pairwise or multiple sequence alignments, but is based on a consensus alignment algorithm that defines abundant OTUs. This algorithm can avoid the problem of relatively high error rate as in next generation sequence technologies. However, it cannot align sequences belonging to rare OTUs and these sequences can be analyzed using other approaches described above. When comparing different methods for defining OTUs, investigators designed some benchmark data by either experimentally sequencing a community with known microbial species (49, 53) or computationally selecting a set of species and introducing errors in these sequences according to the sequence error models of sequencing technologies (46, 50, 51). One criterion used to evaluate algorithms for defining OTUs is comparing the number of OTUs given by the algorithms with the known number of species in the simulated community. Recently, Sun et al. (51) proposed using normalized mutual information (NMI) (54) and F-score (55) to evaluate algorithms for defining OTUs. The NMI score evaluates overall clustering by penalizing two types of errors: assignment of sequences from different species into the same OTU and assignement of sequences from the same species into different OTUs. The NMI score is 1 if the clustering completely agrees with the species, and it is close to 0 if the clustering of sequences is not related to the species where the tag sequences come from. As a complementary evaluative method, F-score was proposed to compare clustering of sequences from an algorithm with true underlying species classification of sequences (51). Given N sequences from m species (S1, S2, ..., Sm), we suppose that an algorithm clusters the sequences into n clusters (C1, C2, ..., Cn) . Let aij be the number of sequences from species Si that are clustered into cluster Cj, i = 1, 2, ..., m; j = 1,2, ..., n. For species i and cluster j, the precision and recall are defined as
and the F-score is defined by
Finally, the F-score for the clustering from the algorithm is defined as
Thus, the F-score will be one if the clusters from the algorithm are the same as the species classification of the sequences. Otherwise, the F-score is small. Both NMI and F-score have been used to evaluate the quality of OTU classifications using known sequences from a mixture of species. The number of tag sequences in metagenomics studies is generally huge, usually on the order of 3.2. Comparison of communities based on OTUs Suppose that there are multiple microbial communities and that tag sequences from each of the communities are obtained. How can we compare the communities based on the tag sequences? One commonly used approach is to cluster all the tag sequences from all the communities into OTUs and then measure the differences among the communities using some distance measures, termed beta diversity, based on the distribution of OTUs in the different communities. Various beta diversity measures (41, 59) can be used to compare the communities. Specifically, beta diversity measures can be grouped into qualitative or quantitative measures. Qualitative measures, such as classic versions of Jaccard, Lennon, and Dice, consider the presence/absence of OTUs within communities without considering their abundance. On the other hand, quantitative measures, such as classic versions of Bray-Curtis, Canberra, Euclidean, and Chao's statistic (60), take the abundance of OTUs into consideration. Recently, Kuczynski et al. (61) studied 14 quantitative and 9 qualitative beta diversity measures based on OTUs. They showed that these measures have varied abilities to identify the relationships between community microbial composition and 1) environmental changes, or 2) community clusters. For example, Chi-square and Pearson correlation distances perform extremely well in identifying environmental gradients of the communities, while Gower and Canberra distances perform well in identifying community clusters. These beta diversity measures have been incorporated into several metagenomics analysis pipelines, including QIIME (62) and SONS (63), which is currently incorporated into MOTHUR (44). Another novel network-based community comparison method was reported in (22), where OTUs and communities were abstracted to nodes in a bipartite graph. In this scheme, an OTU is connected to a community if the OTU is present in the community. The weight of the edge is the number of sequences in the OTU belonging to the community. Network analysis tools such as Cytoscape (http://www.cytoscape.org/) can then be used to analyze the network. 4. PHYLOGENY-BASED METHODS FOR COMPARING METAGENOMICS COMMUNITIES Phylogenetic methods are those that take evolutionary relationships of the sequences into consideration in the comparison of communities. Here, we briefly review some of the approaches, while a more complete comparison of such methods is given in (64). 4.1. The OTU-based beta diversity measures have two main disadvantages. The first, as we have seen in subsection 3.1, involves the difficult problem of accurately defining OTUs. Mistakes in defining the OTUs may lead to misleading results about community relationships. Secondly, OTUs are treated as equal in terms of presence/absence or abundance levels, even though some of them may be closely related and some may not. To overcome these shortcomings, Martin (65) introduced two statistics borrowed from population genetics and systematics for comparing samples, The
where There are various statistics for estimating genetic diversity in a sample. One that is commonly used takes the average nucleotide differences between two randomly chosen sequences from the sample, as calculated by
where k is the number of distinct sequences, The phylogenetic (P) test, also known as the parsimony test (66), can be described as follows. First, a phylogenetic tree, including all the sequences in the samples, is generated using a phylogenetic analysis tool such as PHYLIP (http://evolution.gs.washington.edu/phylip.html). Each sequence is labeled according to the community the sequence comes from. Based on this observed tree, the minimum number of changes needed to explain the labels, termed parsimony score, is calculated. If the two communities are the same, the labels of the sequences and the phylogenetic tree should be unrelated. In the literature, two randomization methods were proposed to test the hypothesis that the two communities are the same. The first approach is to randomize the tree for the sequences and keep their labels. The second approach is to randomize the labels of the sequences without changing the phylogenetic tree. For each approach, the p-value is approximated by the fraction of times that the resulting parsimony score for the randomized sample is equal to, or smaller than, the parsimony score for the original data. The p-values obtained from the two randomization approaches can be different because of the different randomization processes. Actually, the two randomization approaches test for different specific hypotheses. By randomizing the tree, the P-test tests the hypothesis that sequences from the two communities associate with each other through a random phylogenetic tree. By randomizing the labels of the sequences, the P-test tests the hypothesis that the sequences from the two communities are randomly distributed along the leaves of the observed phylogenetic tree. Both approaches have been used to compare communities. As a test strategy, the phylogenetic (P) test cannot be used as a measure of beta diversity because the p-value depends on the number of sequences in each individual community in addition to differences among all communities. The phylogenetic (P) test has been implemented in TreeClimber (63), which is now included in MOTHUR (44). 4.2. UniFrac, weighted UniFrac and variance adjusted weighted UniFrac Two other widely used phylogenetic methods for comparing communities are UniFrac and weighted UniFrac (W-UniFrac), both proposed by Lozupone et al. (67, 68), and they have been widely used in many studies, e.g. (22, 69, 70). Similar to the phylogenetic (P) test, a phylogenetic tree composed of sequences from all the communities is needed, and each sequence is labeled according to the sample it comes from. UniFrac measures the distance between communities by the fraction of branch length of the tree that leads to descendants from each of two single communities, but not from both communities (67), whereas weighted UniFrac takes abundance information into consideration and weights each branch length by the difference of the fractions of sequences from the two communities belonging to the branch (68). UniFrac and W-UniFrac are calculated using the following equations: UniFrac W-UniFrac where n is the number of branches in the tree, and Based on UniFrac and weighted UniFrac, we recently proposed a new quantitative measure (72), termed variance adjusted weighted UniFrac (VAW-UniFrac). Compared to weighted UniFrac, this new statistic adjusts the weights of branch lengths according to the variance of VAW-UniFrac where Despite the wide applications of UniFrac and W-UniFrac, some potential problems have been observed (64) when they are used to cluster communities based on the observation that their mean values decrease with the number of sequences from the two communities. The simulations used by Lozupone and colleagues agreed with this observation; that is, when the number of sequences is relatively small, e.g., less than 1000, then the mean values of UniFrac and weighted UniFrac decrease with the number of sequences from the communities, but their mean values become stable when the number of sequences is greater than 1000 (73). Thus, UniFrac and W-UniFrac depend on the number of sequences from the communities. To overcome this potential problem, Lozupone et al. (73) suggested using bootstrap to sample the same number of sequences from the communities, thus providing a method of comparison when the number of sequences from some communities are relatively small. As an extension of W-UniFrac, VAW-UniFrac experiences this same problem; hence, the bootstrap strategy should be employed when the concern warrants it. Another more philosophical issue about UniFrac is that it assumes that "differences" between communities are proportional to the phylogenetic distances of their constituent members. This may be true for some questions, but not all. It depends on how the distance is interpreted, as factors like ecological roles do not uniformly follow phylogeny. So the "ecological scale" of phylogenetically close and far distances is inherently not predictable. 5. ASSOCIATION NETWORKS OF OTUS AND ENVIRONMENTAL FACTORS Microbial organisms do not function independently in communities. Instead, they interact with each other and with environmental factors (ENV). Without precise knowledge about organisms within communities, we can study the association of OTUs and, as a consequence, form OTU networks. Given the distribution of OTUs in a community under multiple time points, locations, or environmental conditions, Pearson correlation or Spearman correlation can be used to study the association of OTUs and ENVs. An OTU/ENV network can then be constructed, assuming that two OTUs are connected if their abundance levels are significantly associated. For presence/absence of OTUs across many different time points, locations, or environmental conditions, an OTU co-occurrence network can also be obtained as in (74). Two OTUs are connected if they are more likely to co-occur than expected, for example if they prefer similar environmental conditions or if they facilitate each other's survival, as in cooperative relationships like symbioses. Network analysis tools, such as Pajek (75) and Cytoscape (76), can be used to analyze microbial OTU/ENV association networks. Note there are also significant negative associations that may imply interactions like competition or predation, or preference for opposite seasons. With metagenomics data from a series of time points, i.e., time series data, it is possible to define time-delayed-local association between OTU/ENVs, as defined in (77). Standard statistical approaches, such as Pearson or Spearman correlation, may not be able to capture such complex interactions in reality. For example, it was found that two OTUs may only associate within a subset of the time interval of interest. Moreover, it is possible that one OTU, OTU1, may have a time-delayed response to the abundance changes of another OTU, OTU2, thus creating a time-delayed association, as might, for example, be the case in the administration of antibiotics or host immune response to pathologic overload. As suggested, linear regression and Pearson or Spearman correlation will most likely fail to detect the relationship between OTU1 and OTU2 in such situations in that these statistics can only detect global linear relationships between OTU/OTU and OTU/ENV pairs. Obviously, these problems call for the exploration of alternate analytical methods, and in order to identify such complicated relationships between OTU/OTU and OTU/ENV pairs, we developed local similarity analysis (LSA) with time delays to study the relationship between OTU/OTU and OTU/ENV pairs (77). The following procedures were used to identify potentially time-delayed-local associations. First, the abundance levels of each OTU across the time series are normalized so that they can be considered samples from the standard normal distribution. Second, a dynamic program algorithm is then used to find potentially time-delayed-local intervals with highest absolute correlation. Third, a p-value is then calculated by randomization of the normalized abundance levels of the OTUs. Fourth, the p-values are then transformed to q-values for each pair of sequences, and an OTU network can be constructed by thresholding on the p-values or q-values. In most biological studies, both technical and biological noises are unavoidable. Here technical noise indicates errors introduced by the experiments and biological noise indicates randomness introduced during the sampling process. To study the effects of these noises on the local similarity score, biological/technical replicate experiments are usually carried out. We recently extended the original LSA to the situation with replicates termed extended LSA (eLSA) (78). With replicates, we are able to obtain the boostrap confidence interval for the LS score. The LSA software can be downloaded from http://meta.usc.edu/softs/lsa. The local association network approach has been applied to several environmental biological studies, and interesting results about the association of OTUs and environmental factors were obtained and discussed, e.g., findings reported in (79-83). For example, Steele et al. (81) built the largest most comprehensive ecological network using LSA in the ocean. We expect that network-based analysis of OTU/ENVs will play more important roles as more time series data are available. 6. DISCUSSION Metagenomics is a rapidly developing field, and both tag and whole-genome shotgun sequence data are available. However, because of the large amounts of data, there is an urgent need for efficient computational tools to analyze these large datasets in order to understand microbial organism assemblage under different conditions, compare different communities, and understand how microbial organisms associate with each other and the environment. In this paper, we reviewed computational approaches for tag sequence analysis, including the definition of OTUs, the use of OTU- and phylogeny-based methods to compare metagenomics communities, and the construction of OTU/ENV networks to study how OTUs associate with each other and with the environments. We have seen that classifying sequences into different OTUs is an extremely difficult problem. Many shortcomings of the available methods for defining OTUs have been identified, but problems associated with new algorithms have not yet come to light. Clustering itself is an exploratory tool and can give deep insight into the microbial diversity of communities at various levels of phylognetic resolution. Due to the highly complex nature of the evolution of genomes, we recognize that OTUs based on one or a few tag sequences cannot perfectly correspond to microbial species (the characterization or even formal existence of which is still frequently debated), however the distribution of OTUs defined by clustering still has interesting and valuable ecological interpretations, e.g. (84). Although average linkage in hierarchical clustering tends to yield more stable and biologically meaningful OTUs than complete linkage, we doubt that hierarchical clustering is the optimal strategy for defining OTUs. Importantly, for short sequences in particular (as currently determined by the next-generation sequencing approaches like Illumina), the information in the distance matrix between the sequences may not be enough to cluster the sequences into clusters with certainty. Instead, probabilistic clustering may be a more reasonable alternative to hierarchical clustering of tag sequences. Specifically, it is not possible to determine if two specific sequences are definitely in the same cluster. Instead, we only know the probability that they are in the same cluster. To accommodate this idea, we developed a new method, termed CROP (48), which does not force a sequence into one cluster, but rather into different clusters based on probabilities for them to be in each cluster. At the same time, however, it has to be acknowledged that probabilistic clustering is computationally demanding and difficult to explain to non-statisticians. Despite the shortcomings of probabilistic clustering, we expect that further improvement in the computational speed of CROP will, in turn, improve OTU definition. Once the OTUs are defined, many beta diversity measures can be used to compare communities. For instance, the study of Kuczynski et al. (61) highlighted the differences among a variety of beta diversity measures in recovering environmental gradients and clustering communities. However, it is not clear how the mis-specification of OTUs affects the results from different beta diversity measures. We also reviewed phylogeny-based methods for comparing communities, including the parsimony test, UniFrac, weighted UniFrac, and our newly developed variance adjusted weighted UniFrac. UniFrac has been used in over 150 metagenomics studies, and important biological insights have been gained. On the other hand, all the methods we reviewed assume that the tree is given and is correct, that the tag sequences correctly place the sequence in a single place on the tree, and that the distances of interest between communities are reflected by phylogenetic distances. Placement on robust trees are most accurate when we match longer tag sequences unambiguously to existing RNA classification schemes, such as RDP of 16S RNA sequences, since the 16S RNA sequences are well studied, and detailed phylogenetic relationships among them are known. On the other hand, for short sequences that match multiple sequences in differnt parts of the tree nearly equally well, and for other types of tag sequences (non-16S) where the phylogenetic relationships are not clear, problems may emerge based on potential errors in properly placing the sequences on the phylogenetic tree, thus suggesting the need to further study the effects of such errors on phylogeny-based beta diversity measures. Understanding how OTUs associate with each other and with the environment is another very important problem. Initial efforts to establish OTU co-occurrence networks have highlighted the importance of such an approach (74). Our previous analysis of marine time series data using ARISA showed interesting association patterns among OTUs and environmental factors (77, 81). Time series tag sequence data are now available (84), and local similarity analysis of such data, as described above, is giving us more detailed information on the association of microbial organisms. Nonetheless, more sophisticated local similarity analysis approaches are needed to identify other association patterns that cannot be discovered by the current version of LSA. 7. ACKNOWLEDGMENTS This research was partially supported by NSFC grants 11071146, 60928007, and 60805010, and the National Basic Research Program of China (973 Program, No. 2007CB814901). QC is supported by Graduate Independent Innovation Foundation of Shandong University (GIIFSDU). TC, JF and FS are partially supported by US NSF DMS1043075 and OCE 1136818. 8. REFERENCES 1. J. Handelsman, M. R. Rondon, S. F. Brady, J. Clardy and R. M. Goodman: Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chem Biol 5(10), R245-9 (1998) 7. W. T. Liu, T. L. Marsh, H. Cheng and L. J. Forney: Characterization of microbial diversity by determining terminal restriction fragment length polymorphisms of genes encoding 16S rRNA. Appl Environ Microbiol 63(11), 4516-22 (1997) 41. A. E. Magurran: Measuring biological diversity. Blackwell Pub., Malden, Ma. (2004)
59. P. Legendre and L. Legendre: Numerical ecology. Elsevier, Amsterdam; New York (1998) Abbreviations: HMP: human microbial project; OTUs: operational taxonomic units; rRNA: ribosomal RNA; NMI: normalized mutual information; MSA: multiple sequence alignment; PSA: pairwise sequence alignment; P-test: phylogenetic test; W-UniFrac: weighted UniFrac; VAW-UniFrac: variance adjusted weighted UniFrac; ENV: environment factors Key Words: Tag sequences, Metagenomics, Operational taxonomic units (OTUs), Community comparison, 16S rRNA, Phylogeny, Environment factors, Review Send correspondence to: Fengzhu Sun, Molecular and Computational Biology Program, University of Southern California, Los Angeles, CA 90089-2910, USA, Tel: 1-213-740-2413, Fax: 1-213-740-8631, E-mail:fsun@usc.edu |