[Frontiers in Bioscience E4, 2670-2682, June 1, 2012]

A score-statistic approach for determining threshold values in QTL mapping

Chen-Hung Kao1, Hsiang-An Ho1

1Institute of Statistical Science, Academia Sinica, Taipei 11529, Taiwan, R.O.C.

TABLE OF CONTENTS

1. Abstract
2. Introduction
3. Experimental populations
3.1. Advanced populations
3.2. Genotypic distributions
4. Statistical model of interval mapping for QTL
4.1. Statistical model
4.2. Conditional probabilities
5. Likelihood of the statistical model
6. Score test statistics
6.1. Score functions
6.2. Score test statistics
6.3. Asymptotic forms of score test statistics
7. Gaussian stochastic process
7.1. Covariance between test statistics
7.2. Covariance between trait means
8. Simulating the null distribution
9. Real example and simulation studies
10. Discussion
11. Acknowledgements
12. Appendix
13. References

1. ABSTRACT

Issues in determining the threshold values of QTL mapping are often investigated for the backcross and F2 populations with relatively simple genome structures so far. The investigations of these issues in the progeny populations after F2 (advanced populations) with relatively more complicated genomes are generally inadequate. As these advanced populations have been well implemented in QTL mapping, it is important to address these issues for them in more details. Due to an increasing number of meiosis cycle, the genomes of the advanced populations can be very different from the backcross and F2 genomes. Therefore, special devices that consider the specific genome structures present in the advanced populations are required to resolve these issues.  By considering the differences in genome structure between populations, we formulate more general score test statistics and Gaussian processes to evaluate their threshold values. In general, we found that, given a significance level and a genome size, threshold values for QTL detection are higher in the denser marker maps and in the more advanced populations. Simulations were performed to validate our approach.

2. INTRODUCTION

The statistical model of interval mapping (IM) proposed by Lander and Botstein (1) is generally a normal mixture model, as the genotypes of the quantitative trait locus (QTL) are not observable and needed to be inferred from its flanking markers. In the parameter estimation of the normal mixture model, the maximum likelihood estimation is commonly implemented to obtain the maximum likelihood estimates (MLE) through the EM algorithm (2) by treating the model as an incomplete-data problem. Typically, the presence of a QTL, i.e. the null hypothesis of no QTL, is tested over the all possible positions in the whole genome by using likelihood ratio test (LRT) statistics, and the position with the significantly maximum LRT statistic is regarded as the estimated QTL location. Under this framework, it has been recognized that the determination of the threshold values for claiming significant QTL detection (rejecting the null hypothesis) along the genomes is one of the complicated and important issues in QTL mapping (3-4) for the following reasons. One is that the QTL position is unidentified under the null hypothesis, and the maximum LRT statistic does not follow the standard χ2 distribution asymptotically (4). Further, various factors, such as the number and size of intervals, population genome structures, and informativeness of markers, will involve in and should be considered in determining the threshold value for claiming QTL detection (5-8). Besides, because multiple correlated and uncorrelated tests are performed in searching for QTLs on the whole genome, the common pointwise significance level is not appropriate and genomewise significance level should be considered in QTL mapping (1,4,6,8).

Several theoretical and simulation approximations have been proposed to determine the threshold values of QTL detection. Lander and Botstein (1) suggested using Bonferroni argument for sparse-map case and ORENSTEIN-UHLENBECK diffusion for dense-map case to determine the threshold value. For intermediate situations, extensive numerical simulations have been used to determine the thresholds (1,9). Churchill and Doerge (10) suggested using a permutation test for determining an appropriate threshold values for specific data sets. Rebai, Goffinet and Mangin (11) used Davies's bound (12) to derive a conservative formulas for calculating the approximate thresholds for backcross and F2 populations. They demonstrated good performance of their formulas using simulation. Dupuis and Siegmund (13) provided approximate formulas to calculate the threshold for the case of very dense markers in the population, but they did not take interval mapping into account in their approximation. Piepho (14) also used Davies's bound to propose a quick method for computing the approximate thresholds for general designs. The quick method is computationally inexpensive and claimed to be an alternative to permutation procedure. Zou et al. (8) and Chang et al. (4) proposed a score-statistic framework to assess the threshold values. As the maximum of the square of score test statistics is asymptotically equivalent to the maximum LRT statistics, the threshold values derived from the score test statistics can be used as those for the LRT statistics in the population (4, 8, 15-16).

On the basis of score test statistics, Zou et al. (8) proposed a resampling approach to obtain the threshold values mainly for the F2 population by simulating the F2 genome structure. Chang et al. (4) also devised a score-statistic method for computing the threshold values in a backcross population by analytically analyzing the backcross genome structure. Chang et al. (4) showed that score test statistics along the genomes is a Gaussian stochastic process with mean zero and well-structured covariance, and they used them to compute the threshold values of QTL detection in the backcross population. The score-statistic method not only can be less computationally demanding than the permutation test and numerical simulation, but also can be more accurate than previous approximate formulas in the computation of the threshold values (4,8). So far, most of the studies of assessing the threshold values for QTL detection are investigated in the backcross and F2 populations (8, 11 and 14 discussed the threshold values for F3 populations), still they are generally lacking or inadequate for the progeny populations after F2 (advanced populations). These advanced populations, such as recombinant inbred (RI) and advanced intercrossed (AI) populations, have been well devised and implemented in genetic studies. For example, Bai et al. (17) used RI populations in rice and Kelly et al. (18) implemented AI populations in mice for investigating the genetic architectures of quantitative traits in their studies. For specific populations where time is not an issue, the advanced populations can have some very useful properties in that their genomic structures allow researchers to yield better results in their investigations. Therefore, it is of importance to address the issues of assessing the threshold values for these populations in more details in QTL mapping study. Due to the fact that these advanced populations undergo multiple meiosis cycles, their genomic structures, such as homozygosity, genotypic frequency and variance components, are differing and can be very different from the backcross and F2 genomes. In this article, by distinguishing between different population genome structures, we formulate more general score test statistics and Gaussian processes under the framework of interval mapping to compute the threshold values and to study their behaviors for various populations. One of the keys to our approach is to devise the genotypic distributions of two, three and four genes of the populations into the formulations, so that their specific genome structures can be well described to address these issues across various populations. Simulation studies are performed to evaluate our approach to assessing the threshold values. The R program of our approach is available on http://www.stat.sinica.edu.tw/~chkao/.

3. EXPERIMENAL POPULATIONS

3.1. Advanced populations

Various experimental populations have been designed for the study of QTL mapping. Among these populations, the backcross and F2 populations have been the most widely used designs in the studies. Besides, the progeny populations from the F2, which are called advanced populations, are also very common. These experimental populations are produced as follows. A cross between two parental inbred lines, P1 and P2, is performed to produce an F1 population. If the F1 individuals are backcrossed to P1 or P2, it produces the backcross population. If the F1 individuals are selfed or randomly mated, it produces an F2 population. If the F2 population is further selfed and/or randomly mated for generations, the produced progeny populations will be called advanced populations. These advanced populations include recombinant inbred (RI) and advanced intercrossed (AI) populations. As the RI populations or AI populations is obtained by repeatedly selfing (inbreeding) or randomly intermating the F2 individuals for several generations, their genomes will be subject to more meiosis cycles as compared to the backcross and F2 populations. Therefore, the genomic constitutions, such as homozygosity, genotypic distribution and linkage disequilibrium, of the advanced populations will be different from each other (19) and no longer similar to those of the backcross and F2 populations. In general, selfing will increase the homozygosity at the expense of heterozygotes. Also, further meiosis cycles tend to accumulate crossovers so that the proportion of recombinants will increase and linkage disequilibrium between genes will reduce in the populations. These features in the advanced populations can be very useful and may allow for more productive investigations. For example, the RI populations can increase the homozygosity to assist the detection of additive effects in QTL mapping (20). Further, the AI populations can harbor more recombination events in a short chromosome segment for genetic fine mapping and may provide better power in the separation of closely linked QTL (20-21).

3.2. Genotypic distributions

As will be shown and explained later, an important key to successfully compute the threshold values for these populations is to well devise the genotypic distributions of one, two, three and four genes of the populations into the formulations, so that the specific genome structures of the populations can be well considered under our proposed framework. We now explain briefly how these genotypic distributions are obtained in different populations. Consider an F2, AI or RI population used for the QTL mapping studies. In general, for m genes, there are 2m different gametic genotypes and 2(2m-1)+2m/2 zygotic genotypes. Therefore, for m=1, 2, 3 and 4, there are 2, 4, 8 and 16 gametic genotypes and 3, 10, 36 and 136 zygotic genotypes. As the different populations undergo different numbers of meiosis cycle, they will have different distributions of gametic and zygotic genotypes in the genomic constitutions. For one gene, there are three possible genotypes, P1 homozygote, heterozygote and P2 homozygote. The frequencies of these three genotypes are expected to be 1/4, 1/2 and 1/4, respectively, in the AI populations. In the RI populations, the frequency of heterozygote is halved in each selfing cycle. For m=2, 3 or 4, the genotypic frequencies in different AI and RI populations can be obtained by using the transition equations of Haldane and Waddington (22), Geiringer (23), and Kao and Zeng (20, 24). These transition equations, which are derived under the assumptions of Haldane map function and equal crossover value in two sexes, aim to obatin the genotypic distributions of the populations in different generations when individuals are subject to random mating or selfing process. In RI populations, the 5, 20 and 72 transition equations in Haldane and Waddington (22) and Kao and Zeng (20, 24) can be used to compute the frequencies of the 10, 36 and 136 genotypes of two, three and four genes. In AI populations, the gametic frequencies are first computed, and then the genotypic frequencies can be obtained from the product of gametic frequencies. To obtain the frequencies of the 4, 8 and 16 gametic frequencies for m=2, 3 and 4, Geiringer's approach (23) can be used to formulate the transition equations of gametic frequencies or the sets of transition equations provided by Kao and Zeng (20, 24) can be directly used. Conceptually, it is also possible to obtain the genotypic distributions for any number of genes in any AI and RI populations by extending their approaches. However, there may be too many equations, and each equation contains numerous terms.

4. STATISTCAL MOEL OF INTERVAL MAPPING FO QTL

4.1. Statistical model

The data structure of QTL mapping generally consists of two parts, ( for the quantitative trait value and ( for the genetic markers. An interval mapping statistical model for testing a QTL, Q, at the position in an interval, , flanked by markers, M (with alleles M and m) and N (with alleles N and n), is proposed as

  (1)

where is the quantitative trait value of the th individual, and are the additive and dominance effects of Q, and defined as

are the coded variables of genotypes of Q, and is a random error. We assume follows N (0,). In general, as Q may not be coincident with a marker, its genotype is not observable and can be ( and ), ( and ) or ( and ) for an individual .

4.2. Conditional probabilities

Although Q is not observable, its genotypic distribution can be inferred from its flanking marker genotype according to the principle of conditional probability as

. (2)

Therefore, obtaining the above probability involves in the use of the genotypic distributions of two and three genes in the experimental populations. For any advanced population under consideration, the two flanking markers can have nine different genotypes, MN/MN, MN/Mn, Mn/Mn, MN/mN, MNmn (MN/mn or Mn/mN), Mn/mn, mN/mN, mN/mn and mn/mn. For each one of the nine marker genotypes, the genotype of Q can be , or . When M, N and Q are considered together, there are 27 different genotypes and 27 corresponding conditional probabilities. In the following, we denote these 27 conditional probabilities by 's, indexing the marker genotypes and indexing the QTL genotype. It should be pointed out that, in the population, it is straightforward to obtain 's, as the genotypic distributions of two and three genes have a simple relationship with the recombination fractions between M, Q and N. For example, the frequencies of digenic gametes Mq and qN are and , where and are the recombination fractions between Q and M and between Q and N. The frequency of trigenic gamete can be easily obtained by the product of the two digenic frequencies as , and the conditional probability of genotype given the flanking marker genotype MN/MN is simply as , where is the recombination fraction between M and N. Nevertheless, such a simple relationship for straightforwardly computing the conditional probabilities does not hold in the more advanced populations. Here, we implement the sets of transition equations proposed by Haldane and Waddington (22), Geiringer (23) and Kao and Zeng (20, 24) to obtain the genotypic frequencies of two and three genes as have been mentioned in the previous section. With these frequencies, the 27 conditional probabilities of QTL genotypes given the flanking marker genotypes in any populations can be obtained (see Table 1). These conditional probabilities play very important roles in constructing the statistical QTL mapping model for precise QTL mapping.

5. LIKLIHOOD OF THE STATISTICAL MODEL

For an individual , the unobservable Q at position can be , or with certain probabilities depending on the flanking marker genotypes. Now let , and denote the three probabilities of Q being , or for the individual , respectively, and these probabilities at position can be obtained from the 27 conditional probabilities in Table 1. That is 's, , can be found from 's, . If Q is (with probability ), the distribution follows , where . Similarly, Q can be or (with probability or ), and the distribution follows or , respectively, where and . Therefore, the likelihood of an individual is a mixture of three normals with different means and mixing proportions, ' s and ' s, . For a sample with individuals, the log likelihood function for = at position is the sum of the log likelihood of the individuals as

 (3)

Note that 's can be determined by the position and need not to be estimated here. To assist the following derivations of score statistics, we classify the individuals into nine categories according to their marker genotypes and reformulate equation (3) as

 (4)

where , , , , , , , and are the numbers of individuals with the nine marker genotypes MN/MN, MN/Mn, Mn/Mn, MN/mN, MNmn, Mn/mn, mN/mN, mN/mn and mn/mn, respectively. Note that the log likelihoods for the individuals with the same marker genotype have the same mixing proportions, ' s, in the reformulated equation.

6. SCORE TEST STATISTICS

6.1. Score functions

The score functions for the additive and dominance effects are the derivatives of the log likelihood (Equation (4)) with respective to the parameters, and , and using the MLEs of and , and , evaluated under and at position . Let and denote the score functions of and , respectively. The two score functions can be found as

If the individuals are classified into nine categories according to their marker genotypes, the score functions become

 (5)

and

  (6)

where ' s () and ' s, , are the frequencies of individuals and trait means in the nine flanking marker categories. To simplify the following derivations, now let 's and 's be

 (7)

and

 (8)

. Note that 's and ' s are closely related to the genotypic distributions of two and three genes, as 's and ' s are functions of the genotypic frequencies of two and three genes in the population. Then, under the null, the variances of and are

 (9)

and the covariance between and is

  (10)

6.2. Score test statistics

If only additive or dominance effect under the null, or , is considered, the score test statistic, or , is or divided by its standard deviation as

If both additive and dominance effects are considered at a time, we may define the score function as . The score test statistic, , for and against and at location takes the form

 (11)

where is the variance-covariance matrix of . The elements in are the variances and covariance of and in equations (9) and (10). The above derivations for score test statistics are relatively simpler as compared to the derivations for the MLE, as it avoids the parameter estimation of the normal mixture likelihood, which usually involves in the use of the iterated EM algorithm for obtaining the MLE (25). Also, as given by Cox and Hinkley (26) and Chang et al. (4), the maximum of the is asymptotically equivalent to the maximum of LRT. Therefore, the maximum of under the null hypothesis can be used to assess the threshold value of the maximum likelihood approach in QTL mapping.

6.3. Asymptotic forms of score test statistics

To understand the null distributions of the score test statistics, the asymptotically equivalent forms of , and are derived below. In derivations, we follow Haldane and Waddington (22) to use the oblique letters, and , to denote the genotypic frequencies. Let be the frequency of MN/MN or mn/mn genotype, be the frequency of Mn/Mn or mN/mN genotype, be the frequency of MN/Mn, MN/mN, Mn/mn or mN/mn genotype, be the frequency of MN/mn genotype, and be the frequency of Mn/nM genotype, respectively. Note that the exact values of and in the different advanced populations can be computed using sets of transition equations as mentioned in the previous sections. Therefore, for large in a population, we have , , and asymptotically and can formulate the asymptotic forms of , , , and in terms of and in any populations. We then have

 (12)

where

Also, due to symmetry, the term in (equation (7)) will be zero and in a large population. Further, it can be shown that , , and all follow asymptotically. Consequently, follows asymptotically . Similarly, we have the score test statistic of the dominance effect as

 (13)

where

Note that the term in (equation (8)) will be zero in the and AI populations, and it is in the RI populations. As , , , and follow asymptotically, it is also straightforward to show that follows asymptotically. If both effects are considered at a time, the asymptotic forms of , denoted by , can be obtained from equations (11) by using the asymptotic forms of , , , and . As , it implies in the populations.

7. GAUSSIAN STOCHASTIC PROCESS

7.1. Covariance between test statistics

The previous discussions mainly focus on deriving the score test statistics at a fixed position . As presented in Chang et al. (4), the score test statistics along the genomes can be described by Gaussian stochastic process asymptotically. To obtain Gaussian processes for the different populations, we now derive the relationship between the test statistics at two different positions by considering the changes in the genomic structure between populations. Now consider and at two different positions and in two distinct intervals, (A,B) and (C,D), respectively (note that A, B, C and D denote markers with alleles (A,a), (B,b), (C,c) and (D,d), and oblique letters , , , and denote the genotypic frequencies). To compute their covariances, we need to first obtain the covariances between their components at the two different positions. To obtain these component covariances, we reformulate and (Equations (12) and (13)) in much simpler expressions as

where 's and 's are the associated coefficients of and in and . For example,

and

Similarly, the remaining 's and 's are well defined and can be found in Equations (12) and (13). Then, the covariance between and can be expressed more succinctly as

 (14)

and the covariance between and is

 (15)

As 's and 's are constants in a population, the covariances in Equations (14) and (15) will depend on the covariances between and and between and in different positions, i.e. 's and 's. To obtain 's and 's, note that the two different positions and can be in the two neighboring or non-neighboring intervals. If they are in the neighboring intervals (flanking markers B and C are the same), evaluating 's and 's needs to consider the genotypic distributions of three loci. If they are in the non-neighboring intervals, the evaluation needs to take the genotypic distributions of four loci into account. For example,

 (16)

where 's and 's (' s and ' s) denote the trait means of marker genotypes, AB/AB and ab/ab (CD/CD and cd/cd), and () is the frequency of individuals with AB/AB or ab/ab genotype (CD/CD and cd/cd). This covariance needs to evaluate the four covariances, , , and , between trait means in different intervals.

7.2. Covariance between trait means

For large, we have

where () is the number of individuals with AB/AB or ab/ab (CD/CD or cd/cd) marker genotype. Note that and in a population. Therefore, we have

 (17)

Similarly, we can obtain the other components of covariances as

Since and . The four covariances between trait means in different intervals depend on the genotypic frequencies of two and four loci. Consequently, the covariance between and for non-neighboring intervals in Equation (16) can be obtained, and it is

which depends on the genotypic frequencies of four loci. Similarly, for neighboring intervals, the covariance between and is a function of the genotypic frequencies of three loci, and it is

The other covariances, such as 's, 's, 's, 's, ,'s and ,'s, and , can be obtained in a similar way by first deriving their component covariances, i.e. 's, . In general, obtaining these covariances between different and needs to first evaluate the covariances between different and in the different intervals (see Appendix), which will involve in evaluating 36 genotypic frequencies of three genes for neighboring intervals and 136 genotypic frequencies of for genes for non-neighboring intervals (20, 24). With these covariances, the covariances between the asymptotic score test statistics, , , and , in different intervals can be computed. It is found that the covariances between and either in the same or different intervals are zeros, i.e. (), and , asymptotically. Essentially, by incorporating these covariances into the variance-covariance matrix of Gaussian process, it can consider the differences in genome structure between different populations in computing the threshold values. The above derivations allow us to explore the behaviors of threshold values across populations.

8. SIMULATING THE NULL DISTRIBUTION

The scheme of simulating the null distributions of , and is outlined in this section. If only additive (dominance) effect is considered, we may simulate the components of ' s ( 's), i.e. ' s (' s) in our case, and then to compute 's ( 's) and () throughout the genomes as suggested by Chang et al. (4). If both effects are considered at a time, we suggest to simulate their subcomponents, 's, , (components of 's and 's) to obtain the asymptotic forms of , , , and and then to compute along the genomes using Equation (11), and finally to obtain . Note that it is also feasible to obtain () by simulating 's.

When simulating 's for every intervals, note that there are constraints on 's between the current th interval and the next th interval due to sharing a common flanking marker. These constraints are

(18)

 

  (19)

(20)

To simulate the null distribution of for a genome with intervals, we suggest the following steps:

1.Generate (, , , ) from , where is a vector containg the nine trait means in the first interval, and , , is a vector containing six of the nine trait means, e.g., , , , , and , in the th intervals, and is the variance-covariance matrix of the trait means. The construction of for the normal distribution needs to evaluate the covariances between different 's by using the genotypic distributions of three and four genes (see APPENDIX).

2.Compute the remaining three trait means, e.g., , and , for the 2nd to th intervals using the three constraints (equations (18), (19) and (20)). The dimension of is usually large and is .

3. Compute the score test statistics, 's, at all positions in the intervals along the genome, and find and record their maximum.

The above steps are repeated many times, say 10,000 times, to obtain the approximate distribution of . The threshold value at significant level can be determined accordingly. The R program of our approach is available on http://www.stat.sinica.edu.tw/~chkao/.

9. REAL EXAMPLE AND SIMULATION STUDIES

9.1. Real example

As a real data application, we considered a backcross model to compute the threshold value of QTL mapping in a pseudo-testcross population of Radiata pine. One hundred and twenty markers contributed genotypic information across twelve linkage groups and covered ~1679.3 cM. The traits considered are tree diameter, branch quality scores and brown cone number. The average spacing of the 107 marker intervals was 13.5 cM. The maximum distance between two adjacent markers is 74.8 cM, and the minimum is 1.5 cM. For this practice of QTL detection, Kao et al. (7) used Bonferroni argument to choose a value of 12.12 (χ21,0.05/107) as a threshold in the analysis. By using our approach, a larger threshold value 12.43 is obtained. By using 12.43 as a threshold of QTL mapping, the numbers of detected QTL for the three traits are the same as those by using 12.12 as a threshold in this case.

9.2. Simulation studies

Simulations were performed to evaluate the performance of the proposed method and study the behaviors of the threshold values under various marker densities in several experimental populations. For each population, one chromosome with a total length of 100 cM was simulated. On the chromosome, we assume that there are 101, 51, 21, 11, 6, 3 and 2 evenly spaced markers, i.e. the marker distances are 1, 2, 5, 10, 20, 50 and 100 cM, respectively. The experimental populations considered include F2, AI (RI) F3, AI (RI) F4, AI (RI) F5, AI (RI) F6 and AI (RI) F10 populations. Except for the RI F10 population, both additive and dominance effects are considered. For the RI F10 population, only the additive effect is considered, as there are very few heterozygotes due to continuous self. For each case considered, score test statistics are computed every 1 cM, and the number of simulated replicate is 10,000. The 10,000 maximums of the score test statistics along the chromosome are recorded. To validate our method, we also simulate 10,000 sets of the traits and markers from 200 individuals for each marker density and experimental population under the null hypothesis. Each data set is then analyzed by the interval mapping approach, and the LRT statistics were computed every 1 cM and the 10,000 maximums were obtained for comparison. Results of the threshold values at α=0.05 from the maximums of LRT statistics and score test statistics are given in Table 2. It is worth pointing out that the score-statistic approach is several hundred times faster than the LRT-statistic approach to complete the results in Table 2. This advantage of greatly saving computation time has been identified by Chang et al. (4).

Table 2 shows that the threshold values obtained from the two different approach are generally very close to each other, especially, in the F2 population. For example, the values by the score-statistic approach are 8.103, 8.966, 9.853, 10.574, 11.145, 11.732 and 11.959, respectively, for the sever marker densities in the F2 population, and those from the LRT statistic are 8.128, 8.978, 9.998, 10.481, 11.120, 11.750 and 12.318, respectively. For more advanced AI or RI populations, the differences between the threshold values seem to become relatively larger, especially for the sparse marker density in the more advanced AI populations. For example, in the AI F10 population, the threshold values from the score test statistic are 9.410 and 8.194 in the 50- and 100 cM-marker spacing, and those from the LRT statistic are 10.401 and 9.879. Also, the threshold values are increasing for denser marker maps, which is consistent with the findings in previous studies (8, 11). Such an increasing trend becomes more obvious in the more advanced populations as compared with that in the earlier populations. For example, the values for 100- and 1-cM marker spacing are 8.200 and 13.518 (8.273 and 13.257), respectively, in the AI (RI) F6 population, and they are 8.183 and 12.580 (8.076 and 12.447) in the AI (RI) F3 population. Besides, the threshold values are higher in the more advanced populations. For example, the values for 10-cM marker spacing are 10.574, 10.887 (10.859), 11.132 (11.186), 11.533 (11.315), 11.871 in the F2, AI (RI) F3, AI (RI) F4, AI (RI) F5, AI (RI) F6 and AI F10 populations. The threshold values in the AI populations are generally larger as compared to those in the RI populations (except for the case of 100-cM spacing in the F6 population). For example, the threshold values are 8.075, 9.080, 10.090, 10.859, 11.487, 12.212 and 12.447 for the seven marker densities in the RI F3 population, and they are 8.183, 9.100, 10.130, 10.887, 11.638, 12.325 and 12.580 in the AI F3 population. The similar trends can be also observed in the other AI and RI Ft populations.

10. DISCUSSION

When applying the interval mapping procedure to search the whole genomes for QTL, typically, the LRT statistics are constructed over the all possible positions to test for the null hypothesis of no QTL, and the position with the significant maximum of LRT statistic is regarded as the estimated QTL position. Under such a procedure, the determination of the threshold values for the test statistics to declare significance has been a central issue in QTL mapping. As the maximums of the score test and LRT statistics are asymptotically equivalent (4, 8, 26) and the computation cost of the score test statistics is much cheaper (4, 8), we propose a general score-statistic approach to computing the threshold values of QTL detection for various experimental populations. These experimental populations include F2, AI and RI populations. In our approach, the score test statistics are formulated in terms of the trait means of marker classes, mixing proportions, and the genotypic distributions of two and three genes of the populations. The asymptotic distribution of the score test statistics along the genomes is characterized by a Gaussian process with mean zero and well-structured variance-covariance matrix. We devise the genotypic distributions of three and four genes into the variance-covariance matrix to take care of the changes in the genomic structure between different populations, so that the threshold values for the different populations can be computed and their behaviors can be explored in various marker densities and genome sizes. The validity of our approach is compared with the LRT statistics by Monte Carlo simulations. In general, the threshold values obtained by our approach are very close to those by the LRT statistics in the various advanced populations as shown in Table 2. Given a significance level and a genome size, the threshold values should be enhanced in denser marker maps and in more advanced populations.

The different advanced populations are subject to different numbers of meiosis cycle either by inbreeding and/or random mating. They will produce different genomic structures, and their genotypic distributions will be different from each other. By recognizing such differences between populations, we incorporate the distributions of two, three and four genes of these populations into the score test statistics and Gaussian processes (equations~(12), (13), (14) and (15)). Therefore, our approach can consider their specific genome structures to well compute their threshold values and investigate their behaviors. More importantly, it has to be pointed out that the genotypic frequencies of three and four genes can be directly obtained by the genotypic frequencies of pairwise genes in the F2 populations, as the F2 genomes have a Markovian structure under Haldane map function (27). However, for advanced populations, this Markovian property disappears, and obtaining the genotypic frequencies of three and four genes is not straightforward. Here, we use the sets of transition equations proposed by Kao and Zeng (20, 24) to obtain these frequencies. If these frequencies are approximated by using Jiang and Zeng's method (28), which implicitly assumes a Markovian property, and the approximate frequencies are used in the construction of the test statistics and Gaussian processes in the computation of the threshold values. We found that the threshold values obtained by using the approximate genotypic frequencies are generally larger, especially in the denser marker maps and in the more advanced populations (results not shown), as compared to those by using exact frequencies.

The proposed score-statistic approach for computing the threshold value is mainly devised for the interval mapping model, i.e. a single QTL model. Under a single QTL model, the score test statistics for the additive and dominance effects in equations (12) and (13) are not mixtures and are functions of the nine trait means. The nine trait means correspond to the nine flanking marker genotypes whose proportions are adjusted to different population structures. In the setup of the variance-covariance matrices of Gaussian processes, the elements are obtained from the covariances between the test statistics (trait means). Therefore, it needs to consider the distributions of three (for neighboring intervals) and four genes (for non-neighboring intervals) at a time, and the dimension of the variance-covariance matrix is usually large. For example, the dimensions of the matrices are 63�63 (9+(10-1)�6 = 63) and 123�123 (9+(20-1)x6 = 123) for a chromosome with 10 and 20 intervals, respectively. If composite interval mapping (CIM; 6) or multiple interval mapping (MIM; 7) is considered, it will need to evaluate many more genotypic frequencies of more genes to construct a much higher-dimensional variance-covariance matrix in the processes. Taking a two-QTL MIM model, yi = μ+a1x1*+ a2x2*i, as an example, if searching a chromosome segment with 10 marker intervals for the first QTL (testing H0: a1=0 and a2 ≠ 0) by conditioning on the second QTL in the other region, the score test statistic for a1 can be classified into 81 different categories according to the 81 marker genotypes of the two flanking intervals (92=81), and each category contains mixture components. As the score statistic is a mixture, the derivations of its variance and asymptotic form are not simple. Furthermore, the dimension of the variance-covariance matrix in the Gaussian process will increase to 567x567 (81+(10-1)x54=567), and obtaining the covariance elements needs to evaluate the 2080 genotypic frequencies of six genes for this MIM model. If the second QTL is coincident with a marker, the two-QTL MIM model reduces to a CIM model. The corresponding score test statistic can be classified into 27 categories corresponding to 27 marker genotypes. The variance-covariance matrix of the Gaussian process has a 189x189 (27+(10-1)x18=189) dimension, and obtaining the covariance elements needs to evaluate the 528 genotypic frequencies of five genes for this CIM model. Certainly, if more QTL are considered at a time in the model, the obtaining of the score test statistics and Gaussian processes will be even more complicated as they will involve using the distributions of a large number of genes at a time in the populations. Consequently, the issues in determining the threshold values for the multiple-QTL models remain challenging. The approaches to unraveling these issues will not be in a straightforward manner and are worth pursuing in the future.

11. ACKNOWLEDGEMENS

This work was supported by grants NSC99-2118-M-001-011 from the National Science Council, Taiwan, Republic of China.

12. APPENDIX

Evaluation of the covariances between 's and 's, , at the two different positions, and , in two distinct marker intervals, and , respectively, are presented below. In general, the evaluation involves using the genotypic distributions of two, three and four genes in the populations. There are 81 covariances in total. If the two intervals are nonneighboring, the covariances are functions of the genotypic frequencies of two and four genes. They are listed below.

where , , and denote the alleles of markers , , and , respectively, define is the frequency of or genotype, is the frequency of or genotype, is the frequency of , , or genotype, is the frequency of genotype, and is the frequency of genotype, respectively.

If the two intervals are neighboring, the covariances are functions of the genotypic frequencies of two and three genes, and they can be obtained in a similar way. The frequencies of the 10, 36 and 136 genotypes for two, three and four genes in the different advanced populations can be obtained by using the transition equations of Haldane and Waddington (1931) and Kao and Zeng (2009, 2010).

13. REFERENCES

1. Lander E, D Botstein: Mapping Mendelian factors underlying quantitative traits using RELP linkage maps. Genetics 121, 185-199 (1989).

2. Dempster A, N Laird, D Rubin: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. Ser. B 39, 1-38 (1977).

3. Lander E, N Schork. Genetic dissection of complex traits: Science 265, 2037-2048 (1994). doi:10.1126/science.8091226

4. Chang M, RL Wu, S Wu, G Casella: Score statistics for mapping quantitative trait loci. Statistical Applications in Genetics and Molecular Biology 8 Article 16 (2009). doi:10.2202/1544-6115.1386

5. Jensen R: Interval mapping of multiple quantitative trait loci. Genetics 135, 205-211 (1113).

6. Zeng ZB: Precision mapping of quantitative trait loci. Genetics 136, 1457-1468 (1994).

7. Kao CH, ZB Zeng, R Teasdale: Multiple interval mapping for Quantitative Trait Loci. Genetics 152, 1203-1216 (1999). doi:10.1017/S0016672399004255

8. Zou F, J Fine, J Hu, D Lin: An Efficient Resampling Method for Assessing Genome-Wide Statistical Significance in Mapping Quantitative Trait Loci Genetics 168, 2307-2316 (2004). doi:10.1534/genetics.104.031427

9. Van Ooijen J: Accuracy of mapping quantitative trait loci in autogamous species. Theor Appl Genet 84, 803-811 (1992). doi:10.1007/BF00227388

10. Churchill G, R Doerge: Empirical threshold values for quantitative trait mapping. Genetics 138, 963-971 (1994).

11. Rebai A, B Goffinet, B Mangin: Approximate thresholds of interval mapping tests for QTL detection. Genetics 138, 235-240 (1994).

12. Davies R. Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika 64, 247-254 (1977). doi:10.2307/2335690

13. Dupuis J, D Siegmund: Statistical methods for mapping quantitative trait loci from a dense set of markers. Genetics 151, 373-386 (1999).

14. Piepho HP: A quick method for computing approximate thresholds for quantitative trait loci detection. Genetics 157, 425-432 (2001). doi:10.1007/s001220100606

15. Schaid D, C Rowland, D Tines, R Jacobson, G Poland: Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet 70, 425-434 (2002). doi:10.1086/338688

16. Wang K & J Huang: A score-statistic approach for mapping quantitative trait loci with sibships of arbitrary size. Am J Hum Genet 70, 412-424 (2002). doi:10.1086/338659

17. Bai X, L Luo, W Yan, M Rao Kovi, W Zhan, Y Xing: Genetic dissection of rice grain shape using a recombinant inbred line population derived from two contrasting parents and fine mapping a pleiotropic quantitative trait locus qGL7. BMC Genetics, 11:16 (2010). doi:10.1186/1471-2156-11-16

18. Kelly S, D Nehrenberg, J Peirce, K Hua, B Steffy, T Wiltshire, F Villena, T Garland Jr., Daniel Pomp: Genetic architecture of voluntary exercise in an advanced intercross line of mice. Physiol Genomics, 42:190-200 (2010). doi:10.1152/physiolgenomics.00028.2010

19. Weir B: Genetic Data Analysis II. Sinauer Associates, Inc. Sunderland, Massachusetts (1996).

20. Kao CH, MH Zeng: An Investigation of the Power for Separating Closely Linked QTL in Experimental Populations. Genet Res 92, 283-294 (2010). doi:10.1017/S0016672310000273

21. Darvasi A: Experimental strategies for the genetic dissection of complex traits in animal models. Nat Genet 18, 19 24 (1998). doi:10.1038/ng0198-19

22. Haldane JBS, C Waddington: Inbreeding and linkage. Genetics 16, 357- 374 (1931).

23. Geiringer H: On the probability theory of linkage in Mendelian heredity. The Annals of Mathematical Statistics 15, 25 57 (1944). doi:10.1214/aoms/1177731313

24. Kao CH, MH Zeng: A study on the Mapping of Quantitative Trait Loci in Advanced Populations Derived from Two Inbred Lines. Genet Res 91, 85-99 (2009). doi:10.1017/S0016672309000081

25. Kao CH, ZB Zeng: General formulae for obtaining the maximum likelihood estimates and the asymptotic variance-covariance matrix in QTL mapping when using the EM algorithm. Biometrics 53, 653-665 (1997). doi:10.2307/2533965

26. Cox D, D Hinkley. Theoretical Statistics. Chapman & Hall: London (1974). doi:10.2307/2347138

27. Haldane JBS: The combination of linkage values and the calculation of distances between the loci of linked factors. J Genet 8, 299- 309 (1919).

28. Jiang C, ZB Zeng: Mapping quantitative trait loci with dominant and missing markers in various populations from inbred lines. Genetica 101, 47- 85 (1997). doi:10.1023/A:1018394410659

Abbreviations: QTL: quantitative trait loci, AI: advanced intercrossed, RI: recombinant inbred, MLE: maximum likelihood estimate, LRT: Likelihood ratio test

Key Words Advanced populations, Gaussian process, QTL mapping, Score statistics, Threshold values, Interval mapping, genotypic distribution, Review

Send correspondence to: Chen-Hung Kao, Institute of Statistical Science, Academia Sinica, Taipei 11529, Taiwan, ROC, Tel: 886-2-27835611x418, Fax: 886-2-27831523, E-mail:chkao@stat.sinica.edu.tw