|
Journal Publication |
||||||||||||||||||||||
Sample Size Requirements in Case-only Designs to
With advances in molecular genetic technology, more studies will examine gene-environment interaction in disease etiology. If the primary purpose of the study is to estimate the effect of gene-environment interaction in disease etiology, one can do so without employing controls. The case-only design has been promoted as an efficient and valid method for screening for gene-environment interaction. The authors derive a method for estimating sample size requirements, present sample size estimates, and compare minimum sample size requirements to detect gene-environment interaction in case-only studies with case-control studies. Assuming independence between exposure and genotype in the population, the case-only design is more efficient than a case-control design in detecting gene-environment interaction. In addition, the authors illustrate a method to estimate sample size when information on marginal effects (relative risk) of exposure and genotype is available from previous studies.
With advances in molecular genetic technology, genetic markers have been
used increasingly in case-control studies to search for gene-environment
interaction (1-8). Concerns about selecting appropriate control groups for case-control
studies have led tot he development of several nontraditional approaches to the study of
genetic factors (9). If one's primary interest is to assess possible interaction between genetic
and environmental factors in the etiology of a disease, one can use the case-only design
which does not require controls. This design has been promoted as an efficient and valid approach
for screening for gene-environment interaction under the assumption of independence
between exposure and genotype in the population (10-11). To help identify situations in
which a case-only design may be preferable to the case-control design, we present a method
for estimating the sample size required to detect gene-environment interaction with a
case-only study and For simplicity, we assume the exposure and susceptibility genotype are dichotomous variables. A key assumption that underlies use of the case-only design to study interaction effects is independence of exposure and genotype in the population (10-13). We also assume that background risk, unrelated to either exposure or genotype, exists and that the disease is rare so that the odds ratio estimates the risk ratio. Table 1 shows the expected frequency distribution among members of a population according to the presence and absence of exposure and genotype. To calculate sample size, one needs to specify the prevalence of exposure (e), the prevalence of genotype (g), the relative risk for exposure alone (Re), the relative risk for genotype alone (Rg), the effect of the gene-environment interaction (Ri), the case-control ratio, the type I error (a) and the type II error (ß)(12). As shown by Smith and Day, one may calculate the required sample size for specified values of the odds ratio of interaction (Ri), type I error, and type II error (14) from: where Vn is the variance of the logarithm of Ri under the null hypothesis, Va is the corresponding variance under an alternative hypothesis, Ra/2 and Zß are normal deviates which cut off appropriate areas in the tails of the standard normal distribution. For a case-control study, and the notations of Table 1, the results of Smith and Day (14) gave: where Ai is the solution of:
Because there is no closed formula to calculate expected cell counts under the null hypothesis of no interection, we used Mantel-Haenszel approximation (RMH ) to estimate VN as suggested by Smith and Day (14). The number of cases required with a case-control design is derived by solving equation (1) for n:
where VN = nVN and vA = nVA .
We followed a similar approach to derive a formula for the number of cases
required with a case-only design to detect gene-environment interaction (Ri
). The expected distribution of the cases according to exposure and susceptibility
genotype is summarized in Table 2 and the cross product of the data in Table 2 gives
Ri . Under
the assumption of independence between exposure and genotype in the population, Ri obtained
from cases only measures departure from multiplicative joint effect of exposure and
genotype (9,10). The variance of the logarithm of Ri based on expected values under
alternative
We test the null hypothesis Ri = 1 with the test statistic ? = ln(
RN
)/ o VN , where VN is the null variance VN . Based on expected values, VN is calculated from the
marginal totals as:
We used the values of VN and VA to calculate, using equation
(5), the
required number of cases in a case-only design to detect interaction given various
prevalence of exposure (e) and genotype (g) in the population. For comparison, we also
calculated the sample size required in a case-control design. Estimating sample size when only marginal effects of exposure and
genotype are known
and
By solving equations (8) and (9) for Re and Rg, we can express
Re and Rg as functions of R'e, R'g, and Ri and then use those functions in equation (5) to calculate
sample size. Expressions and proof of unique positive solutions for Re and Rg are in
Appendix 1. We also calculated the number of cases required in case-control and case-only
designs to detect Sample sizes for various levels of
Re, Rg and Ri of
gene-environment interaction As seen in Table 3, the case-only design requires fewer cases than the
case-control design to detect interaction. As one would expect, greater interaction (Ri
) is associated with increased power to detect interaction, and the required sample size
is smallest if the prevalence of exposure and genotype are within the range of 30% to 50%.
Sample size calculation using marginal effects of exposure and
genotype Example We next attempted to determine the number of cases required for different values of Ri assuming we know the marginal effects of exposure (R'e ) and genotype (R'g ) from previous studies. From the above example, it can be calculated that R'e = 1.5 and R'g = 2. We assumed that the prevalence of the genotype (g) = 15 percent and that the prevalence of the exposure = 25 percent. Because the prevalence of the genotype (g) is better documented than the value of the exposure (e), we varied the values of Ri and e and calculated the number of cases required for a case-control and for a case-only design. As shown in Figure 1, a case-control study with 100 cases and 200 controls would have low power to detect possible interaction effects with Ri < 5, whereas a case-only study with 100 cases would have moderate power if the exposure prevalence were greater than 15 percent. Our results show that the case-only design is more efficient than case-control design to detect gene-environment interaction under the assumption of independence between exposure and genotype in the population. Our findings are consistent with other studies which showed that when the exposure and genotype are independent in the population, the case-only studies produced more precise estimates of the interaction between exposure and genotype than do case-control designs (10-11). The power to detect interaction is associated with increased values of interaction (Ri ).The approach we used to calculate sample size is based on large sample variances. In some extreme situations, for example, a large interaction coupled with a common exposure and genotype, some of the expected cell sizes become very small for the calculated sample size. If any expected cell size is less than five for a given sample size, we suggest recalculating the required sample size for a less extreme situation. For example, one may recalculate sample size assuming a smaller degree of interaction. The case-only design cannot evaluate an individual's relative risk associated with exposure alone (Re ) or genotype alone (Rg ). If the marginal effects of exposure (R'e ) and genotype (R'g ) are available from previous studies, our approach allows one to calculate the sample size required to study interaction. Although it should typically be the case that exposure and genotype are
independently distributed in the population (9), the independence
assumption may be violated in some instances. For example, individuals with delayed alcohol
metabolism as a The gene-environment interaction (Ri ) derived from a case-only design assumes a departure from multiplicative effects. The appropriateness of using such interaction in epidemiologic studies has been discussed elsewhere (20-22). Studies have shown that many biologically plausible modes of gene-environment interaction involve a departure from multiplicative effects (23). If the true underlying model of joint effect is additive, the odds ratio of interaction (Ri ) derived from a case-only design may not be an appropriate description of the risk in relation to exposure and genotype (9). In conducting a case-only study, one should follow the same epidemiologic principles of case selection as one would in conducting a case-control design. A population-based consecutive series of incident cases is ideal. Selection of cases from the general population would be one way help to make the findings of such a study more generalizable. Researchers are increasingly searching for gene-environment interactions in disease. Examples of such studies include: smoking, TaqI polymorphism, and cleft palate (6-7); lung cancer in relation to debrisoquine metabolic phenotypes (2); glutathione S-transferase class mu, smoking, and sister chromatid exchange (SCE) levels in lung cancer (23); polymorphism at cytochrome p4502E1 with gastric and esophageal cancer due to cigarette smoking and other dietary factors (3); N-acetylation phenotype and bladder cancer (1,5); and cigarette smoking, N-acetylation phenotype, and breast cancer (8). With the rapid advances in molecular technology, one may expect that interest in finding the effects of gene-environment interaction in disease etiology will increase. We believe that, in many instances, the case-only design can be a useful tool with which to rapidly screen for gene-environment interaction. Appendix
Rearranging equation (A2) for Rg, we have:
We now substitute equation (A3) into equation (A1), and solve equation (A1) for Re, to obtain: Re is a quadratic function of e, g, R'e , R'g
, and Ri. If we define:
Acknowledgements
|
||||||||||||||||||||||