Centers for Disease Control and Prevention
Centers for Disease Control and Prevention
Centers for Disease Control and Prevention CDC Home Search CDC CDC Health Topics A-Z    
Office of Genomics and Disease Prevention  
Office of Genomics and Disease Prevention

 

 Journal Publication

This paper was published with modifications in: Am J Epidemiol 1997;146 (9): 713-720


Sample Size Requirements in Case-only Designs to
 Detect Gene-environment Interaction

by Quanhe Yang, Muin J. Khoury, and W. Dana Flanders
print version


bullet Abstract
bullet Methods
bullet Results
bullet Discussion
bullet Appendix
bullet Tables
bullet Figures
bullet References

Abstract

With advances in molecular genetic technology, more studies will examine gene-environment interaction in disease etiology. If the primary purpose of the study is to estimate the effect of gene-environment interaction in disease etiology, one can do so without employing controls. The case-only design has been promoted as an efficient and valid method for screening for gene-environment interaction. The authors derive a method for estimating sample size requirements, present sample size estimates, and compare minimum sample size requirements to detect gene-environment interaction in case-only studies with case-control studies. Assuming independence between exposure and genotype in the population, the case-only design is more efficient than a case-control design in detecting gene-environment interaction. In addition, the authors illustrate a method to estimate sample size when information on marginal effects (relative risk) of exposure and genotype is available from previous studies.

With advances in molecular genetic technology, genetic markers have been used increasingly in case-control studies to search for gene-environment interaction (1-8).  Concerns about selecting appropriate control groups for case-control studies have led tot he development of several nontraditional approaches to the study of genetic factors (9). If one's primary interest is to assess possible interaction between genetic and environmental factors in the etiology of a disease, one can use the case-only design which does not require controls. This design has been promoted as an efficient and valid approach for screening for gene-environment interaction under the assumption of independence between exposure and genotype in the population (10-11). To help identify situations in which a case-only design may be preferable to the case-control design, we present a method for estimating the sample size required to detect gene-environment interaction with a case-only study and
present sample size estimates for several design scenarios. We also discuss situations in which information on marginal effects of exposure and genotype is available from previous studies.

Methods

For simplicity, we assume the exposure and susceptibility genotype are dichotomous variables. A key assumption that underlies use of the case-only design to study interaction effects is independence of exposure and genotype in the population (10-13). We also assume that background risk, unrelated to either exposure or genotype, exists and that the disease is rare so that the odds ratio estimates the risk ratio.

Table 1 shows the expected frequency distribution among members of a population according to the presence and absence of exposure and genotype. To calculate sample size, one needs to specify the prevalence of exposure (e), the prevalence of genotype (g), the relative risk for exposure alone (Re), the relative risk for genotype alone (Rg), the effect of the gene-environment interaction (Ri), the case-control ratio, the type I error (a) and the type II error (ß)(12). As shown by Smith and Day, one may calculate the required sample size for specified values of the odds ratio of interaction (Ri), type I error, and type II error (14) from:


 figure 1

where Vn is the variance of the logarithm of Ri under the null hypothesis, Va is the corresponding variance under an alternative hypothesis, Ra/2 and Zß are normal deviates which cut off appropriate areas in the tails of the standard normal distribution. For a case-control study, and the notations of  Table 1, the results of Smith and Day (14) gave: 


 figure 2

figure 3

where Ai is the solution of:

figure4

 

Because there is no closed formula to calculate expected cell counts under the null hypothesis of no interection, we used Mantel-Haenszel approximation (RMH ) to estimate V as suggested by Smith and Day (14). The number of cases required with a case-control design is derived by solving equation (1) for n:


figure 5  

 

where VN = nVN and v =  nVA .

We followed a similar approach to derive a formula for the number of cases required with a case-only design to detect gene-environment interaction (Ri ). The expected distribution of the cases according to exposure and susceptibility genotype is summarized in Table 2 and the cross product of the data in Table 2 gives Ri . Under the assumption of independence between exposure and genotype in the population, Ri obtained from cases only measures departure from multiplicative joint effect of exposure and genotype (9,10).
 

The variance of the logarithm of Ri based on expected values under alternative
hypothesis is:

figure 6

 

We test the null hypothesis Ri = 1 with the test statistic ? = ln( RN )/ o V , where V is the null variance VN . Based on expected values, V is calculated from the marginal totals as:
 figure 7

 

We used the values of V and VA  to calculate, using equation (5), the required number of cases in a case-only design to detect interaction given various prevalence of exposure (e) and genotype (g) in the population. For comparison, we also calculated the sample size required in a case-control design.
 

Estimating sample size when only marginal effects of exposure and genotype are known  
In planning a study of gene-environment interaction, one may know the marginal effects of exposure (R'e ) and genotype (R'g ) from previous studies, but not know either the effects of exposure among members of the population who are not susceptible (Re) or the effects of the susceptibility genotype among the unexposed (Rg). In such a study, one may wish to calculate the number of cases required to detect gene-environment interaction given only e, g, R'e , R'g , and Ri . Under the assumption of independence between exposure and genotype in the population, we can do this sample-size calculation because the marginal effects of exposure (R'e) and genotype (R'g) are functions of Re, Rg, Ri, the prevalence of exposure (e), and genotype (g) (13):

figure 8

 

and

figure 9

 

 

By solving equations (8) and (9) for Re and Rg, we can express Re and Rg as functions of R'e, R'g, and Ri and then use those functions in equation (5) to calculate sample size.  Expressions and proof of unique positive solutions for Re and Rg are in Appendix 1. We also calculated the number of cases required in case-control and case-only designs to detect
gene-environment interaction assuming that R'e, R'g, and Ri are known.


Results

Sample sizes for various levels of Re, Rg and Ri of gene-environment interaction
We calculated the required sample size for a range of Re, Rg, Ri, g and e (Table 3).  We present sample sizes for the case-only design and, for comparison, for the case-control design. We present only sample sizes for which g is greater than or equal to e, since the sample size is symmetric with respect to the prevalence of exposure and genotype. Because 
we used the Mantel-Haenszel approximation to estimate the sample size under the null hypothesis for case-control design (14), the calculated sample sizes for the case-control design are slightly asymmetric, especially with high values of Ri . For example, for Ri = 10, e = 0.3 and g = 0.7 with g = 0.3 and e = 0.7, the calculated sample size are 93 and 90 respectively. Therefore, for the case-control design we present in Table 3, the average value of the two calculated sample sizes.

As seen in Table 3, the case-only design requires fewer cases than the case-control design to detect interaction. As one would expect, greater interaction (Ri ) is associated with increased power to detect interaction, and the required sample size is smallest if the prevalence of exposure and genotype are within the range of 30% to 50%.
 

Sample size calculation using marginal effects of exposure and genotype
We have also calculated sample sizes based on the marginal effects of exposure and genotype. As seen in Table 4, sample sizes calculated from known or assumed values for marginal effects of exposure, genotype, and gene-environment interaction, also yield fewer required cases for a case-only design than for a case-control design. For R'e = 5 and R'g =
2 in Table 4, changes in R'e and R'g have similar effects on sample size requirements as observed in Table 3 for Re = 2 and Rg = 1.

Example
Hwang et al.(11) investigated the interaction between maternal cigarette smoking, and transforming growth factor alpha polymorphism on the risk for cleft palate in a population-based sample of infants with birth defects. The distribution of these two risk factors in the study is presented in Table 5. Other studies indicated that about 25 percent of women smoke during pregnancy (15-17). We used values of e = 0.25, g = 0.16 (calculated from Table 5), Re = 1, Rg = 0.9, and Ri = 6.1 and a case/ control ratio of 4 to calculate the required number of cases needed to detect the interaction between TaqI polymorphism and maternal smoking on the risk for cleft palate. We found that 75 cases (375 total subjects) would be needed for a case-control study and that 55 would be needed for a case-only study with power of 0.80.

We next attempted to determine the number of cases required for different values of Ri assuming we know the marginal effects of exposure (R'e ) and genotype (R'g ) from previous studies. From the above example, it can be calculated that R'e = 1.5 and R'g = 2. We assumed that the prevalence of the genotype (g) = 15 percent and that the prevalence of the exposure = 25 percent. Because the prevalence of the genotype (g) is better documented than the value of the exposure (e), we varied the values of Ri and e and calculated the number of cases required for a case-control and for a case-only design. As shown in Figure 1, a case-control study with 100 cases and 200 controls would have low power to detect possible interaction effects with Ri < 5, whereas a case-only study with 100 cases would have moderate power if the exposure prevalence were greater than 15 percent.

Discussion

Our results show that the case-only design is more efficient than case-control design to detect gene-environment interaction under the assumption of independence between exposure and genotype in the population. Our findings are consistent with other studies which showed that when the exposure and genotype are independent in the population, the case-only studies produced more precise estimates of the interaction between exposure and genotype than do case-control designs (10-11). The power to detect interaction is associated with increased values of interaction (Ri ).

The approach we used to calculate sample size is based on large sample variances.  In some extreme situations, for example, a large interaction coupled with a common exposure and genotype, some of the expected cell sizes become very small for the calculated sample size. If any expected cell size is less than five for a given sample size, we suggest recalculating the required sample size for a less extreme situation. For example, one may recalculate sample size assuming a smaller degree of interaction.

The case-only design cannot evaluate an individual's relative risk associated with exposure alone (Re ) or genotype alone (Rg ). If the marginal effects of exposure (R'e ) and genotype (R'g ) are available from previous studies, our approach allows one to calculate the sample size required to study interaction.

Although it should typically be the case that exposure and genotype are  independently distributed in the population (9), the independence assumption may be violated in some instances. For example, individuals with delayed alcohol metabolism as a
result of genetic variation in alcohol aldehyde dehydrogenase may have an increased flushing response after alcohol ingestion (18-19) and thus be more likely to avoid alcohol exposure. In addition, the independent assumption could be contradicted in any population where both the exposure and genotype co-vary with other factors, like ethnicity. Such correlations could also invalidate a case-only design in detecting gene-environment interaction.

The gene-environment interaction (Ri ) derived from a case-only design assumes a departure from multiplicative effects. The appropriateness of using such interaction in epidemiologic studies has been discussed elsewhere (20-22). Studies have shown that many biologically plausible modes of gene-environment interaction involve a departure from multiplicative effects (23). If the true underlying model of joint effect is additive, the odds ratio of interaction (Ri ) derived from a case-only design may not be an appropriate description of the risk in relation to exposure and genotype (9).

In conducting a case-only study, one should follow the same epidemiologic principles of case selection as one would in conducting a case-control design. A population-based consecutive series of incident cases is ideal. Selection of cases from the general population would be one way help to make the findings of such a study more generalizable.

Researchers are increasingly searching for gene-environment interactions in disease.  Examples of such studies include: smoking, TaqI polymorphism, and cleft palate (6-7);  lung cancer in relation to debrisoquine metabolic phenotypes (2); glutathione S-transferase class mu, smoking, and sister chromatid exchange (SCE) levels in lung cancer (23); polymorphism at cytochrome p4502E1 with gastric and esophageal cancer due to cigarette smoking and other dietary factors (3); N-acetylation phenotype and bladder cancer (1,5); and cigarette smoking, N-acetylation phenotype, and breast cancer (8). With the rapid advances in molecular technology, one may expect that interest in finding the effects of  gene-environment interaction in disease etiology will increase. We believe that, in many instances, the case-only design can be a useful tool with which to rapidly screen for gene-environment interaction.

Appendix

Calculation of Re and Rg as a function of R'e, R'g and Ri
The marginal effects of exposure (R'e) and genotype (R'g) can be expressed as the function of e, g, Re, Rg and Ri:

appendix figure 1


appendix figure 2
 

Rearranging equation (A2) for Rg, we have:


appendix figure 3

 

We now substitute equation (A3) into equation (A1), and solve equation (A1) for Re, to obtain:

appendix figure 4
 

Re is a quadratic function of e, g, R'e , R'g , and Ri. If we define:
a = [( 1-g) eRi +geRi R'g  ]
b = -[( 1-g) eRi R'e +geR'e R'g -( 1-g)( 1-e)-( 1-e) gRi R'g ]
c = -{R'e(1-e)[( 1-g)+ gR'g ]}
it can be shown that (b2 -4ac)½ > 0 since a > 0 and c < 0, hence b < (b2 -4ac) ½ . Therefore there is one and only one positive solution for Re . We used  positive values of Re obtained from equation (A4) to calculate Rg using equation (A3). We then used Re and Rg derived as a function of e, g, R'e , R'g and Ri to calculate sample size requirements.

Acknowledgements
The authors wish to thank Jim Buehler for his helpful suggestions. We also wish to thank Michael Atkinson and Shih-Jen Hwang for their technical assistance; and also two anonymous reviewers for their helpful comments and suggestions on an early draft of this paper.

Tables

Figures

References

  1. Cartwright, RA, Glashan RW, Rogers HJ, et al. Role of N-Acetyl transferase phenotypes in
    bladder xarcinogenesis: A pharmacogenetic-epidemiological approach to bladder cancer. Lancet
    1982; 2: 842-6.
  2. Caporaso, N., Hayes RB, Dosemeci M, et al. Lung cancer risk, occupational exposure and the
    debrisoquine metabolic phenotype. Cancer Res 1989; 49: 3675-79.
  3. Caporaso N, Landi MT, Vineis P. Relevance of metabolic polymorphism to human
    carcinogenesis: evaluation of epidemiologic evidence. Pharmacogenetics 1991; 1: 4-19.
  4. Shields PG. Inherited factors and environmental exposure in cancer risk. J Occup Med
    1993; 35: 34-41
  5. Hayes RB, Bi W, Rothman N, et al. N-acetylation phenotype and genotype and risk of bladder
    cancer in benzidine exposed workers. Carcinogenesis 1993; 14: 675-78.
  6. Hwang SJ, Beaty TH, Panny S, et al. Association of transforming growth factor alpha (TGFa)
    TaqI polymorphism and oral clefts: indication of gene-environment interaction in a population-based
    sample of infants with birth defects. Am J Epidemiol 1994; 141: 629-36.
  7. Shaw GM, Wasserman CR, Lammer EJ, et al. Orofacial clefts, parental cigarette smorking, and
    transforming growth factor-alpha gene variants. Am J Hum Genet 1996; 58: 551-61.
  8. Ambrosone CB, Freudenheim JL, Graham S, et al. Cigarette smoking, N-acetyltransferase 2
    genetic polymorphisms, and breast cancer risk. JAMA 1996; 276: 1419-1521.
  9. Khoury MJ, Flanders WD. Non-traditional epidemiologic approaches in the analysis of gene-environment
    interaction: case-control studies with no controls! Am J Epi 1996; 144: 207-13.
  10. Piegorsch WW, Weinberg CR, Taylor JA. Non-hierarchical logistic models and case-only
    designs for assessing susceptibility in population-based case-control studies. Stat Med 1994; 13: 153-
    62.
  11. Begg CB, Zhang ZF. Statistical analysis of molecular epidemiology studies employing case-series.
    Cancer Epidemiol Biomarkers Prev 1994; 3: 173-5.
  12. Hwang SJ, Beaty TH, Liang KY, et al. Minimum sample size estimation to detect gene-environment
    interaction in case-control designs. Am J Epidemiol 1994; 140: 1029-37.
  13. Khoury MJ, Beaty TH, Hwang SJ. Detection of genotype-environment interaction in case-control
    studies of birth defects: how big a sample size? Teratology 1995; 51: 336-43.
  14. Smith PG, Day NE. The design of case-control studies: the influence of confounding and
    interaction effects. Int J Epidemiol 1984; 13: 356-65.
  15. Windham GC, Swan SH, Fenster L. Parental cigarette smoking and the risk of spontaneous
    abortion. Am J Epidemiol 1992; 135: 1394-1403.
  16. Fox Sh, Koepsell TD, Daling JR. Birth weight and smoking during pregnancy: effect
    modification by mother's age. Am J Epidemiol 1994; 139: 1008-15.
  17. Zhang J, Savitz DA, Schwingl PJ, et al. A case-control study of paternal smoking and birth
    defects. Int J Epidemiol 1992; 21: 273-78.
  18. Sherman DI, Ward RJ, Yoshida A, et al. Alcohol and aldehyde dehydrogenase gene
    polymorphism and alcoholism. EXS 1994; 71: 291-300.
  19. Chen CC, Hwu HG, Yeh EK, et al. Aldehyde dehydrogenase deficiency, flush patterns and
    prevalence of alcoholism: an interethnic comparison. Acta Med Okayama 1991; 45: 409-16.
  20. Greenland S. Basic problems in interaction assessment. Environ Health Perspect
    1993; 101( suppl 4): 59-66. 14
  21. Thompson WD. Statistical analysis of case-control studies. Epidemiol Rev 1994; 16: 33-50.
  22. Rothman KJ. Modern Epidemiology. Boston, MA: Little, Brown and Company, 1986: 311-26.
  23. Cheng TJ, Christiani DC, Xu X, Wain JC et al. Glutathione S-transferase mu genotype, diet,
    and smoking as determinants of sister chromatid exchange frequency in lymphocytes. Cancer
    Epidemiol Biolmarkers Prev 1995; 4( 5): 535-42.