Centers for Disease Control and Prevention
Centers for Disease Control and Prevention
Centers for Disease Control and Prevention CDC Home Search CDC CDC Health Topics A-Z    
Office of Genomics and Disease Prevention  
Office of Genomics and Disease Prevention

 

 Journal Publication

This article was published with modifications in Epidemiology 1998;9:350-354


The Future of Genetic Studies of Complex Human Diseases: An Epidemiologic Perspective 

by Muin J. Khoury and Quanhe Yang


bullet Introduction
bullet The Case-Control Method
bullet Control Selection in Case-Control Studies of Genetic Factors
bullet Residual confounding from hidden population stratification
bullet Statistical Power
bullet Limitations
bullet Concluding remarks
bullet Acknowledgement
bullet Tables

Introduction

With advances in the human genome project and the increasing availability of DNA markers scattered throughout the genome such as simple sequence polymorphisms, variable number tandem repeats, and short sequence repeat polymorphisms, it has become increasingly possible to search for the genetic basis of complex human diseases using genomic wide screening methods. Linkage analysis using LOD score analysis in large pedigrees has been the traditional tool to identify gene loci for human disorders both for single gene disorders (e.g. Huntington) and for complex chronic diseases (e.g. bipolar disease). Recently, Risch and Merikangas have argued that the future of genetic studies of complex human disease may depend, to a large extent, on applications of new "association" type methods to family-based data. The main method of interest is the transmission disequilibrium test (TDT) in which alleles at a given locus for a person with a specific disease are compared with parental non transmitted alleles, to look for evidence of deviation from expectations in the absence of linkage. The TDT has been shown to be a valid test of linkage in the presence of linkage disequilibrium (which creates associations with specific alleles). They showed that the TDT has more power than traditional linkage analysis for disease genes with weak to moderate effects on disease risks.

In this paper, we argue that the future of the genetic study of complex disorders will rely increasingly on the classical epidemiologic "association" paradigm. We show that on the long run, improvements in study designs and in adjusting for population stratification using interviews and genetics markers will lead to a new era of population-based incident case-control studies that could have more power and lead to more detailed information not only on the presence or absence of a disease susceptibility gene but define the magnitude of risks and gene-environment interaction- a crucial first step to disease prevention and health promotion.

The Case-Control Method

The last two decades have witnessed the growth and development of case-control studies in epidemiology. These studies are easier and less costly to conduct than prospective cohort studies, have more power for less frequent type outcomes (most human disease) and can give risk estimates of association (odds ratios) that are equivalent to relative risk measures obtained from cohort studies. In spite of the methodologic discussions and potential limitations of such studies, the case-control method has become standard practice for conducting valid epidemiologic studies of human disease etiology. The case-control method is particularly useful for genetic studies since genetic risk factors do not change with time, are not affected by disease status, and are easier to measure retrospectively compared with environmental risk factors.

The emergence of the population-based incident case-control study and the nested case-control methods have contributed to further methodologic improvements in study design and methodologic inference. In these types of study, all new cases of diseases diagnosed in a certain time interval in a well-defined population (or cohort of individuals) are ascertained for the study. The ideal control subjects are a random sample of the underlying population from which case subjects are derived (or the cohort at the beginning of observation for the case-base design). The advantages of these types of study are twofold: 1) risk factors for new cases of disease are indicators of causes of disease while risk factors obtained from a mixture of new cases and prevalent cases may be risk factors not only for disease causation but also disease duration or survival, 2) these studies will be able to quantify the magnitude of disease risks in the underlying population. Special cohorts, e.g. occupational groups or managed care cohorts, from which nested case-control studies can be derived have also been used (e.g. nurses health study). Three measures of disease occurrence can be estimated: a) the relative risk of disease (estimated by the odds ratio)-which refers to the ratio of the risk (or incidence) of disease among individuals with the risk factor to the risk of disease among individuals without the risk factor, b) the population attributable risk of disease (i.e., the proportion of cases in the population that can be attributable to the presence of the risk factor), and c) the absolute risk of disease (the actual disease risks-yearly and by a certain age given the presence or absence of the risk factor, or penetrance in the case of a measured genotype).

An example of a population-based incident case-control study is the case-control study of reproductive cancers (CASH) based on the SEER population-based registries. These registries ascertain new cases of cancer in well defined populations and have been used to conduct numerous case-control studies of various reproductive cancers to look for risk factors such as steroid hormones, diet and family history. Because the incidence of various cancers is known in the underlying population, these studies have yielded valuable estimates of the contribution of different risk factors to etiology (e.g. family history and breast cancer) in terms of relative, attributable and absolute risks.

Control Selection in Case-Control Studies of Genetic Factors

A challenge facing researchers conducting case-control studies is the choice of the control group. The concern centers on the presence of hidden confounding. For example, studies have shown that offspring of consanguineous marriages have a higher risk of infant and child mortality compared with offspring of non consanguineous marriages, implicating the presence of genetic factors. The problem with many of these studies is that socioeconomic, religious and cultural factors have not been always accounted for (18). These factors are associated with both the frequency of consanguinity and with mortality status and can act as confounding factors in case-control or cohort studies

A noted example of confounding in genetic studies due to population stratification is the reported association between the genetic marker Gm3;5;13;14 and non-insulin dependent diabetes mellitus among the Pima Indians. In this cross-sectional study, individuals with the genetic marker had a higher prevalence of diabetes than those without the marker (29% vs.8%). This marker, however, is an index of white admixture. When the analysis was stratified by degree of admixture (measured by the reported number of white ancestors at the grand parental generation) the association all but disappeared. Currently, case-control studies are considered appropriate when genetic variants associated with specific phenotypic expressions can be measured. An example of this is the N-Acetyl transferase single gene polymorphism (slow versus fast acetylators), which has been associated using case-control studies with the risk of various cancers. While the use of phenotypic information cannot solve the problem of population stratification, it can increase the biologic plausibility of epidemiologic findings much more than using anonymous markers scattered throughout the genome.

It can be argued that hidden population stratification cannot be entirely controlled by ensuring that cases and controls belong to the same major and ethnic groups, because further stratifications occur within these groups, a concern that has prompted the suggestions to use family members as controls, or even not using controls at all. The fact remains that many case-control studies (including some the authors were involved with) looking at genetic markers have used convenient control groups because of several factors: 1) it is not always easy to collect the appropriate controls and researchers find it easier to go for convenient groups of unaffected individuals; 2) often, what constitutes an appropriate control group is not readily apparent. For example, many studies use cases ascertained at one hospital, or a teaching university clinic. Under these circumstances, it is not clear who should be the controls for such highly selected case groups; 3) investigators often use a mixture of prevalent and incident cases for their studies, a practice that not only complicates the selection of appropriate controls but confuses inferences regarding to the role of a genetic factor in disease occurrence with the role in disease duration.

What will investigators be able to do for case-control studies of genetic factors? First and foremost, the appropriate selection of subjects from the major racial and ethnic subgroups should always be an initial target for control selection. One can ensure this comparability by matching.

Second, beyond the major population grouping obtained from records or initial self classification, interviews with cases and controls should collect information on cultural, demographic, religious, cultural and anthropologic factors for further defining groups beyond major categories such as white Hispanic, African American, or Native American. Historical family information on country of origin and patterns of family migration over more than one generation will be extremely useful in further characterizing population subgroups. For example, Hispanics might be further classified into various countries of origin, and time of migration to the United States. The accuracy of this information may be difficult to evaluate but there is no reason to suspect that such information will be biased by case-control status. As in any study, this imprecise measurement of confounding could blunt the impact of adjustment, leaving substantial residual confounding. The use of interview information, however, can be more elaborate depending on the population studied. For example, in the study of the Pima Indians, investigators were able to create an index of white admixture ranging from 0 to 8 depending on the number of ancestors at the grand parental generation that were white. Information obtained from interviews can be used to adjust further estimates of the association obtained from allele-disease associations. For example, even though all study subjects (cases and controls) were Pima Indians, the analysis between an allele and a disease can be further stratified according to indices of admixture with other racial/ethnic groups.

Third and finally, progress in the Human Genome Project will lead to the identification of an increasing number of DNA markers that will be used to characterize genetic differences in subgroups of the population, which can in turn be used as biologic markers for admixture among and within the different groups. Such work is already been done in a variety of populations and should become increasingly incorporated into epidemiologic studies of human disease. Admittedly, it is unlikely that single genetic markers will account for population admixture groups (like the example of the Gm marker mentioned above). Rather, such groups are likely to differ in the distribution of alleles at multiple loci. Therefore, it may be difficult to assign individuals to one population group or another on the basis of one or more markers. Nevertheless, such markers could still be used, singly or in combination in adjustment for potential confounding in case-control analyses.

While the use of new genetic markers for population classification will continue to evolve in the coming decades, we postulate that using genetic markers of admixture will lead to further refinement and stratification of case-control analysis into more homogeneous population subgroups, and, on the long run supplement, if not replace, less accurate interview- and record-based methods for classifying population groups.

Residual confounding from hidden population stratification

Even after appropriate control selection is done at least from the major racial/ethnic groups and appropriate adjustment is made using interview data and new genetic markers, skeptics will argue that residual confounding from unmeasured stratification can lead to spurious associations. The problem of confounding in epidemiology has been discussed extensively. While no observational study can ever rule out the possibility of confounding, the impact of potential hidden confounding on creating spurious associations can be quantified and its limits defined, as shown by Flanders and Khoury. These and other authors have shown that strong associations (say odds ratios of 5 or more) between a factor and a disease are highly unlikely to be due entirely to a hidden confounder unless this factor is strongly associated with the disease and with the risk factor.

We have every reason for an optimistic outlook about the increasing ability of investigators not only to choose more appropriate control subjects in population-based epidemiologic studies of complex human diseases but also to stratify and adjust their analyzes of allele-disease association using interview-based, record-based and biologic markers of population stratification. Residual confounding from population stratification is not likely to explain strong associations found between marker alleles and disease.

Statistical Power

Risch and Merikangas have recently shown that compared with traditional linkage analysis, association studies based on the TDT have greater statistical power and thus will be able to detect disease susceptibility genes that have relatively low relative risks for complex diseases. Using the same assumptions used in the Risch and Merikanagas paper, we compare the power of the traditional case-control study with that of the TDT (table 1). Essentially, for a disease susceptibility locus with a DNA marker (tightly linked to disease, i.e., recombination fraction=0), we assume two alleles A and a, where one allele A confers an increased risk of disease (with a relative risk of R), and 2 AA alleles lead to a multiplicative effect (R*R). We vary the allele frequency of A (p) and derive the minimal sample sizes needed for a case-control study (1:1 ratio) that attempts to evaluate whether the presence of at least one A allele is a risk factor for the disease. We used a standard statistical package to calculate sample size estimates. (The numbers of cases and controls are compared with the number of affected individuals (along with their parents) needed to conduct a TDT analysis. We used an alpha level of 0.001 (2-sided) for the case-control analysis as well as TDT analysis. The comparative results, however, are essentially the same at any alpha level except that sample size increases with lower alpha levels (for both methods).)

As shown in Table 1, in most instances (except for unrealistically high allele frequencies), the number of cases that is needed in a case-control study design is less than the number of cases that are needed to conduct a TDT analysis. Consider that for a TDT analysis, one needs to recruit not only cases but also both parents (nontransmitted allele information could also be derived from only one parent and indirectly by using siblings) and that such an endeavor may be not practical or even feasible for adult late onset chronic diseases such as Alzheimer’s disease. The statistical power of traditional case-control studies in addition to its practicality make it an attractive alternative to family-based association studies.

Limitations

The two major limitations are essentially similar to the ones encountered in family-based association studies (e.g. TDT). The first consideration is linkage disequilibrium. Ideally, if the gene of interest has been sequenced, the presence of one or more mutations within the gene could be correlated with an altered gene product and case-control status. Many markers, however, reflect DNA variation in the general region of the gene. Investigators thus measure these markers instead of the disease susceptibility mutation itself. Marker alleles could be in linkage disequilibrium with disease alleles if the mutation has risen relatively recently or if there is selective advantages of specific haplotypes.. For many populations, it is unlikely that genetic recombination over several generations can lead to complete independence between a marker allele and a disease allele in the same region. For example, in North American white populations have moved over approximately only 10 generations, with history of striking expansion, making disequilibrium likely. Thus, the use of a marker allele as a proxy for the disease susceptibility allele in a case-control study presumably leads to nondifferential misclassification and a dilution of the odds ratio toward unity. The finding of an association of a certain magnitude (odds ratio more than one) between a DNA marker and disease may thus reflect an important etiologic role of the gene locus of interest but not of the marker itself.

It is noteworthy that these limitations apply to genetic markers when examined one at a time. Increasingly, investigators have begun examining shared extended DNA segments around a candidate gene (identity by descent or IBD mapping). In this approach, finding single marker associations is usually followed by reconstruction of shared haplotypes in a patient series. The areas overlapping the most, would be the locus to look for a gene, as in a standard linkage analysis. While outside the scope of this paper, it would be interesting to compare the efficiency of IBD mapping with the case-control approach for haplotype distribution using several markers instead of single markers. With further progress in the Human Genome Project, tens of thousands of DNA markers will become available and potentially can lead to a better coverage of the human genome for case-control studies.

Secondly, chance findings become increasingly important in case-control studies involving multiple markers at multiple loci. As in other areas of epidemiology, disentangling spurious from causal associations depends on the consistency of the association across studies and on the presence of a biologically meaningful model underlying such associations. To reduce the impact of random errors, empirical Bayes-methods have been used.

Concluding remarks

In summary, we show in table 2 the general advantages of the population-based case-control paradigm in genetic studies of human diseases. Although family-based linkage and association studies will always be useful in locating disease susceptibility genes, we believe that traditional epidemiologic association studies will assume an increasing importance in the study of genetic factors in disease. At present, population-based epidemiologic studies are still needed once a disease locus is found using family-based methods, to quantify the magnitude of disease risks associated with specific alleles in terms of relative, absolute and attributable risks in different population groups. In the not too distant future, we predict that population-based case-control studies will also have an increasing role in the genome-wide search for susceptibility genes for complex adult-onset human diseases.

Acknowledgement

The authors thank Drs. Dana Flanders, Eleanor Feingold, Stephanie Sherman, Irwin Waldman, Arthur Falek, and Feng-Zhu Sun for commenting on an earlier version of the manuscript.

Tables

References

  1. Risch N. Genetic Linkage from an epidemiologic perspective. Epidemiol Rev 1997;19:24-32.
  2. Ebers GC, Kukay K, Bulman DE, Sadovnick AD, Rice G, Anderson C, Armstrong H, Cousin K, Bell RB, Hader W, Paty DW, Hashimoto S, Oger J, Duquette P, Warren S, Gray T, O'Connor P, Nath A, Auty A, Metz L, Francis G, Paulseth JE, Murray TJ, Pryse-Phillips W, Risch N. A full genome search in multiple sclerosis. Nat Genet 1996;13:472-476.
  3. Davies JL, Kawaguchi Y, Bennett ST, Copeman JB, Cordell HJ, Pritchard LE, Reed PW, Gough SC, Jenkins SC, Palmer SM. A genome wide search for human type 1 diabetes susceptibility genes. Nature 1994;371:130-136.
  4. Gusella JF, Wexler NS, Conneally PM, Naylor SL, Anderson MA, Tanzi RE, Watkins PC, Ottina K, Wallace MR, Sakaguchi AY. A polymorphic DNA marker genetically linked to Huntington's disease. Nature 306: 234-238, 1983.
  5. Egeland JA, Gerhard DS, Pauls DL, Sussex JN, Kidd KK, Allen CR, Hostetter AM, Housman DE. Bipolar affective disorders linked to DNA markers on chromosome 11. Nature 1987;325:783-787.
  6. Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science 1996;273:1516-1517.
  7. Risch N. Linkage strategies for genetically complex traits. II. the power of affected relative pairs. Am J Hum Genet 1990;46:229-241.
  8. Spielman RS, Ewens WJ. Invited editorial: The TDT and other family-based tests for linkage disequilibrium. Am J Humn Genet 1996;59:983-989.
  9. Khoury MJ, Beaty TH, Cohen BH. Fundamentals of Genetic Epidemiology Oxford University Press, New York, New York.
  10. Armenian H (ed). Applications of the case-control method. Epidemiol Rev 1994;16:1-164.
  11. Austin H, Hill HA, Flanders WD, Greenberg RS. Limitations in the application of case-control methodology. Epidemiol Rev 1994;16:65-76.
  12. Khoury MJ, Beaty TH. Applications of the case-control method in genetic epidemiology. Epidemiol Rev 1994;16:134-150.
  13. Colditz GA, Manson JE, Hankinson SE. The Nurses' Health Study: 20-year contribution to the understanding of health among women. J Women=s Health 1997;6:49-62.
  14. Layde PM, Webster LA, Baughman AL, Wingo PA, Rubin GL, Ory HW. The independent associations of parity, age at first full term pregnancy, and duration of breast feeding with the risk of breast cancer. Cancer and Steroid Hormone Study Group.: J Clin Epidemiol 1989;42:963-73
  15. Sattin RW, Rubin GL, Webster LA, Huezo CM, Wingo PA, Ory HW, Layde PM. Family history and the risk of breast cancer. JAMA 1985;253:1908-1913.
  16. Claus EB, Risch N, Thomspon WD. Age at onset as an indicator of familial risk of breast cancer. Am J Epidemiol 1990;131:961-971.
  17. Lasky T, Stolley PD. Selection of cases and controls. Epidemiol Rev 1994;16:6-17.
  18. Khoury MJ, Cohen BH, Chase GA, Diamond EL. An epidemiologic approach to the evaluation of the effect of inbreeding on prereproductive mortality. Am J Epidemiol 1987;125:251-262.
  19. Knowler WC, Williams RC, Pettit DJ, Steinberg AG. Gm3,5,13,14 and type 2 diabetes mellitus: an association in American Indians with genetic admixture. Am J Hum Genet 1988;43:520-526.
  20. Cartwright RA, Glashan RW, Rogers HJ, Ahmad RA, Barham-Hall D, Higgins E, Kahn MA. Role of N-acetyl-transferase phenotypes in bladder carcinogenesis: a pharmacogenetic epidemiologic approach to bladder cancer. Lancet 1982;2:842-845.
  21. Ambrosone CB, Freudenheim JL, Graham S, Marshall JR, Vena JE, Brasure JR, Michalek AM, Laughlin R, Nemoto T, Gillenwater KA, Shields PG. Cigarette smoking, N-acetyl transferase 2 genetic polymorphisms and breast cancer risk JAMA 1996;276:1494-1501.
  22. Lander ES, Schork NJ. Genetic dissection of complex traits. Science 1994;264:2037-2048.
  23. Khoury MJ, Flanders WD. Nontraditional epidemiologic approaches to the analysis of gene-environment interaction: case-control studies with no controls. Am J Epidemiol 1996;144:207-213..
  24. Posey D, Khoury MJ, Mulinare J, Adams MJ, Ou CY. Is mutated methylene tetrahydrofolate reductase a risk factor for neural tube defects? A pooled analysis. Lancet 1996;347:686.
  25. Adams MJ Jr, Khoury MJ, Scanlon KS, Stevenson RE, Knight GJ, Haddow JE, Sylvester GC, Cheek JE, Henry JP, Stabler SP. Elevated midtrimester methylmalonic acid levels as a risk factor for neural tube defects. Teratology 1995;51:311-317.
  26. Jackson FL. Race and ethnicity as biologic constructs. Ethnicity and Disease. 1992;2:120-125.
  27. Robinson SL, Gutowski SJ, van Oorschot RA, Fripp Y, Mitchell J. Genetic diversity among selected ethnic subpopulations of Australia: evidence from three highly polymorphic DNA loci. Hum Biol 1996;68:489-508.
  28. Zago MA, Silva Junior WA, Tavella MH, Santos SE, Guerreiro JF, Figueiredo MS. Interpopulation and intrapopulational genetic diversity of Amerindians as revealed by six variable number of tandem repeats. Hum Heredity 1996;46:274-289.
  29. Cavalli-Sforza LL, Piazza A. Human genomic diversity in Europe: a summary of recent research and prospects for the future. Europ J Human Genet 1993;1:3-18.
  30. Lin AA, Hebert JM, Mountain JL, Cavalli-Sforza LL. Comparison of 79 DNA polymorphisms tested in Australians, Japanese, and Papua New Guineans with those of five other human populations. Gene Geography 1994;8:191-214.
  31. Flanders WD, Khoury MJ. Indirect assessment of confounding : graphical description and limits on effect of adjusting for covariates. Epidemiology 1990;1:239-246.
  32. Thompson EA, Neel JV. Allelic disequilibrium and allele frequency distribution as a function of social and demographic history. Am J Hum Genet 1997; 60:197-204
  33. Te Meerman GJ, Van der Meulen MA, Sandkuijl LA. Perspectives of identity by descent (IBD) mapping in founder populations. Clin Exp Allergy 1995; Suppl 2:97-102
  34. Houwen RH, Baharloo S, Blankenship K, Raeymaekers P, Juyn J, Sandkuijl LA, Freimer NB. Genome screening by searching for shared segments: mapping a gene for benign recurrent intrahepatic cholestasis. Nat Genet 1994;8:380-386
  35. James LM. Statistical Analysis Battery for Epidemiologic Research, Centers for Disease Control and Prevention, version 1.91, 1996.
  36. Greenland S, Robins JM. Empirical-Bayes adjustments for multiple comparisons are sometimes useful. Epidemiology 1991;2:244-251.
  37. Greenland S, Poole C. Empirical Bayes and semi-Bayes approaches to occupational and environmental hazard surveillance. Arch Environ Health 1994;49:9-16.