FDA
Home Page | CDRH Home Page | Search
| CDRH
A-Z Index | Contact CDRH
|
Meeting Summary |
This is the computer aided diagnosis workshop, and we are basically here today
to hear from you. Today marks the beginning of FDA's effort to write guidance
in this area, and the first step is that we want to hear from you - particularly
your ideas on the relevant characteristics of these devices and how they should
be evaluated. We'll have a number of speakers this morning and then plenty of
time for discussion this afternoon. We're going to start off with a couple of
introductory talks about device regulation and general software policy,
followed by several speakers from the Center who'll be talking on CADx, and
then we'll get to the main part of the program. Obviously we can't cover
everything today, so later on as ideas occur to you, please submit them to us
in writing.
Medical devices are regulated under the authority of the medical device
amendments to the Federal Food, Drug, and Cosmetic Act of 1938 [21 Code of
Federal Regulations (CFR)]. The 1938 Act required devices to be safe but
placed the burden of proof to remove unsafe products on the government. The
1976 Medical Device Amendments and 1990 Safe Medical Devices Act established a
comprehensive scheme of regulation. They defined the term "device", provided
for classification of all medical devices into three classes, required device
manufacturer registration and listing of products, and set up procedures for
clinical investigations (Investigational Device Exemption -- IDE), premarket
notification (510(k)), and premarket approval (PMA). In addition they included
the basic prohibition on misbranding and adulteration, required adherence to
good manufacturing practices (GMP), and provided for post market surveillance
(of selected devices). The device definition was written with such generality
as to include a wide range of products--including computer software--within its
scope: "...an instrument, apparatus, implement, machine, contrivance, implant,
in vitro reagent, or other similar or related article, including any component,
part, or accessory, which is - (1) recognized in the official National
Formulary, or the United States Pharmacopeia, or any supplement to them, (2)
intended for use in the diagnosis of disease or other conditions, or in the
cure, mitigation, treatment, or prevention of disease, in man or other animals,
or (3) intended to affect the structure or any function of the body of man or
other animals, and [which is not a drug]..." [Section 201(h)]. Further
information concerning these regulations can be obtained through the Center's
Division of Small Manufacturers Assistance (DSMA) at 800-638-2041.
An FDA software policy is under development at this time. No such policy
currently exists, although there has been a "draft" policy for several years.
With regard to this policy, the first question normally asked is how software
can be a medical device anyway. The previous speaker gave the answer to that
question, indicating that much software is a device since it is a component or
accessory to a device, and other software would be covered (probably as a
"contrivance") under the very broad device definition. The second question
with regard to software policy is why such a policy is needed. A policy is
needed because any device is subject to all of the requirements of the Food
Drug and Cosmetics Law as amended, including registration and listing, GMPs,
and premarket review, unless specifically exempted by regulation. If we
rigorously applied these provisions to all software medical devices, it would
represent a tremendous burden on both the Agency and on the medical community.
The software policy is risk/exemption based. We are trying to assess the risks
of medical software devices, decide on appropriate exemptions, and write
classification regulations to implement the exemptions. Thus, our first task
is to define criteria for assessing the impact of product failure on the
patient and apply them rationally to known products. Here are some of the
criteria which seem reasonable: (1) Seriousness of the disease to be diagnosed
or treated, (2) Time frame for use of the information (3) Concordance with
accepted medical practice, (4) Format of data and its presentation, (5)
Individualized vs. Aggregate patient care recommendations, and (6) Clarity of
the algorithm. We are planning upon holding a public workshop to discuss the
policy. A Federal Register notice announcing the workshop and laying out some
of the details of such a policy is under preparation and should be published in
a few months [Registrants at this workshop will be informed of details of the
Software Policy meeting]. Finally, how do CADx devices fit into this picture?
CADx systems are either accessories or have been determined to have significant
impact on patient care and thus need to be regulated via premarket review.
Thus, they are at the high impact/risk end of the medical device software
spectrum. The question is not how they should be regulated, but what sort of
information is necessary to make good decisions on clearing these products.
The Center has begun receiving premarket approval submissions for devices with
CADx features. These are devices which use modern data analysis techniques to
carry out some portion of the decision making process previously provided by
the physician or other health care professionals. Thus, CADx refers to
computer-aided-diagnosis devices, decision support products, and not to
computer-aided diagnostic-devices, computerized devices used to provide basic
input to the diagnostic process (e.g., CT or MRI systems). Examples of CADx
devices include image analysis products used to identify potential
abnormalities, ECG analysis programs, and in-vitro diagnostic test devices
which flag "out-of-bounds" test results and/or provide some more sophisticated
data synthesis. The Center believes that in order to perform intelligent and
timely reviews of these devices with consistency across product lines, it is
appropriate at this time to develop reviewer guidance for them. To that end
the Center has established a CADx Working Group, composed of reviewers and
other technical professionals from all CDRH components. Today's public
workshop is an effort to obtain input from the public at the initial stage of
this project. In particular, we pose two questions to you: (1) How should CADx
devices be categorized?--i.e., What device attributes are relevant to the
degree of regulatory oversight exercised by the Center over a particular CADx
device? And (2) What evaluation methodologies are appropriate to the assessment
of the performance of these devices? We look forward to your comments today
and to any written comments which you are warmly invited to contribute to us in
the future.
In 1995, the FDA approved the first two computer-assisted devices for
evaluation of Papanicolaou (Pap) smear slides. These devices are limited to
rescreening of Pap smear slides that have been previously screened by manual
microscopy and were diagnosed as negative (within normal limits) (WNL). As
yet, no computer-assisted devices have been approved for primary screening of
Pap smear slides. FDA regulates computer-assisted Pap smear readers as
in-vitro diagnostic medical devices. FDA's premarket approvals for the
NeoPath, Inc. AutoPap 300 ZC Automatic Pap Rescreener System and the
Neuromedical Systems, Inc. PAPNET Testing System were based on the evaluation
by FDA staff and panel of consultants of the design of each device, and the
manufacturers' pre-clinical and clinical testing data that demonstrated the
effectiveness and safety of the devices for their intended use and intended
populations for use. The approved intended uses and indications for use are
published in the FDA-approved package insert for each device.
The purpose of this talk is to briefly discuss evaluations of diagnostic
cardiovascular devices, and to point out areas of concern. Electrocardiographs
with computer interpretation are devices that acquire the diagnostic-quality
electrocardiogram (ECG), extract various measurements or features from the
signal, and apply those features to some deterministic or probabilistic
decision-making algorithm to arrive at an interpretation. If the algorithm
uses patient information and traditional ECG measurements, and the output is
over read by an appropriate physician, then we would be less concerned with
algorithm performance. Since the devices are used by the general clinical
community, however, we routinely request performance statistics for each
possible interpretation. In rare cases, when we determined that stand-alone
software packages are not accessories to other classified devices, we also
exempted the devices from premarket notification. In addition to devices that
mimic clinical decision making, the Division also conducts reviews of devices
with advanced digital signal processing of the ECG. Heart rate variability
analysis is illustrative of how further processing of the data elicits further
consideration of these submissions. These issues involve not only clinical
application of the information but also providing the user with an
understanding of how the data were generated. If the measurements have some
basis in traditional electrocardiography, and a reasonable approach to
validation is taken to document the reliability of the information, then we are
likely to continue to clear devices for market, with restricted labeling, until
such time when a manufacturer is able to provide clinical data to support
specific diagnostic indications. This strategy may not completely alleviate
our concerns for potentially unreliable data, that can not be verified by the
user, and the impact of the data on clinical decision making and, ultimately,
on patient safety.
Computer-aided Diagnostic (CADx) devices will be an indispensable part of the
future practice of clinical medicine. Computer-aided diagnostic devices must
be distinguished from computer-based diagnostic devices. A diagnosis is a
prediction. An important implication of CADx's as prediction devices is that
CADx software is assessed in terms of its accuracy, not its efficacy. There
are at least three prediction methods: statistical, expert systems, and
empirical formulae. Each method requires a somewhat different evaluation
approach. Prediction methods can be either general methods applied to medical
problems, or unique-to-the-medical-problem methods. All three methods can be
applied to: (1) the generation and analysis of diagnostic test information
(including laboratory tests such as a SMAC or genetic screening, functional
tests such as an ECG, and radiographic tests such as CT, MRI, mammogram) and
(2) the integration of diagnostic information. Three device categories can be
defined: devices marketed to the public, devices marketed to physicians
involving peer-reviewed statistical methods, and devices marketed to physicians
that have not been peer-reviewed or devices that involve expert system methods.
The creation of CADx guidelines is currently being performed by an internal
FDA CADx Working Group. In order to obtain (i) a non-regulatory perspective
and (ii) additional CADx expertise, non-FDA CADx experts (who are not currently
associated with a CADx device) should be invited to join the CADx Working
Group. The view that the FDA is an obstacle to innovation in medicine may no
longer be correct. In the CADx domain it may be that rather than trying to
protect everyone from everything, the FDA is adopting the view that its job is
to make sure that companies that wish to market CADx devices to physicians
provide: (i) the FDA with sufficient information so that it can determine that
the device meets its functional and accuracy claims and (ii) physicians with
sufficient information so that he/she can determine if the device will be
medically useful in his/her specific clinical situation.
I propose that experts with potential conflicts be included in the review
process. Their potential conflict status should be considered when the panel
makes the final advisory decision; however, their educated and experienced
opinions should be utilized to the fullest. Too much is at stake to lose the
expertise of highly qualified individuals merely because they have perhaps in
the past represented industrial developers with a financial interest. This
applies to all developers and any consultants who are working toward a common
goal. The American public will have confidence in FDA scientists and their
consultants if they adhere to principles of scientific integrity including full
disclosure, and all will benefit from these devices, especially the patients
for whom they are designed.
Quantitative analyses of the electroencephalograph (EEG) have been available
for over 50 years. During the past several decades, researchers at New York
University Medical Center's Brain Research Laboratories have developed the
neurometric method of analysis of the EEG. Digitized EEG recordings are
subjected to a Fast Fourier Transform to extract information on power,
frequency, and phase. These measures are log-transformed to approximate
gaussianity, age-regressed to account for variations in EEG variable
distribution as a function of age, and compared to an extensive normative
database to derive Z-score estimates of deviation from normal. The Z-score
matrix provided by the 1200+ extracted variables is subjected to a multivariate
analysis that corrects for intercorrelations between and within measures to
provide accurate estimates of the difference between patient Z-score values and
those of the normal population. Discriminant analyses are used to identify
variables that contribute to differentiating the patient from normal (normal
vs. abnormal comparison), and correlate the profile with that of various
empirically defined clinical groups. The likelihood that the profile matches
profiles of groups consisting of individuals with known disorders is stated in
probabilistic terms. Test sensitivity and specificity is evaluated by using
ROC curves. Statistical tables summarize the results of the analysis. Data is
further transformed into topographic maps that visually depict the extent of
the deviation of the patient from the normal reference group. The neurometric
method is based on widely accepted statistical procedures, and has been
replicated in a variety of laboratories around the world. Neurometrics
provides an empirical test of brain function and structure, and is useful as a
diagnostic aid for patient evaluation, treatment planning, and treatment
monitoring, enhancing the quality of patient care in psychiatry, neurology, and
related disciplines.
With the rapid expansion in the speed and capabilities of computer software and
hardware, we are now beginning to have the capabilities of simulating some of
the human decision processes involved in differential diagnosis and image
pattern recognition. While certainly issues of validation and hazard analysis
of software systems intended for human diagnosis are a significant part of the
evaluation process, the fundamental part of assessing effectiveness of such
devices will remain as the clinical evaluation of the accuracy of the
classification process the device claims to perform. The behavior of the
binomial distribution, and the associated issues of statistical rigor in
experimental design, will be absolutely critical in understanding the
procedures for performance evaluation of computer-assisted diagnosis devices.
The binomial distribution dictates how devices which classify patients (images,
samples, assays etc.) into one of two categories (normal/abnormal) will behave,
and this behavior can often be counter-intuitive. Evaluating such a device
requires particularly careful attention to the standard clinical design issues
of poolability, cross-over evaluations on the same samples/patients with and
without the device in place in the diagnostic process, statement of all
assumptions involved in testing, statement of the correct hypotheses,
collection of correctly random and unbiased samples from the specified target
population, and separation of performance measures into those independent of
prevalence assumptions and those which explicitly or implicitly depend on
prevalence (such as predictive value). Most importantly of all, perhaps, is
the need to include the entire range of difficulties of classification into the
evaluation process. None of these factors can be ignored in designing or
reviewing submissions on such classification devices.
NeoPath, Inc. has been engaged for six years in the development of an automated
cytological screener for the analysis of Pap smears. Last year, the NeoPath
AutoPap 300 QC System was granted PMA approval following the methods presented
below. It is NeoPath's position that the basic clinical testing of a
cytological screening device must include well-controlled, scientifically valid
studies to establish a quantitative baseline of device performance. This
baseline provides a means for FDA reviewers to assess the initial safety and
efficacy of a device as well as evaluate device enhancements and future
devices. For either primary screening or QC rescreening of Pap smears in
accord with the Bethesda system of cervical cytology four interlocking studies
are needed: prospective intended use, historical and current sensitivity,
multi-run precision-reproducibility, and historical consistency.
The incidence of and mortality from invasive cervical cancer has been
increasing in the United States since 1986 in the purportedly well screened
population of white women under 50 years of age. This disturbing trend is
thought to be attributable, at least in part, to the spread of the Human
Papilloma virus (HPV), which has now reached near epidemic proportions in young
women throughout the world. Thus, the factors contributing to the development
of cervical cancer are apparently so widespread that more women are developing
this preventable cancer despite screening. In addition, there are estimated to
be over 50 million cervical smear tests performed each year in the United
States. Therefore, it is paramount that FDA assure that any new automated
device to be used as a substitute for conventional microscopic screening be
very rigorously tested to assure that even rare cytopathology or unusual
presentations of abnormalities are detected, as even "rare" cases can affect
tens of thousands of American women at a national level. There are three
primary degrees of freedom that must be considered when assuring that all
presentations have been sampled and included in the clinical trial: (1)
Diagnostic variations (all categories of The Bethesda System, including various
types of adenocarcinomas); (2) Patient variations (prevalence of abnormal cells,
size of abnormal cells, smear patterns - must include various patient
demographics); (3) Laboratory variations (staining color and intensity,
coverslip bubbles, artifacts - must include a wide variety of laboratories).
The clinical trial should simulate the device's intended use as closely as
possible. In addition, in terms of establishing standards for comparison with
conventional screening, bias should be minimized by utilizing historic
screening records and applying exhaustive microscopic searching and automated
rescreeners to ensure that no significant abnormality is missed by the
substitutive test. In conclusion, given the public health threat represented
by the HPV epidemic, the rise in incidence in some populations and the
potential for prevention of cervical cancer, the objective of automated
cervical smear screening should be increasing the accuracy of the test, and not
serving as a labor substitute at the expense of sensitivity.
Reasonable standards have already been established for clinical trials for
medical devices within the FDA and the medical community. CADx Pap smear
medical devices are essentially no different from other devices, and careful
trial design should be followed. In general, there are four issues that need
to be addressed. First, the device must be tested in its intended use. This
means that levels of disease prevalence used in the trial should reflect
prevalence in routine use. Too high a level of disease in a trial can affect
vigilance of the participants. Second, a reference standard must be
established. Essentially, we want to compare the discriminatory level of a Pap
smear screening device to that of humans. Using standards, one can evaluate
sensitivity and specificity, preferably using an analytical method such as
Receiver Operator Characteristic curves. Such performance standards and
comparisons are necessary for educating potential users in how a Pap smear
device may work in their laboratories. It will also help users to compare the
device to other alternatives; for example it may be desirable to compare the
accuracy and cost effectiveness of a double screening by humans to the combined
use of humans and a machine. Third, given the subjective nature of cytology,
and the difficulty with borderline diagnoses, a method of adjudicating the
difference between the reference and the CADx result must be developed. There
are many methods that can be applied to the Pap smear, and any one of these
should prove acceptable. They include the use of an independent pathologist, a
panel review, biopsy, Human Papilloma Virus testing, and patient follow-up.
Finally, in a trial, vigilance must be controlled since taking part in any
trial elevates ones attention and performance. This demands the use of a two
armed clinical trial so that both the CADx arm and the standard (e.g. human)
arm have elevated levels of vigilance. One should also consider how vigilance
might be raised, or even lowered, in actual use when use when using a CADx
system. In summary, careful clinical trial design is critical for evaluating
Pap smear methodology and the results of trials should be presented in such a
way that the potential users of a system will be able to comprehend potential
performance in their own laboratories.
The main points I would like to leave with you are: (1) Discriminating power is
the underlying measure of performance of a CADx device. (2) As a minimum, a
single measurement of both sensitivity AND specificity is necessary to
establish discriminating power. This would also allow an ROC analysis to be
conducted. A well-designed study should also generate sensitivity AND
specificity results for a human reviewing the same material without the CADx
device. This would yield two-armed results and should form the basis for the
product's evaluation. (3) If a CADx device, by itself, has greater
discriminating power that a human, then approval should be forthcoming. (4) If
a CADx device does not, by itself, have greater discriminating power than a
human, then this information should be made clear in the labeling. Without
this caution clearly in the labeling, users WILL mistakenly assume such a
device is better than a human--after all the FDA approved it. (5) A device
with lower discriminating power may still provide benefit if it is cheaper than
a human, and is used for back-up purposes only. The FDA should make
information available to allow a user to calculate cost-effectiveness. This
information is either the underlying discriminating power or the
sensitivity/specificity results.
As one of the nation's leading providers of cervical cytology testing services,
Corning Clinical Laboratories has been working with each of the developers of
new technologies that promise to improve Pap testing accuracy. There is a
risk, however, that vigorous marketing by these developers and media exposure
will create pressure to adopt these new technologies before problems inherent
in their use are fully resolved and before complete data regarding their
efficacy is available. In particular, our concerns include (1) likely loss of
positive predictive value of abnormal results during a months- or years-long
pathologist and cytotechnologist "learning curve" (the two recently
FDA-approved devices, PapNet and AutoPap achieve higher sensitivity by
"flagging" cells or slides for manual re-review, leading possibly to negative
cases being misinterpreted as abnormal because of device created biases), (2)
possible loss of situation awareness among those who read Pap smear slides and
who may develop complacency and decreased detection rates, (3) the risk that
new standards of care will be created "by default" in the face of a dearth of
clinical outcome studies, and (4) and difficult post-FDA-approval period marked
by unclear regulations, a lack or reporting format standards and inconsistent
reimbursement policies. Perhaps public health interests plus the very near
approach by some of these new technologies to actual diagnostic processes,
warrant an FDA paradigm shift, namely to require technology assessment that
goes beyond the normal purview. For example, the FDA could require
post-approval market surveillance that includes rigorous training requirements
and measurement of training outcomes, and post-approval clinical specificity
studies. Other agencies and professional organizations could play a more
active role than they have been, as well. These concerns notwithstanding, our
company believes the FDA-approved technologies plus others still in development
promise to significantly improve the accuracy of cervical cytology testing.
The primary goal of the Pap screening test is to eliminate death and suffering
that result from invasive cancer of the cervix, at an acceptable cost. This is
accomplished by identifying and treating pre-invasive cancerous lesions. Other
uses of the Pap test, such as the detection of ovarian cancer, sexually
transmitted diseases, etc., are, at best, secondary goals. The sensitivity of
the Pap test to the detection of STDs and other conditions is poor. The
performance of the existing Pap test should be well understood by those who
assess a computer assisted Pap test. The conventional Pap test leads to
treatment of very many women for conditions which, if left untreated, would
never develop into invasive cancer. For every woman with a truly pre-cancerous
lesion, at least 30 and possibly more than 50 receive treatment. Computer
assisted diagnostic Pap screeners should be assessed in the context of an
objective assessment of the current conventional system. For an imperfect
test, accuracy is most completely characterized by the ROC (receiver-operator
characteristic) curve. The test accuracy describes the ability of the test to
separate overlapping true positive and true negative populations. The notion
of positive predictive value combines test accuracy with disease incidence.
For the Pap test process to achieve a positive predictive value of only 10%
would require test accuracy corresponding to a separation of the true positive
and true negative populations by 5-6 standard deviations. This problem is
fundamental to the current Pap test. It does not result from the occasional
failure of a screener to detect a "needle in a haystack;" it results from the
facts that few "haystacks with needles" have the potential to become invasive
cervical cancer, and we can't tell which ones they are with the current
approach.
My presentation will focus on evaluating the quality and utility of digital
images. I will summarize some of the principles developed in ongoing
collaborations with Stanford colleagues Robert M. Gray, PhD, Professor of
Electrical Engineering; Debra Ikeda, MD, Section Chief of Breast Imaging; and
many others. Our research involves compression and enhancement of digital
medical images and the applications of these technologies to computer-aided
diagnosis. We study CT images of the lung and mediastinum, MR chest images
taken for the purpose of measuring major vessels in the chest, and many aspects
of mammography. However, when computational interventions affect what
radiologists see, it is imperative that these interventions be evaluated by
carefully designed clinical experiments. Experimental protocols should
simulate ordinary clinical practice to the extent possible. A nearly full
range of examples should be included. Findings should be reportable using the
American College of Radiology Standardized Lexicon. Statistical analyses
should be based upon assumptions that are faithful to the clinical scenario and
tasks. The numbers of studies and radiologists should be sufficient to ensuresatisfactory size and power for the principal statistical tests of interest.
"Gold standards" must be defined clearly and be consistent with experimental
hypotheses. Sources of bias should be recognized and minimized. To the extent
possible, I will deal with all these issues in my 10 minutes, not least some
statistical techniques that we feel are particularly relevant here.
Mammography has become the standard for detection of early, more curable breast
cancer. Increasing numbers of mammograms are being performed, and reading
screening mammograms is a repetitive task that requires high attention to
minute detail. While mammography is the best method for early breast cancer
detection, radiologists interpreting the mammograms are fallible, and an
estimated 30% of breast cancers are present but missed on mammograms. A second
human observer can detect up to 15% more cancer, but having a second reader is
time-consuming and probably impractical, being done in only about 5% of
practices in the US. This type of problem is one that lends itself to
automation, through computer-aided diagnosis. The detection process of
flagging potential abnormalities for the radiologist can be accomplished by
CAD, using digitized mammograms. CAD can be defined as a diagnosis made by a
radiologist using computer output to improve his or her decision, with a goal
of making radiographic interpretation easier and more accurate. Work over the
last decade has developed mammographic CAD programs to a level where about 85%
of breast cancers are detected by the computer, at a reasonably low false
positive rate of 1 or 2 per image. The detection programs work for both
calcifications and masses, the two prime signs of breast cancer on mammogram.
At the University of Chicago, we have been running CAD in our clinical
mammography area for over a year, on more than 5,000 mammograms. Analysis of
the first 1,149 patients shows that CAD performed as expected, identifying 86%
of the screening-detected cancers. We have also been greatly encouraged by our
studies showing that retrospective CAD can correctly identify approximately 50%
of lesions clinically missed by radiologists (observation errors). The current
level of development is appropriate for clinical introduction, acting as a
second reader to aid the radiologist, who retains the final decision on whether
or not potential areas on the mammogram are suspicious enough to warrant
further work up. I believe that introduction of technology of this type is
inevitable, as the results to date have been very promising. Radiologist access
to CAD in the clinical setting should act to significantly improve patient
care.
Many different types of systems are categorized as "computer aided diagnosis
(CADx)." The essential characteristics of these systems that have relevance
for assessments of safety and effectiveness can be analyzed into three groups:
(1), type of system design, of which three are identified here, (2), type of
information base incorporated in the system, of which three are also
identified, these three not having a one to one correspondence with the three
types of system design, and (3), type, or level of certainty, of system output,
of which four were identified, again without any direct correspondence to the
preceding six classes. Within each of the above three groups, a hierarchy of
the types was identified, each level having more serious implications for
evaluation of safety and effectiveness than the preceding. Thus, there is,
hypothetically at least, the possibility of 36 distinct combinations of the
levels of the three groups of essential characteristics within the set of CADx
systems, each having a different convolution of the challenges and
opportunities for assessment of safety and effectiveness represented within the
groups. This suggests that CADx systems cannot be thought of as a single type
of entity and that a single regulatory policy cannot be successful for all.
Rather, a regulatory scheme that recognizes each of the identified subgroups
and establishes policies for them which account for the various ways in which
they may be combined is required. Some systems will require extensive clinical
testing. Others may be fully evaluated through engineering testing alone.
1. Richard Eaton, NEMA presented a list of questions from his organization:
(i) General Questions/Issues: How do computer-aided devices differ from
computer-controlled devices? Are there additional requirements for 510(k)
applications? What are the 510(k) and postmarket surveillance requirements
which will be associated with these types of devices? Are there different
levels of concerns for each of these types of devices? Will recalls be
required if there is a "glitch" in the software? How do user errors influence
regulation of these devices? (ii) Issues pertaining to transmission of data
over line: We have concerns over what happens when data is sent over a line:
How is validation handled when data is sent over a line, as opposed to on-site
validation? What about patient confidentiality issues? (iii) Issues relating to
a device "acting as a physician:" Will there need to be a duplicative diagnosis
done by a physician if the device itself "is acting as a physician" and thus
renders a diagnosis? (iv) Sufficiency of electronic signatures: Is there an
"equivalent" for the doctor's signature, an "electronic OK" which is needed
before the diagnosis data can be transmitted across the lines? (v) Effect of
favorable FDA approval of class III device upon manufacturer's product
liability exposure, and use of favorable decision as a defense: If FDA
determines that a computer-aided device is a class III device, and thus a PMA
would be required, would an FDA approval of an application serve to reduce the
manufacturers' product liability exposure, such that FDA's approval could be
used at least as a partial defense to an action against a manufacturer?
2. Several participants suggested that they would like an explanation of how a
CADx decision was reached. This would better allow the physician to judge the
reliability of the CADx output, which would otherwise be just emerging from a
"black box." Others, however, indicated that this is overly simplistic. For
any difficult problem, it is very hard to provide a simple explanation of the
CADx system "reasoning." It was further suggested that what the user of a CADx
system really needed was an indication of the consistency and accuracy of the
diagnostic information, not a description of how the decision was reached.
3. Numerous speakers cited the usefulness and validity of receiver operating
characteristic (ROC) analysis. The ROC curve is a plot of the variation of the
true positive fraction as a function of the false positive fraction
(sensitivity vs. one minus specificity). The ROC curve is obtained by varying
the threshold criterion for deciding between positive and negative diagnoses
from more conservative to less conservative. It therefore includes information
on all system operating points (sensitivity/specificity pairs) and is
independent of disease prevalence. A particular benefit of the method is that
it allows the separation of technology assessment from practice-of-medicine
issues. One participant was concerned with Gaussian assumptions (not
fundamental to ROC theory but made by many ROC analysis programs). Discussion
also ensued concerning the variability of human observer performance and the
difficulty this causes for the evaluation of a machine "observer." The
complication that diagnostic tasks are not typically binary (as required for
conventional ROC analysis) but have multiple possible outcomes (diagnoses) was
also raised.
4. The suggestion was made that the evaluation of commercial CADx devices
should be similar to that of the scientific peer review process. The machine
algorithm and representative data should be available for outside professionals
to carry out disinterested confirmation of the manufacturer's results. The
fear of compromising trade secret or other proprietary information seemed to
temper the enthusiasm of commercial participants to this suggestion.
5. Questions were raised concerning the availability of guidance on other
software matters. It was noted that in addition to the overall software
policy, there are Center groups considering policy with regard to commercial
off the shelf (COTS) software and developing design control guidance as part of
the GMP revisions efforts. All of these efforts will be soliciting public
comment.
6. It was noted that CADx algorithms may be very sensitive to the particular
sensors used in obtaining training data. Great care must be exercised in
determining the range of input sensors for which the device functions
accurately. Furthermore, it was noted that often in the evaluation of CADx
devices there is a commingling of the training and testing sets. This must be
avoided in order to obtain an unbiased performance estimate.
7. Compression was mentioned as a source of performance degradation for CADx
devices. When large data sets are needed, (lossy) compression may be required.
Its effect must be examined carefully.
8. One participant noted that a liberal interpretation of the medical device
definition would result in clinical guidelines being considered as medical
devices. Despite their ubiquitous presence in the field, very few have been
properly validated.
9. The presentation of CADx results and the labeling of CADx devices in terms
of probabilities was discussed. This was felt by many participants to be
desirable; however, it was suggested that the clinician "user" population was
not sufficiently sophisticated to understand data presented in that way.
10. The problem of CADx false positives was raised. CADx "attention getting"
systems typically point to many areas where no abnormality exists. This was
felt to be a natural attribute of these systems and an aspect which should be
addressed through user training and experience. As long as these systems are
only "aiding" in the diagnosis, they should not be held to the same standards
as a device actually making the diagnosis.
Thank you for participating in the computer-aided diagnosis device workshop.
We have heard today a few ideas on the categories of CADx devices and even more
on evaluation methods, especially the use of the receiver operating
characteristic curve. Further written comments are solicited and may be faxed
to us at (301) 443-9101. This input will aid us in preparing reviewer guidance
for the premarket clearance of these devices. As a reminder to the speakers,
please get to me a copy of your overheads and, if possible, your talk. I will
submit these to Dockets Management for docket number 95N-0363 where they will
be available to the public. In addition, if the speakers will provide me with
a brief summary of their talks we will compile a meeting summary which we will
mail to all persons who have registered for this workshop. In addition, the
summary will be available on the World Wide Web at the URL http://www.fda.gov.
(March 6, 1996)
CDRH Home Page | CDRH A-Z Index | Contact CDRH | Accessibility | Disclaimer
FDA Home Page | Search FDA Site | FDA A-Z Index | Contact FDA | HHS Home Page
Center for Devices and Radiological Health / CDRH