April 30, 2004 The Honorable Trent Lott Dear Senator Lott: I am pleased to submit the enclosed report to Congress on Improving the Usability and Accessibility of Voting Systems and Products. This document, also known as the Human Factors report, was produced in consultation with the National Institute of Standards and Technology (NIST) to meet the requirements of Section 243 of the Help America Vote Act of 2002 (HAVA). Despite the fact that the U.S. Election Assistance Commission (EAC) was established in mid-December 2003, and therefore missed the October 2003 statutory deadline for this report, we were able to focus early on this task and the research that NIST had completed. The resulting report describes the findings of that study. It also presents a set of recommended actions that, if implemented, should improve the usability and accessibility of voting products and systems. We would be pleased to meet with you at your convenience, to discuss our work and the recommendations contained in this report. I can be reached at (202) 566-3100.
Enclosure Improving the Usability and Accessibility of Voting Systems and Products
DisclaimerAny mention of commercial products or reference to commercial organizations is for information only; it does not imply recommendation or endorsement by NIST nor does it imply that the products mentioned are necessarily the best available for the purpose. Executive SummaryIn the Help America Vote Act (HAVA) of 2002, Public Law 107-252, the Election Assistance Commission is mandated to submit a report on human factors, usability, and accessibility to Congress. Specifically, “…the Commission, in consultation with the Director of the National Institute of Standards and Technology, shall submit a report to Congress which assesses the areas of human factor research, including usability engineering and human-computer and human-machine interaction, which feasibly could be applied to voting products and systems design to ensure the usability and accuracy of voting products and systems, including methods to improve access for individuals with disabilities (including blindness) and individuals with limited proficiency in the English language and to reduce voter error and the number of spoiled ballots in elections.” This report was written to address this mandate. It describes the results of our review and analysis of related research, standards, guidelines, and evaluation methodologies. It also presents our assessment of their applicability to voting systems and products and to the process of qualification and certification testing. As a result of this investigation, we have compiled a set of recommendations that, if followed, should measurably improve the usability and accessibility of voting products and systems. These recommendations are:
Table of Contents1. IntroductionThe goal of this report is to describe how research and best practices from the human factors, human-machine and human-computer interaction, and usability engineering disciplines can be brought to bear to improve the usability and accessibility of voting products and systems. A major contribution of the report is a set of ten recommendations for developing standards, accompanying test methods, and guidelines that can measurably improve levels of usability and accessibility.After the introduction, we discuss our assumptions and the information we used to generate the recommendations. We describe the current status of and testing process for voting systems, present an overview of the concepts of usability and accessibility, and discuss some related standards. We then present a detailed discussion of approaches that can be applied to improve usability and accessibility based on our review of relevant standards, guidelines, and testing and evaluation methodologies. We conclude with the set of recommendations and a discussion of short and long term activities that can help achieve the recommendations.[1] 1.1 ScopeThe scope of this report is limited to human factors issues, that is, we are concerned with the process of the voter casting a ballot as intended and, to a lesser extent, the interaction of the poll worker with the voting system. This primarily involves the “user interface” the voter is presented by the system and the environment at the polling place. We have NOT examined issues concerning what happens after the voter casts a ballot such as the accuracy of counting the votes, the quality of the hardware and software, or the security of voting systems as these, in general, do not involve user interaction. Any approaches addressing these issues that do involve voter or election official interaction would require some analysis of the human factors and these should be addressed in future work.Our analysis addresses issues pertaining to both voting products and voting systems. A voting product, as defined here, refers to a product procured from a vendor such as a Direct Recording Electronic (DRE) terminal. [2] By voting system we mean the combination of physical environment, voting product, ballot, voter, and other persons involved in the voting process (e.g., poll workers and other election officials). The bulk of the discussion focuses on the usability and accessibility of voting products for the voter. However, we also include usability issues pertaining to ballot design, the influence of the environment on accessibility as well as usability, and the setup and operation of voting systems by poll workers and election administrators. Further, we have constructed our recommendations for improvements so that they will fit into the existing and future qualification and certification testing frameworks for voting systems. Note that we expect that these recommendations will be taken into consideration by the Technical Guidelines Development Committee (TGDC) when it becomes operational under the Election Assistance Commission (EAC) as described in the HAVA. 1.2 BackgroundThere are many examples, some highly publicized, of voter confusion possibly caused by usability and accessibility problems (McIntire, 2003; Caltech–MIT 2001). Bedersen et. al identified a number of potential usability problems with DRE’s to be used in Maryland (Bederson & Herrnson, 2002). Susan King Roth’s 1998 article pointed out problems with readability, legibility, organization, and height (Roth, 1998). A 2002 report (Burton & Uslan, 2002) from the American Foundation for the Blind’s AccessWorld describing informal testing by 15 blind and low vision users reported “tremendous improvements over the way in which people who are blind and visually impaired currently vote” but also stated there was “certainly room for improvement.” The report even cited one machine as preferable since it had “a lesser tendency to cause confusion.” Additional informal testing at the National Federation of the Blind (NFB) has shown a number of accessibility or usability issues associated with nearly all of the six modern DRE devices they tested. Also, it should be noted that both the AccessWorld and NFB studies were performed on voting products with features specifically designed for voters with disabilities.It also appears that the problems of voting product usability and accessibility are not felt equally across the voter population. The U.S. Civil Rights Commission reported in (Voting Irregularities, 2001) that “Poorer counties, particularly those with large minority populations, were more likely to use voting systems with higher spoilage rates than more affluent counties with significant white populations.” Further, “Even in counties where the same voting technology was used, blacks were far more likely to have their votes rejected than whites.” As a result of these and other reported voting irregularities, the U.S. Congress enacted the Help America Vote Act (HAVA) of 2002, Public Law 107-252. In the areas related to human factors, usability, and accessibility, the Election Assistance Commission is mandated to submit a report to Congress. Specifically, “…the Commission, in consultation with the Director of the National Institute of Standards and Technology, shall submit a report to Congress which assesses the areas of human factor research, including usability engineering and human-computer and human-machine interaction, which feasibly could be applied to voting products and systems design to ensure the usability and accuracy of voting products and systems, including methods to improve access for individuals with disabilities (including blindness) and individuals with limited proficiency in the English language and to reduce voter error and the number of spoiled ballots in elections.” This report was written to address this mandate. 1.3 Brief History of Standards and Testing for Voting Systems [3]During the 1970s, few states had any guidelines for testing or evaluating voting machines. Stories about voting equipment problems and failures circulated among election officials, triggering concerns about the integrity of the voting process. In 1975, NIST (known then as the National Bureau of Standards, or NBS) prepared a report entitled, TEffective Use of Computing Technology in Vote Tallying (NBS Special Publication 500-30)T. The report concluded that one cause of computer-related election problems was the lack of technical skills at the state and local level for developing or implementing complex written standards against which voting system hardware and software could be tested.This report, along with comments from state and local election officials, led the U.S. Congress to direct the Federal Elections Commission (FEC) to work with NIST to conduct a study of the feasibility of developing national standards for voting systems. Following release of the 1982 report, limited funds were appropriated to begin the multi-year effort. Thirteen meetings and five years later, with the help of about 130 different policy and technical officials, the FEC instituted the 1990 Voluntary Voting System Standards (VSS).No Federal agency at that point had been assigned responsibility for testing voting equipment against the VSS. The National Association of State Election Directors (NASED) subsequently established a “certification” program through which equipment could be submitted by the vendors to an Independent Testing Authority (ITA) for system qualification. The ITAs are accredited by NASED to determine whether voting products are in compliance with the VSS. The results of the qualification tests can be used by States and local jurisdictions to help them assess system integrity, accuracy, and reliability, as part of their own certification testing. The VSS themselves were substantially updated and issued again in 2002, following a three-year development and public review process. This most recent update was accorded favorable review by the General Accounting Office in its preliminary audit (GAO, 2001). This release included functional requirements to improve accessibility by individuals with disabilities. An advisory section was included as guidance to improve user interface and ballot design. There were no specific qualification test criteria developed for this section; hence no formal conformance tests are associated with the guidance.[4] 2. Basic Terminology and ConceptsThe purpose of this section is to explain the terminology and concepts of usability, accessibility, standards and conformance testing as used throughout this report.2.1 Definition of a “System”The term “system” is used in industry in a number of different ways. Software is often referred to as a system, particularly by those developing software products (e.g., the operating system). A computer is often referred to as a system, though it contains both hardware and software. The hardware, software, and wiring used to interconnect a set of computers are often referred to together as a system (e.g., the networking system). These definitions are problematic in a discussion of usability. In the usability field, the definition of system encompasses the users and all the elements required to accomplish some goal. A specific system is viewed as one (or more) users, attempting to accomplish some activities towards a goal or set of goals, within a specific environment. The activities include all interaction between the user and other parts of the system (the products, the environment, etc.) as well as activities they might do internally, such as decision-making. Various elements of the environment include: (1) the physical environment (lighting, temperature, and ambient noise), (2) the psychological environment (time or social pressure present in the environment), (3) all of the equipment used, and (4) any other users or support personnel involved. For voting, this means that, from a usability perspective, the voting system is defined by:
2.2 Definitions of Accessibility and UsabilityIn this report, we have tried to adhere as much as possible to the International Organization for Standardization (ISO) definitions of accessibility and usability. These definitions support development of standards that will lend themselves to suitable test methods for conformance. It is critical to be able to measure accessibility and usability in order to say with authority that a voting product or system has achieved a specified level of accessibility and usability.2.2.1 DisabilityA disability is defined as “a mental or physical impairment which substantially limits one or more of a person's major life activities” by the Americans with Disabilities Act of 1990. This includes, but is not limited to, four major types of impairments: (1) physical impairment such as limited or total loss of use of one or more limbs, limited strength or dexterity, speech impediments, and difficulties in motor control (including tremors), (2) visual impairments ranging from partially to legally blind to total loss of vision as well as other visual deficiencies including color blindness, macular degeneration and tunnel vision, (3) auditory impairments including partial hearing loss in segments of the auditory spectrum or across the entire auditory spectrum, and deafness, and (4) cognitive impairment including learning and reading disabilities, and, under some definitions, users limited in their English proficiency (LEP). There are also common forms of multiple disabilities (e.g., deaf-blind). Designing a product that could be used unaided by the total range of disabled users (including multiple disabilities) is most likely infeasible, but significant ranges of the populations with disabilities can be accommodated with the proper application of modern technology and good, universal design. This is one of the key areas where computer-based solutions hold significant promise. For example, alternate media are possible such as text-equivalent speech since audio output is considered to be a nearly universal solution for those with visual impairments.[5] 2.2.2 Accessibility Accessibility standards are typically intended to specify designs that will maximize the access of the majority of persons with these types of disabilities, but does not necessarily guarantee access for a specific individual’s disability or combination of disabilities. An example of this approach to accessibility standards is the set of Section 508 Standards (Section 508 Standards, 2000) developed by the U.S. Access Board for Section 508 of the Rehabilitation Act of 1973, as amended in 1998. The U.S. Access Board is an independent Federal agency devoted to accessible design for people with disabilities. Section 508 is a set of accessibility requirements for Federal electronic and information technology. It applies to all Federal agencies when they procure, develop, use or maintain such technology. Accessibility as defined by the Access Board “is a term that describes products or services that meet the Access Board guidelines (in the case of the ADA) or the standards (in the case of 508). Something that is accessible – i.e., meets the guidelines or standards – is not always usable.” The Access Board also recognizes that products are accessible to individuals or groups of individuals. They are never “accessible” as an absolute unless every single person with any type, degree or combination of disabilities would be able to use the products. Products can meet accessibility standards or guidelines. These products are sometimes referred to as “accessible” in this more limited sense. However, these products may still be inaccessible to some people. The ISO standard TS 16071 defines accessibility as the usability [italics added] of a product, service, environment or facility by people with the widest range of capabilities (ISO/TS 16071, 2003). For the purposes of this report, however, we intentionally make a distinction between accessibility and usability since, in general, meeting accessibility standards does not necessarily imply that a system is usable by a particular individual or even a group of individuals, but only that barriers to access have been removed. This distinction is particularly pertinent to this report and the recommendations for addressing these issues (including the means of testing and certification) are discussed in more detail in Section 6. 2.2.3 Usability and Usability Testing 2.2.5 Accessibility versus Usability 2.2.6 Usability in Practice (an Example) Even assuming that customers can access the scanner, there are some obvious usability issues, especially with first- and second-generation scanner designs. The first difficulty is determining how to insert the credit card. Often the diagram next to the slot is horizontal, but the customer must insert the card vertically and must figure out how to match the diagram. To ensure both efficiency and satisfaction, the cashier may take the card and insert it for the customer if the customer tries a few times and still doesn’t orient the card properly. Next there are several options to choose from following instructions on a small screen. The wording and orientation of the instructions on the screen and buttons can be confusing. The dollar total is displayed along with a question of yes or no. But the displayed “yes” and “no” are often located at the top of the screen, quite near to a set of buttons just above the screen for credit or debit. As a result, even with the color coding used on the “yes” and “no” buttons it is not uncommon to see users hesitate or even hit the credit or debit buttons instead of the “yes” and “no” buttons at the bottom, below the screen. Effectiveness for this product could be measured by different criteria including counting the number of errors (e.g., the number of incorrect insertions of the credit cards or the number of incorrect button presses) or by counting the number of help incidents (e.g., the number of times the cashier intervenes). Efficiency can also be measured several ways including timing the transaction or by counting the number of discrete steps. It can also be measured either only in the “happy path” (e.g., with no errors affecting effectiveness) or as an average over actual use (e.g., including errors affecting effectiveness). Finally, satisfaction can be measured by various means such as through a questionnaire that asks how the customers perceived the process and the outcome (e.g., did they feel frustrated or embarrassed by needing help or by holding up the line). Sufficient usability testing can be conducted to produce measures of the usability of such a product once suitable criteria are established. The resulting measurements can be used as benchmarks for testing new designs or comparing across products of equivalent capability. 2.3 Product Requirements, Usability, and Testing MethodsProduct requirements are used to specify in advance what is expected of a product. Requirements may vary greatly in their level of detail and formality: a company may compile a simple list of desired features for its own use, or an authorized committee may compose a lengthy formal standard for use by the general public. In this section, we discuss four independent properties of requirements: their type, their pertinence to human interaction, their level, and their specificity. We define each of these properties and then discuss the implications for usability and for appropriate test methodology.2.3.1 Type Performance requirements can be further subdivided into purely functional requirements and those specifying the degree of performance. As an example, suppose we wish to require that an automobile allow the driver to open the trunk from within the passenger compartment: A design requirement might specify that there be a trunk-release handle at least 4 inches long located no farther than 7 inches to the left of the driver’s left knee and that the driver must pull the handle towards the back of the car in order to operate it. A functional (performance) requirement might simply state that there must be a way for the driver to open the trunk without leaving his/her seat. A degree-of-performance requirement might add that the operation must be able to be performed in four seconds or less (on average) once a driver is shown how to operate the mechanism. Note that although this particular example involves human use, the performance/design distinction applies just as well to requirements for autonomous products. See the next section (2.3.2) for more discussion on the human interaction. Generally speaking, design requirements are appropriate when the purpose of the requirement is interoperability, since this often requires an “exact fit” between system components. While there may be some variation in the concrete implementation of a design requirement (e.g. the color and exact shape of the trunk release handle), the thrust is to constrain the product. Performance requirements are usually preferable when quality is the goal, since they directly describe the behavior of the system and not the supporting mechanism. Thus, performance requirements allow for innovative solutions, and they enable comparison of multiple competing designs since they are “technology agnostic”. Design requirements usually invite direct examination as the most suitable mode of testing. Examination may be as simple as observing the presence of some required part, or may involve very precise measurement and analysis (e.g. analyzing a file to see if it conforms to a format requirement; checking “legal HTML” for a web page is a form of examination). Performance requirements, on the other hand, are more suited to testing by operation. Since the requirement describes how a system should behave, we operate the system and see whether it works as expected. 2.3.2 Human Interaction Conversely, requirements limiting automobile emissions or describing the format of a DVD do not have direct implications for the way in which humans use automobiles or DVD players. Interaction requirements can be physical or psychological. Physical interaction requirements are related to such things as the product weight; size; the spacing of knobs, controls, and buttons; the force required to interact with knobs, controls, and buttons; the user’s reach envelope (what parts of the product the user can physically reach); and the user’s field of view (what parts of the product the user can visually “reach”). Psychological interaction requirements include the user’s understanding of the system’s displays, labels, and messages, the user’s understanding of the processes and procedures required to use the product, and the user’s ability to understand the outcome of the interaction (including awareness that goals were met). Interaction requirements can be formulated in functional terms (e.g. users must be able carry a product) or as design specifications (e.g. the unit’s maximum weight and size). It is presumed that meeting the design specification will ensure the usability of the resultant product. However, design requirements are nearly always based on an “average” or “typical” range of users and do not necessarily apply to all individual users. Conformance tests for requirements that do not involve human interactions are usually susceptible to automation, since only the intrinsic structure or behavior of some artifact is being tested. When a requirement does involve human interaction, the way in which it is to be tested depends on its type, as defined in Section 2.3.1. 2.3.2.1 Testing Design Requirements for Interactive Products 2.3.2.2 Testing Functional Requirements for Interactive Products 2.3.2.3 Testing Performance Requirements for Interactive Products A special case of a performance specification would be to mandate a certain degree of user satisfaction with the use of the product. Within the usability and human factors engineering community, user satisfaction is often considered a requirement for a product, but it is rarely stated specifically. However, this trend is changing. Because satisfaction is one dimension of the system's usability that is purely subjective, the requirements for satisfaction are difficult, but not impossible, to include in product testing. Testing might involve questionnaires or interviews with users to determine their subjective reaction to the use of the product. Note that a number of validated subjective satisfaction questionnaires exist, such as SUMI (TUhttp://sumi.ucc.ie/UT), SUS (http://www.cee.hw.ac.uk/~ph/sus.html), and QUIS (http://www.cs.umd.edu/hcil/quis/). 2.3.3 Levels The testing implications are straightforward: numerous low-level specifications typically require numerous tests (although each test is likely to be small and simple). Higher-level specifications might require somewhat more complex and holistic tests, but there are likely to be fewer of them. 2.3.4 Specificity Specific requirements are not the only mechanism by which quality can be encouraged. In cases where it is impossible to state precise requirements, general specifications can provide useful guidance. Indeed, some usability “requirements” (e.g., use legible fonts) are in fact simply checklists of general design considerations to be taken into account by developers. Specificity is important when it is necessary to test objectively whether a given system conforms to a requirement. General specifications lead to tests that are either subjective (e.g. an expert uses his/her judgment to decide whether the visibility is “good”) or somewhat arbitrary (the test procedure, rather than the requirement, adopts a more precise definition of what constitutes good visibility). Specific requirements support test procedures that are both more objective and more directly justified by the text of the requirement. 2.4 Standards and Conformance Testing 2.4.1 Terminology of Standards
2.4.2 Pragmatic Issues for the Application of Standards
2.4.2.1 Interpretation 2.4.2.2 Conformance Testing Note that there are many types of testing that do not deal directly with conformance – examples include:
Finally, there is the issue of test operation. Conformance test suites are sometimes executed by the vendor (self-testing), or by a potential purchaser. It has also become common practice for a third party, such as an accredited laboratory to perform the testing. As mentioned earlier, such third-party testers are referred to as Independent Testing Authorities (ITAs). 2.4.2.3 Enforcement In the first case, the standard itself may make distinctions between its binding specifications (often denoted by saying that something “shall” be the case) and non-binding specifications (denoted by “should”). Second, the standard as a whole may be enforced in several ways:
3. Usability and Accessibility Requirements of Voting SystemsAs one would expect, the various kinds of usability-related requirements are well-represented for voting products, though most are functional- and performance-based. For example, there is a functional requirement for the voter to have the ability to cast a single vote in a winner-take-all election or to cast multiple votes in a multi-member election. There is a functional requirement to allow voters to modify their votes before casting them. Functional requirements also provide constraints on the interactions, designed to protect the voter from inadvertent errors such as the provisions to prevent overvotes and to notify voters of undervotes. There are also general functional requirements such as the ability for voters with disabilities to interact with the product.Interaction requirements can also be identified for voting products some of which exist in current or draft standards. These include specification of the typical reach envelope, minimum font size, and other specific design details. As with all design requirements, there is some question as to the effect these requirements actually have on the usability of the product. User performance requirements for voting products also exist. These are generally not enforced and appear to be provided only as guidelines. In one state, for example, there is a requirement that the act of voting by an individual voter take no more than five minutes. However, there does not appear to be any currently defined requirements related to user error rates, though there appears to be an implied requirement related to DRE systems that sets the error rate for overvoting to zero. However, there appears to be no standard for the number of anticipated user errors, or the number of calls for assistance that would be considered acceptable when dealing with a large user population such as is the case with voting. Such errors might include the number of times a voter inadvertently attempts to overvote, unintentionally undervotes a ballot, or is unsure of the next step in a process, whether these conditions are corrected or not before the vote is cast. User satisfaction requirements do not appear to be defined for voting products though they have been the subject of many articles on voting. The question that remains to be answered is whether or not existing standards are necessary and/or sufficient to ensure a high degree of usability for voting products. Finally, it is important to note that there is considerable variation in the implementation and design of voting products, which makes it a challenging task to create standards that are testable, span this range of design, and ensure some level of good usability and accessibility. 3.1 Implementation Examples of Functional Requirements for Voting3.1.1 Implementation Variations Since multiple designs may be in use across the country, across the state, or even across a district, the usability of the products will vary. Further, within the U.S., vendors are free to create unique voting products. This is due, in part, to our culture of both voluntary standards and free competition. Contrast this with the approach taken by the Brazilian government, which contracted with two research companies to design a single product for voting (Caltech–MIT, 2001). Separate contracts were made with a number of companies to manufacture the product to the design specifications. In this situation, provided the product is both accessible and usable, all voters will experience the same system and therefore the same level of usability and accessibility will be seen across the entire country. Although this single design approach is a possibility for the U.S, the variations in State requirements and the nature of the relationship between the Federal government and State governments make it a highly unlikely solution. Nevertheless, it is still necessary to ensure that all designs from all vendors achieve a minimum level of usability and accessibility. The current VSS approach assumes that appropriate standards can be put in place to ensure the usability and accessibility of voting products. However, design standards can ensure a specific level of usability and accessibility only if they completely specify the interface design. This can restrict both the incorporation of new advances in technology as well as creativity on the part of the designers to develop novel solutions. Alternatively, the usability and accessibility of each product can be independently determined and compared to a fixed standard for these aspects of the product design. It is for this reason that this report focuses on performance-based standards for both usability and accessibility and minimizes the dependence on design standards. 3.1.2 Example of Voting Product Design Variations In this section, we will describe these designs in terms of their interaction characteristics[8] in a winner-takes-all type election and discuss the resulting usability issues. 3.1.2.1 Product A – No Change Feedback 3.1.2.2 Product B – Yes/No on Change 3.1.2.3 Product C – Deselect/Select to Change 3.1.2.4 Interpretation from a Usability Perspective Product A appears to be the simplest to use since it contains the fewest number of steps. User action is responded to directly by the system. However, this design includes the possibility of the voter inadvertently changing his vote and not detecting this. If, while moving his hand across the screen, the voter accidentally touches the name of an alternate candidate that candidate will be selected. If the voter fails to notice this change, and continues the voting process, he won’t necessarily notice this error during the review. If he does notice this error, he must return to the selection and change his vote. Even if the voter is able to perform this pass without difficulty (he is able to determine how to return to the contest and make the correction), it will increase the time on task for this voter. The potential also exists for the voter to fail to notice this change, even during the review process, and cast his votes with the inadvertent error. Product B has a specific feature apparently designed to prevent this very error. The voter must specifically acknowledge the change in vote before it takes effect. Product C appears also to prevent inadvertent selection of an alternate vote, but does so in a fashion that requires the voter to determine why the system failed to respond. He must determine that the vote has to be removed before the intended vote can be cast. There is some question as to whether or not the voter would realize this by himself. Some voters might have no difficulty in making this determination. Others might ask for help to understand how to make this change. Some voters might perceive (incorrectly) that the system does not allow them to make the change once a candidate selection is shown on the screen. All three designs support the functional requirement but provide separate usability challenges for the voters in terms of what they must physically do and mentally understand. As a result, the error rates for each of these designs (in terms of voter confusion, calls for help, error rate, time on task, or voter acceptance) would likely vary, though all three errors rates may very well be within acceptable limits. The actual error rate of these designs cannot be determined without adequate testing with actual voters and a determination of what are “acceptable” limits. 3.2 Potential Usability Problems in Voting ProductsIn this section, we discuss the types of usability issues associated with voting products. We assume here that there are no accessibility issues and these discussions apply to all voters. There are a number of factors that determine the nature and frequency of usability problems encountered with any product. Users must be able to (1) deduce the interaction required (or be trained and able to accurately recall the interaction) and (2) be physically and mentally able to perform the interaction. If users cannot achieve (1) or (2), they will not be able to use the system. However, since we are talking about human involvement, perfect usability is rarely if ever realized. Instead, usability varies between perfect success (a perfect match between designer expectation and user interaction and accomplishment of the goal within an allotted or acceptable period of time) and total failure (the inability to reach the goal or to reach the goal accurately within an allotted or acceptable period of time). Hence, there are three classes of usability problems:
3.2.1 Usability Problems Prior to Success In a voting product, these usability problems might manifest themselves by increases in the time it takes one user to vote. These changes can be detected only by measuring the actual time required to vote or by directly observing voter behaviors (e.g., physical hesitation while voting). From an individual perspective, the task is still completed so the issues might not be serious enough to address, except where problems affect the user’s subjective ratings including confidence in the final result. Although individual performance might not be sufficiently affected to warrant concern over the design, additional delays in lines and added frustration on the part of those waiting can be severe enough to affect the overall system performance (i.e., the collection of all voters in a given polling location). Voters waiting longer in line may perform worse than those that have to wait less (potentially leading to more frequent or more severe usability problems) or might even leave before voting. 3.2.2 Usability Problems Leading to Partial Failure More disturbing than the presence of problems leading to partial failure, is the fact that voters might not even be aware of the existence of the problems. For example, a voter may unintentionally cast a ballot that shows signs of rolloff voting behavior believing that they voted all levels of an election when, in fact, they did not. Note also that usability problems may result in added pressure to complete the ballot, thus resulting in a conscious decision not to vote in some races even though this was not the original intention (a usability problem leading to partial failure). This would also be classified as a usability problem leading to partial failure, but the voter would not necessarily view this as a failure. (This is one reason why exit polling and post test surveys can lead to false impressions about the nature and extent of usability problems in a product.) 3.2.3 Usability Problems Leading to Total Failure 3.2.4 Examples of Potential Usability Problems This analysis is based on the physical and psychological interaction required. However, other factors can also affect the nature and probability of error. Variations in the visual display or spacing could also change how errors occur for any of these designs. The sensitivity or internal technology of the different touch screen devices could also change the error profile. Without actual usability testing, we cannot know if any of these designs or potential errors described would cause usability problems prior to success, usability problems leading to partial failure or a false sense of success, or even usability problems leading to failure. 3.2.4.1 Example 1 – Changing a Vote There is also a possibility that this event would not be detected by the voter at the time it occurs and therefore, the error would not be corrected at that point.[10] If the event is not detected at the time it occurred, then she would have another opportunity to detect the event during the ballot review (as mandated by HAVA) before casting her vote. However, there is still a possibility that she might not detect the event and cast her ballot with the unintended error. This illustrates a usability problem leading to a false sense of success. Product B has an additional design feature that appears to specifically preclude the inadvertent selection error and thus would likely have fewer incidents of this error going undetected (since the inadvertent contact with the screen results in a message). However, it introduces something new that the voter must understand and interact with correctly. There is the possibility that he might select “yes” instead of “no” or vice versa, which could be influenced by the arrangement of the buttons, their prior experience with similar messages on computer systems, or the wording of the message.[11] [12] Further, such an error message should be “modal”, that is, it does not allow the voter to interact with any other part of the system until he completes the interaction with this message. If the error message is not “modal”, there is a possibility that the message might accidentally become hidden from view and leave the system in an indeterminate state without actually casting a vote. It is unclear in this specific case if the product would allow the user to cast a ballot with a non-modal message open. If it did, then this illustrates a usability problem leading to total failure. Product C, where the voter must reselect the first candidate’s name to remove the selection mark before selecting a new name, represents a design that has a lower or even zero probability that an inadvertent vote change will occur since it would require inadvertently touching the screen twice – one to remove the existing vote and once to inadvertently select the new vote. However, this same design must support the voter’s attempt to change her vote. This design appears to be the most difficult for voters to understand since it lacks any feedback or guidance telling a voter how to change her vote. Thus, if she selects a new candidate without de-selecting the first candidate’s name (assuming that this is the correct action required), there is no feedback during the event to alert her to a problem. Rather than assuming her action was inappropriate to accomplish her vote change, she might assume that this design does not allow her to make a new selection once one is made. This would make it a usability problem resulting in partial failure. Even if the voters assume the system should allow this type of change (or were told that it did), they might struggle with this design or seek help from a poll worker. 3.2.4.2 Example 2 – Voting a Multi-Seat Contest In other words, two different types of contests are represented by the same visual design. Readers familiar with computer programs with graphical user interfaces will note that these different types of behaviors (single selection from a group and multiple selections from a group) are generally represented by two different visual elements – a round circle (called a radio button) for single selections from a group and a square for multiple selections from a group (called a check box). These different visual designs are intended to aid the user in visually identifying the capabilities (and their inherent interaction requirements) associated with the type of each group. The design of the example products can appear to voters familiar with computers as internally inconsistent (or even as “coded wrong”) and thus might represent a usability problem prior to success. There is also a chance that a voter could mistake the box shape as a computer type check box and attempt to overvote a single-seat election. With the new DRE products this would result in a usability problem prior to success since nearly all of them preclude overvoting. Finally, some users may mistake a multi-seat contest as a single-seat contest and inadvertently undervote the contest – a usability problem leading to partial failure.[13] 3.3 Potential Accessibility Problems in Voting ProductsSince we have elected to restrict our definition of accessibility to access to, but not usability of, the product and cover usability by people with disabilities as a subcategory of usability, this section primarily discusses barriers to the accessibility of the product, with only a limited discussion of usability issues. Accessibility represents a wide range of issues and design challenges. Not only must access be provided to people with many types of disabilities, but also access must be provided for U.S. citizens who are not proficient in English and who have different cultural backgrounds (including Native Americans). For a voting system to be accessible one must first remove barriers to access. Then interaction requirements can be addressed as part of a usability analysis to ensure that the system is actually usable by these diverse populations. To satisfy the goal of accessibility, barriers to access by people with disabilities and language difficulties must be removed or an alternate means of access provided. These barriers are often represented by physical barriers such as the inability to enter a building, to reach controls or read displays from a seated position, to interact with controls that require visual feedback (e.g., touch screens) or to use a mouse or a touch screen due to lack of fine motor control. Difficulty in communication can also be a barrier to access. Products that provide information via audible feedback exclusively may be difficult or impossible to use by persons who are deaf or have hearing loss. The issue is present not only in the primary display but also in the feedback used to indicate progress or selection. Many products provide auditory feedback for the user to indicate the end of a page or the last page in a multi-page form, which is fine if the feedback is redundant and also available as visual information. Touch screen products often use auditory feedback to aid the user in knowing that a selection has been made (though this is nearly always redundant with visual feedback). Visual displays cannot be accessed by many users who have visual impairments. Again, a touch screen product, even if not used for data display, relies heavily on visual feedback for proper operation. Once the barriers to access are removed by adding redundancy, a second condition must be satisfied – the product must be usable by these populations. There is some interaction between usability and accessibility, since the means of providing access represents interaction challenges for the user. Some environmental factors may have changed from those used by non-disabled users. The product may have a different keyboard or entry device for disabled users. In the case of touch screen-based, DRE products for example, an alternate set of keys typically are provided for movement and selection. There might also be differences in the medium used for data display and feedback (e.g., audio instead of video, text instead of audio, alternate entry device instead of a touch screen). Interestingly, specific accessibility features, if used by non-disabled users, may reduce some usability problems. The usability problems noted in the DRE interface design that allowed users to change votes with only a visual indication of the event risks inadvertent activation. However, a user who is blind, using an audio interface, is provided with the name of the new selection even if the selection was inadvertent. This would increase the probability of detection of the event. However, accessible interfaces are often provided as an alternative and are not integrated. Sighted users might not have access to the audio when they are using the touch screen interface. (Note that voters with vision or cognitive problems can benefit from the audio together with a touch screen to confirm that they are reading and interpreting the screens correctly.) In addition, there are special issues of usability for disabled users. The design of the ballot, the length of the ballot, the number of candidates, and the number of races might be obvious to a sighted user, but not to a voter who is blind or visually impaired unless a feature is included that provides this information. Though access may be provided, additional requirements for product usability by people with disabilities exist. For both touch screen and non-touch screen-based DRE products reviewed for this report, audio is the primary alternate medium provided for users who are blind or visually impaired. There are many good reasons for this decision on the part of vendors, but problems remain. Audio may be provided as recorded speech or synthetic speech, each with its own benefits and disadvantages. Audio feedback takes longer than reading unless the user can and is able to understand audio playback at high speed. In any case, audio data is transient, so users who are blind or visually impaired rely on short term memory to a larger extent and for more data then non-disabled users. Browsing an audio display is significantly harder and more time consuming than browsing visual displays. Some voters who are deaf might take a longer time to vote. Deaf individuals, particularly those who are congenitally deaf, read at a lower reading level than non-disabled users.[14] Data entry via an alternate input device may be more difficult, take more steps, or have other differences than the primary input. At a minimum, a fully accessible user interface is anticipated to have an average longer time on task for a person interacting with an audio-based interface than the time on task using a visual display. For standardized tests, it is presumed that there is a 50% increase in task time. However, this estimate is based on completing a standardized test, at a desk, using a familiar alternative interface. Personal correspondences by the authors of this report with individuals with visual disabilities place the estimate on the order of 3 to 4 times longer for some users who are blind. Furthermore, the nature and frequency of usability problems encountered are almost certain to be different. 4. Current Usability and Accessibility Related StandardsGeneric standards exist for usability and accessibility that are available from sources such as standards and professional organizations, as well as military and corporate institutions. In addition, some portion of the existing VSS and proposed IEEE standards for voting systems address some aspects of usability and accessibility. This section reviews these sources.4.1 Current (and Proposed) Voting Systems Standards related to Human Factors, Usability, and AccessibilityOnly recently, in the wake of problems revealed in the 2000 elections, has significant attention turned towards the issues of human factors, usability, and accessibility. There are a number of references to usability and accessibility in the existing VSS and proposed IEEE standards for voting systems. A brief overview of the current (as of October 2003) standards environment follows. The information presented has been gleaned from the following sources:
4.1.1 Requirements of HAVA
These represent both high-level and mid-level functional requirements. 4.1.2 Current FEC Process: the VSS Significantly, the VSS define a voting system as the devices that allow users to vote. Voters are not considered part of the system. This is in contrast to the definition provided in this report. There is a consequent emphasis on mechanical and electronic performance of the device. Usability is presently covered only in an advisory Appendix, although there are plans to add it as an official specification. Also, no coverage is included for mail-in or absentee balloting or for Internet voting. Both of these areas may present significant issues in the area of security and the potential for fraud or misuse, but the usability aspects of these areas could be addressed independently. Conversely, there is considerable attention given to telecommunications, again demonstrating the present emphasis on the technical aspects of voting. In general, the VSS document does not clearly define human factors, usability, accessibility, or the associated conformance testing that should be applied to these areas. In the following sections, we describe the VSS sections on Human Factors, Usability, and Accessibility in more detail. 4.1.2.1 VSS: Human Factors and Usability
In the following subsections, we identify features of the VSS that are pertinent to our discussion of how to develop and test usability and accessibility standards. Note that this discussion is by no means a thorough analysis of the VSS and is not intended to diminish the tremendous and valuable efforts of the FEC and NASED over the past 25 years to develop these standards. 4.1.2.2 Pertinent Features of VSS: Volume I, Performance Standards VSS Section 1.1 emphasizes a non-process oriented approach taken within the standards:
This is broadly true since most of the specifications are functional requirements in the form, "The system must be able to do X". VSS Section 1.5.1 contains the definition of a voting system:
“...voting systems are subject to the following three testing phases prior to being purchased or leased”
VSS Section 2 is central to understanding the Voting System Standards. Here the general functionality expected of a voting system is defined. Even though most of it does not directly address usability, it does convey the general approach of the VSS to standardization. VSS Section 2.1 commits to functional-style standards: "This section sets out precisely what it is that a voting system is required to do." Indeed, most of the requirements are functional, but there are a few low-level design specifications as well. Accessibility is covered in VSS Section 2.2.7. Here some very specific design standards are given. For example: "Where any operable control is 10 inches or less behind the reference plane, [the system shall] have a height that is between 15 inches and 54 inches above the floor." This section has a wide variety of types of standards, ranging from broad functional statements to narrow and technology-specific design requirements, many of which can be subject to broad interpretation. For example, in the VSS:
Beyond VSS Section 2, a few more usability related issues are mentioned in various places: Section 3.2.4.1 states that "all systems” shall provide “...privacy for the voter, and be designed in such a way as to prevent observation of the ballot by any person other than the voter". Section 3.2.4.2.2 states that: "punching devices” shall “...facilitate the clear and accurate recording of each vote intended by the voter". In contrast to the surrounding material, there is a short subsection, VSS Section 3.4.9, on Human Engineering – Controls and Display that contains explicit functional and design requirements for usability. It begins:
This section goes on to state that: “Appendix C provides additional advisory guidance on the application of human engineering principles to the interface between the voter and the voting system.” VSS Section 9 on Qualification Testing (a term equivalent to conformance testing as it is described in Section 2.4 of this report) describes a general approach; however, no criteria or procedures for usability and accessibility testing are specified:
VSS Section 9.4.1.4 notes that:
Appendix C is the VSS’s preliminary statement on requirements for usability. The requirements are a mixture of functional and design specifications, with the latter being somewhat predominant. The level and specificity of the requirements vary greatly. For example:
Section C.1 emphasizes formative rather than quantitative/summative testing. For example, this section states:
4.1.2.3 Pertinent Features of VSS: Volume II, Testing Standards
This latitude in designing and conducting tests across voting products may be appropriate to allow the ITAs to develop specific tests based on the nature of the technology used, but would not ensure uniform testing of the independent quality of usability and accessibility across all voting products. VSS Section A.4.3.5 (System-level test case design) tells the ITAs to simulate typical voter errors and gauge the robustness of the system, but this is not the equivalent of actual user interaction:
VSS Section B.5 authorizes ITAs to use their own judgment to decide, in some cases, whether a system is accepted or rejected:
4.1.2.4 Current Conformance Testing 4.1.2.5 Recent FEC Efforts in Support of Usability 4.1.3 IEEE Effort
Details can be found at http://grouper.ieee.org/groups/scc38/1583/index.htm. It appears that paper-based systems, such as optical scan systems, are not covered by their effort. The standard is currently (as of October 2003) in draft form and is undergoing review and editing. Within the standards, IEEE Section 5.3 addresses usability and accessibility issues The specifications are a mix of high and low-level, and performance and design requirements. Low-level design specifications predominate in the standard. Only voting equipment is covered in the body of the standard, but there is an informative annex offering guidance for ballot design. IEEE Section 6.3, which covers testing, classifies testing methodologies into two categories: Standards Compliance and Usability Testing. In the Standards Compliance section, four different methods are used to determine whether or not a voting system conforms to each applicable usability/accessibility standard: inspection, expert –based evaluation, and tests and usability tests. Inspection (I) the design is inspected to determine whether it possessed a feature or function specified in the standards... Expert-based analytical evaluation (E) a human factors or usability subject-matter expert performs a comprehensive review ... to determine whether the applicable standards are being met. Test (T) tests are specified to determine whether the applicable standards are met, e.g., a measure of letter height or sound intensity. Usability testing (UT) evaluates a voting system by having a representative sample of voters perform voting tasks under realistic but simulated conditions. User performance and user opinion regarding their interactions with the system are measured and compared against usability and accessibility goals and requirements. The various requirements in IEEE Section 5.3 are mapped onto one of these methodologies as the preferred way to verify that the system under test conforms. Usability testing is mainly reserved for general usability requirements that cannot be tested by the other compliance testing methods. 4.2 Generic Usability and Accessibility StandardsThere are a number of standards for usability and accessibility. These standards are typically written to apply across large domains such as military systems, computer applications, or web site designs. As generic usability standards, they do not address functional issues, since they cannot account for the intended users, activities, and goals of a product being developed under these standards. In addition, as generic standards they do not include specific performance requirements, since such requirements also depend on the application domain.These generic standards contain examples of the various kinds of requirements as described above in Section 2.3 (performance vs. design, specific vs. general, etc.). One further distinction is worth making: some of these standards apply to products in the conventional sense. For others, it is the development process itself that is specified. We refer to the latter as “process-oriented” standards. Several of the standards are ISO standards for usability. The U.S. does not have anything equivalent as a national standard except for ANSI/HFES 100-1988 (HFES-100, 1988), which covers only ergonomic (physical) requirements of workstations in an office setting; it does not cover software interface design issues or processes. Several Military standards exist that address human factors engineering concerns for equipment (including software) and facilities, as well as for specific military applications (such as helicopter cockpit design), for specific items (such as labeling), as well as planning and process requirements. Many of these documents are no longer supported. One notable exception for process standards development is the Department of Defense, which has standards such as MIL-STD-1472 (MIL-STD-1472) and MIL-H-46855 (MIL-H-46855), covering human factors engineering. But, process documents like MIL-H-46855 and others are no longer being maintained and have been rescinded. Companies in the U.S. tend to rely on industry standards and best practices for their guidance with such documents as the Windows Style Guide (which changes with each release) and other commercially available books on interface design. Another exception is ANSI/INCITS 354-2001 (ANSI/INCITS 354, 2001) a standard developed by NIST for documenting summative usability test results. It is currently in the process of internationalization. Some of the generic standards that are applicable to voting are described below.[15] 4.2.1 Section 508 of the Rehabilitation Act of 1973, as Amended in 1998 As stated in the official Section 508 website (see: http://www.section508.gov):
Subpart B contains the technical specifications that apply to a wide variety of IT products including software, web-based applications, multimedia and PCs. From the 508 summary page:
Thus, we see that the Standards encompass both design and performance specifications, and both quality-oriented and interoperability requirements. As with other standards we have examined, some of the requirements are very low-level and specific, others are very general. 4.2.2 ISO 9241: Ergonomic Requirements for Office Work with Visual Display Terminals (VDTs) As an example of a technical specification, consider Part 7: Requirements for Display with Reflections. Section 4 carefully defines technical concepts and metrics and how they are related mathematically. This allows precise characterization of hardware performance (e.g. luminosity at various angles). Section 5 states that the purpose is to assure that VDTs be "legible and comfortable in use". Section 6 lays out the precise requirements to be met. Section 7 then describes the approved test method (lighting, optical instrumentation etc.) for ascertaining conformance. It is notable that the standard itself defines the test method as well as the requirements. The next notable point is that the standard anticipates (as a future technique) a completely different test method based on the performance of human subjects, namely their ability to read text from the screen under various lighting conditions. This test method more directly addresses the purpose of the standard as set out in Section 5, but is less closely tied to the technical requirements of Section 6. Thus, there are really two approaches described within the standard:
Later in the same document, in contrast with the highly technical nature of Part 7, Part 10: Dialogue Principles is a very general, non-binding standard. As its name implies, it simply describes some design principles that should be taken into consideration when developing a system that interacts with human users via a VDT. It is a good overview and tutorial, but is not a standard in the strict sense. Likewise, Part 11: Guidance on Usability defines and discusses basic concepts of usability (effectiveness, users, tasks, etc). The discussion is very thorough - comparable to a long introductory chapter in a textbook. 4.2.3 ISO 13407: Human-centered Design Processes for Interactive Systems In the conformance section, the standard indicates that the project manager must generate documentation showing that the procedures of 13407 were followed. The "level of detail" of the documentation is to be negotiated by the "involved parties". Annex C provides templates for such documentation, which is to be evaluated by an "assessor". Thus, the test procedures fall into the category of subjective inspection. ISO 13407 contains a good deal of useful information -- it provides a checklist of possible techniques to be used by the conscientious project manager who wishes to improve the usability of a product. 4.2.4 ISO 16982: Ergonomics of Human-System interaction -- Usability Methods Supporting Human-Centered Design 4.2.5 ISO 10075: Ergonomic principles related to mental workload 4.2.6 ANSI/INCITS 354-2001: Common Industry Format (CIF) for Usability Test Reports 5. Current Human Factors Engineering, Usability, and Accessibility ResearchThis section summarizes research that can be applied to voting systems for both design and testing. Research can be divided broadly into basic and applied. First, as background, we discuss basic versus applied research. We then describe the types of results that can be applied to voting.5.1 Background: Basic ResearchBasic research is general in nature and therefore can be applied across a broad range of design domains (including voting). As a result, there is a great deal of data available from human factors, ergonomics, cognitive psychology, human computer interaction, usability engineering, and other related fields that are applicable to voting systems. This includes data on basic human perception, memory, cognition, higher level thought processes, decision-making, biases, psychomotor capabilities, etc. But basic research investigates one or just a few variables in isolation. As a result, basic research results can be difficult to apply within a given context such as voting system design. For example, basic research on human memory has been conducted to determine the number of unique elements that can be stored in human short-term memory. The study looked at colors, sounds, angles, or other simple data elements to be stored in human short-term memory to determine the actual limit.[16] Whether the results of this research apply to a specific voting system product design depends on the design’s use of short-term memory. Finally, it should always be kept in mind that basic research results are applicable to the specific participants used in the study and might not be applicable across a broad range of users, such as the users of voting products.In addition to issues of applicability based on the research design parameters used, there is the issue of the interaction effect of variables. By isolating a single variable of interest, research can make conclusions about this variable. However, in a real world situation, this variable may interact with other variables. For example, the current VSS standards include information on minimal size for text to ensure the text is readable. This value is based on research with font size as an isolated variable (i.e., with lighting, contrast ratio, color, font style, and other factors held constant). In a real world situation, differences in contrast ratio, font style, lighting condition, display density, or even user fatigue could change the minimal font size requirements. Under some conditions, the minimal requirements for font size might need to be higher than stated and under other combinations, lower font sizes might suffice. Usability Research Related to the Design and Testing ProcessOne of the most promising research areas in human factors engineering and usability fields that can be applied to the development of highly usable voting systems is the research on the product design and testing process. Much of the most recently published literature has focused on the “user-centered design” process as a design approach that directly enhances product usability (Bittner, 2000, Constantine, 2003, Desurvire, Kondziela, & Atwood.1992, ISO 13407, 1999, Jacobsen, & Jørgenson, 2000, Meister, 2000, Mercuri, 2002, Redish, Bias, Bailey, Molich, Dumas, Spool, 2002). This process is described under a number of names and has been the venue of many consulting practices that specialize in business process reengineering as well as usability labs in large software development companies.The user-centered design (UCD) process, and its derivative forms, is an approach that includes interaction with users throughout the product’s design and development cycle to gather data and test design assumptions. The basic concept is to ensure that usability is incorporated into a product’s design from the beginning of the design process and evaluated throughout the development process. Methods of incorporating usability include the use of user profiles (or personas), the development of use case models, usability walkthroughs, heuristic reviews, expert review, and user-based testing. Story boards, mock-ups, and prototypes can each be evaluated to test design assumptions and interaction effects. These activities serve to provide formative or diagnostic data on a product from conception through deployment. The studies that we were able to find on the specific issue of the usability of voting products would best be classified as formative. They provide findings and make specific recommendations or observations about specific products under evaluation. Some other studies have been based on data gathered from after-event reports or user opinion data, both of which may lead to false conclusions about both the nature and frequency of usability and accessibility problems, as we have previously discussed. These include reports such as the Caltech/MIT Report (Caltech 2001), the National Center for Voting Technology report (ECRI 1998), the New York Times story on voting results (McIntire, 2003), the University of Maryland report on the Diebold system (Bederson & Herrnson, 2002); the AccessWorld report on accessibility (Burton & Uslan 2002); product reviews by state officials as part of their product selection or evaluation process; and product reviews by end users and end user advocates such as the National Federation of the Blind and the American Foundation for the Blind. Our review of these reports also suggests that many of the “results” of these studies were speculations about usability problems that could occur or might have occurred with the use of these products. For example, the University of Maryland expert evaluation of an early version of the Diebold DRE system in the State of Maryland identified multiple usability issues:
The Caltech/MIT study speculated on usability issues by reporting on “spoiled ballot” data and making an estimate of the number of usability errors that were likely represented by this number. It stated that 1.5 million presidential votes are lost each election and 3.5 million votes for governor and senator are lost each cycle. However, a null ballot may be the result of a machine failure to record the voter’s preferences (a hardware or software problem), a voter error resulting from a usability problem, or an accurate record that the voter did not wish to vote for that office. However, all of these would be classified as “spoiled ballot” by the study’s definition. Spoiled ballots strongly suggest usability problems that result in total failure. However, since the actual users were not interviewed or observed, there is no way to tell if usability issues were responsible for all of the spoiled ballots or if there is another cause (such as purposive voter action). It is not clear that there is even a consistent definition of spoiled ballots across the studies or that any specific definition used is valid. For example, the New York Times article reported that official estimates stated that 60,000 votes were not cast in a 2000 election based on the lack of an interlock device on the voting product. It is not clear that the Caltech/MIT study, or others, would consider this a spoiled ballot. In summary, to the best of our knowledge, there has been only one research effort [Roth, 98] performing a controlled experiment in which error rates for voting were directly measured by comparing the intended vote with the recorded vote. However, even this study did not include a fully representative sample of the voting population. Though the data from the Roth study is valuable, it was performed using 32 subjects, none of whom were users with disabilities, and did not attempt to provide specific tasks in an attempt to determine a range of potential usability problems that may be present in the product tested. Only a single ballot was used instead of a range of ballots. As a result, the data cannot be generalized across voters or ballot types for even the specific machine tested. The data from informal reviews of voting products by officials and other parties interested in usability have raised similar issues related to the data they produce (Tutt.et al., 94, Etgen & Cantor, 2000, Gray & Salzman, 1998, Gray, 2003, Hertzman, et al., 2002, John & Marks, 1997). These studies have not been uniformly conducted in a realistic environment, with realistic ballots, or even representative users. Some of these studies are based on an evaluation process known to mask some usability problems and generate others as artifacts of the process. Some of these studies were “expert reviews” that report results based on the opinions of experts in the design of products. The results of these studies have been very important in raising awareness of usability issues and generating thought-provoking examples. However, it is likely that many of the identified usability problems would not change the election outcome (usability problems prior to success) and others might reflect problems that might not exist in actual use. The extent of these problems (how often they would show up in an actual election), and their actual effects are not known. Additional research is certainly required in this area; the recommendations presented in Section 6 address some of the limitations of existing research. 6. RecommendationsIn this section we summarize our findings into ten recommendations based on the analysis discussed in this report. The recommendations focus on the need for an updated VSS that contains clear and unambiguous standards for usability and accessibility that are accompanied by conformance tests. These standards should not only reflect current research in human factors engineering, usability and accessibility but also make use of the best practices available for user interface design and standards specification, testing, and certification. We also must emphasize that the development of good standards is iterative and we would expect that it will take several years of development and supporting research to achieve these goals in their entirety.We expect that these recommendations will be taken into consideration by the Technical Guidelines Development Committee (TGDC) when it becomes operational under the Election Assistance Commission (EAC) as described in the HAVA. Any implementation of recommendations will be at the behest of the EAC through the TGDC. Please note that although a rationale is included with each of the recommendations in this section, we strongly suggest that the reader refer to earlier sections and the glossary in Appendix A for the more comprehensive assessment and analysis. In particular, section 2 defines and discusses many of the concepts used in the recommendations, e.g. performance vs. design requirements. 6.1 Overall Goal: Develop Measurable, Performance-Based Standards6.1.1 Recommendation 6.1.2 Rationale
For voting systems, this can be summed up as follows: A voting system is usable if voters can cast valid votes as they intended, easily and efficiently, and feel confident about the experience. Because we have defined “usability” to include usability by people with disabilities (see Section 2.2.2), the same measures apply for accessibility, once the barriers to accessibility are removed via a separate set of design standards to make a system available to those individuals. These goals are, of course, those of quality-oriented standards, not interoperability or metric-based standards. Other things being equal, such standards are best formulated with these properties:
Such standards, and the conformance tests based on them, directly address the bottom-line performance of existing products. They do not attempt to guide product development, nor diagnose problems. Further, this approach is supported by the ITA structure currently in place. The process the ITAs use to certify a voting system is based on testing against a standard. As such it is critical to have standards that lend themselves to objective, repeatable and reproducible test procedures. By these criteria, many of the usability specifications in the VSS (VSS Section 3.4.9 and Appendix C) and in the IEEE draft standard suffer from one of two problems. They are either too design-oriented and low-level, or too general (even when they are performance-oriented). So, while the existing and draft standards are a good base, they need some reformulation. 6.1 Specify Functional Requirements6.1.1 Recommendation 6.1.2 Rationale The functional requirements should be at the user interface (not with the internal software requirements), should be independent of the implementation (make no references to “how”, just “what”), and should not include imprecise references to “how well”(including metrics). The requirements should include the identification of the system level capabilities (e.g., voting for only one person in a single-seat contest, voting for multiple people in a multi-seat election, etc.) as well as the sequence control functionality of the interface (e.g., provide a means of moving between pages in a multi-page ballot, provide a means of moving between contests and/or referendums, etc.). This is not a simple effort, and indeed may be tedious in some respects, but it is absolutely necessary to address the specific functional requirements if we are to have appropriate usability guidelines. The requirements should also address ballot design software capabilities. The data necessary to specify these functional requirements could be derived from a complete human factors task analysis and could be validated as part of usability test development as described later in this section. 6.2 Avoid Detailed Product Design Specifications for Usability6.2.1 Recommendation 6.2.2 Rationale Since the inclusion of specific design requirements appears to be part of the current approach for both the VSS and proposed IEEE standards, a more detailed discussion of the problem of providing detailed product design specifications is included below. This should not be viewed as an attack on the existing standards or the existing approach, but an assessment of the difficulties and limitations inherent in this approach. And, we believe there is a viable alternative in the development of conformance tests for usability and accessibility, as discussed below in Section 6.10. 6.2.2.1 Low-Level Design-Oriented Specification This might be a good design guideline, but a standard should not so closely mandate the design of a system. If cursor control is indeed a problem, it will be reflected in one or more of the basic usability measures (effectiveness, efficiency, or satisfaction). A long list of design guidelines, however valid they are individually, does not constitute a good standard. Note that we are not questioning the value of design guidelines, as such. These may be very helpful during the design and development of a voting product, but they are not essential metrics by which potential purchasers should judge the system. In addition, this specification presumes that a computer-based DRE has a cursor and an “enter/return” key. Depending on the design, a touch screen, for example, might not have these artifacts. Also, a list of design guidelines, whether provided as a standard or not, raises the questions of the validity and completeness of the list of requirements. Has it been shown that automatic cursor positioning actually does improve speed or accuracy in the voting process? Are there other equally valuable guidelines that have been omitted (e.g. adequate spacing between buttons)? Is there an interaction effect between guidelines that could affect the specifics (e.g., contrast ratio and font height)? Are these interaction effects taken into account? Problems with such a standard are also reflected in the testing process. A long list of low-level guidelines invites a long “checklist” or even “decision tree” style test to see if the requirements are met. This is a tedious process and does not ensure usability. Further, in addition to the problems already noted, many of the current requirements are stated as ones that “should” be applied. This is non-binding so the vendor is not required to conform and it cannot be included in conformance testing. Because “guidance” is not enforceable, it is unclear that any product design guidance provided would ensure usability of the product. It would seem to be more appropriate to provide a discussion about the need to locate and apply the most current design guidance available, and the standard could also identify some of the more likely sources of such information. Finally, as mentioned earlier, the IEEE draft standard addresses only DRE equipment and not paper-based systems, such as optical scan, and so could not serve as the basis for a general standard on voting and usability. Section 301c (2) of HAVA explicitly states that paper-based voting systems are not excluded. The standards provided to vendors should be applicable to any product they develop. 6.2.2.2 Imprecise Specifications
As a general design guideline to developers, this is unobjectionable, but as a specification it fails to provide a clear criterion against which conformance can be measured. Generally worded specifications have to be tested either by invoking an expert’s judgment, which can be subjective, or by “creatively” interpreting the specification so as to generate a more precise test. 6.3 Address the Lack of Specific Research on Usability and Accessibility for Voting Systems on Which to Base Requirements6.3.1 Recommendation 6.3.2 Rationale Until very recently there has been little applied research from the human factors and usability fields specifically on voting systems. Accessibility has been addressed by generic design standards that intended to remove barriers to access, but usability by persons with disabilities has not been addressed by research. In fact, we know very little about users’ experiences with voting systems including those people with disabilities. This suggests a need to focus efforts on building a foundation of applied research for voting systems and voting products to support the development of usability standards. Until this is done, there is little basis upon which to include many detailed specifications. 6.4 Develop Design Specifications for Accessibility6.4.1 Recommendation 6.4.2 Rationale In contrast to our recommendation for performance-based standards for usability, we believe that for accessibility, design standards are currently the only practical approach. This is because the population addressed by accessibility standards is so much more heterogeneous than that addressed by usability standards. As a consequence, it is not practical to formulate performance criteria and test methods that could be applied broadly and uniformly to the disabled population. The Access Board has provided information to the FEC for incorporation in the standards, which are the basis for the requirements currently in the VSS; however, the guidelines provided by the Access Board are for self-contained, closed products. These are products that are expected to contain all the accessibility features necessary for use by persons with disabilities. This is contrasted with “open architecture” products for which the end user is intended to provide some form of adaptive technology (e.g., a screen reader or external braille display). In addition, some of the requirements provided by the Access Board are general in nature and have not been tailored to the specific domain of voting system products. Also, they do not address all of the associated aspects that need to be specified (e.g., determining the quality of audio and how to test it) because some of this is considered usability rather than accessible design. The IEEE has made some progress in this area and any new VSS should take advantage of their work. 6.5 Develop Ballot Design Guidance6.5.1 Recommendation 6.5.2 Rationale It is recommended that this be a separate section of the new VSS: a section that would include recommendations for ballot instructions, visual design and layout, and recommendations for randomizing. We believe that research is needed to find an “optimal” set of guidelines, but significant improvement in usability could be made through standardization, particularly of instructions. This research would need to include developing instructions in alternate languages, since direct translation is not always possible and improper translations could induce new usability issues. In addition, we recommend that guidance be provided on testing methodologies to be used to ensure that the ballots do not inadvertently induce usability problems. Finally, it is recommended that a set of requirements to support these guidelines be developed and included as the specifications for vendor-developed ballot design software. 6.6 Develop Facility and Equipment Layout Guidance6.6.1 Recommendation 6.6.2 Rationale It is recommended that existing data be gathered and analyzed and a set of guidelines be developed for facility and equipment layout. This portion of the new standards would be for users other than vendors (i.e., election officials responsible for voting locations and poll workers). It would provide information on which the vendors could base their designs. Information relevant to facilities and equipment layout is very likely to be available in the research literature and can be generated almost entirely from a literature search. We should not overlook the importance of the poll workers and election officials being able to set up the polls and run the election with the equipment properly. The usability of the documentation and training materials supplied by both the vendor and the state is critical. We recommend that these materials undergo usability testing and that guidance be developed for how to do this testing at the state level. 6.7 Encourage Vendors to use a User-Centered Design Process6.7.1 Recommendation 6.7.2 Rationale We recommend that vendors be encouraged to incorporate a UCD approach into their product design and development cycle including formative (diagnostic) usability testing as part of product development. As the Federal standards are revised to incorporate more usability requirements, this will help vendors prepare for usability qualification testing. Further, we recommend that vendors be encouraged to perform their own summative usability testing on their products prior to releases and report them using a CIF standard (INCITS 354-2001) format. 6.8 Create Test Procedures for Accessibility6.8.1 Recommendation 6.8.2 Rationale We recommended that a uniform set of test procedures be developed for testing the conformance of voting products against the applicable accessibility requirements (self-contained, closed or open architecture products). Further, we believe that the test procedures could be added to the test battery currently conducted by the ITAs. 6.9 Create Test Procedures for Usability6.9.1 Recommendation 6.9.2 Rationale As described in this report, we have separated accessibility into two categories: removing the barriers to access and usability by users with disability. Removing barriers to accessibility has already been addressed in the recommendations above and identified as a likely extension to the current ITA process; therefore, we will restrict our discussion in this section to usability. Note, however, that when we are referring to usability we include all users, with and without disabilities, at different levels of reading proficiency and from different cultural and economic backgrounds. As we have discussed previously, a set of design requirements cannot properly address the issues of usability for voting system products. Also, no document can contain a sufficient set of design requirements to ensure voting product usability unless the document completely specifies a design already shown to be usable. And, the traditional ITA test approaches such as testing by demonstration or inspection will fail to uncover usability problems. The FEC brochures, the IEEE discussions on usability evaluation, and numerous reports and texts discuss testing that should be done as part of the design process. Though these formative or diagnostic tests are valuable tools in the design process, they do not guarantee that the final product is usable as measured by the metrics described earlier (efficiency, effectiveness, and satisfaction) since they are used during the design process, not on a final product. Even tests that are conducted on a final product design are generally not conducted in a way that would allow the results to be generalized to the intended populations (i.e., the participants of the study may or may not be appropriately extrapolated to a majority of all actual users). This is particularly true for voting system products since the range of users required for such a test would make this type of testing cost prohibitive to most vendors. In addition, there are currently no defined standards for usability metrics that vendors could use as benchmarks for their testing. For these reasons, we believe that vendor testing of the product, while valuable, is a separate issue from certifying that the end product is usable. We believe that usability qualification testing is necessary, but it will require the establishment of both objective usability test procedures and pass/fail criteria. To ensure usability of a voting product, it is imperative that the product be tested with actual users performing realistic tasks in a realistic environment, in sufficient numbers, and using a broad enough cross-section of users to be truly representative of the voting population. Further, to ensure good usability of the system, we must test not only the interaction of voter with the product but also the interaction of the voters, election administrators, and poll workers with the entire voting system. We recommend the development of a valid, reliable, repeatable, and reproducible process for usability testing of voting products against agreed-upon usability pass/fail requirements. In particular, there must be a careful definition of the metrics, such as what counts as an error, how to measure error rate, time on task, etc., by which systems are to be measured. The pass/fail criteria should be restricted to usability problems leading to partial failure, and usability problems leading to total failure. Since we are dealing with outcomes, usability problems prior to success need not be specifically included, but would be represented in the time on task measure from testing. Note that while excessive time required does not lead to failure, it is still unacceptable. Since human users are involved in the process, it is unlikely that the error rates will be zero for any criteria established, so a specific acceptable error rate and margin of error will likely be required. For example, it may be possible to enforce a requirement that no user be allowed to consciously cast a ballot with an overvote for one or more contests since this error represents the action of the voter. However, a voter still might inadvertently cast a vote for an unintended candidate in any product but this error cannot be detected without knowing the intent of the voter. Yet, both of these conditions must be tested. This test process must be defined at a high enough level of generality that the same procedure could be applied to any product (i.e., we do not want to define product-specific tests). Otherwise, the results for various products would not be comparable. Fortunately, the task requirements for voting are specific enough that this should not be difficult to do. It might be necessary, however, to have technology-specific variants of the test procedure and protocol (e.g. DRE vs. paper-based), although we believe the differences can and should be kept minimal. Research would need to be conducted to determine: (1) the nature of errors possible during a voting process (this includes voter errors and poll worker errors), and (2) the level (rate) of these errors (both the current levels for existing products and recommendations for “acceptable” levels of each error type). Once this information is available, we recommend that a set of repeatable and reproducible processes be defined and that each voting product be tested using these test processes and usability test pass/fail criteria. This would include the definition of all test procedures, the data collection required, the data analysis approach, participant screening and selection procedures, and reporting requirements. We also believe that, though the ITAs would likely have the responsibility to conduct these tests, the nature and format of the testing would likely require additional personnel with qualifications to conduct this type of testing. As part of the development of this report we have explored the feasibility of this recommendation and have provided some suggestions as to how to develop the test procedures and protocols. This information is included in Appendix B of this report. The details of the statistical data analysis are described in Appendix C. 7. Roadmap for Implementing the Recommendations7.1 Proposed TimelineIn this section we outline the initial steps needed to implement the recommendations we have suggested. In particular, developing a set of performance-based usability standards and associated test procedures is a complex endeavor. Also, even gathering together the existing standards, checking their validity, and ensuring that the ITAs have proper test procedures will require some effort. We also recognize that vendors are developing their products and state and local officials must make procurement decisions in the short term. Therefore, we also describe a preliminary roadmap for implementing these recommendations that includes suggestions for short-term activities that will help to address usability and accessibility issues while the longer-term research and development proceeds.7.2 Short-TermIn the short term, we recommend a push to obtain initial user testing data as soon as possible. We anticipate this being a “pilot-test” with both disabled and non-disabled users to simply find the usability issues, and determine possible procedures for testing. We would simplify by using only 1-3 different ballots. This initial data will be used to develop robust testing protocols including appropriate statistical analyses. (We discuss the statistical analyses in Appendix C.) The initial data we gather from testing with real equipment would be forwarded to vendors so they can improve their products as they see fit. We recognize, however, that some states are facing purchasing decision deadlines for products for the 2004 election and that they want to make wise choices that include usability and accessibility factors. We recommend the following for the state election directors:
Because of financial concerns and convenience, one or more states may plan to join together for these evaluations. The FEC has prepared brochures about usability and procurement of voting systems that are helpful in indicating the issues that should be considered before making a decision. Additional information on accessibility can often be gathered from various advocacy groups for the disabled, such as the National Federation of the Blind, United Cerebral Palsy, etc. who are often willing to do reviews of products, typically focused on one type of disability. Note that many of the results from this type of testing are subjective and often include usability problems prior to success and, therefore, should be used with caution. 7.3 Long-Term Plans – 1-4 YearsIn the longer term, as part of a major effort, work should begin on the formulation of standards for usability, as discussed in recommendation 6.1 and, in parallel, on the development of standardized test procedures, which should be checked for validity and reliability, etc. The goal is to develop a set of validated procedures for the task based testing within 1-2 years. This set of procedures could be used by vendors, procuring offices, etc. Baseline performance levels could then be determined through applying these procedures on existing voting products in following 2 years. Obviously, the procedures themselves would not provide enough for an ITA process until error limits and confidence rates were confirmed.7.4 Coordination with the TGDCWe expect that the recommendations in this report will be taken into consideration by the EAC and the TGDC. NIST will work with the TGDC to develop a plan for implementing the recommendations. In the next section, we suggest some basic work that we believe is needed to support implementation of the performance-based standards described in this report as we view this as critical to any VSS updates. We leave the details of a work plan and timeline for the other aspects of the recommendations to the TGDC. In general, work should also begin on the issues of functionality, ballot design, and facility layout, some of which will be to determine what research materials are already available.7.5 Proposed Next Steps for Testing and Standards DevelopmentTo move forward to meet the goals of improving usability and accessibility of voting systems and products, the next steps have two major emphases: (1) initial baseline testing of voting products that are currently in use as a means to determine errors, procedures, and statistical criteria (based in part on the preliminary short term research project) 2) to develop the standards as described in this paper for usability and accessibility.7.5.1 Proposed Testing
7.5.2 Proposed Standards Development ReferencesADA Accessibility Guidelines for Buildings and Facilities (ADAAG) (2002). http://www.access-board.gov/adaag/html/adaag.htm Adelman, L. (1991). Experiments, Quasi-Experiments, and Case Studies: A review of Empirical Methods for Evaluating Decision Support Systems IEEE Transaction on Systems, Man, and Cybernetics, Vol. 21, No.2 p. 293-301 Alvarez, R.M. (2002). Ballot Design Options, Manuscript prepared for Human Factors Research on Voting Machines and Ballot Design: An Exploratory Study Andre, T.S.; Belz, S.M.; McCreary, F.A.; & Hartson, H.R. (2000). Testing a Framework for Reliable Classification of Usability Problems Proceedings of the IEA 2000/HFES 2000 Congress, p. (6)573-576 ANSI/HFES 100 (1988). Human Factors Engineering of Visual Display Terminal Workstations Bailey, R.W. (2000). The Usability of Punched Ballots: Improving Usability in America's Voting Systems, Human Factors International Bederson , B.B. & Herrnson, P.S. (2002). Usability Review of the Diebold DRE System for Four Counties in the State of Maryland. Report from the University of Maryland’s Center for American Politics and Citizenship, http://www.capc.umd.edu/rpts/MD_EVoteMach.pdf Bederson , B. B. & Herrnson, P.S. (2002). An Evaluation of Maryland's New Voting Machines - CAPC and HCIL exit poll research on voter comfort and trust in new electronic voting machines, A Report from the University of Maryland’s Center for American Politics and Citizenship web site Bederson , B.B.; Herrnson, P.S.; & Niemi, R.G. (2002). Electronic Voting System Usability Issues, A Report from the University of Maryland’s Center for American Politics and Citizenship web site Bevan, N. (2000). ISO and Industry Standards for Usability Measurement, Tutorial Notes, SERCO Bittner, A.C. Jr. (2000). Building Performance Measurement into Today’s Testing and Evaluation (T&E;). Proceedings of the IEA 2000/HFES 2000 Congress, (6) p. 557-560 Bremer, J. (Undated). Ballot Design in an Electronic Environment – Lessons from the Online Market Research Industry Burton, D & Uslan M. (2002, November). Cast a Vote by Yourself: A Review of Accessible Voting Machines, Access World, 3(6), November, http://www.afb.org/afbpress/pub.asp?DocID=%20aw030603 CalTech-MIT (2001). July, Voting: What is, What Could Be, http://web.mit.edu/voting/ Castillo, J. C. & Hartson, H.R.; (2000). Critical Incident Data and their Importance in remote Usability Evaluation, Proceedings of the IEA 2000/HFES 2000 Congress, p. (6)590-(6)601 Cherlunick, P.D. (2001). Methods for Behavioral Research: A Systematic Approach, Sage Publications, Inc. Conrad, F. G (Unknown). Usability and Voting Technology, White paper for Voting Technology Workshop Constantine, L. (2003). Testing… 1… 2… 3… Testing… (unpublished) Darcy, R., & McAllister, I. (1990). Ballot Position Effects. Electoral Studies, 9(1), pp. 5-17 Design for Democracy Case Studies (undated). From the Design for Democracy Web Site, http://www.electiondesign.org/case.html deSouza, Flavio and Bevan, Nigel (1990). The Use of Guidelines in Menu Interface Design: Evaluation of a Draft Standard. Proceedings of IFIP INTERACT ’90: Human-Computer Interaction, p. 435-440 Desurvire, H.W. Kondziela, J.M. & Atwood, M.E. (1992). What is Gained and Lost when Using Evaluation Methods Other than Empirical Testing Practical Evaluation Methods for Improving a Prototype Proceedings of the HCI'92 Conference on People and Computers VII 1992 p.89-102 Dutt, A. Johnson, H. & Johnson, P. (1994). Evaluating Evaluation Methods Methodology of Interactive Systems Development Proceedings of the HCI'94 Conference on People and Computers IX 1994 p.109-121 Englehardt, J. & McCabe, S. (2001, March 11). Over-votes Cost Gore the Election in FL, Palm Beach Post, http://65.40.245.240/voxpop/palmpost.htm Etgen, M. & Cantor, J. A, (2000). Comparison of Two Usability Testing Methods: Formal Usability Testing and Automated Usability Logging, Proceeding of the 2000 UPA Conference Federal Election Commission (2003). Developing a User-Centered Voting System, http://www.fec.gov/pdf/usability_guides/developing.pdf Federal Election Commission (2003). Procuring a User-Centered Voting System, http://www.fec.gov/pdf/usability_guides/procuring.pdf Federal Election Commission (2003). Usability Testing of Voting Machines, http://www.fec.gov/pdf/usability_guides/usability.pdf GAO (2001). Elections: Status and Use of Federal Voting Equipment Standards, GAO-02-52, October, 2001, http://www.gao.gov/new.items/d0252.pdf Gray, W.D. & Salzman, M.C. (1998). Damaged Merchandise? A Review of Experiments That Compare Usability Evaluation Methods, Human-Computer Interaction, Vol. 13, p. 203-261Gray, W.D. & Salzman, M.C. (1998). Repairing Damaged Merchandise: A Rejoinder, Human-Computer Interaction, Vol. 13, p. 325-335 Gray, W.D. (2003). Returning Human Factors to an Engineering Discipline: Expanding the Science Base through a New Generation of Quantitative Methods – Preface to the Special Edition, Human Factors, Vol. 45, No. 1, p.1-3 Hemenway, D. (1980, October). Performance vs. Design Standards, NBS/GCR 80-297 Henninger, S; Haynes K.; & Reith M.W. (1995). A Framework for Developing Experience-Based Usability Guidelines, Proceedings of DIS'95: Designing Interactive Systems: Processes, Practices, Methods, & Techniques, p.43-53 Herrnson, P.S.; Niemi, R.G.; & Richman (undated). Characteristics of Optical Scan and DRE Voting Equipment: What Features Should be Tested? TUhttp://www.capc.umd.edu/rpts/MD_EVote_HerrnsonNiemi.pdfUT Hertzum, M.; Jakobson, N.E., & Molich, R. (2002). Usability Inspection Methods by Groups of Specialists: Perceived Agreement in Spite of Disparate Observation, CHI 2002 ISO/TS 16071 (2003T). Ergonomics of Human-System Interaction -- Guidance on Accessibility for Human-Computer Interfaces ISO 9241-11 (1998). Ergonomic Requirements for Office Work with Visual Display Terminals (VDT) – Part 11: Guidelines on Usability ISO 13407 (1999). Human Centred Design Processes for Interactive Systems Ivory, M.Y. & Hearst, M.A. (2001). The State of the Art in Automated Usability Evaluation of User Interfaces ACM Computing Surveys, Vol. 33, No. 4, p. 470-516 Jacobsen, N. E. & Jørgenson, A. H.; (2000). The State of Art in the Science of Usability Evaluation Methods: A Kuhnian Method, Proceedings of the IEA 2000/HFES 2000 Congress, p. (6)577-(6)580 John, B.E. & Marks; S.J. (1997). Tracking the Effectiveness of Usability Evaluation Methods Usability Evaluation Methods Behaviour and Information Technologyv.16 n.4/5 p.188-202 Jones, D. (2002). Handicapped Accessible Voting, Voting and Elections Web Pages, University of Iowa Jones, D. (2002). Voting Systems Standards: Work that Remains to be Done, Testimony before the Federal Election Commission, Washington D.C., April 17, 2002 Jones, D.W. (2001). Problems with Voting systems and the Applicable Standards, Testimony before the U.S. House of Representative Kanis, H. & Arisz, H.J. (2000) How Many Participants: A Simple Means for Concurrent Monitoring. Proceedings of the IEA 2000/HFES 2000 Congress, p. (6)637-(6)572 Leahy, M. & Hix, D. (1990). Effect of Touch Screen Target Location on User Accuracy, Proceeding of the Human Factors Society 34PthP Annual Meeting, p. 370-374 Lowgren, J. & Tommy Nordqvist, T (1992). Knowledge-Based Evaluation as Design Support for Graphical User Interfaces Tools and Techniques, Proceedings of ACM CHI'92 Conference on Human Factors in Computing Systems, p.181-188 Lowgren, Jonas and Nordqvist (1992). Knowledge-Based Evaluation as Design Support for Graphical User Interfaces. CHI ’92, p. 181-188. McCormack, C.B. (2003). Ballot Design: Has It Impacted Voting Behavior in Los Angeles County, California? Presentation at the 2003 CHI Conference McIntire, M. (2003, October 9). TTo Make Sure Votes Count, Sensor Device Goes Back On. New York Times, http://www.nytimes.com/2003/10/09/nyregion/09VOTE.html Meister, D. (2000). Changing Concepts of Test and Evaluation, Proceedings of the IEA 2000/HFES 2000 Congress (6), p. 554-556 Mercuri R. (2002). Humanizing Voting Interfaces, Presentation to the UAPA Conference, Orlando, FL Mercuri, R. (2000). Electronic Vote Tabulation Checks & Balances. Doctoral dissertation, University of Pennsylvania, Philadelphia, PA MIL-H-46855, Human Engineering Requirements for Military Systems, Equipment and Facilities MIL-HDBK-761, Human Engineering Guidelines for Management Information Systems MIL-STD-1472, Design Criteria Standard Human Engineering Miller, G.A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. UPsychological ReviewU, U63U, p. 81-97 Molich, R. & Jeffries R. (2003). Comparative Expert Reviews Proceeding of the CHI 2003 Conference, p. 1060-1061 Mosier, J.N. & Smith, S.L. (1986). Application of Guidelines for Designing User Interface Software Behaviour and Information Technology Vol. 5, No. 1, p. 39-46. ECRI, National Center for Voting Technology (1988). “An Election Administrator’s Guide to Computerized Voting Systems,” Plymouth Meeting, PA. Neale, D. C. & Kies, J. K. (2000). Symposium on Recent Advances in Critical Incidence Techniques, Proceedings of the IEA 2000/HFES 2000 Congress, p. 6-589. Nelson, J & Molich, R. (1990). Heurisitic Evaluation of User Interfaces, Proceedings of ACM CHI'90 Conference on Human Factors in Computing Systems 1990 p. 249-256 Niemi, R.G. & Herrnson, P.S. (2003). Beyond the Butterfly: The Complexity of U.S. Ballots Perspectives on Politics, Vol. 1 p. 317-326. Olson, G.M. & Moran, T.P. (1998). Commentary on “Damaged Merchandise?”, Human-Computer Interaction, Vol. 13, p. 263-323 Quesenbery, W. (2001). Voting for Usability: A Backgrounder on the Issues, Talk presented at TECH*COMM 2001 in Washington DC Redish, J. (moderator); Bias, R.G. (moderator); Bailey, R.; Molich, R.; Dumas, & J., Spool, J.M. (2002). Usability in Practice: Formative Usability evaluations – Evolution and Revolution, Proceeding of the CHI 2002 Conference, p. 885-890 Roth, S.K. (undated). Human Factors Research on Voting Machines and Ballot Designs: An Exploratory Study Roth, S.K. (1998). Disenfranchised by Design, Information Design Journal, Vol. 9, No. 1 p.1-8 Rubin, J.; Miller, J.R.; Wharton, C.; & Uyeda, K.M. (1991). User Interface evaluation in the real world: A Comparison of Four Techniques Proceedings of ACM CHI'91 Conference on Human Factors in Computing Systems, p.119-124 Savage, P. (1996). User Interface evaluation in an Iterative Design Process: A Comparison of Three techniques, (1996) Proceedings of ACM CHI 96 Conference on Human Factors in Computing Systems, v.2 p.307-308 Section 508 Electronic and Information Technology Accessibility Standards (2000). Architectural and Transportation Barriers Compliance Board, 36 CFR Part 1194, http://www.access-board.gov/sec508/508standards.htm Smith, S. (1986). Standards versus Guidelines for Designing User Interface Software Behaviour and Information Technology, Vol. 5, No. 1, p.47-61 Souza, F. & Bevan, N. (1990). The Use of Guidelines in Menu Interface Design: Evaluation of a Draft Standard, Proceedings of IFIP’90: Human Computer Interaction 1990, p. 435-440 Spool, J. (2003). (Unpublished) Evolution Trumps Usability Guidelines Tadayoshi, K.; Stubblefield, A; Rubin, A.D. & Wallach, D.S. (2003). Analysis of an Electronic Voting System Traugott, M.W. (2002). Testing Alternative Hardware and Ballot Forms, Prepared for the meeting of the Working Group on Voting Technologies and Balloting Voting Irregularities in Florida during the 2000 Presidential Election (2001, June). U.S. Commission on Civil Rights Report, http://www.usccr.gov/pubs/vote2000/report/main.htm Wald, A. (1947). Sequential Analysis, John Wiley Wilson, S.V. (2003). Opinion on “Southwest Voter Registration Education, et al vs. Kevin Shelley, in his official capacity as California Secretary of State”, U.S. 9PthP District Court of Appeal Appendix A – GlossaryThe purpose of this glossary is to clarify the terminology used in this report; the definitions are not to be taken as an officially approved general-purpose standard. Moreover, the scope of this glossary is limited to those terms needed in a discussion of voting and usability and the definitions given are to be understood within that context. The glossary does not cover other voting areas, such as registration or security.
Appendix B – Developing and Conducting Usability Conformance Testing ProceduresWe have done some preliminary work on the development of the usability test procedures we believe would be necessary to ensure usability. An outline of this set of test processes is provided below. Additional research is necessary to validate our assumptions and initial conclusions and to make specific detailed recommendations for the tests.B.1 Test Environment B.2 Voter Subsystem Testing B.3 Poll Worker Subsystem Testing B.4 Full System Testing B.5 Standard Test Materials B.6 Feasibility and Limitations Appendix C – Statistical Data AnalysisThis Appendix addresses the question of how many participants would be needed in a usability test of voting products in order to make reliable estimates of presumably low error rates. As we have argued in the report, a controlled experiment with a valid sample of users is the only reliable way to directly measure bottom-line metrics of system performance, such as error rates and time on task.In contrast to some previous studies, we will not aim to estimate the mean error rates for specific population groups. There also has been work (suggested specifically in the context of ballot voting problems) to find a "reasonable means of estimating the number of subjects required" for testing. Bailey (2000) uses binomial probability models in a diagnostic testing scheme that seeks to bring in enough subjects to trigger all the existing errors; each system error is presumed to have a fixed probability of being triggered by any individual test subject. Bailey estimated that if, in the 2000 presidential election, the infamous butterfly ballot caused 1% of votes to be inadvertently cast incorrectly, one could assume that usability testers would each have a 1% percent chance of uncovering that error during testing; conversely, 99% of subjects would not be affected by that problem. Given that there are n subjects, each with a probability p of encountering the problem, then the probability of that problem being triggered by at least one of the n subjects is q(p,n)= 1-(1-p)^n. Bailey shows how large a sample would be needed to uncover the problem with a certain probability by putting p=.01 and varying n. For example, if n=289 subjects, then the probability of at least one of them uncovering the problem is q(.01,289)= .95. Similarly, setting q(.01,n)=.99 requires n to be at least 423. Bailey’s exposition states that the above numbers show that 289 testing subjects would be needed to find 95% of such problems, and 423 subjects are needed to find 99% of the problems. We presume that he supposes that there are n testing subjects, and each problem has an independent discovery rate of 1% by each subject; then each problem is discovered by at least 1 subject with probability q(p,n). In that case, the average number of problems with that discovery rate found would be q(p,n) of the those problems. Of course, the numbers of problems present in a voting system and their respective discovery rates will not be known before the testing occurs. We propose instead to test a voting system by using a test to determine if the system’s failure rate is acceptably low, where the failure rate is the proportion of the population that fails to use the system successfully for any reason. The description of the Wald test that follows shows that if the acceptable error rates can be pre-determined, it is possible to do sequential testing that can limit the number of subjects that must participate. Sequential testing, pioneered by Abraham Wald (Wald, 1947), was considered important enough to be classified during World War II, where it was used for sampling inspection of manufactured goods. In certain situations, Wald's Sequential Probability test can save time and money by limiting the number of subjects needed for testing. The testing of a system or a manufactured lot of products can be modeled by sampling from a binomial population with failure rate p (with p between 0 and 1); that is, independent subjects tested have a probability p of failing the test and a probability (1-p) of passing the test. The goal of the testing is to determine whether the failure rate is above or below acceptable limits. In contrast, conventional tests would test a fixed sample of subjects, and the lot or system would pass or fail depending on the results of the entire sample. In certain cases when the samples are tested in sequence, the results can be such that a firm conclusion can be reached without having the need to run the rest of the subjects. For instance, suppose we test a system to see if its failure rate is below 0.01 and schedule 25 subjects. If the subjects are tested sequentially, and 5 of the first 6 subjects fail the system, then the system will flunk regardless of the result of the next 19 trials, which thus become unnecessary. Sequential testing has been incorporated, though not without controversy, in some clinical tests of new medical procedures, where reducing the number of subjects may well save lives. We suggest that sequential testing may also be applied to testing voting products for usability (and, in fact, this technique is part of the ITA testing for hardware and software compliance). In Wald’s sequential tests, the procedures for reaching a conclusion and stopping the test are not haphazard but spelled out in advance given what failure and error rates are acceptable. At each stage of the test, the number of failures up to that point is tracked and compared to a pre-specified threshold for that stage. If the number of failures is greater than the rejection threshold, then the system is considered to have failed. If the number of failures is smaller than the acceptance threshold, then the system is accepted. If the number of failures is between the thresholds, then the test continues to the next stage, with new thresholds applying to the new stage. The form and thresholds for the sequential probability test depend on several predetermined parameters, which are listed here with discussion below:
The actual results of the test depend on the real failure rate p, which is the proportion of subjects in the tested population that would fail the test. The subjects should be independent of each other and serve as random samples from the population. We can determine what values of alpha and beta are acceptable, which will include our thinking about the harm created by making each kind of mistake. If many of these kinds of tests are run, both kinds of mistakes are likely to occur occasionally (just as when flipping a fair coin repeatedly, you will sometimes get 5 heads in a row). The chosen values of alpha, beta, p0, and p1 determine the thresholds and stopping points of the tests. Having smaller alpha and beta require longer runs, and indicate less willingness to risk choosing the wrong conclusion. The formulas for the parameter-dependent thresholds can be complicated but are easily computed. For example, suppose that the maximum acceptable failure rate of a voting system is p0=0.001, but that its real failure rate p happens to be the minimum unacceptable failure rate p1=.01. Suppose also that we want both the false rejection rate alpha and the false acceptance rate beta to be bounded by 0.05. In that case, the average number of subjects needed would be 189; the actual number of subjects needed would vary randomly according to the results. If we wanted both alpha and beta to be 0.01, then the average number of subjects needed would rise to 321. If instead we relaxed both alpha and beta to be 0.1, then the average number of trials would be 125. Relaxing alpha and beta even further to .25 reduces the average number of subjects needed to 40. In addition to alpha and beta, the choices of p0 and p1, and how they relate to the real error rate p, also affects how many subjects will be needed. If one of p0 or p1 is obviously wrong, then the test can terminate speedily. For instance, if p0 is .001, and half the runs are failures, then the test can terminate quickly. However, for very low p0, it can take many trials without error to convince the test that the real p is less than or equal to p0, especially if p1 is relatively close to p0. In general, increasing the ratio of p1 to p0 will reduce the average needed number of trials. As an example, suppose again that alpha=beta=.05, p1=.01, and p0=.0001 rather than .001. If the real p=p1=.01, the average needed number of subjects is only 74. However, suppose the real failure rate p=0. Then the test takes 296 subjects, because when p0 is tiny, it takes many trials to convince the test that the failure rate is really that small, unless the alternative p1 is so large as to be obviously untenable. The specific implementation (i.e., choosing the “acceptable” values for p0, p1, alpha, and beta to be used in the Wald process) on the voting product testing with users will need to be determined. One of the purposes of the research proposed above (see Section 6.4) would be to gather data about the actual error rates that could then be used as a guide for determining meaningful and realistic thresholds for conformance tests (see Section 6.10). Appendix D – Report MethodologyWriting this report required expertise in human factors and ergonomics, usability and accessibility of information technology, voting systems, standards development, conformance testing, and statistics. It also required talking to representative stakeholders from across the election and voting communities in order to identify relevant issues and map these to research and best practices that could be applied to voting systems. NIST created a team that had the necessary expertise and analysis skills in June of 2003. In this Appendix we provide a description of the methodology we used to do the analysis and write this report. Appendix E contains the biographies of the authors. It was critical to understand the human factor, usability, and accessibility issues from the perspectives of the many different stakeholders in the elections and voting process. The challenge was to then understand the current situation for voting systems and to identify what approaches for general research and best practices could be brought to bear to improve the usability and accessibility of voting systems. The voting team spent several months reading the relevant literature and talking to numerous individuals knowledgeable about elections and voting systems. We reviewed the research and best practices literature [17] in the following general areas:
We also reviewed the literature specifically for voting systems and usability and accessibility, much of which exists as research papers, news articles, websites, workshops held since the 2000 elections, vendor demonstrations and literature, and email reflectors/discussion boards on electronic voting (e.g., upa-evoting@yahoogroups.com and verifiedvoting.org). Topics covered can be categorized as:
We talked to representative stakeholders and researchers associated with the election and voting communities. This included visits, discussions and phone calls at NIST and elsewhere involving election officials, vendors, voter advocacy groups, and researchers; attendance at various technical meeting such as the 2003 IACREOT Conference and Trade Show; the 2003 ACM Computer Human Interaction Conference; the August 2003 ACM Voter Verification Workshop; and IEEE standards meetings and teleconferences. We tried to speak with anyone and everyone who had looked at aspects of usability and accessibility for voting systems or, at a minimum, read their writings. For example, we participated in the following activities:
We also reviewed the current ITA testing process for certification of voting products and the current vendor system engineering processes for user-centered design, and usability and accessibility testing. We then identified the gaps between industry best practices and research (for both standards development and usability and accessibility design and testing) and the current situation for voting products and systems. By analyzing these gaps, we were then able to define a set of recommendations for improving the usability and accessibility of voting systems. Appendix E – Author BiographiesDr. Sharon Laskowski Dr. Laskowski’s work on investigating standards and conformance testing issues for usability and accessibility of voting systems included participation on NIST’s pre-HAVA, ad hoc voting issues team in 2002, the FEC Advisory Board on Usability and Human Interface Standards, and the IEEE P1583 Usability and Accessibility Task Group. She also organized and moderated the panel on usability and accessibility for the December 2003 NIST Conference on Building Trust and Confidence in Voting Systems. Other recent work has focused on usability evaluation methods and standards such as the development of ANSI/INCITS Standard 345-2001, the Common Industry Format for Usability Test Reports, which NIST developed with human factors and usability engineering industry leaders as part of the Industry Usability Reporting Project. She has provided advice on a number of accessibility activities related to the Section 508 IT accessibility requirements and the development of the INCITS V2 standard protocol for more transparent accessibility. She created the NIST Web Metrics project for experimenting with rapid, remote, and automated web usability evaluation that includes tools for user logging and category analysis. She has contributed to information visualization research, in particular for large document collections. Over the years, she has been an active researcher in a number of other areas of computer science including expert systems, plan recognition, analysis of algorithms, and computational complexity. She is a member of the Usability Professionals’ Association (UPA), the Institute of Electrical and Electronics Engineers (IEEE), the Association for Computing Machinery’s Special Interest Group on Human Computer Interaction (ACM SIGCHI), and a founding member of the local chapter of SIGCHI: DCCHI. Prior to joining NIST in 1994, Dr. Laskowski was a lead scientist at the MITRE Corporation. She has also been an assistant professor in the Computer Science Department at the Pennsylvania State University. Dr. Laskowski received her BS degree in Mathematics from Trinity College, Hartford, CT and her PhD in Computer Science from Yale University. Dr. Marguerite Autry John Cugini
From 2000 until 2003, his work focused on the interaction between visualization and usability. In particular, he was a major contributor to NIST Web Metrics project (http://www.nist.gov/webmetrics). This work included design and implementation of software that analyzes how users interact with a given website. From 1994 until 2000, he worked on the development and evaluation of prototypes for information visualization, with particular application to document browsing and searching. From 1988 to 1994, his major effort was the construction of conformance tests for the PHIGS standard. PHIGS is a complex standard describing an application programming interface for 3D graphics. Measuring conformance involves an interactive feedback loop in which a human operator must recognize visual features of the 3D display. Starting in 1979, his work at NIST was in the area of programming language standards. This included development of test sets for implementations of BASIC and FORTRAN, standardization of numeric accuracy, impact analysis of the revision to COBOL, and a survey publication evaluating several major programming languages. He has participated actively in national and international standards organizations, including those for Ada, BASIC, C, and Common Lisp. From 1984 to 1988, his primary work was research on expert systems. This included evaluation of the KBS-oriented languages, Lisp, Prolog, and OPS5. It also involved a research project that provided conceptual navigation through a knowledge base by means of graphics, using Prolog and GKS. Mr. Cugini received his AB from Columbia in 1970 with a major in philosophy. He worked for the U.S. Army from 1971 until 1978 as a programmer and instructor. During that time he earned an MS in computer science at the University of Iowa in 1977. Bill Killam He has authored a number of publications, has been a reviewer for several books on Human Factors, and was a member of the Special Editorial Board for Human Sciences for the British publication Interacting with Computers. He was a contributing author for the DOD HCI Style Guide, the DoD's DII Interface Specification, and the author of the DoD AGCCS Style Guide. One of his projects was recently highlighted as a case study in Interaction Design: Beyond Human-Computer Interaction (Preece, Rogers, & Sharp, 2002), a new textbook published by John Wiley & Sons. Dr. James Yen Footnotes [1] Note to the Reader: If you wish, you may skim the recommendations in Section 6 before reading through the technical details of the report. This will put the detailed definitions and technical explanations for standards, testing approaches, research, and best practices that lead up to the recommendations into the proper perspective. However, to understand the rationale for the recommendations it is necessary to read through the technical details. [2] The reader should note that the Glossary in Appendix A provided at the end of the report contains the definitions of voting and usability terminology used herein. [3] Thanks to Penelope Bonsall for her help in accurately summarizing the history of the VSS. [4] See Section 2.4 for a more detailed discussion of conformance testing. [5] Reading speed with braille and other tactical displays is significantly slower than even audio output at normal speed. In addition, only a limited number of blind and visually impaired users are proficient in reading braille or other tactile displays. [6] The Opticon is a device that converts visual data into tactile data and can be used to read data from a computer screen. Only a small percentage of blind users use Opticons for this purpose and the reading speed is significantly lower than other forms of alternate output. [7] This is the term used by the U.S. Access Board. It should not be confused with the concept of an open or closed architecture as used when referring to computer systems. [8] Note that in describing the voter interaction in these examples we have chosen to alternate the use of the pronouns “he” and “she” to give some indication of the diversity of voters and paint a more vivid picture of the user interaction. [9] It is common for users to blame themselves for their inability to accomplish a task with a given system, even though difficulties experienced may be common across a range of users and the result of correctable usability problems. [10] There is also the probability that the change is detected, but the user does not know why and assumes it to be a system error. In this case, users may correct the error but be suspect of the ability of the system to accurately capture their vote or they might assume that the presumed error needs to be reported to a poll worker. Alternatively, they might be startled by the error and lose confidence in their ability to operate the system. [11] Failure to adhere to standard button arrangement for Yes/No messages is one of a number of common errors in design that results in inadvertent activation of the incorrect choice.[12] One blatant example of this type of error is an application message from a commercial software package that reads “OK to not save changes?” [13] Alternate visual designs for single- and multi-seat contests could be used to reduce the probability of inadvertent undervoting by providing feedback in a multi-seat contest of the current number of selected candidates and the maximum number of candidates allowed. [14] This is the often the case for congenitally deaf users (those deaf from birth) since reading ability is learned to a large extent as an auditory process. [15] It should be noted that many of the standards discussed in this section are long and complex and only a highly simplified overview is presented here. For greater detail, readers are urged to consult the standards themselves. [16] This is an actual reference to the data used in a study done in the late 1950’s on human short term memory by Miller (Miller, 1956) that led to the now famous7 +/- 2 rule. The results were widely applied and led to the creation of 7 digit telephone numbers. [17] Note that the references cited in this report are only those that are directly pertinent to the report and are just a subset of the literature that was actually examined. |