Introduction

Consider this scenario: You are managing the Intranet applications for a large company. You've spent the last year championing data-driven (re-)design approaches with some success. Now there is an opportunity to revamp a widely used application with significant room for improvement. You need to do the whole project on a limited dollar and time budget. It's critical that the method you choose models a user-centered approach that prioritizes the fixes in a systematic and repeatable way. It is also critical that the approach you choose be cost-effective and convincing. What do you do?

Independent of the method you pick, your tasks are essentially to:

Identify the problems
Prioritize them based on impact to use
Prioritize them based on time/cost benefits of fixing the problems
Design and implement the fixes

In this situation, most people think of usability testing and heuristic (or expert) review.

In usability testing, a set of representative users are asked (individually) to complete a series of critical and/or typical tasks using the interface. During the testing, usability specialists observe the self-evidency of the task process. That is, they watch as participants work through preselected scenarios and note when participants stumble, make mistakes, and give up. The usability impact is prioritized objectively according to how frequently a given difficulty is observed across the participants. Prioritizing cost-benefits of fixes and suggesting specific design improvements for that specific interface are typically based on the practitioner's broader experience with usability inspection/testing experience.

As the name implies, heuristic review offers a shortcut method for usability evaluations. Here, one or more specialists examine the application to identify potential usability problems. Problems are noted when the system characteristics violate constraints known to influence usability. The constraints, which vary from practitioner to practitioner, are typically based on principles reviewed in usability tomes (e.g., Moglich and Nielsen, 1990) and/or criteria derived from and validated by human factors research (e.g., Gerhard-Powals, 1996). The critical steps of prioritizing the severity of the problems identified and suggesting specific remediation approaches for that interface are based on the practitioner's broader experience with usability inspection/testing experience.

Empirical evaluations of the relative merit of these approaches outline both strengths and drawbacks for each. Usability testing is touted as optimal methodology because the results are derived directly from the experiences of representative users. For example, Nielsen and Phillips (1993) report that despite its greater absolute cost, user testing 'provided better performance estimates' of interface effectiveness. The tradeoff is that coordination, testing, and data reduction adds time to the process and increases the overall man- and time-cost of usability testing. As such, proponents of heuristic review plug its speed of turnaround and cost-effectiveness. For example, Jeffries, Miller, Wharton, and Uyeda, (1991) report a 12:1 superiority for an expert inspection method over usability testing based on a strict cost-benefit analysis. On the downside, there is broad concern that the heuristic criteria do not focus the evaluators on the right problems (Bailey, Allan and Raiello, 1992). That is, simply evaluating an interface against a set of heuristics generates a long list of false alarm problems. But it doesn't effectively highlight the real problems that undermine the user experience.

There are many, many more studies that have explored this question. Overall, the findings of studies pitting usability testing against expert review, lead to the same ambivalent (lack of) conclusions.

Hear what I say, or watch what I do?

In an attempt to find alternative approaches to compare usability testing and expert review, Muller, Dayton and Root (1993) reanalyzed the findings from four studies (Desurvire, Condziela, and Atwood, 1992; Jeffries and Desurvire, 1992; Jeffries, Miller Wharton and Uyeda, 1991; and Karat, Campbell and Fiegel, 1992). Rather than looking at the raw number of problems identified by each technique, their re-analysis categorized the findings of the previous studies on parameters such as:

# problem classes per hour invested,
# of classes of usability problems identified,
likelihood of identifying severe problems,
uniqueness of results, and
average cost/problem identified for each technique.

Again, their re-analysis demonstrated no stable difference indicating that either usability testing or heuristic review (conducted by human factors professionals) is a superior technique.

An array of methodological inconsistencies makes interpreting the findings in toto even more challenging. The specific types of interfaces or tasks used in comparison vary widely from study to study. Many studies do not clearly articulate what the "experts" are expected to do to come up with their findings (much less, what they really DO). The specific heuristics applied are rarely specified clearly. The level of expertise of the evaluators is rarely described or clearly equated, although it is often offered informally as a factor in the diversity of outcomes. As such, it is possible that, among other things, the conclusions of specific studies falsely favor a method, when relative benefits really result from the broader and deeper experience of the individual implementing the method (Johns, 1994).

So, what's an Intranet manager to do?

Equivocal research findings aren't really helpful when your task is to select the most cost-effective means of identifying, prioritizing and fixing problems in your interface. To make it worse, it appears that the problems identified by the two approaches are largely non-overlapping. According to Law and Hvannberg (2002), the problems identified by usability testing tend to reflect flaws in the design of the task scenarios, such as task flows that do not reflect the steps/order that users expect. In contrast, expert reviews highlight problems intrinsic to the features of the system itself, such as design consistency. Actually that's the good news.

Findings from recent studies extend those findings and may potentially help identify the parameters for identifying the right solution. Instead of pitting one strategy against the other, this study focuses on identifying the qualitative differences between the findings of usability testing and expert review.

Levels of Understanding

Rasmussen (1986) identified three levels of behavior that could lead to interface challenges: skill-based, rule-based, and knowledge-based behavior.

Success at the skill-based level depends on the users' ability to recognize and pay attention to the right signals presented by the interface. Skill-based accuracy can be undermined, for example, when non-critical elements of an interface flash or move. These attention grabbing elements may pull a user's focus from the task at hand, causing errors.

Success at the rule-based level depends on the users' ability to perceive and respond to the signs that are associated with the ongoing procedure. Users stumble when they fail to complete a step or steps because the system (waiting) state or next-step information was not noticeable or clearly presented.

Success at the knowledge-based level depends on the users' ability to develop the right mental model for the system. Users acting based on an incorrect or incomplete mental model may initiate the wrong action resulting in an error.

Fu, Salvendy and Turley (2002) contrasted usability testing and heuristic review by evaluating their effectiveness at identifying interface problems at Rasmussen's three levels of processing. In their study, evaluators were assigned to evaluate an interface via either observing usability testing or heuristic review starting with an identical set of scenario-based tasks. Across the study, 39 distinct usability problems were identified. Consistent with previous research, heuristic evaluation techniques identified slightly more of the problems (87%) and usability testing slightly fewer (54%). There was a 41% overlap in the problems identified.

More interestingly, when the problems/errors were categorized on Rasmussen's behavior-levels, heuristic review and usability testing identified complimentary sets of problems: heuristic review identified significantly more skill-based and rule-based problems, whereas usability testing identified challenges that occur at the knowledge-based level of behavior.

Upon consideration, this distribution is not terribly surprising. Usability Testing identifies significantly more problems at the knowledge-level. Knowledge-based challenges arise when users are learning – creating or modifying their mental models – during the course of the task itself. Often the problems that surface here are the result of a mismatch between the expected and actual task flow. Since the mental interaction models for experts are usually fairly articulated, experts are not good at conjuring the experiences or speculating about expectations of novice users.

In contrast, skill- and rule-based levels of behavior are well studied and documented in attention and perception literature (e.g., Proctor and Van Zandt, 1994). Human Factors courses often focus on these theories. The criteria for heuristic review are essentially derived from them. It is not surprising that usability specialists focusing on heuristic criterion for evaluation would identify relatively more problems at this level. Not incidentally, parameters of interface design affecting skill- and rule-based levels of behavior reflect characteristics that are intrinsic to the interface itself. Intrinsic problems are more likely challenge advanced users because they present a fundamental challenge to their experience-based "standard model" of interface interactions.

Fu, Salvendy and Turley (2002) conclude that the most effective approach is to integrate both techniques at different times in the design process. Based on their analysis, heuristic review should be applied first or early in the (re-)design process to identify problems associated with the lower levels of perception and performance. Once those problems are resolved, usability testing can be used to effectively focus on higher level interaction issues without the distraction of problems associated with skill- and rule-based levels of performance.

Mindful of budget and time constraints, Fu and colleagues also note that if redesign is concerned only with icon design or layout, heuristic review may be sufficient to the task. However, if modification affects software structure, interaction mapping or complex task flows, usability testing is the better choice. To that end, the more complex the to-be-performed tasks are, the more critical representative usability testing becomes. The more complex the proposed redesign, the more critical that both methods be employed.

It depends...

So what should you do with your evaluation project? Like most other projects, it depends on the specific case. Despite the chaotic nature of the field, it is still possible to draw a few conclusions from these studies:

heuristic (expert) reviews and usability testing identify different types of usability problems
expert reviews really only work well when experts do them
combining techniques optimizes the return

Taken together these suggest that to select the most appropriate method you will need to consider more than your budget and the possible methodologies. To make the best decision, you will also need to weigh the expertise of your evaluators, the maturity of your application, the complexity of the tasks, and possibly even the current status of your usability program.

References

Bailey, R. W., Allan, R. W., and Raiello, P. (1992). Usability testing vs. heuristic evaluation: A head-to-head comparison. Proceedings of the 36th Annual Human Factors Society Meeting. pp.409-413.

Desurvire, H. Kondziela, J. and Atwood, M. (1992). What is gained and what is lost when using evaluation methods other than usability testing. Paper Presented at Human Computer Interaction Consortium.

Gerhardt-Powals, J. (1996). Cognitive engineering principles for enhancing human-computer performance. International Journal of Human-Computer Interaction, 8(2), 189-211.

Moglich, R. and Nielsen, J. (1990), Improving a human-computer dialogue. CACM, 33(3), 338-348.

Nielsen, J. (1994). Enhancing the explanatory power of usability heuristics. Proceedings of CHI'94. ACM, New York. pp 153-158.

Jeffries, R. and Desurvire, H. (1992). Usability Testing vs. Heuristic Evaluation: Was there a Contest? SIGCHI Bulletin, 24(4) 39-41.

Jeffries, R. Miller, J. Wharton, C. and Uyeda, K. (1991). User interface evaluation in the real world: A comparison of four techniques. In Proceedings of CHI'91. (New Orleans, LA) ACM, New York. Pp. 119-124.

Karat, C. M., Campbell, R. and Fiegel, T. (1992). Comparison of empirical testing and walkthrough methods in user interface evaluation. In Proceedings of CHI '92 (Monterey, CA) ACM, New York. Pp. 397-404.

John, B. (1994). Toward a deeper comparison of Methods: A reaction to Nielson& Phillips and New Data. CHI'94 Companion. (Boston, Massachusetts), ACM, New York, pp. 285-286.

Law, L. and Hvannberg, E. T. (2002). Complementarity and Convergence of Heuristic Evaluation and Usability Test: A case Study of UNIVERSAL Brokerage Platform. NordiCHI (October 19-23, 2002), pp. 71-80.

Muller, M.J., Dayton, T., and Root, R. W. (1993). Comparing studies that compare usability methods: An unsuccessful search for stable criteria. INTERCHI '93 Adjunct Proceedings, pp. 185-186.

Rasmussen, J. (1986). Information Processing and Human Machine Interaction: An approach to Cognitive Engineering. (New York: Elsevier.)

Savage, P. (1996). User Interface Evaluation in an Iterative Design Process: A comparison of three techniques. CHI'96 Companion. (Vancouver, BC, Canada April 13-18)

Message from the CEO, Dr. Eric Schaffer — The Pragmatic Ergonomist

Reader comments

Alan Wexelblat

I have to say that I find it totally unsurprising that the quality of the evaluator is the most important factor in determining the effectiveness of usability evaluation, regardless of the method.

We've known for thirty years in software development that quality of personnel was (by at least a factor of 4) the most important thing affecting the quality of the resulting product. Things like software development methods and tools are known to have roughly a 10-20% impact on metrics such as speed of delivery or errors in final product. It therefore would be unsurprising to see a similar result in the HCI field.

Cool stuff and UX resources

Introduction

Hear what I say, or watch what I do?

So, what's an Intranet manager to do?

Levels of Understanding

It depends...

References

Message from the CEO, Dr. Eric Schaffer — The Pragmatic Ergonomist

Leave a comment here

Reader comments

Alan Wexelblat

Subscribe

Follow us

Cool stuff and UX resources

Pitting usability testing against heuristic review

Introduction

Hear what I say, or watch what I do?

So, what's an Intranet manager to do?

Levels of Understanding

It depends...

References

Message from the CEO, Dr. Eric Schaffer — The Pragmatic Ergonomist

Leave a comment here

Reader comments

Alan Wexelblat

Subscribe

Follow us