|
|
|
|
|
Insights from Human Factors International
|
 |
|
In This Issue :
|
| |
|
|
Enough is enough... but five probably
isn't. Evaluating the "test-five-users" guideline.
|
|
Kath Straub, Ph.D., CUA, Chief Scientist of HFI, revisits the topic
of the optimum number of usability testing subjects.
|

|
|
| |
|
|
The Pragmatic Ergonomist
|
Dr. Eric Schaffer, Ph.D., CPE, founder and CEO of HFI offers practical
advice.
|
| |
|
| |
|
|
Death. Taxes. And "the how many users?" debate.
|
There are a few things that seem inevitable. Death. Taxes. The "Are-Five-Users-Enough?"
panel discussion that occurs at every usability conference.
Every conference.
Every year.
These panels are legend. People get excited. Speakers get hyperbolic.
Listeners get frustrated.
Repeat.
Listeners get frustrated because the debate rages with the same opinions
and no new and compelling data. The answer to the "how-many-users"
question is important. However entertaining, the fact that there is no
resolution frustrates practitioners who need to know how to justify the
choice to test five (6? 10? 90? 150?) users to their management. Understanding
the "right" answer (and why it is right) is particularly important
for individuals institutionalizing their usability practice. They need
to make critical decisions on how to prioritize activities with limited
staff time and within a limited budget and a short window to build credibility.
So, really... This year they will tell us, right? How many users?
|
 |
|
Is so...
|
For years we have heard that, using the law of diminishing returns, five
users will uncover approximately 80% of the usability problems in a product
(Virzi, 1992).
In support of this claim, Nielsen (Landauer and Nielsen, 1993; Nielsen,
1993) present a meta-analysis of 13 studies in which they calculate confidence
intervals to derive the now famous formula:
Problems found = N(1-(1-L)n)
N = number of known problems
L = the probability of any given user finding
any given problem
n = # of participants
Since this function ceilings rapidly at five participants, practitioners
typically interpret the formula as advising that five is enough.
|
 |
|
Is not...
|
There are two broad approaches to arguing against the five-user guideline.
One approach is to deconstruct the claim on statistical methods. Researchers
who take this approach argue that inappropriate calculations were used
or that the underlying assumptions are faulty or not met (Grosvenor, 1999;
Woolrych and Cockton, 2001).
Others take a more empirical approach. Spool and Schroeder (2001) report
that testing the first five revealed only 35% of problems identified by
the larger test set. Perfetti and Landesman (2002) show that participants
6-18 (of 18) each identified five or more problems that were not uncovered
within the first five user tests.
|
 |
|
Do you read the fine print?
|
In fairness, both Virzi and Nielsen place qualifications on the five-user
guideline. Nielsen carefully describes the confidence part of confidence
intervals. Virzi warns that "[s]ubjects should be run until the number
of new problems uncovered drops to an acceptable level." (p.467).
This leaves unsuspecting readers either to wade through the philosophy
of confidence intervals or test until they've tested to an (unspecified
but) "acceptable" level. It's no wonder that practitioners blink
at the caveats and remember number five.
But is that the right thing to do?
|
 |
|
A new way to decide
|
Faulkner (2003) buttresses the old empirical evaluation with a statistical
sampling approach to arrive at a novel new way to determine if five is
really enough. She evaluated the five-user guideline in a two phase experiment.
First, she evaluated the usability of a Web-based time sheet application
by observing deviations from the optimal path over 60 participants. Then,
she used a sampling algorithm to randomly draw smaller sets of individual
users' results from the full dataset for independent analysis. Set sizes
corresponded to the number of users 'tested' in that simulation. In the
course of her experiment she ran 100 simulations each with user group
sizes 5, 10, 20, 30, 40, 50 and 60 users.
She found that, on average, Nielsen's prediction is right. Over 100 simulated
tests, testing five users revealed an average of 85% of the usability
problems identified with the larger group.
Averages are good, but for day-to-day practitioners, the range of problems
identified is a more critical figure. The range was not so promising.
Over the 100 simulated tests, the percentage of usability problems found
when testing five participants ranged from nearly 100% down to only 55%.
As any good freshman statistics student could predict, there is a large
variation in outcomes between trials with small samples. Extrapolating
from Faulkner's findings, usability test designers relying on any single
set of five users run the risk that nearly half the problems could be
missed.
Increasing the number of participants, however, improves the reliability
of the findings quickly. Drawing 10 participants instead of five, the
simulation uncovered 95% of the problems on average with a lower bound
of 82% of problems identified over 100 simulations. With 15 participants,
97% of the identified problems were uncovered on average, with a lower
bound of 90% found.
|
 |
|
So? How many then?
|
So what's the answer? As always in usability, the answer is "It
depends." The key to effective usability testing is recruiting a
truly representative sample of the target population. Often the test population
will need to represent more than one user group.
That aside, Faulkner's work strongly indicates that a single usability
test with five participants is not enough.
|
| |
| |
|
|
|
So for a routine usability test run 12 people for each segment. For an
important one where the stakes are high run 30. If resources are really
tight, you can drop to five-six per segment, but this is bad.
Remember I said "FOR EACH SEGMENT." If you are designing a
time reporting system for health care workers, government employees, lawyers,
and forestry workers, you are making a big mistake if you test just three
in each group. That would be 12 people tested, but the groups are quite
diverse and you need more people from each segment to be confident.
|
 |
|
References
|
Faulkner, L. (2003). Beyond the five-user assumption: Benefits of increased
sample sizes in usability testing. Behavior Research
Methods, Instruments and Computers, 35(3), 379-383.
Grosvenor, L. (1999). Software usability: Challenging the myths and assumptions
in an emerging field. Unpublished master¹s thesis, University of
Texas, Austin.
Landauer, T. K., & Nielsen, J. (1993). A mathematical model of the
finding of usability problems. Interchi ¹93,
ACM ComputerHuman Interface Special Interest Group.
Nielsen, J. (1993). Usability engineering. Boston: AP Professional.
Perfetti, C., & Landesman, L. (2002). Eight
is not enough. Retrieved April 14, 2003.
Spool, J., & Schroeder, W. (2001). Testing web sites: Five users
is nowhere near enough. In CHI 2001 Extended Abstracts
(pp. 285- 286). New York: ACM Press.
Virzi, R. A. (1992). Refining the test phase of usability evaluation:
How many subjects is enough? Human Factors,
34, 457-468.
Woolrych, A., & Cockton, G. (2001). Why and when five test users
aren't enough. In J. Vanderdonckt, A. Blandford, & A. Derycke (Eds.),
Proceedings of IHM-HCI 2001 Conference: Vol.
2 (pp. 105- 108). Toulouse, France: Cépadèus.
|
| |
|
|
Martha Roden,
Usability Engineer
CoCreate Software Inc.
|
Regarding the "Is 5 Enough?" debate in your newsletter.
If you are the only interaction designer and usability professional in
a company of 400 people, AND no one really thinks usability tests are
important or wants to free up additional resources to conduct the tests...
then believe me ... 5 is definitely better than none!
Five people still find more usability problems than zero people.
|
|
|
James R. (Jim) Lewis, Ph.D., CHFP
Senior Human Factors Engineer
IBM Pervasive Computing Division |
As always, I enjoyed the newsletter. I guess I can understand the frustration
with the recurring question of sample sizes for usability studies, but
it can be an important issue, and the various explorations of it have
enhanced our understanding of some of our practices (or at least led to
some interesting discussions).
I've ordered the Faulkner paper, and am looking forward to reading it.
It sounds similar in method and results to a paper that I published in
2001:
Lewis, J. R. (2001). Evaluation of procedures for adjusting problem-discovery
rates estimated from small samples. Journal of Human-Computer Interaction,
13, 445-479.
I was a little surprised that you didn't mention the paper that I published
in Human Factors in 1994:
Lewis, J. R. (1994a). Sample sizes for usability studies: Additional
considerations. Human Factors, 36, 368-378.
Then again, I shouldn't be surprised because it doesn't get mentioned
that much. I don't promote my work as well as Jakob Nielsen, and Bob Virzi
was the first to publish the results of a Monte Carlo study on the topic
(HFES Proceedings in 1990, then the HF article in 1992). I also think
that I might not have received as much mention because my message wasn't
as optimistic or simple. Basically, the message was that the sample size
you need depends on the problem discovery rate (p). If p is large, then
you don't need very many people to discover the problems available for
discovery. If p is small, then you need a larger sample. An ROI simulation
indicated that the appropriate target for the proportion of problems to
discover also depended on the value of p, with higher values having a
break-even point at around 98% and lower values of p having a break-even
point at around 86% (it's important to keep in mind that these values
might depend on the assumptions made for the simulation -- but I think
they are still informative). The data I reported in Lewis (1994) had p=.16,
so to get to 86% problem discovery, I needed to run 12 participants (I
actually ran 15). It's interesting (perhaps coincidental) that this is
the number that Dr. Schaffer suggested in his part of the newsletter.
Another finding from Lewis (1994) that might have been discouraging to
practitioners was that I didn't find any relationship between problem
impact (or severity, using a behavioral scale rather than observer judgments)
and frequency. In other words, my data showed severe problems being discovered
at the same rate as non-severe problems.
Something that people always seem to forget (and this is something that
I heard Jakob Nielsen try to drive home at a UPA panel session) is that
the use of small-sample usability tests is bound to iterative testing.
If you're not iterating, then small samples don't make much sense. By
the time you finish testing, you should have tested a relatively large
number of participants -- not just five! The reason to run multiple small
samples (especially in the early stages of evaluating a design) is that
certain problems emerge early and there is often general agreement among
human factors engineers and developers regarding the reality of the observed
problems and the need to fix them. Why would you continue running participants
when you know in advance that they have a very high likelihood of encountering
the known problems? Continuing to test under those circumstances is potentially
wasteful of resources, slows down your ability to find out if the fixes
you have in mind will actually fix the problem and not create any others,
and there might even be an ethical consideration about needlessly frustrating
participants.
As you repair problems, the likelihood of discovering additional problems
should fall until the problems are appearing almost at random -- what
I sometimes think of as problem foam. The early problems seem to have
more structure and consistency than the later problems (which will usually,
almost by definition, have a lower likelihood of occurrence than the early
problems). Thus, the more mature a product becomes, the larger your sample
size will need to be to discover the remaining problems (but, hopefully,
there are fewer problems available for discovery -- and I don't think
that any of us believe that it is possible to discover all usability problems).
This suggests a strategy of starting with small samples, then increasing
sample sizes as you continue iterating. Another advantage of shifting
from small to large samples is that as you approach the end of the development
cycle, the importance of global task measures (task completion rates,
task completion times, user satisfaction measures) becomes more important
-- especially if you have set up targets for these measures to use as
a stopping rule for the iterative test-redesign-retest process, as larger
samples lead to more precision in these types of measurements.
It's a shame that practitioners sometimes latch onto simple but misleading
rules of practice. Thanks for tackling these topics in the newsletter.
|
|
|
|
Past Issues
|
|