& Sample Sizes for Usability Tests: Mostly Math, Not Magic

Transcription

& Sample Sizes for Usability Tests: Mostly Math, Not Magic
WAITS
& MEASURES
S P E C I A L
S E C T I O N
Sample Sizes for
Usability Tests: Mostly
Math, Not Magic
James R. Lewis | IBM Corp. | jimlewis@us.ibm.com
“Really, how many users do you need to test?
Three answers, all different.”
—USER EXPERIENCE, VOL. 4, ISSUE 4, 2005
WHY DO WE KEEP TALKING about appropriate sample sizes for usability tests?
Perhaps the most important factor is the economics of usability testing. For many practitioners, usability tests are fairly expensive events, with much of the expense in the variable cost of the number of participants observed (which includes cost of participants, cost
of observers, cost of lab, and limited time to obtain data to provide to developers in a timely fashion). Excessive sampling is always wasteful of resources [9], but when the cost of an
additional sample (in usability testing, an additional participant) is high, it is very important
that the benefit of additional sampling outweighs the cost.
Another factor is the wide range of test and evaluation situations that fall under the
umbrella of usability testing. Usability testing includes three key components: representative participants, representative tasks, and representative environments, with participants’
activities monitored by one or more observers [2]. Within this framework, however, usability tests have wide variation in method and motivation. They can be formal or informal,
think-aloud or not, use low-fidelity prototypes or working systems. They can have a primary
focus on task-level measurements (summative testing) or problem discovery (formative
testing). This latter distinction is very important, as it determines the appropriate general
approach to sample-size estimation for usability tests.
When the focus is on task-level measurements, sample-size estimation is relatively
straightforward, using mainstream statistical techniques that have been available since
the early 20th century (in some cases, even earlier). Basically, you need an estimate of
the variance of the dependent measure(s) of interest (typically obtained from previous,
similar studies or pilot data) and an idea of how precise the measurement must be
(which is a function of the magnitude of the desired minimum critical difference and
statistical confidence level); once you have that, the rest is arithmetic. There are numerous sources for information on standard sample-size estimation [6, 23]. For this reason,
I’m not going to describe them in any additional detail here (but for a detailed discussion of this type of sample size estimation in the context of usability testing, see Lewis
[14]). The less-well-understood problem is sample-size estimation for problem-discovery
(formative) testing.
A LITTLE HISTORY. I first encountered this problem when I starting working for IBM in
1981, fresh from graduate school. The IBM practice at that time, based on papers published by Alphonse Chapanis and colleagues [1, 5], was to observe about five to six participants per iteration for problem discovery. Chapanis had asserted that after you’d observed
six participants, you would have seen about all of the problems you were going to see.
i n t e r a c t i o n s
/
n o v e m b e r
+
d e c e m b e r
2 0 0 6
: / 29
Based on graduate statistics classes I’d had with James Bradley [3, 4], I thought that there
must be a way to more precisely estimate sample sizes for these types of tests. Specifically,
it seemed like you should be able to use the binomial probability formula for this purpose,
and I mentioned this briefly in my first publication [10]:
THE GOAL:
PROBLEM DISCOVERY
You can’t really talk about discovering
90 percent of all possible usability
problems across all possible users,
tasks, and environments. You can
establish a problem discovery goal
given a sampled population of users,
a defined set of tasks, and a defined
set of environments. Change the
population of users, tasks, or
environments, and all bets are off. But
this is better than nothing. If your
problem discovery rate is starting to
go down, then change one or all of
these elements of usability. Test from
a different population of users, using
different tasks, in different
environments. You’ll discover different
problems.
The binomial probability theorem can be used to determine the probability
that a problem of probability p will occur r times during a study with n subjects. For example, if an instruction will be confusing to 50 percent of the user
population, the probability that one subject will be confused is 0.5. If two
subjects are observed, then the probability that either one or both subjects
will be confused is 0.75; and if three subjects are observed, the probability
that at least one of them will be confused is 0.875.
I didn’t mention the now-famous formula 1-(1-p)n in that paper, but that’s the formula
I used for the computations. Bradley taught his students that this was a very useful formula
for many situations, derived from the binomial probability formula as P(At least once) = 1
- P(0) (in other words, the probability of something happening at least once is 1 minus the
probability of its not happening at all). When r = 0 in the binomial probability formula, P(0)
is (1-p)n, so P(At least once) is 1-(1-p)n.
The years 1990 through 1994 saw a series of publications investigating the use of the
formula to model usability problem discovery, including empirical verification of its accuracy for problem discovery studies, in which sample size refers to the number of participants, and heuristic evaluations, in which sample size refers to the number of independent observers [21, 22, 25, 15, 12]. These studies provided quite a bit of evidence
that 1-(1-p)n is a good model of problem discovery. For problem-discovery tests, this literature contains several large-sample examples that showed p ranging from 0.16 to 0.42
[12]. For several large-sample heuristic evaluations, the reported value of p ranged from
0.22 to 0.60 [16].
So, what does 1-(1-p)n suggest about usability-problem discovery? Note that there are
only two variables—p and n. The most direct interpretation of this is that many other variables that we might assume would affect problem discovery—such as the cost of fixing a
problem or the severity of the problem from the user’s perspective—don’t. For example,
Virzi [22] reported earlier discovery of more-serious problems, but I failed to replicate that
finding [12]. Also, a return-on-investment (ROI) model in the same paper showed that as
the magnitude of the savings associated with early discovery versus late discovery
increased, the ROI of a usability study also increased, but this factor had no appreciable
effect on the sample size at maximum ROI [12].
An additional outcome of the ROI study was that the appropriate problem discovery goal
depended on the value of p. The model indicated that if the expected value of p was small
(say, around 0.10), practitioners should plan to discover about 86 percent of the problems.
If the expected value of p was larger (say, around 0.25 or 0.50), practitioners should plan
to discover about 98 percent of the problems. For expected values of p between 0.10 and
0.25, practitioners should interpolate between 87 and 97 percent to determine an appropriate goal for the percentage of problems to discover. The analysis did not address values
of p smaller than 0.10, but, presumably, the appropriate goal would be something less
than 86 percent.
If you know or can estimate the expected value of p for a study and know the desired
problem discovery goal, you can compute n with the following formula (derived algebraically from Goal = 1-(1-p)n, solving for n):
n = log(1-Goal)/log(1-p)
But getting an estimate of p can be tricky if you’re working with small samples. For many
years, I’d assumed that small-sample estimates of p would behave like small-sample esti-
: / 30
i n t e r a c t i o n s
/
n o v e m b e r
+
d e c e m b e r
2 0 0 6
WAITS
& MEASURES
S P E C I A L
S E C T I O N
mates of the arithmetic mean—that they would have more variability than large-sample
estimates, but would be unbiased (tending to have the same value as large-sample estimates in the long run). In 2001 I found out that this assumption was completely wrong. I
was editing a special issue of the International Journal of Human-Computer Interaction on
Usability Evaluation (Vol. 13, No. 4), and received a manuscript from Morten Hertzum and
Niels Jacobsen in which they proved that small-sample estimates of p were necessarily
biased to be higher than the actual population problem discovery rate [7]!
In response to this, I investigated a number of methods for adjusting problem-discovery
rates estimated from small samples [13]. The best method for compensating for the bias
was to average two methods—one method based on Good-Turing discounting and a normalization method based on the work of Hertzum and Jacobsen. The resulting adjustment
looks complicated, but it won’t seem quite so bad after going through a worked-out example (in the next section):
padj = 1/2 [(pest - 1/n)(1 - 1/n)] + 1/2[ pest /(1+>GTadj )]
GTadj is the Good-Turing adjustment to probability space to account for unseen events
(which is the proportion of the number of problems that occurred once divided by the total
number of different problems). The pest /(1+ GTadj ) component in the equation produces
the Good-Turing-adjusted estimate of p by dividing the observed, unadjusted estimate of
p (pest ) by the Good-Turing adjustment to probability space. The (pest - 1/n)(1 - 1/n) component in the equation produces the normalized estimate of p from the observed, unadjusted estimate of p and n (the sample size used to estimate p). The adjustment uses the
average of these two estimates, because the Good-Turing estimator tends to overestimate
the true value of p, but normalization tends to underestimate it [13]. Note that the GoodTuring adjustment is a function of the number of infrequently occurring problems, whereas normalization is a function of the estimate’s sample size. The Monte Carlo experiments
of Lewis [13] demonstrated that this adjustment works very well, even with initial sample
sizes as small as two to four participants.
A HYPOTHETICAL EXAMPLE. The best way to work with these formulas is to create
a participant-by-problem matrix, as shown in Table 1.
Problem Number
2
3
Participant
1
4
Count
p
1
1
0
2
1
0
1
0
2
0.500
1
1
3
3
1
0.750
0
0
0
1
0.250
4
5
0
0
0
0
0
0.000
1
0
1
0
2
0.500
6
1
0
0
0
1
0.250
7
1
1
0
0
2
0.500
8
1
0
0
0
1
0.250
Count
7
1
3
1
P
0.875
0.125
0.375
0.125
0.375
Table 1. Data from a Hypothetical Usability Test with Eight Subjects, p est = 0.375
One of several ways to compute p is to divide the number of problem occurrences by
the number of participants times the number of problems. After running eight participants,
the estimate of p is 0.375 (12/(8*4)). But what did things look like after having run the first
four? At that time there was no evidence that Problem 2 existed, so the estimate of p was
i n t e r a c t i o n s
/
n o v e m b e r
+
d e c e m b e r
2 0 0 6
: / 31
REFERENCES 1. Al-Awar, J., Chapanis, A., & Ford,
R. (1981). Tutorials for the first-time computer user.
IEEE Transactions on Professional Communication, 24,
30-37. 2. ANSI. (2001). Common industry format for
usability test reports (ANSI-NCITS 354-2001).
Washington, DC: American National Standards
Institute. 3. Bradley, J. V. (1968). Distribution-free statistical tests. Englewood Cliffs, NJ: Prentice-Hall. 4.
Bradley, J. V. (1976). Probability; decision; statistics.
Englewood Cliffs, NJ: Prentice-Hall. 5. Chapanis, A.
(1981). Evaluating ease of use. Unpublished manuscript prepared for IBM, available on request from J.
R. Lewis. 6. Diamond, W. J. (1981). Practical experiment designs for engineers and scientists. Belmont,
CA: Lifetime Learning Publications. 7. Hertzum, M.,
& Jacobsen, N. J. (2003). The evaluator effect: A chilling fact about usability evaluation methods.
International Journal of Human-Computer Interaction,
15, 183-204. 8. ISO. (1998). Ergonomic requirements
for office work with visual display terminals (VDTs) Part 11: Guidance on usability (ISO 9241-11:1998(E)).
Geneva, Switzerland: Author. 9. Kraemer, H. C., &
Thiemann, S. (1987). How many subjects? Statistical
power analysis in research. Newbury Park, CA: Sage.
10. Lewis, J. R. (1982). Testing small system customer set-up. In Proceedings of the Human Factors
Society 26th Annual Meeting (pp. 718-720). Santa
Monica, CA: Human Factors Society. 11. Lewis, J. R.
(1993). Problem discovery in usability studies: A
model based on the binomial probability formula. In
Proceedings of the Fifth International Conference on
Human-Computer Interaction (pp. 666-671). Orlando,
FL: Elsevier. 12. Lewis, J. R. (1994). Sample sizes for
usability studies: Additional considerations. Human
Factors, 36, 368-378. 13. Lewis, J. R. (2001).
Evaluation of procedures for adjusting problem-discovery rates estimated from small samples.
International Journal of Human-Computer Interaction,
13, 445-479 14. Lewis, J. R. (2006). Usability testing.
In G. Salvendy (ed.), Handbook of Human Factors and
Ergonomics (pp. 1275-1316). New York, NY: John
Wiley. 15. Nielsen, J., & Landauer, T.K. (1993). A
mathematical model of the finding of usability problems. In Proceedings of ACM INTERCHI’93 Conference
(pp. 206-213). Amsterdam, Netherlands: ACM Press.
16. Nielsen, J., & Molich, R. (1990). Heuristic evaluation of user interfaces. In Conference Proceedings on
Human Factors in Computing Systems - CHI90 (pp.
249-256). New York, NY: ACM. 17. Perfetti, C., &
Landesman, L. (2001). Eight is not enough. Retrieved
July 4, 2006 from http://www.uie.com/articles/
eight_is_not_enough/ 18. Sauro, J. (2006). UI problem discovery sample size. Downloaded from
Measuring Usability website, July 20, 2006http://www.measuringusability.com/samplesize/pro
blem_discovery.php. 19. Spool, J., & Schroeder, W.
(2001). Testing web sites: Five users is nowhere near
enough. In CHI 2001 Extended Abstracts (pp. 285286). New York: ACM Press. 20. Turner, C. W.,
Lewis, J. R., & Nielsen, J. (2006). Determining usability test sample size. In W. Karwowski (ed.),
International Encyclopedia of Ergonomics and Human
Factors (pp. 3084-3088). Boca Raton, FL: CRC Press.
21. Virzi, R. A. (1990). Streamlining the design
process: Running fewer subjects. In Proceedings of
the Human Factors Society 34th Annual Meeting (pp.
291-294). Santa Monica, CA: Human Factors Society.
22. Virzi, R.A. (1992). Refining the test phase of
usability evaluation: How many subjects is enough?
Human Factors, 34, 457-468. 23. Walpole, R. E.
(1976). Elementary statistical concepts. New York, NY:
Macmillan. 24. Wixon, D. (2003). Evaluating usability methods: Why the current literature fails the practitioner. interactions, 10(4), 28-34. 25. Wright, P. C.,
& Monk, A. F. (1991). A cost-effective evaluation
method for use by designers. International Journal of
Man-Machine Studies, 35, 891-912.
: / 32
6/(3*4), or 0.500 (an example of the bias described by Hertzum and Jacobsen, [7]).
Furthermore, suppose you had established a goal of 90 percent problem discovery.
If you were to estimate the sample-size requirement using the unadjusted value of p, you’d
get n = log(1-.90)/log(1-0.5) = log(0.1)/log(.5) = (-1)/(-0.3) = 3.3, which rounds up to 4.
How much would this change using the adjusted value of p? First, let’s do the GoodTuring adjustment. We need to know the total number of discovered problems (three after
having observed four participants), and how many of those had occurred just once (one).
For this example, the adjustment is 0.5/(1 + 1/3), which equals 0.375. Next is the normalization procedure, which is (0.5 - 1/4)(1 - 1/4) = 0.188. The average of these two adjustments is 0.28—almost half the unadjusted value. The correspondingly adjusted estimate of
n is log(1-0.90)/log(1-0.28) = log(0.1)/log(0.72) = (-1)/(-0.143) = 7—almost double the
original estimate (but still not terribly large).
As an exercise to the reader, what are the adjusted values for p and n if you use the data
from all eight participants in Table 1? If you don’t want to drag out the calculator with the
log functions, try the sample-size calculator at the Measuring Usability Web site
(http://www.measuringusability.com/samplesize/problem_discovery.php-[18]).
THE “EIGHT IS NOT ENOUGH” EXAMPLE. In 2001, Spool and Schroeder published the results of a large-scale usability evaluation in which they concluded that five
users were “nowhere near enough” to find all (or even 85 percent) of the usability problems in the Web sites they were studying. Perfetti and Landesman [17], discussing related research, stated:
When we tested the site with 18 users, we identified 247 total obstacles-topurchase. Contrary to our expectations, we saw new usability problems
throughout the testing sessions. In fact, we saw more than five new obstacles for each user we tested. Equally important, we found many serious
problems for the first time with some of our later users. What was even
more surprising to us was that repeat usability problems did not increase as
testing progressed. These findings clearly undermine the belief that five
users will be enough to catch nearly 85 percent of the usability problems on
a Web site. In our tests, we found only 35 percent of all usability problems
after the first five users. We estimated over 600 total problems on this particular online music site. Based on this estimate, it would have taken us 90
tests to discover them all!
The information provided in this paragraph shows that the value of p in this study was
very small. If there were 600 usability problems available for discovery given the study’s
method, then 247 problems are 41 percent of the total available for discovery. Taking 1(1-p)18 = 0.41 and solving for p gives p = 0.029.
Given p = 0.029, the percentage of discovery expected when n = 5 is 13.7 percent. In
accordance with the data reported by Perfetti and Landesman, 13.7 percent of 600 is 82
problems, which is about 35 percent of the total number of problems they discovered with
18 participants (35 percent of 247 is 86).
For the conditions present in their study, it is not surprising that they continued to see
more than five new problems with each participant. In fact, you wouldn’t expect the number of new problems per participant to fall below five until around the 45th participant.
This is what you’d generally expect with a low problem discovery rate and a large number
of problems available for discovery.
Their discovery of serious problems with later users is consistent with Lewis [12], which
failed to replicate the early discovery of serious problems reported by Virzi [22].
The low incidence of repeat usability problems is also consistent with low values of p. A
high incidence of repeat usability problems is more likely with evaluations of early designs
i n t e r a c t i o n s
/
n o v e m b e r
+
d e c e m b e r
2 0 0 6
WAITS
than evaluations of more mature designs. Usability testing of designs that have already had
common usability problems removed is likely to uncover problems that are relatively idiosyncratic, which seems to have been the case with this study. Also, as the authors report,
the tasks given to participants were somewhat unstructured, which could have expanded
the space of problems available for discovery.
Their primary conclusion—that five or eight users aren’t enough to discover 85 percent
of the problems available for discovery when p = 0.029—is well founded. On the other
hand, even with this extremely low value of p, the expected percentage discovered with
eight participants is about 21 percent, which is certainly better than not running any participants at all. When p is this small, if the goal is to discover 85 percent of the problems
available for discovery, then the required sample size is 62. If the goal is to discover 99 percent (“all”) of the 600 problems, then the required sample size is 140.
What we don’t know from this study is how likely it is to have such a low value of p. The
authors surmised that this might be a characteristic of usability studies of Web sites, but it
could also be a function of the testing method or the level of description of usability problems. Regardless, this example illustrates the importance of computing an early estimate of
p and making an explicit decision about the desired percentage of problem discovery as
integral steps for rationally determining the required sample size.
DISCUSSION. We know a lot more about how to estimate required sample sizes for
usability problem-discovery tests than we did 25 years ago, but I don’t believe that this
knowledge is very prevalent throughout the usability testing community, nor is it widely
taught to graduate students. I hope that recent publications [14, 20] will change the current situation.
There will, of course, continue to be discussions about sample sizes for problem-discovery usability tests, but I hope they will be informed discussions. If a practitioner says
that five participants are all you need to discover most of the problems that will occur in
a usability test, it’s likely that this practitioner is typically working in contexts that have a
fairly high value of p and fairly low problem discovery goals. If another practitioner says
that he’s been running a study for three months, has observed 50 participants, and is
continuing to discover new problems every few participants, then it’s likely that he has a
somewhat lower value of p, a higher problem discovery goal, and lots of cash (or a lowcost audience of participants). Neither practitioner is necessarily wrong—they’re just
working in different usability testing spaces. The formulas developed over the past 25
years provide a principled way to understand the relationship between those spaces, and
a better way for practitioners to routinely estimate sample-size requirements for these
types of tests.
& MEASURES
S P E C I A L
S E C T I O N
SOLUTION TO THE EXERCISE
The Good-Turning adjustment is
0.375/(1 + 2/4) = 0.25. The normalization
adjustment is (0.375-1/8)(1-1/8) = 0.22.
Their average, the adjusted estimate of
p, is 0.235, a little smaller than the
adjusted value at n = 4. The corresponding adjusted estimate for n is log(10.90)/log(1-0.235) = log(0.1)/log(0.765) =
(-1)/(-0.116) = 8.6, which rounds up to 9.
The hypothetical practitioner might consider running one more participant, given
the resources to do so. If not, the practitioner can assess the adequacy of the
sample size by using the basic formula
1-(1-p)n. The estimated proportion of
problems discovered is 1-(1-0.235)8,
which is 0.88 (88 percent)—only a little
short of the goal of 90 percent.
ABOUT THE AUTHOR Jim Lewis has been a usability practitioner at IBM since 1981,
working primarily on input methods (especially speech input) and usability evaluation.
He studied engineering psychology and applied statistics at New Mexico State
University (MA, 1982) and psycholinguistics at Florida Atlantic University (PhD, 1996).
He has written several papers on standardized usability questionnaires and sample-size determination
and recently wrote the usability testing chapter for the third edition of the Handbook of Human
Factors and Ergonomics.
Permission to make digital or hard copies of all or part of this
work for personal or classroom use is granted without the fee,
provided that copies are not made or distributed for profit or
commercial advantage, and that copies bear this notice and the
full citation on the first page. To copy otherwise, to republish, to
post on services or to redistribute to lists, requires prior specific
permission and/or a fee. © ACM 1072-5220/06/1100 $5.00
i n t e r a c t i o n s
/
n o v e m b e r
+
d e c e m b e r
2 0 0 6
: / 33