What’s in a Number?
By Carol M. Barnum
Whereas 7 (plus or minus 2) is the mantra for structured writing and other methods for organizing information, 5 (plus or minus 2) is the mantra for the number of participants needed in a usability test.
Recent articles have looked at what Miller, who introduced the research on short-term memory, really meant regarding the 7 + or – 2 number (Doumont 2002; Kolbach 2002), and a similar re-examination is now a much-discussed topic regarding the viability of applying the number 5 to web usability testing. Two widely-publicized usability studies of Web users, one directed by Rolf Molich and the other by Jared Spool, are fueling the discussion. At the most recent meetings of CHI and UPA, panels addressed this specific topic, and the first question directed to Jakob Nielsen at the CHI session entitled "Ask Jakob" was, How many users does it take?
Knowing something about the research studies and the issues raised gives you the ammunition to decide where you stand. So, here’s a brief overview of what the controversy is based on, and, if you want to learn more, you can read the whole story in the original sources.
Where Does the "Magic Number 5" Come From?
The "Magic Number 5"—five participants will yield 80% of the findings from a usability test—comes from research conducted in the 1990’s by Nielsen, Virzi, Lewis, and other human factors engineers. The one article that brings all of this work to focus on the bottom-line, however, is Nielsen’s "Guerilla HCI: Using Discount Usability Engineering to Penetrate the Intimidation Factor," which appeared in Cost-Justifying Usability. Both the title of the article and the title of the book say it all. Nielsen’s model showing the number of users required for the maximum cost-benefit ratio in a usability test provides the evidence needed to convince our managers that a cost-effective usability test can be added to the development timeline without slipping the schedule and without costing much money. Such evidence has propelled usability out of the box of "scientific experiment" and into the arena of diagnostic and exploratory research.
Along Come the Challengers
Of late, however, our confidence in this magic number 5 has been eroded when it comes to usability testing of web sites, especially large, commercial Web sites. First to challenge assumptions about the applicability of the number 5 for Web testing was Rolf Molich, who organized a comparative evaluation of Hotmail, in which nine independent usability teams evaluated Hotmail, with each team presenting its findings in a report. Based on an analysis of the reports, Molich and his colleagues catalogued 300 problems, with 75% of these problems reported by only one team. Thus, Molich et al. concluded that it would take many more than five users to uncover all the issues with a web-based product.
At about the same time, Jared Spool and colleagues at User Interface Engineering (UIE) were reporting their findings from testing large e-commerce sites, in which users were asked to make CD music purchases. Expecting to find overlapping results after only a few users, Spool reports their amazement at seeing 247 problems identified by 18 users, with major findings being identified by each new user. On the basis of these observations, Spool projects that it would take 90 users to identify 600 problems.
So, where does that leave us? Are we no longer able to apply the discount model when we’re testing Web sites? If so, will that mean that companies with limited budgets will forego testing altogether and just hope for the best? Before we jump to that conclusion (which some have done already), let’s look a bit more closely at the research, both old and new.
What Did Nielsen, Virzi, and Lewis Really Say?
A closer examination of the findings from the original researchers—chiefly Nielsen, Virzi, and Lewis--sheds light on the issues that can help us understand the challenges posed by the more recent studies. Here’s the gist of each person’s contribution to the question of how many users it takes:
What Method Did CUE-2 and UIE Use?
The Hotmail study led by Rolf Molich in 1998-99 is called CUE-2 (Comparative Usability Evaluation, number 2). CUE-1, conducted by Molich the previous year, was a comparative evaluation of a calendar program by four independent laboratories. Because of criticism received about the unstructured methodology used in CUE-1, CUE-2 addressed the lack of a common scenario and the lack of access to the client. However, as I was the sponsor of one of the teams participating in CUE-2, I know that in reality the teams did not have access to the client (rather, a member of the CUE advisory board answered our team’s questions). Also, the mislabeled "scenario" provided by the test sponsors turned out to be a lengthy list of questions and issues, which resulted in the separate labs doing very few similar tasks in their testing. Approaches varied as well, and so did the reports, which were the mechanism used by the CUE advisory board to correlate the findings. I am not surprised to see little commonality among the nine teams’ findings because the tasks and scenarios, user profiles, and testing methods were so dissimilar. For instance, almost half of the 57 tasks tested were used by only one team.
In the study led by Jared Spool of UIE, we don’t know whether there was a specific scenario used or a specific user profile. All we know is that users were selected on the basis of having experience making music purchases online. Users were asked to make purchases from a wish list they brought to the test. Because the users may have been very dissimilar in their overall experience levels as well as any actual experience they may have had with the Web sites being tested, and especially because users may have been visiting completely different areas of the Web sites being tested, it is not surprising that all 18 users kept uncovering new problems.
What Conclusions Can We Draw From These Studies?
Is it a logical conclusion to draw, as Molich and Spool have done, that the "magic number 5" does not work for web testing? Or do we need to examine the two studies to see how well they match the approach outlined by the early researchers?
Here are two observations:
Nielsen repeatedly stresses (www.alertbox.com, 2000) that a small subset of the user population is required to correlate findings. So, when testing a large Web site, Nielsen says that you may need to test several subsets of three or four users and then compare the findings. He also stresses that specific scenarios should be used to place users in the same areas of the product. In addition, Lewis observed that mature products with good usability may require more users. Hotmail clearly falls into that category of a mature product and, although we don’t know the four sites Spool and colleagues used, we do know that one was Amazon.com.
Finally, we need to remember that the early researchers viewed the discount model as a diagnostic tool to uncover problems while products are in development. The objective of the "discount" model is to observe a few users, then apply this learning to the ongoing development of the product, then test again to see if you got it right.
At the same time, we should not be lulled into relying on only one method for learning about users. Usability testing, when combined with other methods for gathering data about users, provides a rich pool of information from which to develop user-centered products.
So, the next time you hear someone question or perhaps challenge the "magic number 5" when it comes to web testing, you might consider sharing the research findings with them. A little bit of knowledge can go a long way toward keeping usability testing affordable and effective.
Doumont, J.-L. Magical Numbers: The seven-plus-or-minus-two myth. IEEE Transactions on Professional Communication, 45 (2), 123-127.
Kalbach, J. (2002). The myth of "seven, plus or minus 2." Webreview. Retrieved June 11, 2002, from www.webreview.com/2002/01_14/strategists/index01.shtml.
Lewis, J.R. (1994). Sample sizes for usability studies: Additional considerations. Human Factors 36, 368-378.
Miller, G.A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63.2, 81-97.
Molich. R., et al. (1999). CUE-2. Comparative usability evaluation-2. Retrieved May 1, 2002, from www.dialogdesign.dk/cue/html.
Nielsen, J. (1989). Usability engineering at a discount. In G. Salvendy & M.J. Smith (Eds.), Using human-computer interfaces and knowledge-based systems. (pp. 394-401). Amsterdam: Elsevier.
Nielsen, J. (1990). "Evaluating the think-aloud technique for use by computer scientists. In H. Hartson & D. Hix (Eds.), Advances in human computer interaction, 2 (pp. 69-82). Norwood, NJ: Ablex.
Nielsen, J. (1994). Guerilla HCI: Using discount usability engineering to penetrate the intimidation barrier. In R.G. Bias & D.J. Mayhew (Eds.), Cost-justifying usability. (pp. 242-272). Boston: Academic Press.
Nielsen, J. (2000). Why you only need to test with 5 users. Alertbox. Retrieved May 1, 2002, from www.useit.com/alertbox/2000319.html.
Spool, J. & Schroeder, W. (2001). Testing web sites: Five users is nowhere near enough. Extended abstracts of CHI 2001, 285-286.
Virzi, R. (1990). Streamlining the design process: Running fewer subjects. Proceedings of the Human Factors Society 34th annual meeting, 1 (pp. 291-294). Orlando, FL.
Virzi, R. (1992). Refining the test phase of usability evaluation: How many subjects is enough? Human Factors 34, 457-486.