video gaming

video gaming
Refer to the Schute-Ventura reading.

What is the difference between measurement and assessment? How are both of these concepts applied to learning?
Briefly describe the ECD framework and its purpose.
The reading discusses some of the challenges in using games for the purposes of assessment. Choose one of the challenges mentioned and explain how it manifests in the
game you have chosen to study this semester.
Refer to the D’Anastasio reading.

What is the premise of the article?
We have played a number of games in class which asked us to identify with marginalized characters. Based on your own experience, do you agree with the premise of the
article or not? Are there ethical implications to attempting to convey empathy through games? What are some potential consequences in doing so (both positive and
negative)?
Reflect on your experience in the course. (1 paragraph)

What has been your most valuable take-away?
What worked well for you in the course? What would you change to improve your learning experience?
Â

Respond to at least 2 peers.

Â

Â

Â

Â

Â

Â

Please use easy grammar and vocab.

https://motherboard.vice.com/en_us/article/mgbwpv/empathy-games-dont-exist

Stealth Assessment
Measuring and Supporting Learning
in Video Games
Valerie Shute and Matthew Ventura
The John D. and Catherine T. MacArthur Foundation Reports on
Digital Media and Learning
education/technology
Stealth Assessment
Measuring and Supporting Learning in Video Games
Valerie Shute and Matthew Ventura
To succeed in today’s interconnected and complex world, workers need to be able
to think systemically, creatively, and critically. Equipping K–16 students with these
twenty-first-century competencies requires new thinking not only about what should
be taught in school but also about how to develop valid assessments to measure and
support these competencies. In Stealth Assessment, Valerie Shute and Matthew Ventura
investigate an approach that embeds performance-based assessments in digital
games. They argue that using well-designed games as vehicles to assess and support
learning will help combat students’ growing disengagement from school; provide
dynamic and ongoing measures of learning processes and outcomes; and offer students
opportunities to apply such complex competencies as creativity, problem solving,
persistence, and collaboration. Embedding assessments within games provides
a way to monitor players’ progress toward targeted competencies and to use that
information to support learning.
Shute and Ventura discuss problems with such traditional assessment methods
as multiple-choice questions, review evidence relating to digital games and learning,
and illustrate the stealth-assessment approach with a set of assessments they are
developing and embedding in the digital game Newton’s Playground. These stealth
assessments are intended to measure levels of creativity, persistence, and conceptual
understanding of Newtonian physics during game play. Finally, they consider future
research directions related to stealth assessment in education.
Valerie Shute is Professor of Educational Psychology and Learning Systems at Florida
State University. Matthew Ventura is a Research Scientist at Florida State University.
Front cover image (detail) by Les Todd/Duke University Photography.
The MIT Press
Massachusetts Institute of Technology
Cambridge, Massachusetts 02142
http://mitpress.mit.edu
978-0-262-51881-9
Stealth Assessment Shute and Ventura
www.macfound.org
This report was made possible by grants from the John D. and Catherine
T. MacArthur Foundation in connection with its grant making initiative
on Digital Media and Learning. For more information on the initiative
visit http://www.macfound.org.
Stealth Assessment
The John D. and Catherine T. MacArthur Foundation Reports on
Digital Media and Learning
Peer Participation and Software: What Mozilla Has to Teach Government, by
David R. Booth
Kids and Credibility: An Empirical Examination of Youth, Digital Media Use,
and Information Credibility, by Andrew J. Flanagin and Miriam Metzger
with Ethan Hartsell, Alex Markov, Ryan Medders, Rebekah Pure, and
Elisia Choi
New Digital Media and Learning as an Emerging Area and “Worked Examples”
as One Way Forward, by James Paul Gee
Digital Media and Technology in Afterschool Programs, Libraries, and Museums,
by Becky Herr-Stephenson, Diana Rhoten, Dan Perkel, and Christo
Sims with contributions from Anne Balsamo, Maura Klosterman, and
Susana Smith Bautista
Quest to Learn: Developing the School for Digital Kids, by Katie Salen, Robert
Torres, Loretta Wolozin, Rebecca Rufo-Tepper, and Arana Shapiro
Measuring What Matters Most: Choice-Based Assessments for the Digital Age,
by Daniel L. Schwartz and Dylan Arena
Learning at Not-School? A Review of Study, Theory, and Advocacy for Education
in Non-Formal Settings, by Julian Sefton-Green
Stealth Assessment: Measuring and Supporting Learning in Video Games, by
Valerie Shute and Matthew Ventura
The Future of the Curriculum: School Knowledge in the Digital Age, by Ben
Williamson
For a complete list of titles in this series, see http://mitpress.mit.edu/
books/series/john-d-and-catherine-t-macarthur-foundation-reports
-digital-media-and-learning.
Stealth Assessment
Measuring and Supporting Learning in Video Games
Valerie Shute and Matthew Ventura
The MIT Press
Cambridge, Massachusetts
London, England
© 2013 Massachusetts Institute of Technology
All rights reserved. No part of this book may be reproduced in any form
by any electronic or mechanical means (including photocopying, recording,
or information storage and retrieval) without permission in
writing from the publisher.
MIT Press books may be purchased at special quantity discounts for
business or sales promotional use. For information, please email special_sales@mitpress.mit.edu
or write to Special Sales Department, The
MIT Press, 55 Hayward Street, Cambridge, MA 02142.
This book was set in Stone Serif and Stone Sans by the MIT Press. Printed
and bound in the United States of America.
Library of Congress Cataloging-in-Publication Data
Shute, Valerie J. (Valerie Jean), 1953– , author.
Stealth assessment : measuring and supporting learning in video games
/ Valerie Shute and Matthew Ventura.
pages cm. — (The John D. and Catherine T. MacArthur Foundation
reports on digital media and learning)
Includes bibliographical references.
ISBN 978-0-262-51881-9 (pbk. : alk. paper)
1. Educational tests and measurements. 2. Video games. I. Ventura,
Matthew, author. II. Title.
LB3051.S518 2013
371.26—dc23
2012038217
10 9 8 7 6 5 4 3 2 1
Contents
Series Foreword vii
Acknowledgments ix
Education in the Twenty-First Century 1
Problems with Current Assessments 7
Assessment Writ Large 8
Traditional Classroom Assessments Are Detached Events 10
Traditional Classroom Assessments Rarely Influence Learning 11
Traditional Assessment and Validity Issues 12
Digital Games, Assessment, and Learning 17
Evidence of Learning from Games 18
Assessment in Games 23
Stealth Assessment 31
Stealth Assessment in Newton’s Playground 32
Conscientiousness Review and Competency Model 38
Creativity Review and Competency Model 46
Conceptual Physics Review and Competency Model 53
vi Contents
Relation of Physics Indicators to Conscientiousness and Creativity
Indicators 57
Newton’s Playground Study Procedure 63
Discussion and Future Research 67
Appendixes
Appendix 1: Full Physics Competency Model 71
Appendix 2: External Measures to Validate Stealth Assessments 72
References 79
Series Foreword
The John D. and Catherine T. MacArthur Foundation Reports
on Digital Media and Learning, published by the MIT Press in
collaboration with the Monterey Institute for Technology and
Education (MITE), present findings from current research on
how young people learn, play, socialize, and participate in civic
life. The reports result from research projects funded by the MacArthur
Foundation as part of its fifty million dollar initiative
in digital media and learning. They are published openly online
(as well as in print) in order to support broad dissemination and
stimulate further research in the field.

Acknowledgments
We would like to sincerely thank the Bill and Melinda Gates
Foundation for its funding for this project, particularly Emily
Dalton-Smith, Robert Torres, and Ed Dieterle. We would also
like to express our appreciation to the other members of the
research grant team—Yoon Jeon Kim, Don Franceschetti, Russell
Almond, Matt Small, and Lubin Wang—for their awesome and
abundant support on the project, and Lance King, who came up
with the “agents of force and motion” idea. Finally, we acknowledge
Diego Zapata-Rivera for ongoing substantive conversations
with us on the topic of stealth assessment.

Education in the Twenty-First Century
You can discover more about a person in an hour of play than in a year
of conversation.
—Plato
In the first half of the twentieth century, a person who acquired
basic reading, writing, and math skills was considered to be sufficiently
literate to enter the work force (Kliebard 1987). The
goal back then was to prepare young people as service workers,
because 90 percent of the students were not expected to seek or
hold professional careers (see Shute 2007). With the emergence
of the Internet, however, the world has become more interconnected,
effectively smaller, and more complex than before (Friedman
2005). Developed countries now rely on their knowledge
workers to deal with an array of complex problems, many with
global ramifications (e.g., climate change or renewable energy
sources). When confronted by such problems, tomorrow’s workers
need to be able to think systemically, creatively, and critically
(see, e.g., Shute and Torres 2012; Walberg and Stariha 1992).
2 Education in the Twenty-First Century
These skills are a few of what many educators are calling twentyfirst-century
(or complex) competencies (see Partnership for the
21st Century 2012; Trilling and Fadel 2009).
Preparing K–16 students to succeed in the twenty-first century
requires fresh thinking about what knowledge and skills (i.e.,
what we call competencies) should be taught in our nation’s
schools. In addition, there’s a need to design and develop valid
assessments to measure and support these competencies. Except
in rare instances, our current education system neither teaches
nor assesses these new competencies despite a growing body of
research showing that competencies, such as persistence, creativity,
self-efficacy, openness, and teamwork (to name a few),
can substantially impact student academic achievement (Noftle
and Robins 2007; O’Connor and Paunonen 2007; Poropat
2009; Sternberg 2006; Trapmann et al. 2007). Furthermore, the
methods of assessment are often too simplified, abstract, and
decontextualized to suit current education needs. Our current
assessments in many cases fail to assess what students actually
can do with the knowledge and skills learned in school (Shute
2009). What we need are new performance-based assessments
that assess how students use knowledge and skills that are
directly relevant for use in the real world.
One challenge with developing a performance-based measure
is crafting appropriate situations or problems to elicit a
competency of interest. A way to approach this problem is to
use digital learning environments to simulate problems for performance-based
assessment (Dede 2005; DiCerbo and Behrens
2012; Quellmalz et al. 2012). Digital learning environments can
provide meaningful assessment environments by supplying students
with scenarios that require the application of various competencies.
This report introduces a variant of this assessment
Education in the Twenty-First Century 3
approach by investigating how performance-based assessments
can be used in digital games. Specifically, we are interested in
how assessment in games can be used to enhance learning (i.e.,
formative assessment).
For example, consider role-playing games (e.g., World of Warcraft).
In these games, players must read lengthy and complex
quest logs that tell them the goals. Without comprehending
these quest instructions, the players would not be able to know
how to proceed and succeed in the game. This seemingly simple
task in role-playing games is, in fact, an authentic, situated
assessment of reading comprehension. Without these situated
and meaningful assessments, we cannot determine what students
can actually do with the skills and knowledge obtained.
Thus new, embedded, authentic types of assessment methods
are needed to properly assess valued competencies.
Why use well-designed games as vehicles to assess and support
learning? There are several reasons. First, as our schools
have remained virtually unchanged for many decades while
our world is changing rapidly, we are seeing a growing number
of disengaged students. This disengagement increases the
chances of students dropping out of school. For instance, high
dropout rates, especially among Hispanic, black, and Native
American students, were described as “the silent epidemic” in
a recent research report for the Bill and Melinda Gates Foundation
(Bridgeland, DiIulio, and Morison 2006). According to
this report, nearly one-third of all public high school students
drop out, and the rate is higher for minority students. In the
report, when 467 high school dropouts were asked why they left
school, 47 percent of them simply responded, “The classes were
not interesting.” We need to find ways (e.g., well-designed digital
games and other immersive environments) to get our kids
4 Education in the Twenty-First Century
engaged, support their learning, and allow them to contribute
fruitfully to society.
A second reason for using games as assessments is a pressing
need for dynamic, ongoing measures of learning processes
and outcomes. An interest in alternative forms of assessment is
driven by dissatisfaction with and the limitations of multiplechoice
items. In the 1990s, an interest in alternative forms of
assessment increased with the popularization of what became
known as authentic assessment. A number of researchers found
that multiple-choice and other fixed-response formats substantially
narrowed school curricula by emphasizing basic content
knowledge and skills within subjects, and not assessing higherorder
thinking skills (see, e.g., Kellaghan and Madaus 1991;
Shepard 1991). As George Madaus and Laura O’Dwyer (1999)
argued, though, incorporating performance assessments into
testing programs is hard because they are less efficient, more difficult
and disruptive to administer, and more time consuming
than multiple-choice testing programs. Consequently, multiple
choice has remained the dominant format in most K–12 assessments
in our country. New performance assessments are needed
that are valid, reliable, and automated in terms of scoring.
A third reason for using games as assessment vehicles is that
many of them typically require a player to apply various competencies
(e.g., creativity, problem solving, persistence, and collaboration)
to succeed in the game. The competencies required
to succeed in many games also happen to be the same ones that
companies are looking for in today’s highly competitive economy
(Gee, Hull, and Lankshear 1996). Moreover, games are a
significant and ubiquitous part of young people’s lives. The Pew
Internet and American Life Project, for instance, surveyed 1,102
youths between the ages of twelve and seventeen. They reported
Education in the Twenty-First Century 5
that 97 percent of youths—both boys (99 percent) and girls (94
percent)—play some type of digital game (Lenhart et al. 2008).
Additionally, Mizuko Ito and her colleagues (2010) found that
playing digital games with friends and family is a large as well as
normal part of the daily lives of youths. They further observed
that playing digital games is not solely for entertainment purposes.
In fact, many youths participate in online discussion
forums to share their knowledge and skills about a game with
other players, or seek help on challenges when needed.
In addition to the arguments for using games as assessment
devices, there is growing evidence of games supporting learning
(see, e.g., Tobias and Fletcher 2011; Wilson et al. 2009). Yet we
need to understand more precisely how as well as what kinds
of knowledge and skills are being acquired. Understanding the
relationships between games and learning is complicated by the
fact that we don’t want to disrupt players’ engagement levels
during gameplay. As a result, learning in games has historically
been assessed indirectly and/or in a post hoc manner (Shute and
Ke 2012; Tobias et al. 2011). What’s needed instead is real-time
assessment and support of learning based on the dynamic needs
of players. We need to be able to experimentally ascertain the
degree to which games can support learning, and how and why
they achieve this objective.
This book presents the theoretical foundations of and research
methodologies for designing, developing, and evaluating stealth
assessments in digital games. Generally, stealth assessments are
embedded deeply within games to unobtrusively, accurately,
and dynamically measure how players are progressing relative to
targeted competencies (Shute 2011; Shute, Ventura, et al. 2009).
Embedding assessments within games provides a way to monitor
a player’s current level on valued competencies, and then use
6 Education in the Twenty-First Century
that information as the basis for support, such as adjusting the
difficulty level of challenges or providing timely feedback. The
term and technologies of stealth assessment are not intended to
convey any type of deception but rather to reflect the invisible
capture of gameplay data, and the subsequent formative use of
the information to help learners (and ideally, help learners to
help themselves).
There are four main sections in this report. First, we discuss
problems with existing traditional assessments. We then review
evidence relating to digital games and learning. Third, we define
and then illustrate our stealth assessment approach with a set
of assessments that we are currently developing and embedding
in a digital game (Newton’s Playground). The stealth assessments
are intended to measure the levels of creativity, persistence, and
conceptual understanding of Newtonian physics during gameplay.
Finally, we discuss future research and issues related to
stealth assessment in education.
Problems with Current Assessments
Our country’s current approach to assessing students (K–16) has
a lot of room for improvement at the classroom and high-stakes
levels. This is especially true in terms of the lack of support that
standardized, summative assessments provide for students learning
new knowledge, skills, and dispositions that are important to
succeed in today’s complex world. The current means of assessing
students infrequently (e.g., at the end of a unit or school
year for grading and promotion purposes) can cause various
unintended consequences, such as increasing the dropout rate
given the out-of-context and often irrelevant test-preparation
teaching contexts that the current assessment system frequently
promotes.
The goal of an ideal assessment policy/process should be to
provide valid, reliable, and actionable information about students’
learning and growth that allows stakeholders (e.g., students,
teachers, administrators, and parents) to utilize the
information in meaningful ways. Before describing particular
problems associated with current assessment practices, we first
offer a brief overview of assessment.
8 Problems with Current Assessments
Assessment Writ Large
People often confound the concepts of measurement and assessment.
Whenever you need to measure something accurately, you
use an appropriate tool to determine how tall, short, hot, cold,
fast, or slow something is. We measure to obtain information
(data), which may or may not be useful, depending on the accuracy
of the tools we use as well as our skill at using them. Measuring
things like a person’s height, a room’s temperature, or a car’s
speed is technically not an assessment but rather the collection
of information relative to an established standard (Shute 2009).
Educational Measurement
Educational measurement refers to the application of a measuring
tool (or standard scale) to determine the degree to which
important knowledge, skills, and other attributes have been or
are being acquired. It involves the collection and analysis of
learner data. According to the National Council on Measurement
in Education’s Web site, this includes the theory, techniques,
and instrumentation available for the measurement of
educationally relevant human, institutional, and social characteristics.
A test is education’s equivalent of a ruler, thermometer,
or radar gun. But a test does not typically improve learning any
more than a thermometer cures a fever; both are simply tools.
Moreover, as Catherine Snow and Jacqueline Jones (2001) point
out, tests alone cannot enhance educational outcomes. Rather,
tests can guide improvement (given that they are valid and reliable)
if they motivate adjustments to the educational system
(i.e., provide the basis for bolstering curricula, ensure support
for struggling learners, guide professional development opportunities,
and distribute limited resources fairly).
Problems with Current Assessments 9
Again, we measure things in order to get information, which
may be quantitative or qualitative. How we choose to use the data
is a different matter. For instance, back in the early 1900s, students’
abilities and intelligence were extensively measured. Yet
this wasn’t done to help them learn better or otherwise progress.
Instead, the main purpose of testing was to track students into
appropriate paths, with the understanding that their aptitudes
were inherently fixed. A dominant belief during that period was
that intelligence was part of a person’s genetic makeup, and thus
testing was aimed specifically at efficiently assigning students
into high, middle, or low educational tracks according to their
supposedly innate mental abilities (Terman 1916). In general,
there was a fundamental shift to practical education going on in
the country during the early 1900s, countering “wasted time” in
schools while abandoning the classics as useless and inefficient
for the masses (Shute 2007). Early educational researchers and
administrators inserted the metaphor of the school as a “factory”
into the national educational discourse (Kliebard 1987).
The metaphor has persisted to this day.
Assessment
Assessment involves more than just measurement. In addition
to systematically collecting and analyzing information (i.e.,
measurement), it also involves interpreting and acting on information
about learners’ understanding and/or performance relative
to educational goals. Measurement can be viewed as a subset
of assessment.
As mentioned earlier, assessment information can be used
by a variety of stakeholders and for an array of purposes (e.g.,
to help improve learning outcomes, programs, and services as
well as to establish accountability). There is also an assortment
10 Problems with Current Assessments
of procedures associated with the different purposes. For example,
if your goal was to enhance an individual’s learning, and
you wanted to determine that individual’s progress toward an
educational goal, you could administer a quiz, view a portfolio
of the student’s work, ask the student (or peers) to evaluate progress,
watch the person solve a complex task, review lab reports or
journal entries, and so on.
In addition to having different purposes and procedures for
obtaining information, assessments may be differentially referenced
or interpreted–for instance, in relation to normative data
or a criterion. Norm-referenced interpretation compares learner
data to that of other individuals or a larger group, but can also
involve comparisons to oneself (e.g., asking people how they are
feeling and getting a “better than usual” response is a normreference
interpretation). The purpose of norm-referenced interpretation
is to establish what is typical or reasonable. On the
other hand, criterion-referenced interpretation involves establishing
what a person can or cannot do, or typically does or does
not do—specifically in relation to a criterion. If the purpose of
the assessment is to support personal learning, then criterionreferenced
interpretation is required (for more, see Nitko 1980).
This overview of assessment is intended to provide a foundation
for the next section, where we examine specific problems
surrounding current assessment practices.
Traditional Classroom Assessments Are Detached Events
Current approaches to assessment are usually divorced from
learning. That is, the typical educational cycle is: teach; stop;
administer test; go loop (with new content). But consider the following
metaphor representing an important shift that occurred
Problems with Current Assessments 11
in the world of retail outlets (from small businesses to supermarkets
to department stores), suggested by James Pellegrino, Naomi
Chudhowsky, and Robert Glaser (2001, 284). No longer do these
businesses have to close down once or twice a year to take inventory
of their stock. Rather, with the advent of automated checkout
and bar codes for all items, these businesses have access to
a continuous stream of information that can be used to monitor
inventory and the flow of items. Not only can a business
continue without interruption; the information obtained is also
far richer than before, enabling stores to monitor trends and
aggregate the data into various kinds of summaries as well as
to support real-time, just-in-time inventory management. Similarly,
with new assessment technologies, schools should no longer
have to interrupt the normal instructional process at various
times during the year to administer external tests to students.
Assessment instead should be continual and invisible to students,
supporting real-time, just-in-time instruction (for more,
see Shute, Levy, et al. 2009).
Traditional Classroom Assessments Rarely Influence Learning
Many of today’s classroom assessments don’t support deep
learning or the acquisition of complex competencies. Current
classroom assessments (referred to as “assessments of learning”)
are typically designed to judge a student (or group of students)
at a single point in time, without providing diagnostic
support to students or diagnostic information to teachers. Alternatively,
assessments (particularly “assessments for learning”)
can be used to: support the learning process for students and
teachers; interpret information about understanding and/or performance
regarding educational goals (local to the curriculum,
12 Problems with Current Assessments
and broader to the state or common core standards); provide
formative compared to summative information (e.g., give useful
feedback during the learning process rather than a single judgment
at the end); and be responsive to what’s known about how
people learn—generally and developmentally.
To illustrate how a classroom assessment may be used to support
learning, Valerie Shute, Eric Hansen, and Russell Almond
(2008) conducted a study to evaluate the efficacy of an assessment
for learning system named ACED (for “adaptive content
with evidence-based diagnosis”). They used an evidence-centered
design approach (Mislevy, Steinberg, and Almond 2003) to create
an adaptive, diagnostic assessment system that also included
instructional support in the form of elaborated feedback. The key
issue examined was whether the inclusion of the feedback into the
system impairs the quality of the assessment (relative to validity,
reliability, and efficiency), and does in fact enhance student learning.
Results from a controlled evaluation testing 268 high-school
students showed that the quality of the assessment was unimpaired
by the provision of feedback. Moreover, students using the
ACED system showed significantly greater learning of the content
(geometric sequences) compared with a control group (i.e., students
using the system but without elaborated feedback—just correct/incorrect
feedback). These findings suggest that assessments
in other settings (e.g., state-mandated tests) can be augmented
to support student learning with instructional feedback without
jeopardizing the primary purpose of the assessment.
Traditional Assessment and Validity Issues
Assessments are typically evaluated under two broad categories:
reliability and validity. Reliability is the most basic requirement
Problems with Current Assessments 13
for an assessment and is concerned with the degree to which a
test can consistently measure some attribute over similar conditions.
In assessment, reliability is seen, for example, when a
person scores really high on an algebra test at one point in time
and then scores similarly on a comparable test the next day. In
order to achieve high reliability, assessment tasks are simplified
to independent pieces of evidence that can be modeled by existing
measurement models.
An interesting issue is how far this simplification process can
go without negatively influencing the validity of the test. That
is, in order to remove any possible source of construct-irrelevant
variance and dependencies, tasks can end up looking like
decontextualized, discrete pieces of evidence. In the process of
achieving high reliability, which is important for supporting
high-stakes decision making, other aspects of the test may be
sacrificed (e.g., engagement and some types of validity).
Another aspect that traditional, standardized assessments
emphasize is dealing with operational constraints (e.g., the need
for gathering and scoring sufficient pieces of evidence within a
limited administration time and budget). In fact, many of the
simplifications described above could be explained by this issue
along with the current state of certain measurement models that
do not easily handle complex interactions among tasks, the presence
of feedback, and student learning during the test.
Validity, broadly, refers to the extent to which an assessment
actually measures what it is intended to measure. Here are the
specific validity issues related to traditional assessment.
Face Validity
Face validity states that an assessment should intuitively
“appear” to measure what it is intended to measure. For example,
14 Problems with Current Assessments
reading some excerpted paragraphs on an uninteresting topic
and answering multiple-choice questions about it may not be
the best measure for reading comprehension (i.e., it lacks good
face validity). As suggested earlier, students need to be assessed
in meaningful environments rather than filling in bubbles on a
prepared form in response to decontextualized questions. Digital
games can provide such meaningful environments by supplying
students with scenarios that require the application of various
competencies, such as reading comprehension and problemsolving
skill.
Predictive Validity
Predictive validity refers to an assessment predicting future
behavior. Today’s large-scale, standardized assessments are generally
lacking in this area. For example, a recent report from the
College Board found that the SAT only marginally predicted college
success beyond high school GPA at around r = 0.10 (Korbin
et al. 2008). This means that the SAT scores contribute around 1
percent of the unique prediction of college success after controlling
for GPA information. Other research studies have shown
greater incremental validity of noncognitive variables (e.g.,
pyschosocial) over SAT and traditional academic indicators like
GPA in predicting college success (see, e.g., Robbins et al. 2004).
Consequential Validity
Consequential validity refers to the effects of a particular assessment
on societal and policy decisions. One negative side effect
of the No Child Left Behind (NCLB 2002) initiative, with its
heavy focus on accountability, has been teachers “teaching to
the test.” That is, when teachers instruct content that is relevant
to answering items on a test but not particularly relevant for
Problems with Current Assessments 15
solving real-world problems, this reduces student engagement
in school, and in turn, that can lead to increased dropout rates
(Bridgeland, DiIulio, and Morison 2006). Moreover, the low predictive
validity of current assessments can lead to students not
getting into college due to low scores. But the SAT and similar
test scores are still being used as the main basis for college admission
decisions, which can potentially lead to some students
missing opportunities at fulfilling careers and lives, particularly
disadvantaged youths.
To illustrate the contrast between traditional and new performance-based
assessments, consider the attribute of conscientiousness.
Conscientiousness can be broadly defined as the
motivation to work hard despite challenging conditions—a disposition
that has consistently been found to predict academic
achievement from preschool to high school to the postsecondary
level and adulthood (see, e.g., Noftle and Robins 2007;
O’Connor and Paunonen 2007; Roberts et al. 2004). Conscientiousness
measures, like most dispositional measures, are primarily
self-report (e.g., “I work hard no matter how difficult the
task”; “I accomplish my work on time”)—a method of assessment
that is riddled with problems. First, self-report measures
are subject to “social desirability effects” that can lead to false
reports about behavior, attitudes, and beliefs (see Paulhaus
1991). Second, test takers may interpret specific self-report items
differently (e.g., what it means “to work hard”), leading to unreliability
and lower validity (Lanyon and Goodstein 1997). Third,
self-report items often require that individuals have explicit
knowledge of their dispositions (see, e.g., Schmitt 1994), which
is not always the case.
Good games, coupled with evidence-based assessment, show
promise as a vehicle to dynamically measure conscientiousness
16 Problems with Current Assessments
and other important competencies more accurately than traditional
approaches (see, e.g., Shute, Masduki, and Donmez 2010).
These evidence-based assessments can record and score multiple
behaviors as well as measurable artifacts in the game that pertain
to particular competencies. For example, various actions that a
player takes within a well-designed game can inform conscientiousness—how
long a person spends on a difficult problem
(where longer equals more persistent), the number of failures and
retries before success, returning to a hard problem after skipping
it, and so on. Each instance of these “conscientiousness indicators”
would update the student model of this variable—and thus
would be up to date and available to view at any time. Additionally,
we posit that good games can provide a gameplay environment
that can potentially improve conscientiousness, because
many problems require players to persevere despite failure and
frustration. That is, many good games can be quite difficult, and
pushing one’s limits is an excellent way to improve persistence,
especially when accompanied by the great sense of satisfaction
one gets on successful completion of a thorny problem (see, e.g.,
Eisenberg 1992; Eisenberg and Leonard 1980). Some students,
however, may not feel engaged or comfortable with games, or
cannot access them. Alternative approaches should be available
for these students.
As can be seen, traditional tests may not fully satisfy various
validity and learning requirements. In the next section we
describe how digital games can be effectively used in education—as
assessment vehicles and to support learning.
Digital Games, Assessment, and Learning
Digital games are popular. For instance, revenues for the digital
game industry reached US $7.2 billion in 2007 (Fullerton
2008), and overall, 72 percent of the population in the United
States plays digital games (Entertainment Software Association
2011). The amount of time spent playing games also continues
to increase (Escobar-Chaves and Anderson 2008). Besides being
a popular activity, playing digital games has been shown to be
positively related to a variety of cognitive skills (on visual-spatial
abilities, e.g., see Green and Bavelier 2007; on attention, e.g., see
Shaw, Grayson, and Lewis 2005), openness to experience (Chory
and Goodboy 2011; Ventura, Shute, and Kim 2012; Witt, Massman,
and Jackson 2011), persistence (i.e., a facet of conscientiousness;
Ventura, Shute, and Zhao, forthcoming), academic
performance (e.g., Skoric, Teo, and Neo 2009; Ventura, Shute,
and Kim 2012), and civic engagement (Ferguson and Garza
2011). Digital games can also motivate students to learn valuable
academic content and skills, within and outside the game
(e.g., Barab, Dodge, et al. 2010; Coller and Scott 2009; DeRouinJessen
2008). Finally, studies have shown that playing digital
18 Digital Games, Assessment, and Learning
games can promote prosocial and civic behavior (e.g., Ferguson
and Garza 2011).
As mentioned earlier, learning in games has historically been
assessed indirectly and/or in a post hoc manner (see Shute and Ke
2012). What is required instead is real-time assessment and support
of learning based on the dynamic needs of players. Research
examining digital games and learning is usually conducted using
pretest-game-posttest designs, where the pre- and posttests typically
measure content knowledge. Such traditional assessments
don’t capture and analyze the dynamic, complex performances
that inform twenty-first-century competencies. How can we
both measure and enhance learning in real time? Performancebased
assessments with automated scoring are needed. The main
assumptions underlying this new approach are that: learning by
doing (required in gameplay) improves learning processes and
outcomes; different types of learning and learner attributes may
be verified as well as measured during gameplay; strengths and
weaknesses of the learner may be capitalized on and bolstered,
respectively, to improve learning; and ongoing feedback can be
used to further support student learning.
Evidence of Learning from Games
Below are three examples of learning from educational games.
Preliminary evidence suggests that students can learn deeply
from such games and acquire important twenty-first-century
competencies.
Programming Skills in NIU-Torcs
The game NIU-Torcs (Coller and Scott 2009) requires players to
create control algorithms to make virtual cars execute nimble
Digital Games, Assessment, and Learning 19
maneuvers and stay balanced. At the beginning of the game,
players receive their own cars, which sit motionless on a track.
Each student must write a C++ program that controls the steering
wheel, gearshift, accelerator, and brake pedals to get the car
to move (and stop). The program also needs to include specific
maneuverability parameters (e.g., gas pedal, transmission, and
steering wheel). Running their C++ programs permits students
to simulate the car’s performance (e.g., distance from the center
line of the track and wheel rotation rates), and thus students are
able to see the results of their programming efforts by driving
the car in a 3-D environment.
NIU-Torcs was evaluated using mechanical engineering students
in several undergraduate classrooms. Findings showed
that students in the classroom using NIU-Torcs as the instructional
approach (n = 38) scored significantly higher than students
in four control group classrooms (n = 48) on a concept
map assessment. The concept map assessment included questions
spanning four progressively higher levels of understanding:
the number of concepts recalled (i.e., low-level knowledge),
Figure 1
Screen capture of NIU-Torcs
20 Digital Games, Assessment, and Learning
the number of techniques per topic recalled, the depth of the
hierarchy per major topic (i.e., defining features and their connections),
and finally, connections among branches in the hierarchy
(i.e., showing a deep level of understanding). Students
in the NIU-Torcs classroom significantly improved in terms of
the depth of hierarchy and connections among branches (i.e.,
deeper levels of knowledge) relative to the control group. Figure
1 shows a couple of screen shots from the NUI-Torcs game.
Understanding Cancer Cells with Re-Mission
Re-Mission (Kato et al. 2008) is the name of a video game in
which players control a nanobot (named Roxxi) in a 3-D environment
representing the inside of the bodies of young patients
with cancer. The gameplay was designed to address behavioral
issues that were identified in the literature and were seen as critical
for optimal patient participation in cancer treatment. The
video gameplay includes destroying cancer cells and managing
common treatment-related adverse effects, such as bacterial
infections, nausea, and constipation. Neither Roxxi nor any of
the virtual patients die in the game. That is, if players fail at any
point in the game, then the nanobot powers down and players
are given the opportunity to retry the mission. Players need
to complete missions successfully before moving on to the next
level.
A study was conducted to evaluate Re-Mission at thirty-four
medical centers in the United States, Canada, and Australia. A
total of 375 cancer patients, thirteen to twenty-nine years old,
were randomly assigned to the intervention (n = 197) or control
group (n = 178). The intervention group played Re-Mission while
the control group played Indiana Jones and the Emperor’s Tomb
(i.e., both the gameplay and interface were similar to Re-Mission).
After taking a pretest, all participants received a computer either
Digital Games, Assessment, and Learning 21
with Indiana Jones and the Emperor’s Tomb (control group) or the
same control group game plus the Re-Mission game (intervention
group). The participants were asked to play the game(s) for
at least one hour per week during the three-month study, and
outcome assessments were collected at one and three months
after the pretest. Game use was recorded electronically. Outcome
measures included adherence to taking prescribed medications,
self-efficacy, cancer-related knowledge, control, stress, and quality
of life. Adherence, self-efficacy, and cancer-related knowledge
were all significantly greater in the intervention group
Figure 2
Screen capture of Re-Mission game
22 Digital Games, Assessment, and Learning
compared to the control group. The intervention did not affect
self-reported measures of stress, control, or quality of life. Figure
2 shows an opening screen of Re-Mission.
Taiga Park and Science Content Learning
Our last example illustrates how kids learn science content and
inquiry skills within an online game called Quest Atlantis: Taiga
Park. Taiga Park is an immersive digital game developed by Sasha
Barab and his colleagues at Indiana University (Barab et al. 2007;
Barab, Gresalfi, and Ingram-Goble 2010). Taiga Park is a beautiful
national park where many groups coexist, such as the flyfishing
company, the Mulu farmers, the lumber company, and
park visitors. In this game, Ranger Bartle calls on the player to
investigate why the fish are dying in the Taiga River. To solve
this problem, players are engaged in scientific inquiry activities.
They interview virtual characters to gather information, and collect
water samples at several locations along the river to measure
water quality. Based on the collected information, players make
a hypothesis and suggest a solution to the park ranger.
To move successfully through the game, players need to
understand how certain science concepts are related to each
other (e.g., sediment in the water from the loggers’ activities
causes an increase to the water temperature, which decreases the
amount of dissolved oxygen in the water, which causes the fish
to die). Also, players need to think systemically about how different
social, ecological, and economic interests are intertwined
in this park. In a controlled experiment, Barab and his colleagues
(2010) found that middle-school students learning with Taiga
Park scored significantly higher on the posttest (i.e., assessing
knowledge of core concepts such as erosion and eutrophication)
compared to the classroom condition (p < 0.01). The Taiga Park
Digital Games, Assessment, and Learning 23
group also scored significantly higher than the control condition
on a delayed posttest, thus demonstrating retention of the
content relating to water quality (p < 0.001) in a novel task (thus
better retention and transfer). The same teacher taught both
treatment and control conditions. For a screen capture from
Taiga Park, see figure 3.
As these examples show, digital games appear to support
learning. But how can we more accurately measure learning,
especially as it happens (rather than after the fact), and beyond
content knowledge?
Assessment in Games
In a typical digital game, as players interact with the environment,
the values of different game-specific variables change. For
Figure 3
Screen capture of Taiga Park
24 Digital Games, Assessment, and Learning
instance, getting injured in a battle reduces a player’s health, and
finding a treasure or another object increases a player’s inventory
of goods. In addition, solving major problems in games permits
players to gain rank or “level up.” One could argue that these are
all “assessments” in games—of health, personal goods, and rank.
But now consider monitoring educationally relevant variables at
different levels of granularity in games. In addition to checking
health status, players could check their current levels of systemsthinking
skill, creativity, and teamwork, where each of these
competencies is further broken down into constituent knowledge
and skill elements (e.g., teamwork may be broken down
into cooperating, negotiating, and influencing/leadership skills).
If the estimated values of those competencies got too low, the
player would likely feel compelled to take action to boost them.
One main challenge for educators who want to employ or
design games to support learning is making valid inferences—
about what the student knows, believes, and can do—at any
point in time, at various levels, and without disrupting the flow
of the game (and hence engagement and learning). One way to
increase the quality and utility of an assessment is to use evidence-centered
design (ECD), which informs the design of valid
assessments and yields real-time estimates of students’ competency
levels across a range of knowledge and skills (Mislevy,
Steinberg, and Almond 2003).
ECD is a conceptual framework that can be used to develop
assessment models, which in turn support the design of valid
assessments. The goal is to help assessment designers coherently
align the claims that they want to make about learners as well
as the things that learners say or do in relation to the contexts
and tasks of interest (e.g., Mislevy and Haertel 2006; Mislevy,
Steinberg, and Almond 2003; for a simple overview, see ECD for
Digital Games, Assessment, and Learning 25
Dummies by Shute, Kim, and Razzouk 2010). There are three
main theoretical models in the ECD framework: competency,
evidence, and task models.
Competency Model
What collection of knowledge, skills, and other attributes should be
assessed? Although ECD can work with simple one-dimensional
competency models, its strength comes from treating competency
as multidimensional. Variables in the competency model
describe the set of knowledge and skills on which inferences are
based (see Almond and Mislevy 1999). The term student model
is used to denote an instantiated version of the competency
model—like a profile or report card, only at a more refined grain
size. Values in the student model express the assessor’s current
belief about the level on each variable within the competency
model, for a particular student.
Evidence Model
What behaviors or performances should reveal those competencies?
An evidence model expresses how the student’s interactions with
and responses to a given problem constitute evidence about competency
model variables. The evidence model attempts to answer
two questions: (a) What behaviors or performances reveal targeted
competencies; and (b) What’s the statistical connection between
those behaviors and the variable(s) in the competency model?
Task Model
What tasks or problems should elicit those behaviors that comprise
the evidence? The variables in a task model describe features of situations
that will be used to elicit performance. A task model provides
a framework for characterizing or constructing situations
26 Digital Games, Assessment, and Learning
with which a student will interact to supply evidence about
targeted aspects of competencies. The main purpose of tasks or
problems is to elicit evidence (observable) about competencies
(unobservable). The evidence model serves as the glue between
the two.
There are two main reasons why we believe that the ECD
framework fits well with the assessment of learning in digital
games. First, in digital games, people learn in action (Gee 2003;
Salen and Zimmerman 2005). That is, learning involves continuous
interactions between the learner and game, so learning is
inherently situated in context. The interpretation of knowledge
and skills as the products of learning therefore cannot be isolated
from the context, and neither should assessment. The ECD
framework helps us to link what we want to assess and what
learners do in complex contexts. Consequently, an assessment
can be clearly tied to learners’ actions within digital games, and
can operate without interrupting what learners are doing or
thinking (Shute 2011).
The second reason that ECD is believed to work well with digital
games is because the ECD framework is based on the assumption
that assessment is, at its core, an evidentiary argument.
Its strength resides in the development of performance-based
assessments where what is being assessed is latent or not apparent
(Rupp et al. 2010). In many cases, it is not clear what people
learn in digital games. In ECD, however, assessment begins by
figuring out just what we want to assess (i.e., the claims we want
to make about learners), and clarifying the intended goals, processes,
and outcomes of learning.
Accurate information about the student can be used to support
learning. That is, it can serve as the basis for delivering
timely and targeted feedback as well as presenting a new task
Digital Games, Assessment, and Learning 27
or quest that is right at the cusp of the student’s skill level, in
line with flow theory (e.g., Csikszentmihalyi 1990) and Lev
Vygotsky’s (1978) zone of proximal development.
As discussed so far, there are good reasons for using games as
assessment vehicles to support learning. Yet Diego Zapata-Rivera
and Malcolm Bauer (2011) discuss some of the challenges relating
to the implementation of assessment in games, such as the
following:
•  Introduction of construct irrelevant content and skills When
designing interactive gaming activities, it is easy to introduce
content and interactions that impose requirements on knowledge,
skill, or other attributes (KSA) that are not part of the construct
(i.e., the KSAs that we are not trying to measure). That is,
authenticity added by the context of a game may also impose
demands on irrelevant KSAs (Messick 1994). Designers need to
explore the implications for the type of information that will be
gathered and used as evidence of students’ performance on the
KSAs that are part of the construct.
•  Interaction issues The nature of interaction in games may be at
odds with how people are expected to perform on an assessment
task. Making sense of issues such as exploring behavior, pacing,
and trying to game the system is challenging, and has a direct
link to the quality of evidence that is collected about student
behavior. The environment can lend itself to interactions that
may not be logical or expected. Capturing the types of behaviors
that will be used as evidence and limiting other types of behaviors
(e.g., repeatedly exploring visual or sound effects) without
making the game dull or repetitive is a challenging activity.
•  Demands on working memory Related to both the issues of
construct-irrelevant variance (i.e., when the test contains excess
28 Digital Games, Assessment, and Learning
variance that is irrelevant to the interpreted construct; Messick
1989) and interaction with the game is the issue of demands
that gamelike assessments place on students’ working memory.
By designing assessments with higher levels of interactivity and
engagement, it’s easy to increase cognitive processing demands
in a way that can reduce the quality of the measurement of the
assessment.
•  Accessibility issues Games that make use of rich, immersive
graphic environments can impose great visual, motor, auditory,
and other demands on the player to just be able to interact in
the environment (e.g., sophisticated navigation controls). Moreover,
creating environments that do not make use of some of
these technological advances (e.g., a 3-D immersive environment)
may negatively affect student engagement, especially for
students who are used to interacting with these types of games.
Parallel environments that do not impose the same visual,
motor, and auditory demands without changing the construct
need to be developed for particular groups of students (e.g., students
with visual disabilities).
•  Tutorials and familiarization Although the majority of students
have played some sort of video game in their lives, students
will need support to understand how to navigate and
interact with the graphic environment. Lack of familiarity with
navigation controls may negatively influence student performance
and student motivation (e.g., Lim, Nonis, and Hedberg
2006). The use of tutorials and demos can support this familiarization
process. The tutorial can also be used as an engagement
element (see, e.g., Armstrong and Georgas 2006).
•  Type and amount of feedback Feedback is a key component
of instruction and learning. Research shows that interactive
Digital Games, Assessment, and Learning 29
computer applications that provide immediate, task-level feedback
to students can positively contribute to student learning
(e.g., Hattie and Timperley 2007; Shute 2008; Shute, Hansen,
and Almond 2008). Shute (2008) reviews research on formative
feedback and identifies the characteristics of effective formative
feedback (e.g., feedback should be nonevaluative, supportive,
timely, specific, multidimensional, and credible). Immediate
feedback that results from a direct manipulation of objects in
the game can provide useful information to guide exploration
or refine interaction strategies. The availability of ongoing feedback
may influence motivation and the quality of the evidence
produced by the system. Measurement models need to take into
account the type of feedback that has been provided to students
when interpreting the data gathered during their interaction
with the assessment system.
•  Handling dependencies among actions Dependencies among
actions/events can be complex to model and interpret. Assumptions
of conditional independence required by some measurement
models may not hold in complex interactive scenarios.
Designing scenarios carefully can help reduce the complexity of
measurement models. Using data-mining techniques to support
evidence identification can also help with this issue.
In addition to these challenges, in order to make scalable
assessments in games, we need to take into account operational
constraints and support the need for assessment information by
different educational stakeholders, including students, teachers,
parents, and administrators. Stealth assessment addresses many
of these challenges. The next section describes stealth assessment
and offers a sample application in the area of Newtonian
physics.

Order from us and get better grades. We are the service you have been looking for.