In April of this year, SRI published a report evaluating the efficacy of adaptive learning technologies. This report represents the culmination of a program, funded by the Bill & Melinda Gates Foundation, in which 14 higher education institutions set out to test nine different adaptive learning tools across 23 courses, involving more than 280 unique instructors and more than 19,500 unique students. The goal was to conduct experiments or quasi-experiments in which the adaptive learning technologies would be evaluated via side-by-side comparison (i.e., against a control group). Pretty impressive, eh?
Well, I’ve read the report, and sadly, I must agree with Carl Straumsheim’s Inside Higher Ed summary: “Major study of adaptive learning finds inconclusive results…”.
Even more sadly, I wasn’t at all surprised.
For someone like me – strong advocate for educational research, lover of complex datasets and intricate statistical tests, and generally optimistic person – it’s a bit odd to admit that I did not have high hopes for this study. But my reason is straightforward: classroom-based research is hard. In particular, classroom-based comparison studies are hard. In real-life educational settings, it’s a nontrivial challenge to conduct a true experiment with random assignment to conditions or, failing that, to design a quasi-experimental study with adequate control.
Indeed, a recurring theme in the SRI report is that the higher education systems under study were not ready to engage in this kind of investigation: “Baseline equivalence is essential for justifying claims about courseware effects, but the common practice in higher education institutions is to simply compare course success rates without any data on student characteristics” (p. ES-iv, 26). “In some cases, we observed that the students in the adaptive learning conditions differed considerably (> 0.25 standard deviation) from comparison group students on the measure of prior learning” (p. 8).
These study design problems certainly contributed to the challenges faced by the SRI researchers, but they are not the sole cause of “inconclusive results” – in this study or elsewhere. There is another set of factors that I would argue are just as important to methodological quality and, hence, to increasing the likelihood of conclusive results. And the good news is: these factors are well within the control of the typical teacher-researcher. However, it just happens that many teacher-researchers haven’t come across them in their own disciplinary training. So common research practice is to do what seems reasonable and unwittingly fall into some avoidable traps.
Below is my list of five (relatively) easy-to-avoid pitfalls in (comparison-based) educational research. The point is: if you want to conduct research comparing instructional conditions (e.g., an efficacy evaluation like the SRI project) and you want to generate meaningful conclusions, you need more than a solid experimental design; you also need to avoid these all-too-common pitfalls. For those interested in educational comparisons that do not involve technology (e.g., teaching with inquiry-based learning practices vs. without, interspersing concept questions during class vs. traditional lecture), the following points apply equally well; arguably, most of them they apply to educational research in general.
Before reading on, remember, this is a list of things to avoid. The first paragraph after each heading explains why the pitfall is detrimental, and the second paragraph provides advice on what to do instead. Then the third paragraph describes how the corresponding issues played out in the SRI report.
Pitfall #1: Final course grades are used as a measure of student learning.
Granted, we all want grades to accurately reflect what students learned in our courses, but there are many reasons why they really don’t. First, final course grades generally include multiple components that are not measures of learning: student attendance and/or participation, homework scores (that may not be completed independently), extra credit, and other effort-based points. Second, final grades are not necessarily consistent across instructors or courses; rather, the weighting of different components tends to be rather idiosyncratic. Third, and perhaps most importantly, final course grades don’t adjust for students’ prior knowledge coming into the course, so they cannot serve as a measure of learning – i.e., knowledge gain.
To establish a more valid and reliable measure of students’ learning than final grades, I recommend using direct assessments of students’ performance instead (and ensuring that they are consistently administered across the different groups in a study). Using a common final exam, an off-the-shelf standardized test (that aligns with the course content and objectives), or equivalent assignments (with consistently applied grading criteria) are all reasonable ways to accomplish this. In addition, it is ideal to incorporate a corresponding pre-test at the beginning of the course to enable calculation of learning gain and to verify that the different groups of students began the study with equivalent baseline performance.
Now, how did the SRI study fare regarding this pitfall? Final course grades were used as one of the two main learning outcomes. (Course completion rates were the other.) So it’s not surprising that almost none of the side-by-side comparisons showed a significant difference. (And even if they had, you would have been suspicious about final grades as a good measure of student learning anyway, right?) Thankfully, the SRI researchers recognized this problem (p. 8), and in their recommendations, they call for the use of more direct assessments of learning and performance (pp. ES-vi, 27). Hear, hear!
Pitfall #2: Too much weight is given to students’ assessments of learning gain.
Given that collecting an adequate measure of students’ learning is more involved than exporting students’ final course grades (see pitfall #1), it’s not surprising that folks have sought alternative, quick-and-dirty options. However, just asking students (e.g., via a survey) how much they think they learned during a course or, even worse, having students compare their learning in the current situation to another, distant-in-memory or hypothetical situation, is simply asking for trouble. Research has shown that students’ self-reports of learning can be quite biased (often in unpredictable ways). Moreover, when direct assessments of learning and students’ self-reports of learning are both collected in the same study, results for the two types of measures often don’t match. That is, results based on direct assessments might indicate that one instructional condition is better than the other, but results based on students’ self-reports indicate the opposite! In other words, if you only use students’ assessments of learning gain (or if you give such data too much weight in your analysis), you may end up drawing the wrong conclusion about educational effectiveness.
Instead of using students’ self-reports of learning gain (yes, I would avoid collecting them at all), one recommended alternative is to leverage students’ perceptions in other ways – where they are more likely to be accurate. For example, if you are interested in students’ perceived enjoyment or engagement with their learning experience, you could survey students about that. Or you could ask them if they would like to learn more (or would recommend to a friend) using the same approach they just experienced. Questions of this sort are still subject to potential biases (e.g., social desirability), but they are safer than self-reports of learning gain. (Note: if you are really interested in measuring student engagement and preference, also consider collecting relevant behavioral data, e.g., number of optional activities completed, time on task, subsequent selection of major, or enrollment in follow-on courses.)
And what about the SRI study? Unfortunately, student self-assessments of learning were included in the set of measures collected and analyzed by SRI. As predicted, the self-assessment results did not match the learning outcome results. Specifically, 77% of 2-year college students and 51% of 4-year college students self-reported that they achieved positive learning gains with an adaptive learning technology, but only 4 of the 15 outcome comparisons showed any positive result for the adaptive technology group (and those few cases only showed “slightly higher course grades”). The SRI authors acknowledge this problem and recommend that survey data “are insufficient as the only outcome in a courseware evaluation” (pp. ES-vi, 27). When it comes to self-reports of learning gain, I say: don’t even bother collecting them.
Pitfall #3: Teachers are not given sufficient preparation or professional development before incorporating a new teaching tool – technological or pedagogical – into their practice.
This one seems obvious enough that it should be easy to avoid. And yet… new math. It seems as if this pitfall applies to virtually every teaching innovation that has been ever been disseminated. (Ok, maybe that last comment exaggerates a bit, but I think you see my point. And if you’re seriously skeptical, see Larry Cuban’s excellent book.)
So, instead of assuming that teacher preparation will happen naturally or simply take care of itself, let’s make a pact to deliberately plan this phase into the research process. There are many ways to do this: hands-on workshops, mentoring circles, training institutes, etc.. The key is to involve teachers early on in the process, not only to help them gear up but also to solicit their insights and perspectives on how to best incorporate the new tool in support of student learning. Relatedly, it’s also worth planning additional phases on the other end of the process – i.e., by continuing to study the intervention beyond its initial implementation. Because novel teaching tools can take some time to be fully incorporated and optimized, it’s advantageous to study each new intervention at multiple stages of adoption – from initial implementation through to steady state.
What did the SRI report have to say on this issue? Although it is not directly referenced in the report, I imagine that instructors in this study could have benefited from greater support and professional development before teaching with their chosen adaptive learning technology. The report mentions instructor training time as one variable collected (by survey and interview) but doesn’t provide much detailed information. (Only results for the “parent variable” overall preparation time are presented, but that variable includes much more than initial training.) Nevertheless, on related issues, the SRI report includes two noteworthy features. First, it recommends: “More research is needed to develop cost effective ways to capture the changes in instructional practice associated with implementations of adaptive courseware” (pp. ES-vi, 27). I completely agree with this and would even go a step further to additionally recommend capturing information on instructors’ training experiences. (Who knows? It may be that how instructors are trained up on a new teaching tool makes a difference for its success.) As for the second noteworthy feature of the SRI report, consistent with my recommendation above, the data in this study were collected across multiple semesters (“terms”) of implementation. The SRI researchers took advantage of this by analyzing some of the data separately for Term 1 versus subsequent terms, allowing them to investigate different patterns in the data across time. I’d like to see more of this in educational research.
Pitfall #4: It is assumed that students in the treatment group will “take” the treatment.
Medical doctors have this problem all the time: Patients don’t always take their medicine, so when the treatment doesn’t work, doctors don’t know if they need to try another option or work on getting better compliance. The analogue of this problem in educational studies gets even worse when we forget that null results (i.e., finding no difference between the treatment and control groups) might stem from an insufficient “dose” being taken by students in the treatment group. It’s natural for researchers to forget about this ambiguity because they are so focused on other aspects of the study, let alone they are likely excited about the new tool’s great potential for success. (“The students are just gonna love this new <insert your favorite teaching innovation here>!”) Unfortunately, this is a case where student behaviors can and will fail to match our expectations. To put it another way, “If you build it, they will come” does not apply new teaching tools.
But there are several things one can do to avoid these problems. Here are two recommended strategies – one more proactive, the other more reactive. Proactively, one can set up course expectations and assignments so that engaging with the new teaching tool is standard behavior for students. Teachers have a variety of incentivizing strategies they use in general, so it makes sense to employ whichever one(s) best fit the context. A more reactive strategy is to find a reasonable measure of how much students in each treatment group actually engaged with their new tool (e.g., amount of time, number of activities), and then include this as another variable in the analysis of student outcomes. I call this an educational “dose-response” analysis. There are some issues to bear in mind here – e.g., available measures of dose may correlate highly with being a good student, a confounding variable in the analysis – but the basic idea is the same: leverage data on amount/degree of student use to investigate its influence on outcomes.
Did the SRI study fall prey to this issue? It seems so. Among the SRI report’s key findings is this one: “The major concern expressed by instructors was getting students to use the adaptive courseware frequently enough” (p. ES-iii). Later in the report, the authors acknowledge a key consequence of this problem: “Students at Metropolitan [College] reported using Cerego [adaptive software] just ‘a few times or more,’ raising questions about whether the level of usage provided a fair test of the product’s potential” (p. 18).
Pitfall 5: When the focus of research is on outcomes, collecting data on the process of teaching and learning gets lost in the shuffle.
It’s appropriate for the focus of comparison-based educational research to be on outcomes – Did students learn more (or make more/better achievements) after the instructional treatment, compared to control? As such, researchers naturally focus their data-collection efforts on outcome variables: final grades, course completion rates, test scores, etc.. Another likely reason for the focus on such variables is that they tend to be quantitative in nature. (I sometimes get the feeling that there is an implicit bias in favor of quantitative data.) However, if we only collect quantitative data on outcome variables, we support the goal of identifying which teaching tool is better at the cost of the related goal of discovering why.
Collecting data (yes, even qualitative data) on the process of teaching and learning is one way to glean why and how information about the teaching comparisons we care about. And it’s not that tricky to do. It can simply involve collecting similar outcome data at additional, intermediate points in time – to get a richer picture of the trajectory of student development. It can also include collecting other types of data, such as interviews (e.g., asking teachers about their teaching practices) or think-aloud protocols (i.e., having students speak out loud – mentioning the pieces of knowledge that go through their mind – as they engage in learning with the new tool). In addition, when a student involves a new technological tool, “process data” can often be drawn from log files collected automatically as students interact with the tool (e.g., students’ click sequences, their ongoing interactions with the tool, or the content/accuracy of their responses). With these kinds of data in hand, we can explore more detailed research questions about how and why a given teaching tool worked the way it did. This, in turn, can help us apply the “better” tool to positive effect in new situations (a key point, after all), and it can improve our general understanding of how learning works (a beautiful thing, in and of itself).
In the SRI study, several process measures were collected, especially from faculty: interviews and surveys on their teaching practices, estimates of their time spent in various preparatory activities, and costs paid to put the adaptive learning technologies in place. One variable that I found interesting – though a bit coarse – is instructors’ self-reported time spent preparing and giving lectures (measured in hours/week). The report compared this quantitative variable between treatment and control groups for three different use cases (e.g., transitioning from traditional lecture to teaching with adaptive learning tools). I would like to see this set of process data expanded in future studies. For example, one could collect more precise data on when (and for how long in each segment) instructors lecture in the different conditions, qualitative data on the content of their lectures (to be compared across adaptive and non-adaptive groups), and personal logs of how instructors spend their time preparing for class in the different conditions. Additionally, process data could be collected from students: the sequence and timing of their (potentially infrequent) uses of the adaptive learning tools, the nature of learning activities they engaged in (including whether these activities actually differed across students in the adaptive groups), the content and accuracy of student responses (and the pattern of subsequent actions after failure vs. success). The possibilities are endless, and benefits include deeper understanding of the process under study – regardless of whether the outcome results were conclusive or not.
To sum up
If you are doing (or reading) a comparison-based educational study, watch out for the five pitfalls mentioned above. Instead of falling prey to them, consider the following recommendations:
- Identify direct measures of student learning and performance that can be consistently administered/collected across groups and used as outcome data; seek to administer corresponding pre-tests as well.
- Do not collect students’ self-reports of learning gains; instead, use survey questions to ask students about their self-reported engagement or related outcomes, where students have a chance of providing an accurate judgment about something you want to know.
- Plan a training phase into the research process – to help instructors learn about the new teaching tool and to learn from their insights on how the tool can best be incorporated.
- Incentivize students to engage with the new teaching tool (lest they fail to utilize it sufficiently), or collect data on how much they actually used it (so you can conduct dose-response analyses after the fact).
- In addition to outcome data, consider collecting data on teaching and learning during the process (and don’t be afraid to include qualitative data in this set).
By making these five moves, you’ll likely improve the quality of your study design, your dataset, and ultimately, your results. After all, it’s in everyone’s best interest to produce high-quality data and results in educational research. Only then will we have the solid information we need to make effective choices that enhance student learning.