Chapter 2: Algorithm-Guided Learning at Scale

What were your biggest takeaways from the second of the Three Genres of Learning at Scale? Do you have any experience with Adaptive Tutors and Computer-Assisted Instruction and how do they align or differ from the research presented?

First, big compliments to Justin for explaining some complex topics in simple terms. Well done!

I have worked on CAI or CAI-adjacent products at Cisco, Pearson, and now Khan. A couple things that I would emphasize.

  1. There is a difference between deciding what the next problem will be and what the next topic will be in terms of the roles and responsibilities of technology and the teacher. I have seen little objection to technology choosing which problem to do next for a student. I have seen great difficulty in adjusting to a system where the technology decides what topic to do next. Teachers often have pacing guides and a scope and sequence that they are tied to, and that is foundational to success on the end-of-year assessments. Asking them to let students spend significant time outside of that plan is a difficult request to make and for them to adapt to, particularly if they are less experienced, less trained, and unfamiliar with the school (guess which students are more likely to have teachers that fit that description).

  2. Algorithm-based recommendations require a level of trust in the technology system. Why is it making those recommendations? Teachers want to be able to inspect and verify these, but many of the complex algorithms that are in use for recommendations are not easily inspectable, even by those that build them. On top of that, in my observation, one or two “off” recommendations are sufficient for teachers and students to lose faith in the system.

  3. If there is a Reich rule in the last chapter about students who do more, do well, there should be one in this chapter as well that is something like: schools that implement interventions in the way they were design see better outcomes. It seems obvious, but so much of our evaluation data comes down to variability in results that is largely due to variation in implementation.

All of this has convinced me over the past decade plus of working on this that we need to stay simple. We need to design technology solutions that are simple to understand and simple to implement. I am not convinced that the vision of letting technology help with the basic procedural skills and knowledge so teachers can do more problem-based instruction focused on higher level thinking is wrong. I don’t think it requires a huge disruption to do that. But we have to have simpler goals and simpler methods.


I concur with Kristen’s comment that the teacher default is to twist their plans to fit scope, sequence and the 1000 other considerations. We will break tools, sometimes making them useless, in order to ensure that they fit into boxes we have ascribed. AND… that often turns out to be a BAD learning experience for our students.

Right now, I am trying to figure out how I would build an “adaptive tutor” to help high school students practice historical thinking skills. This is a modest project, with no intention to disrupt history education. The chapter confirmed that history instruction has to be predicated on a blended learning model. No app will replace history teachers any time soon. Still, I was encourage to read that there is real value for creating opportunities for students to work at their own pace and to work on practicing the skills they identify to be problematic. Getting immediate feedback in a formative assessment can help. Tinkering… not large scale systems… is worthwhile.

I’m not the biggest fan of adaptive learning technologies, but throughout the chapter I was struck with the “RCT conundrum” as I call it, which I find even more frustrating than the emphasis on robot tutors in the sky solving all our problems. Talk about what furthers or silences innovation in a field: let’s look introspectively at our education research field. Nothing shuts out innovation more than declaring one and only one method the “gold standard” and treating all other methods as inferior, and I was pretty disappointed to see this approach further in Failure to Disrupt. RCTs are a specific methodologies, and constraining every approach to this method constrains what research questions we can and cannot ask.

Case in point, p. 74: “Randomized control trials are good tool for figuring out is tool work on average. But no school district is average: every context is unique.” Therefore, the next line should be a call to rethink the use of RCTs! It context and implementation matter more than anything else, and an RCT forces fidelity to implementation and controls for context, then we have a problem we need to talk about. This is the core of the conundrum: we must use RCTs, but there’s a whole set of issues we know to be important that RCTs ignore in their methodology.

One emergent finding, this is called out repeatedly in this chapter, is that there are three teacher-related factors that are at least as important for success as the choice of technology itself: teacher desire to use the technology, teacher interest or excitement to use the technology, and teacher quality without the technology. And RCT explicitly tries to control for and ignore each of these three factors. Rather than treat these factors as an * to add on to RCT results, why don’t we acknowledge that RCT is not a gold standard, one research methodology is not above all better than others (which would mean education research would look more similar to other hard sciences and less similar to medical research, which seems overall a good thing), and we encourage multiple research questions to be asked.

For example, the only question we can really ask with an RCT is “Does this technology work?” which the is not to not be the most useful question to ask, because it necessarily avoids context and the situation is far more complex. What if we asked questions like:
Do teachers that self-select a technology due to interest and excitement do better than those which forgo the technology and prefer to teach by their original methodology?
Which kinds of implementations work best? Which teacher practices work best with this technology? Which student practices work best with this technology?
Which technological and social factors need to be in place at a school for this technology to work well?

I’d love to see some introspection in the conversation today, on why there have been failures to disrupt the RCT as a gold standard within the educational research community, and how we can disrupt how research is conducted to better understand how education can be improved.

I agree… the “randomized control trials are what matters” – is throwing qualitative research under the desk.

p ages 59-60: “The theory of disruptive innovation argues that, periodically, innovations come along that may be low quality in some dimensions but offer low cost and novel features.” Sony Walkman was cited as an example of something with lousy sound, but “non-consumers” – people who weren’t going to get the good stuff anyway – would go for it.
Whew, does that sound like what happens in education. We’ve got whole categories of people in the “non-consumer” category: they aren’t going to get the good stuff anyway. Let’s give them something with some superficial features so we can all feel okay about that.

I especially agree with your third point, Kristen - when we evaluate new interventions, we need to include measures of fidelity! Of course, the great difficulty in education research (and the social sciences, generally) is the number of variables that can affect an implementation, and no measure will completely capture that.

While RCTs have a role to play, I think this is the perfect example of where rich, qualitative case studies are especially important. We need to know as much as possible about the exact contexts in which these tools do work in order to make more accurate recommendations on how others (outside of the initial testing group) can use them effectively within their own context. I would like to see more producers of CAI products include case studies as part of their early development research.

Hi Kevin, thanks for your thoughts here. You might be interested in some of our commentary on a recent big RCT we ran with ~250K participants:

" The kind of large-scale research that is needed to advance this work is not well-represented in the dominant paradigm of experimental educational research. The NSF/IES Common Guidelines for Education Research define a trajectory for experimental research that proceeds from pilot studies in laboratories, to initial implementations in field sites, to scale-up studies designed to generate “reliable estimates of the ability of a fully-developed intervention or strategy to achieve its intended outcomes” across multiple, diverse, real-world contexts (11). Many large grants available to researchers require that they hold their intervention constant across contexts.

Our present study confirms a principle that is central to social psychology and the learning sciences: Context matters. Alongside large-scale studies that test a single, fully developed intervention across multiple contexts, “scale-up” funding should be available for approaches that assume interventions will need to be constantly refined and modified to support specific groups of people across diverse contexts. These studies would be designed to respond to concerns of temporal validity, the notion that the effectiveness of interventions can vary as contexts and populations change over time (35). Rather than treating large-scale studies as the conclusion of a research trajectory, scale-up studies should support new research into context-level variation that cannot be explored in small field trials. We encourage greater focus on the characteristics of different contexts that induce variation in the effects of interventions to advance the development of a science of context in education. In a new paradigm, the question of “what works?” would be replaced with “what works, for whom, right here?”

Do we have examples of systems that do this well, or at least are contenders for consideration?

1 Like