Computational Modeling and Data Mining
One of the greatest impacts of technology on 21st century education will be the scientific advances made possible by mining the vast explosion of learning data that is coming from educational technologies. The Computational Modeling and Data Mining (CMDM) Thrust is pursuing the scientific goal of using such data to advance precise, computational theories of how students learn academic content. We will accomplish this by drawing on and expanding the enabling technologies we have already built for collecting, storing, and managing large-scale educational data sets. For example, DataShop will grow to include larger and richer datasets coming not only from our LearnLab courses but also from thousands of schools using the Cognitive Tutor courses and from additional contexts where we can collect student dialogue data, measures of motivation and affect, and layered assessments of both student knowledge and metacognitive competencies. This growth in the amount, scope, and richness of learning data will make the DataShop an even more fertile cyber-infrastructure resource for learning science researchers to use. But to realize the full potential of that resource – to make new discoveries about the nature of student learning – researchers need new and powerful knowledge discovery tools – innovations that will occur within the CMDM Thrust.
The CMDM Thrust will pursue three related areas: 1) domain-specific models of student knowledge representation and acquisition, 2) domain-general models of metacognitive, motivational, and social processes as they impact student learning, and 3) predictive engineering models and methods that enable the design of large-impact instructional interventions.
Developing Better Cognitive Models of Domain-Specific Content. Understanding and engineering better human learning of complex academic topics is dependent upon accurate and usable models of the domains students are learning. However, domain modeling has been a continual challenge, as student knowledge is not directly observable and its structure is often hidden by our “expert blind spots” (Koedinger & Nathan, 2004; Nathan & Koedinger, 2000). Key research questions are: a) Can the discovery of a domain’s knowledge structure be automated? b) Do knowledge component models provide a precise and predictive theory of transfer of learning? c) Can we integrate separate methods for modeling memory, learning, transfer, and guessing/slipping, to optimize models of student knowledge, and in turn optimize students' effective time on task?
One of the planned projects for Year 5 will build on our promising past results, obtained with the Cen, Koedinger, and Junker (2006) Learning Factor Analysis (LFA) algorithms. Specifically, we will, by broadening the generalizability of this domain-modeling approach, incorporating new knowledge-discovery methods, and increaseing the level of automation of knowledge analysis so as to engage more researchers in applying this technique to even more content domains. To more fully automate the discovery of knowledge components, Pavlik will use Partially Ordered Knowledge Structures (POKS) (cf. Desmarais, et al., 1995) to build more complete and accurate representations of map the given domain and to capture the prerequisite relationships between hypothesized knowledge components and their predictions of performance. The models that this work produces will become the input to algorithms that can optimize for each student the amount of practice and ideal sequencing of instructional events for acquiring each knowledge component. These approaches will be applied to tutors across domains, including math, science, and language (particularly for English vocabulary and article learning domains). A related project will investigate the impact of combining LFA model refinement with improved moment-by-moment knowledge modeling, using a probabilistic model that uses student interaction data to estimate whether a student’s correct answer or error informs us about their knowledge or simply represents a guess or slip (Baker, Corbett & Aleven, 2008). In addition to clear applied benefits, these projects will advance a more precise science of reasoning and learning as it occurs in academic settings.
Developing Models of Domain-General Learning and Motivational Processes. Our work toward developing high-fidelity models of student learning has involved capturing, quantifying, and modeling domain-general mechanisms that impact students’ learning and the robustness of that learning. In the first four years of the PSLC, our models have moved beyond addressing domain-specific cognition (e.g., the cognitive models behind the intelligent tutors for Physics, Algebra, and Geometry) to capture metacognitive aspects of learning (e.g., Aleven et al.’s, 2006, detailed model of help-seeking behavior), general mechanisms of learning (Matsuda et al., 2007) and motivational and affective constructs such as students’ off-task behavior (Baker, 2007), and whether a student is “gaming the system” (Baker et al., 2008; shown to be associated with boredom and confusion in Rodrigo et al, 2007).
A key Year 5 effort will extend the SimStudent project both as a theory-building tool and as an instruction-informing tool. We will use SimStudent to make predictions about the nature of students’ generalization errors and the effects of prior knowledge on students’ learning and transfer, testing these predictions using human-learning data in DataShop. While psychological and neuroscientificce models typically produce only reaction time predictions, these models will predict specific errors and forecast the pattern of reduction in those errors. Developing a system that integrates domain-general processes to produce human-like errors in inference, calculation, generalization, and the use of feedback/help/instructions would be both a major theoretical breakthrough, and an extremely useful tool for other researchers.
Looking forward to the renewal period, an important project will be to develop machine-learned models of student behaviors at a range of time scales, from momentary affective states like boredom and frustration (cf. Kapoor, Burleson, & Picard, 2007) to longer-term motivational and metacognitive constructs such as performance vs. learning orientation and self-regulated learning (Azevedo & Cromley, 2004; Elliott & Dweck, 1988; Pintrich, 2000; Winne & Hadwin, 1998). We will expand prior PSLC work by Baker and colleagues (Rodrigo et al, 2007, in press; Baker et al, in press) to explore causal connections between these models and existing models of motivation-related behaviors such as gaming the system and off-task behavior. We will pursue models of differences in cognitive, affective, social, and motivational factors as they relate to classroom culture, schools, and teachers. These proposed models would be, to our knowledge, the first systematic investigations of school-level effects factors affectingon fine-grained states of student learning.
Developing Predictive Models to Inform Instructional Event Design. A fundamental theoretical problem for the sciences of learning and instruction is what we have called “the Assistance Dilemma”: optimizing the amount and timing of instruction so that it is neither too little nor too much, and neither too early nor too late (Koedinger & Aleven, 2007; Koedinger, 2008; Koedinger, Pavlik, McLaren, & Aleven, 2008). Two theoretical advances are necessary before we can resolve these broad questions. First, we need a clear delineation of the multiple possible dimensions of instructional assistance (e.g., worked examples, feedback, on-demand hints, self-explanation prompts, or optimally-spaced practice trials). We broadly define assistance to include not only direct verbal instruction, but also instructional scaffolds that prompt student thinking or action as well as implicit affordances or difficulties in the learning environment. Second, we need precise, predictive models of when increasing assistance (reducing difficulties) or decreasing assistance (increasing difficulties) is best for optimal robust learning. Existing theoretical work on this topic – like cognitive load theory (van Merrienboer & Sweller, 2005; Sawicka, 2008), desirable difficulties (Bjork, 1994), and cognitive apprenticeship (Collins, Brown, & Newman, 1989) -- have not reached the stage of precise computational modeling that can be used to make a priori predictions about optimal levels of assistance.
We will use DataShop log data to make progress on the Assistance Dilemma by targeting dimensions of assistance one at a time and creating parameterized mathematical models that predict the optimal level of assistance to enhance robust learning (cf. Koedinger et al., 2008). Such a mathematical model has been achieved for the practice-interval dimension (changing the amount of time between practice trials), and progress is being made on the example-problem dimension (changing the ratio of examples to problems). These models generate the inverted-U shaped function curve characteristic of the Assistance Dilemma as a function of particular parameter values that describe the instructional context. These models are created and refined using student learning data from DataShop. We hypothesize that this form approach will work for other dimensions of assistance. These models will address the limitations of current theory indicated above by generating a priori predictions of what forms of assistance or difficulty will enhance learning. Further, these models will provide the basis for on-line algorithms that adapt to individual student differences and changes over time, optimizing the assistance provided to each student for each knowledge component at each time in their learning trajectory.