The TheoremPath Pedagogy Thesis: Why It Is Built This Way

What This Page Is

This page exists for one reason. The TheoremPath site has a specific shape: a prerequisite directed acyclic graph, explicitly typed topic pages, FSRS-based review scheduling, BKT-flavoured mastery tracking, IRT-style item difficulty calibration, and a deliberate refusal to claim personalization the underlying calibration cannot support. Every one of those is a design choice with an evidentiary basis, and every one of those choices was made over plausible alternatives. This page states which empirical finding each choice implements, names the alternatives that were not taken, and identifies where the TheoremPath implementation diverges from the literature.

The page is one of five PedagogyPath entries that together make the case for the TheoremPath architecture explicit. The other four are bayesian-knowledge-tracing-for-educators, item-response-theory-for-educators, fsrs-spaced-repetition-for-educators, and intelligent-tutoring-systems. This page is the spine that connects them.

There is no original empirical claim here. The page restates findings established in the cited literature and identifies which ones the TheoremPath production system is built around. The original work is in the combination and the constraints: which findings combine well, which trade off, and which are deferred because the calibration data does not yet support them.

The Four Pillars

TheoremPath's adaptive architecture rests on four design decisions, each of which has a separate empirical foundation and a separate documentation page.

Pillar	TheoremPath implementation	Empirical foundation
1. Prerequisite DAG over topics	Frontmatter `prerequisites:` field on every topic page; build-time validation that prerequisites resolve; sitemap ordering reflects DAG depth	Cognitive load theory (Sweller 1988, 2019); the working-memory limit on what can be processed simultaneously; element interactivity as a function of prerequisite mastery
2. FSRS-based review scheduling	`ReviewCard` Prisma model; `src/lib/fsrs/scheduler.ts` implementation; default target retention 0.9	The spacing effect (Cepeda et al. 2008); retrieval practice (Roediger and Karpicke 2006); the Dunlosky et al. (2013) "high-utility" classification
3. BKT-flavoured mastery layer	`TopicAssessment` and `AssessmentAttempt` models; `src/lib/mastery/scoring.ts` discrete-band tracking; difficulty-aware updates	The original Corbett-Anderson (1995) BKT formulation; the modern PFA / DKT alternatives (Pavlik 2009; Piech 2015); Pelánek (2017) review
4. IRT-style item difficulty calibration	Per-item difficulty parameter (1-10 band) on every assessment item; `src/lib/adaptive/difficulty-calibration.ts` audit logic; `audit-difficulty.ts` script	Item response theory (Lord 1980; Embretson and Reise 2000); the Rasch tradition (Rasch 1960); the practical limits of IRT on small calibration samples

Each pillar has a dedicated PedagogyPath page that covers it in the depth its empirical literature deserves. The remainder of this page treats the pillars as a system: how they interact, where they support each other, where they trade off, and what the architecture deliberately does not yet do.

Pillar 1: The Prerequisite DAG

What TheoremPath does

Every topic page on TheoremPath has an explicit prerequisites: array in its YAML frontmatter, naming the slugs of topics whose content the current topic builds on. The build-time content lint validates that every prerequisite slug resolves to an existing topic page; broken references fail the build. The site's navigation, the layered tier system (layer 0A foundations through layer 5 advanced topics), and the machine-readable graph snapshot at data/graph/snapshot.json all rest on this DAG.

The DAG is hand-authored. There is no automatic prerequisite inference. The cost is editorial: the graph requires explicit maintenance as topics are added or revised. The benefit is that the structure encodes deliberate pedagogical judgment, not a correlation extracted from co-occurrence in text.

What the empirical literature supports

The deepest empirical foundation for the prerequisite DAG is cognitive load theory (Sweller 1988, Cognitive Science 12(2): 257-285; Sweller, van Merriënboer, and Paas 2019, Educational Psychology Review 31: 261-292). The core claim: working memory has limited capacity for processing novel information; learning happens when items can be related to existing schema in long-term memory; presenting material whose prerequisite schema are not yet in place imposes extraneous cognitive load that crowds out the germane load that actually produces learning.

The practical consequence: ordering instruction so that prerequisite material is mastered before dependent material is introduced reduces extraneous load and improves learning rates. This is well-established empirically. The implementation question is how to enforce the ordering.

The alternative implementations TheoremPath did not adopt

Three plausible alternatives:

Linear curriculum. A single fixed sequence. Standard in textbooks; simple to implement; loses the ability to take multiple paths through the same material. TheoremPath rejects this because the same theorems appear naturally in multiple curricula (concentration inequalities arrive from probability, statistics, and learning-theory directions); a linear curriculum forces an artificial main thread.

Learned prerequisite extraction. Use co-occurrence in the text or response data to infer prerequisite relations. This is an active research area (the educational-data-mining literature on prerequisite inference; Liang et al. 2018, AAAI). TheoremPath rejects this because (a) the co-occurrence signal is weak in our corpus, (b) the inferred relations frequently conflict with author judgment, and (c) editorial accountability matters more than scaling for a research-reference site.

Tag-based loose graph. Tags on each page, no explicit DAG. Easier to maintain; loses the build-time validation. Rejected because the validation is the point: a broken prerequisite breaks the build, which forces the maintainer to fix it.

Where the implementation diverges from the literature

Cognitive load theory talks about element interactivity per lesson or worked example, not per topic page. TheoremPath treats a topic as the unit; the assumption is that a topic page is roughly the granularity at which prerequisite mastery matters. This is a simplification. A more granular implementation would track prerequisite relations at the level of individual theorems or definitions within a page. The tradeoff is editorial maintainability: the topic-level DAG has ~700 nodes, which is manageable; a theorem-level DAG would have ~10,000 nodes and would be a substantial maintenance burden.

Pillar 2: FSRS-Based Review Scheduling

What TheoremPath does

Every assessment question a learner has attempted becomes a review card with FSRS state (stability, difficulty, due time, last review, review count, lapse count, FSRS state code). The scheduler in src/lib/fsrs/scheduler.ts implements the FSRS update equations on each review and selects the next due time to target a default retention rate of 0.9. The user-facing review surface (/daily-review) shows due cards in a queue ordered by priority.

The fsrs-spaced-repetition-for-educators page covers FSRS in the depth it deserves; this section restates the design choice in the context of the larger TheoremPath architecture.

What the empirical literature supports

Three findings, each well-replicated:

The spacing effect. Distributed practice produces better long-term retention than massed practice, with optimal absolute spacing gaps that depend on the desired retention horizon (Cepeda, Vul, Rohrer, Wixted, and Pashler 2008, Psychological Science 19(11): 1095-1102). The Dunlosky et al. (2013) review classifies "distributed practice" as one of two "high-utility" study techniques.

The testing effect (retrieval practice). Active retrieval strengthens future retrieval more than re-studying the same material (Roediger and Karpicke 2006, Psychological Science 17(3): 249-255). Karpicke's broader research program has established this finding across content domains and learner populations.

Effective scheduling captures the spacing benefit. The empirical comparison studies on spaced-repetition algorithms (particularly the FSRS-vs-SM-2 calibration benchmarks) show substantial improvement in calibration to target retention, which translates into either fewer reviews at the same retention or higher retention at the same review burden.

The alternative implementations TheoremPath did not adopt

SM-2 (the SuperMemo-2 algorithm). The standard scheduler for the prior generation of spaced-repetition tools. Rejected because FSRS produces measurably better calibration on the public benchmarks and because FSRS's per-learner parameter fitting handles individual differences in memory dynamics that SM-2 cannot.

Half-life regression (Settles and Meeder 2016, ACL). A logistic-regression-based scheduler used by Duolingo. Rejected because TheoremPath's review volume per learner is too small for regression-based per-feature calibration to be meaningful; FSRS's smaller per-learner parameter set fits TheoremPath's data better.

Fixed-interval review (the Leitner system). Cards advance through discrete boxes on success. Rejected because the calibration to retention is much coarser; the spacing-effect benefit is partially captured but the per-card variance is large.

Where the implementation diverges from the literature

FSRS as a research result calibrates to a target retention rate that the learner sets. TheoremPath defaults to 0.9 for all learners and exposes the parameter only on the settings page. For most learners this is an under-customization. The reason is operational: the cost of an out-of-the-box wrong default is low (mild over- or under-review), and most learners do not have a strong prior on what the right rate is. Customization is available; it is not the default surface.

A more substantial divergence: FSRS schedules review of cards, which are typically isolated facts. TheoremPath schedules review of assessment questions, which often require synthesis. The mismatch matters: a question that requires applying a theorem in a new context is not memory-only, and FSRS's model of memory half-life does not fully capture the cognitive work involved. The practical workaround is to pair FSRS-scheduled review with the underlying topic-page material; the page is the explanation that lets a learner produce the answer. The page does not substitute for the practice; the practice does not substitute for the page.

Pillar 3: BKT-Flavoured Mastery Layer

What TheoremPath does

Each user's mastery of each topic is tracked through the TopicAssessment Prisma model, with the underlying evidence in the append-only AssessmentAttempt log. The mastery state is a discrete band (unassessed, learning, working knowledge, fluent), updated on each assessment attempt with weighting by item difficulty.

The bayesian-knowledge-tracing-for-educators page covers BKT in the depth it deserves; this section restates the design choice in context.

What the empirical literature supports

BKT and its descendants (PFA, DKT, knowledge-component models) all rest on the same underlying claim: a learner's recent response history on items exercising a skill is informative about their probability of success on future items exercising the same skill, and that the inference can be operationalized as a probabilistic state-space model. Corbett and Anderson (1995) established the model; thirty years of educational-data- mining work have refined it.

The headline practical claim: a BKT-style mastery estimate predicts future correctness on items in the same skill domain better than simple running-accuracy estimates do (Pelánek 2017), though by a relatively modest margin and at the cost of parameter-identifiability concerns (Beck and Chang 2007).

The alternative implementations TheoremPath did not adopt

Pure running-accuracy mastery. The fraction of recent correct responses. Rejected because it does not weight items by difficulty (a learner who succeeds on five easy items looks the same as one who succeeds on five hard items) and does not have a principled probabilistic interpretation.

Vanilla BKT. Four parameters per skill, Bayesian updating. Rejected primarily because the parameter identifiability issues are severe at TheoremPath's data volume; BKT works best on millions of responses across hundreds of items per skill, which is more data than the TheoremPath mastery substrate currently has.

Deep Knowledge Tracing (DKT). A neural sequence model. Rejected because of (a) data volume, (b) interpretability, and (c) the operational cost of training and updating a neural model versus the tractable BKT-flavoured update rule. DKT has better predictive accuracy on benchmark datasets; that advantage does not translate at TheoremPath's scale.

Performance Factor Analysis. A logistic-regression alternative that combines item difficulty with success and failure counts. The TheoremPath implementation is closer to PFA than to vanilla BKT in spirit, and the src/lib/learning-model/weighted-pfa.ts module reflects this.

Where the implementation diverges from the literature

The TheoremPath mastery layer uses discrete bands rather than a continuous mastery probability. This is a deliberate simplification: the bands are calibrated to map to user-facing labels that learners can act on. A continuous probability estimate would be more information-rich but also more brittle, and would invite over-interpretation by learners.

The mastery state is also surfaced under a feature flag (ADAPTIVE_MASTERY_DASHBOARD) rather than as a public claim about personalization. The substrate framing in docs/ADAPTIVE_LEARNING_KERNEL.md makes this explicit: until the underlying calibration is validated against external criteria, the mastery estimate is treated as research evidence, not as a personalization commitment.

Pillar 4: IRT-Style Item Difficulty Calibration

What TheoremPath does

Every assessment question carries a difficulty parameter on a 1-10 scale, with the convention documented in CLAUDE.md's Quiz System section: 1-3 = foundation, 4-6 = intermediate, 7-10 = advanced. The scale is set by the question author, audited by the audit-difficulty.ts script, and used in the mastery-update logic to weight evidence.

The item-response-theory-for-educators page covers IRT in the depth it deserves; this section restates the design choice in context.

What the empirical literature supports

Item response theory is the methodological foundation of high-stakes psychometric assessment. The 1PL Rasch, 2PL, and 3PL models provide a principled framework for placing item difficulty and learner ability on a common scale, with calibration procedures that scale to large item banks.

For TheoremPath, the relevant claim is operational: weighting mastery-update evidence by item difficulty produces more informative mastery estimates than treating all items as exchangeable. This is not in dispute in the literature.

The alternative implementations TheoremPath did not adopt

Continuous IRT calibration. Per-item 2PL or 3PL parameters fit by marginal maximum likelihood. Rejected because the data-volume requirements (several hundred respondents per item) exceed TheoremPath's current evidence base. The fits would be unstable.

No difficulty calibration. Treat all items as exchangeable. Rejected because mixing easy and hard items in a mastery update produces estimates that are systematically wrong (a learner who succeeded on three easy items and failed on two hard ones should not look the same as one who succeeded on the hard ones and failed on the easy ones).

Crowdsourced difficulty. Use response data to estimate difficulty without author input. Rejected for the same data- volume reason; the response data is too sparse on most items to drive calibration.

Where the implementation diverges from the literature

The 1-10 author-set scale is much coarser than a continuous IRT parameter. This is deliberate. The audit-difficulty.ts script computes a response-data estimate of difficulty per item and flags items where the author label diverges substantially from the data estimate; these are candidates for re-labelling or for editorial revision. The intent is to use the discipline of IRT (item difficulty as a parameter that can be calibrated and audited) without committing to the data-intensive continuous-parameter machinery.

How the Pillars Combine

The four pillars are not independent. They support each other, they trade off against each other, and they constrain each other in specific ways.

Prerequisite DAG plus mastery layer. A prerequisite DAG without mastery tracking gives the learner an ordering but no feedback on whether the ordering is being honored. A mastery layer without a DAG gives the learner feedback per topic but no guidance on what to study next. The two together give an adaptive learning surface: the DAG names the candidate next topics; the mastery layer ranks them.

FSRS plus mastery layer. Mastery tracking measures whether the learner currently knows the material. FSRS schedules review to maintain that knowledge over time. Without FSRS, the mastery estimates decay silently as the learner forgets. Without mastery, FSRS reviews material the learner has not yet mastered. The two together support both the acquisition phase and the retention phase of learning.

IRT plus mastery layer. The mastery update is weighted by item difficulty. Without difficulty calibration, the mastery update treats all items as equivalent, producing the systematic bias described above. With it, the update is informative: a correct answer on a hard item moves mastery up faster than a correct answer on an easy item.

FSRS plus IRT. FSRS schedules review at the level of individual cards or questions; IRT calibrates the items. The combination supports adaptive item selection: at review time, the system can choose a difficulty appropriate to the current mastery state, not just the next due card. TheoremPath does not yet do this in production (the review queue is sorted by due date alone), but the pillars are in place to support it.

What TheoremPath Deliberately Does Not Do

Five things the architecture deliberately defers.

No automated tutoring loop. TheoremPath has the substrate of an intelligent tutoring system (domain model, learner model, evidence log) without a tutor-loop policy that selects the next item or hint based on the current state. The intelligent-tutoring-systems page covers the tradition; the substrate framing in docs/ADAPTIVE_LEARNING_KERNEL.md is explicit that the policy problem is treated as contextual bandit / rules-plus-evaluation rather than reinforcement learning, and that durable action / impression logging is identified as a future need.

No reinforcement learning from production behavior. RL requires a faithful action / impression log so that policy training can compare what was shown to what could have been shown. TheoremPath's evidence log captures attempts but not the recommendations that led to those attempts; closing this gap is a future foundation.

No personalization claim beyond what the calibration supports. Mastery estimates are gated behind feature flags; recommendation surfaces show ranked candidates without claiming personalized routing. This is a deliberate restraint, not a limitation. The site does not claim to know more about the learner than the underlying calibration supports.

No paid model calls in CI or default flows. The adaptive substrate runs on deterministic data structures (Prisma rows, JSON artifacts). External model calls are reserved for explicit operator-initiated work. This keeps the substrate auditable and the per-learner cost zero.

No affect-aware tutoring. Modern ITS research is increasingly affect-aware (D'Mello and Graesser 2014). The TheoremPath substrate does not capture engagement signals (dwell time, hint usage, frustration) at sufficient fidelity to support affect-aware policy work. This is a known gap.

What This Page Does Not Claim

This page does not claim TheoremPath is a finished system. The deferred items above are deliberate omissions, not problems solved. The architecture is designed to absorb them as the calibration evidence and the operational machinery mature.

This page does not claim the four-pillar combination is optimal. There are plausible alternative architectures (a single deep-learning learner model unifying mastery and review; an ITS-style tutor loop; an LLM-backed conversational tutor) that would produce different tradeoffs. TheoremPath's choice reflects current calibration data, current interpretability priorities, and current editorial discipline, not a claim that no other architecture could work.

This page does not claim every pedagogical finding cited above is settled. Cognitive load theory has critics (Schnotz and Kürschner 2007, Educational Psychology Review); the spacing effect's optimal-gap function depends on retention horizon in ways that simplifications miss; BKT parameters are non-identifiable in known cases. The page cites the standard references; it does not claim the standard references are the final word.

FAQ

Why use BKT, FSRS, and IRT together rather than a single deep model?

The deep alternatives (Deep Knowledge Tracing, transformer- based knowledge tracing, sequence models for spaced repetition) have better predictive accuracy on benchmark datasets but worse interpretability and worse small-data behaviour. At TheoremPath's data volume, the interpretable component-wise architecture produces more reliable estimates than a single end-to-end neural model would. The data-volume argument may flip if TheoremPath's user base grows substantially.

Is TheoremPath's adaptive system "AI-powered"?

The phrase is doing too much work. The adaptive system uses specific probabilistic models (BKT-flavoured mastery, FSRS scheduling, IRT-style calibration), which are AI in the broad-discipline sense. It does not use a large language model or a general-purpose deep learning system as the adaptive engine. The substrate framing in docs/ADAPTIVE_LEARNING_KERNEL.md is honest about which kinds of AI are in use and which are not.

Why isn't there a personalized recommendation surface?

There is partial recommendation infrastructure (the RecommendationEvent Prisma model logs surface decisions), but the user-facing recommendation surface is currently rules-based rather than learned. The reason is the data-volume argument above: a learned policy needs durable action / impression logging that does not yet exist at the right fidelity. Rules-plus-evaluation is the honest level of commitment.

How does this connect to the broader cognitive-science of

learning?

Each pillar has a dedicated PedagogyPath page covering the empirical literature. The bayesian-knowledge-tracing-for-educators, item-response-theory-for-educators, fsrs-spaced-repetition-for-educators, and intelligent-tutoring-systems pages are the four companion entries. PedagogyPath is the site where the cognitive-science findings live in their own right; this thesis page is where the TheoremPath implementation is mapped onto them.

What if a finding cited here is later overturned?

The architecture is component-wise: each pillar has its own empirical foundation and its own implementation. If one finding is overturned (or, more realistically, refined), the corresponding pillar can be updated without rewriting the others. This is one of the operational benefits of the component-wise design over a single end-to-end model: the update path is local.

Why does this page exist as part of PedagogyPath rather than

TheoremPath?

PedagogyPath documents the science and practice of teaching and learning. TheoremPath is the technical reference site that uses that science. The thesis page belongs on the pedagogy site because it is a pedagogical-research synthesis applied to a specific implementation; placing it on TheoremPath would mix the technical and pedagogical registers. The cross- network linking strategy in the path-network-domains inventory makes this carve-up binding.

Internal links

PedagogyPath: bayesian-knowledge-tracing-for-educators; item-response-theory-for-educators; fsrs-spaced-repetition-for-educators; intelligent-tutoring-systems; plato-as-teacher for the older pedagogical tradition that the TheoremPath architecture inherits in spirit if not in form.
TheoremPath: editorial-principles for the broader site-design context; common-probability-distributions, hypothesis-testing-for-ml, and maximum-likelihood-estimation for the underlying probabilistic and statistical machinery the adaptive substrate uses.
PhilosophyPath: empiricism-induction-and-llm-limits for the philosophical context of architecture-versus-data questions; the plato-vs-socrates source-discipline framework that the PedagogyPath site inherits.
ProofsPath: polya-and-how-to-solve-it (when written) for the practice-side pedagogical companion that bridges to the proof-writing material on ProofsPath.

Sources and further reading

Cognitive load theory:

Sweller, J. "Cognitive Load During Problem Solving: Effects on Learning." Cognitive Science 12(2) (1988): 257-285.
Sweller, J., van Merriënboer, J. J. G., and Paas, F. "Cognitive Architecture and Instructional Design: 20 Years Later." Educational Psychology Review 31 (2019): 261-292. The standard updated review.

Spacing and retrieval:

Cepeda, N. J., Vul, E., Rohrer, D., Wixted, J. T., and Pashler, H. "Spacing Effects in Learning: A Temporal Ridgeline of Optimal Retention." Psychological Science 19(11) (2008): 1095-1102.
Roediger, H. L., and Karpicke, J. D. "Test-Enhanced Learning: Taking Memory Tests Improves Long-Term Retention." Psychological Science 17(3) (2006): 249-255.
Dunlosky, J., Rawson, K. A., Marsh, E. J., Nathan, M. J., and Willingham, D. T. "Improving Students' Learning With Effective Learning Techniques." Psychological Science in the Public Interest 14(1) (2013): 4-58.

Knowledge tracing and adaptive learning:

Corbett, A. T., and Anderson, J. R. "Knowledge Tracing: Modeling the Acquisition of Procedural Knowledge." User Modeling and User-Adapted Interaction 4(4) (1995): 253-278.
Pelánek, R. "Bayesian Knowledge Tracing, Logistic Models, and Beyond." User Modeling and User-Adapted Interaction 27(3) (2017): 313-350.
Pavlik, P. I., Cen, H., and Koedinger, K. R. "Performance Factors Analysis: A New Alternative to Knowledge Tracing." Artificial Intelligence in Education (2009).
Piech, C., et al. "Deep Knowledge Tracing." NeurIPS (2015).

Item response theory:

Lord, F. M. Applications of Item Response Theory to Practical Testing Problems. Lawrence Erlbaum, 1980.
Embretson, S. E., and Reise, S. P. Item Response Theory for Psychologists. Lawrence Erlbaum, 2000.

Intelligent tutoring systems:

Anderson, J. R., Corbett, A. T., Koedinger, K. R., and Pelletier, R. "Cognitive Tutors: Lessons Learned." Journal of the Learning Sciences 4(2) (1995): 167-207.
VanLehn, K. "The Relative Effectiveness of Human Tutoring, Intelligent Tutoring Systems, and Other Tutoring Systems." Educational Psychologist 46(4) (2011): 197-221.

Critical perspectives on the cited literature:

Schnotz, W., and Kürschner, C. "A Reconsideration of Cognitive Load Theory." Educational Psychology Review 19(4) (2007): 469-508. Critical engagement with CLT.
Beck, J. E., and Chang, K.-M. "Identifiability: A Fundamental Problem of Student Modeling." Educational Data Mining (2007). On BKT identifiability.

TheoremPath internal documentation:

docs/ADAPTIVE_LEARNING_KERNEL.md: substrate-only framing of the adaptive architecture, the feature-flag matrix, and the deferred items.
docs/ADAPTIVE_DIAGNOSTIC_STATUS.md: ship map of the adaptive components (snapshot 2026-04-23, in need of refresh).
prisma/schema.prisma: the data layer for TopicAssessment, AssessmentAttempt, LearningEvent, ReviewCard, and RecommendationEvent.
src/lib/adaptive/, src/lib/learning-model/, src/lib/mastery/, src/lib/fsrs/: the implementation modules.

This page is part of PedagogyPath, sister site to TheoremPath in the path-network family. It is the canonical statement of how the four pillars of the TheoremPath adaptive architecture relate to the empirical cognitive-science-of-learning literature. Subsequent revisions land here as the architecture, the calibration evidence, or the cited literature evolves.