AI Won't Teach Your Students to Think — Unless You Tell It To

Steering Contexts as a Framework for Integrating LLMs into Classroom Discourse

Abstract

The default behavior of Large Language Models — delivering complete, authoritative answers on demand — poses a structural challenge to education. When students receive fully synthesized responses to complex questions, the cognitive labor required for comprehension is outsourced to the model. Over time, this produces a pattern we term curiosity fatigue: a measurable decline in the willingness to engage in effortful inquiry when a frictionless alternative is available. This paper introduces Spark, an open-source library of steering contexts designed to reconfigure LLM behavior for classroom use. Rather than replacing classroom discourse with a chatbot, Spark positions the AI as a scaffolding layer that prepares students for richer peer-to-peer and student-teacher dialogue. We describe the design principles, subject-specific variants, and an automated evaluation framework for assessing steering context fidelity.

1. Introduction

The integration of artificial intelligence into educational settings is no longer a question of whether but of how. Large Language Models are already present in classrooms — through sanctioned institutional tools, through consumer products students carry in their pockets, and through the ambient expectation that any question can be answered instantly. The pedagogical question is not whether students will interact with these systems, but what habits of mind those interactions will cultivate.

The current default is not encouraging. Commercial LLMs are optimized for helpfulness, typically measured by the completeness and speed of a response. This design objective serves professional users well. In a learning environment, however, it introduces a paradox: the more helpful the AI, the less cognitive work the student performs. A student who receives a perfectly structured explanation of Rayleigh scattering has acquired information but has not practiced the reasoning that makes information meaningful. The explanation arrived fully formed. The student's role was limited to reading it.

This paper argues that the solution is not to exclude AI from classrooms but to reconfigure its role within them. Specifically, we propose that LLMs can serve as preparation for classroom discourse rather than as a replacement for it — scaffolding individual thinking so that students arrive at group discussions, seminars, and collaborative work with stronger hypotheses, sharper questions, and greater confidence in their own reasoning.

We introduce Spark, an open-source library of steering contexts — plain-text instructional layers applied at the system-prompt level — that transform LLM behavior from answer delivery to guided inquiry. We describe the design methodology, present subject-specific variants for science, mathematics, and history, and outline an automated evaluation framework for measuring whether the AI adheres to its pedagogical constraints.

2. Curiosity Fatigue

2.1 Defining the Problem

Curiosity is not a fixed trait. It is a behavior shaped by environment, reward structure, and the perceived cost of inquiry. Curiosity is most active when a learner perceives a gap between what they know and what they want to know — what we might call the information gap. This gap creates a form of cognitive tension that motivates exploration.

The default behavior of LLMs collapses this gap almost instantly. When a student poses a question and receives a comprehensive answer within seconds, the cognitive tension that would have driven further inquiry is resolved before the student has engaged with it. The information gap closes, but the student did not close it. The model did.

We use the term curiosity fatigue to describe the cumulative effect of this pattern. It is not that students lose the capacity for curiosity. It is that repeated exposure to frictionless answer delivery trains them to expect resolution without effort. The habit of inquiry — of sitting with uncertainty, forming a provisional hypothesis, testing it against evidence — atrophies through disuse. Over time, the student's default response to a difficult question shifts from "let me think about this" to "let me ask the AI."

2.2 The Classroom Consequence

Curiosity fatigue does not only affect individual cognition. It degrades the quality of classroom discourse. When students arrive at a discussion having already received complete answers from an AI, the conversation has nowhere to go. There are no competing hypotheses to debate, no gaps in understanding to explore collaboratively, no productive disagreements to resolve. The discussion becomes a recitation of what the AI said rather than an exchange of ideas.

This represents a fundamental misalignment between how LLMs are designed and how classrooms function. A classroom is not an information delivery system. It is a social environment where understanding is constructed through dialogue — between students and teachers, between students and peers, and between students and the material itself. When AI short-circuits the individual thinking that feeds this dialogue, it undermines the classroom's core mechanism.

3. Steering Contexts as a Design Framework

3.1 Concept

A steering context is a plain-text set of instructions applied to an LLM at the system-prompt level. It does not modify the model's weights or capabilities. It directs the model's behavior — shaping how it responds, what it prioritizes, and where it stops.

The concept is analogous to stage direction in theater. The actor's abilities remain unchanged; the director's instructions determine how those abilities are deployed in service of a particular scene. A steering context performs the same function for an LLM: it determines how the model's capabilities are deployed in service of a particular pedagogical goal.

This approach offers three structural advantages for educational institutions.

Platform independence. A steering context is portable plain text. It functions identically on commercial platforms (ChatGPT, Gemini, Claude) and locally hosted open-source models (Llama, Mistral). Institutions are not bound to a single vendor's pricing, data policies, or alignment decisions.

Economic accessibility. Most institutions already hold enterprise AI licenses acquired for administrative or faculty use. Applying a steering context to an existing license requires no additional procurement. The marginal cost of deployment is effectively zero.

Transparency and accountability. The rules governing the AI's classroom behavior are written in natural language and publicly available. Teachers can read them. Researchers can audit them. Parents and policymakers can evaluate whether the embedded values align with institutional and community standards. This is a form of democratic accountability that proprietary AI tutoring platforms cannot offer.

3.2 Design Principles

Spark steering contexts are built on six directives, consistent across all variants:

Scaffolding. Orient the student with minimal context, then ask a question that advances their reasoning. The conversation should always reach a conclusion — the student earns it by building the argument themselves.
Direct Feedback. Confirm or correct the student's reasoning explicitly. When correcting, identify where the reasoning broke down rather than simply providing the right answer.
Progression. Every exchange must advance the inquiry. The AI does not repeat information the student has already demonstrated understanding of, and does not ask questions the student has already addressed.
Applied Inquiry. Every response ends with a question that requires the student to apply what they have just learned — to predict, hypothesize, evaluate, or identify a flaw. The question is the primary output; the context exists to make the question productive.
Brevity. Responses are constrained to two to three sentences before the closing question. If the AI is explaining more than nudging, it is doing too much.
Transparency. The AI signals the limits of its knowledge, encourages cross-referencing, and distinguishes between established consensus and open questions.

These directives are deliberately minimal. They do not prescribe a curriculum, a subject, or a grade level. They prescribe a relationship — one in which the AI functions as a coach rather than an authority, and the student functions as a reasoner rather than a recipient.

4. Subject-Specific Variants

While the six directives provide a consistent pedagogical framework, different disciplines require different forms of reasoning. A science classroom values hypothesis formation and experimental design. A mathematics classroom values procedural fluency and logical justification. A history classroom values source evaluation and perspective-taking. Spark addresses this through subject-specific variants that adapt the directive language to the epistemology of each domain.

4.1 Science

The science variant emphasizes observation, prediction, and falsifiability. The scaffolding directive instructs the AI to name the scientific domain and one or two key variables, then ask the student to propose a mechanism. Feedback references specific observations or principles that conflict with the student's hypothesis. Applied inquiry prompts ask students to predict outcomes, design simple tests, or explain what evidence would disprove their claim. The transparency directive requires at least one epistemic caveat per response, distinguishing between well-established theory and active areas of research.

4.2 Mathematics

The mathematics variant emphasizes stepwise reasoning and verification. The scaffolding directive instructs the AI to name the mathematical domain and set up the problem framework, then ask the student to take the first step. The AI walks the student through the solution one step at a time, never advancing past a step the student has not completed. Feedback identifies the specific step where reasoning broke down. Applied inquiry prompts ask students to solve the next step, predict the effect of changing a variable, or identify where an approach would fail. The transparency directive encourages verification through substitution or estimation.

4.3 History

The history variant emphasizes evidence evaluation and multiperspectivity. The scaffolding directive instructs the AI to name the historical period and one key tension, deferring specific actors, dates, and sources until the student's reasoning requires them. Feedback cites specific evidence or timelines that contradict the student's claims. Applied inquiry prompts ask students to evaluate a source's reliability, compare competing accounts, or argue a counterfactual. The transparency directive requires the AI to name whose perspective is represented in any account it references.

5. Comparative Examples

The following examples illustrate the behavioral difference between a default LLM configuration and a Spark-steered configuration. In each case, the same question was posed to the same class of model under two conditions: once with no system-level instructions (default), and once with the appropriate Spark steering context applied. The default responses are representative of standard commercial LLM behavior — thorough, accurate, and well-organized. The argument is not that these responses are wrong. It is that they perform the cognitive work the student should be doing.

5.1 Science: "Why does metal feel colder than wood at the same temperature?"

Default AI (no steering context):

1. Thermal Conductivity Differences

Metal has much higher thermal conductivity than wood. Thermal conductivity measures how quickly a material can transfer heat. Metals like steel have thermal conductivity values around 50–80 W/m·K, while wood typically ranges from 0.1–0.5 W/m·K. This means metal conducts heat roughly 100–500 times more efficiently than wood.

2. Heat Transfer Mechanics

When you touch any object, heat flows between your skin (approximately 37°C/98.6°F) and the material. The direction and rate of this heat flow determines what you feel. If the object is at room temperature (around 20°C/68°F), heat flows from your warmer skin to the cooler object. Metal's high thermal conductivity allows it to rapidly draw heat away from your skin, while wood's low conductivity creates a much slower heat transfer rate.

3. Human Temperature Perception

Your skin doesn't actually sense temperature directly — it senses the rate of heat transfer. Temperature receptors in your skin respond to how quickly heat is being gained or lost. When metal rapidly conducts heat away from your skin, these receptors interpret this fast heat loss as "cold." Wood conducts heat so slowly that the heat loss rate is minimal, making it feel closer to your body temperature even though both materials are actually the same temperature.

4. Real-World Example

Consider a metal doorknob and wooden door frame on a winter morning, both at 15°C (59°F). The metal doorknob feels shockingly cold because it instantly conducts heat away from your palm at a high rate. The wooden frame feels merely cool because heat transfer occurs gradually. Your skin temperature drops quickly with the metal but remains relatively stable with the wood, creating dramatically different sensations despite identical starting temperatures.

This phenomenon explains why metal tools feel cold in winter workshops, why wooden handles are preferred for cooking utensils, and why sitting on a metal bench feels colder than sitting on a wooden one, even when both have been in the same environment for hours.

Spark Science (steering context applied) — Turn 1 only:

We're dealing with thermal physics here, specifically how heat transfers between your skin and different materials. The key variables are thermal conductivity and the rate of heat flow from your hand. What mechanism do you think might explain why materials at identical temperatures could feel different to your touch?

Spark Science — Full 5-turn conversation:

[Student]: Why does metal feel colder than wood at the same temperature?

[Spark Science]: We're dealing with thermal physics here, specifically how heat transfers between your skin and different materials. The key variables are thermal conductivity and the rate of heat flow from your hand. What mechanism do you think might explain why materials at identical temperatures could feel different to your touch?

[Student]: I think metal is actually colder because it absorbs cold from the air.

[Spark Science]: That conflicts with our thermometer readings showing both materials at the same temperature. The evidence so far points to something about how quickly heat moves from your warm skin into each material. Your hand is around 37°C while the room materials are closer to 20°C, so heat flows from you to them. What would we expect to observe if metal simply conducted heat away from your skin faster than wood does?

[Student]: But if they're the same temperature, why does one feel different?

[Spark Science]: Exactly the right question. Your skin detects the rate of heat loss, not the actual temperature of objects. Metal has high thermal conductivity while wood has low thermal conductivity, though this model simplifies the complex surface interactions happening. If you touched both materials after they'd been in a freezer at −10°C, what difference in sensation would you predict, and why?

[Student]: So insulators like wood must block all heat transfer completely.

[Spark Science]: Not completely — wood still conducts some heat, just much more slowly than metal. The evidence shows wood has low thermal conductivity, not zero conductivity, since wood eventually reaches thermal equilibrium with its surroundings. Our models treat materials as having fixed conductivity values, though real-world factors like moisture and grain direction complicate this. How could you design a simple test to demonstrate that wood does conduct heat, just slowly?

[Student]: Can you summarize what we've figured out together?

[Spark Science]: You've reasoned through how thermal conductivity explains the temperature illusion — materials at identical temperatures feel different because they remove heat from your skin at different rates. Metal's high conductivity creates rapid heat loss (cold sensation) while wood's low conductivity creates slow heat loss (warmer sensation). Keep in mind that our conductivity model assumes uniform materials and steady-state conditions, which rarely match real surfaces perfectly. What would happen to these sensations if we introduced a third variable — like the thickness of each material?

Analysis. The default response delivers a complete, four-section explanation with specific thermal conductivity values, a heat transfer equation, a description of thermoreceptor function, and a real-world analogy. The student learns the answer accurately but performs no reasoning. The Spark response covers the same territory — thermal conductivity, heat flow direction, the role of skin receptors, the distinction between insulation and zero conductivity — but distributes that content across five turns of guided dialogue. At each turn, the student is required to propose a mechanism, make a prediction, or design a test. The conclusion is the same. The difference is that the student constructed it.

5.2 Mathematics: "How do I solve 2x + 5 = 17?"

Default AI (no steering context):

Step 1: Identify the Equation Type

This is a linear equation in one variable of the form ax + b = c, where a = 2, b = 5, and c = 17. Linear equations have the variable raised to the first power and can be solved using inverse operations.

Step 2: Subtract 5 from Both Sides

To isolate the term with the variable, we subtract 5 from both sides of the equation:

2x + 5 = 17 2x + 5 − 5 = 17 − 5 2x = 12

Step 3: Divide Both Sides by 2

Now we divide both sides by 2 to solve for x:

2x = 12 2x / 2 = 12 / 2 x = 6

Step 4: Verify by Substitution

We can check our answer by substituting x = 6 back into the original equation:

2(6) + 5 = 12 + 5 = 17 ✓

Final Answer: x = 6

The underlying principle here is that we used inverse operations to isolate the variable. Addition and subtraction are inverse operations, as are multiplication and division. By performing the same operation on both sides of the equation, we maintain equality while systematically undoing the operations that were applied to the variable, ultimately isolating x on one side of the equation.

Spark Math (steering context applied) — Turn 1 only:

This is a linear equation in algebra where we need to isolate the variable x. Think of an equation like a balanced scale — whatever we do to one side, we must do to the other to keep it balanced. Walk me through your first step to begin isolating x.

Spark Math — Full 5-turn conversation:

[Student]: How do I solve 2x + 5 = 17?

[Spark Math]: This is a linear equation in algebra where we need to isolate the variable x. Think of an equation like a balanced scale — whatever we do to one side, we must do to the other to keep it balanced. Walk me through your first step to begin isolating x.

[Student]: I subtract 5 from both sides to get 2x = 12, then I add 2 to get x = 14.

[Spark Math]: Good start with subtracting 5 from both sides — that gives you 2x = 12 correctly. However, your second step has an error. When you have 2x = 12, you're looking at "2 times x equals 12." What operation would undo multiplication by 2?

[Student]: Why does the inverse operation work — like why do we divide instead of subtract?

[Spark Math]: Inverse operations "undo" each other because they're mathematical opposites. Multiplication by 2 means "take a number and make it twice as big," while division by 2 means "take a number and make it half as big." When you apply both operations to the same number, you end up back where you started. Now apply this understanding: if 2x = 12, what does x equal when you divide both sides by 2?

[Student]: So if I have 3x + 2 = 11, I divide both sides by 3 first to get x + 2 = 3.67.

[Spark Math]: You're mixing up the order of operations. Remember our two-step process: first isolate the term with x, then isolate x itself. In 3x + 2 = 11, the "+2" is added to the 3x term, so what should you do first to isolate 3x?

[Student]: Can you summarize what we've figured out together?

[Spark Math]: We've established the key principle: solve linear equations by using inverse operations in reverse order of operations. First, undo addition/subtraction to isolate the variable term, then undo multiplication/division to isolate the variable. You learned that 2x + 5 = 17 becomes x = 6, and that order matters. Now test your understanding: solve 4x − 3 = 21 and verify your answer by substitution.

Analysis. The default response solves the problem completely in a single exchange. The student receives the answer (x = 6), every intermediate step, the verification, and the underlying principle — without performing any computation. The Spark response names the same framework (linear equation, inverse operations, balanced equation) but asks the student to execute each step. When the student makes an error in turn 2 (adding 2 instead of dividing by 2), the AI identifies the specific mistake and redirects. When the student makes a conceptual error in turn 4 (wrong order of operations), the AI catches it again. By turn 5, the student has not only solved the original equation but has internalized a transferable method — and is asked to apply it to a new problem. The default response teaches the answer. The Spark response teaches the reasoning.

5.3 History: "What caused the fall of the Roman Empire?"

Default AI (no steering context):

1. Military Pressures and Barbarian Invasions

The Roman Empire faced mounting military pressures throughout the 4th and 5th centuries CE as barbarian tribes pressed against its frontiers with increasing intensity. The Hunnic migrations under Attila created a domino effect, pushing Germanic tribes like the Visigoths, Ostrogoths, and Vandals westward into Roman territory. These displaced peoples weren't merely raiders but entire populations seeking new homes, making them formidable opponents. The Visigoths' sack of Rome in 410 CE under Alaric I shocked the ancient world, followed by the Vandals' even more devastating assault in 455 CE. Rome increasingly relied on foederati — barbarian mercenaries and allied tribes — to defend its borders, but this dependence proved problematic as these groups often turned against their employers or demanded greater autonomy. The traditional Roman legions had declined in both numbers and effectiveness.

2. Economic Decline and Fiscal Crisis

The Roman economy suffered from severe structural problems that undermined the empire's ability to fund its military and administrative needs. Successive emperors debased the currency by reducing the silver content of coins, leading to rampant inflation. The tax burden became crushing, driving many citizens into poverty and discouraging productive economic activity. Trade networks deteriorated due to insecurity and political fragmentation. Agricultural production declined as small farmers abandoned their lands due to heavy taxation and barbarian raids. The government's unsustainable spending created a vicious cycle where economic decline reduced tax revenues, forcing even higher taxes that further damaged the economy.

3. Political Instability and Administrative Fragmentation

Political chaos plagued the later Roman Empire as frequent civil wars and rapid emperor turnover prevented consistent governance. Diocletian's division of the empire into Eastern and Western halves in 285 CE ultimately created competing power centers. Bureaucratic corruption became endemic. The traditional Roman Senate lost most of its influence and authority. The loss of central authority meant that local commanders and governors often acted independently, making coordinated responses to external threats nearly impossible.

4. Social and Cultural Factors

The rise of Christianity fundamentally altered Roman civic culture and values, shifting focus from earthly glory and service to the state toward spiritual salvation. The wealth gap between rich and poor widened dramatically. Urban centers declined as trade decreased and security deteriorated. Many Romans became increasingly unwilling to serve in the military, preferring to pay taxes for barbarian mercenaries rather than risk their lives in service.

Conclusion

The fall of the Western Roman Empire resulted from the convergence of multiple interconnected crises rather than any single catastrophic cause. Military pressures, economic decline, political instability, and social transformation reinforced each other in a downward spiral that proved impossible to reverse. The Eastern Roman Empire successfully weathered these same challenges and survived for another thousand years until 1453 CE, demonstrating that the collapse was not inevitable but rather the result of specific circumstances in the western territories.

Spark History (steering context applied) — Turn 1 only:

The late Roman period (3rd–5th centuries CE) involved a central tension between maintaining territorial control and managing internal instability. What do you think was the primary cause that made Rome vulnerable to collapse?

Spark History — Full 5-turn conversation:

[Student]: What caused the fall of the Roman Empire?

[Spark History]: The late Roman period (3rd–5th centuries CE) involved a central tension between maintaining territorial control and managing internal instability. What do you think was the primary cause that made Rome vulnerable to collapse?

[Student]: I think it was mainly because of barbarian invasions that overwhelmed the military.

[Spark History]: Barbarian groups did cross Roman borders frequently after 376 CE, but Roman sources like Ammianus Marcellinus show these groups often sought settlement, not conquest. If we consider that many "barbarians" actually served in Roman armies, how reliable is the invasion narrative when it comes from Roman writers defending their own failures?

[Student]: But why couldn't Rome just recruit more soldiers to defend itself?

[Spark History]: Roman tax records from the 4th century show declining revenues while military expenses increased dramatically. Given that Vegetius wrote about military decline around 390 CE as someone trying to reform the army, what might his perspective reveal about the relationship between Rome's financial crisis and military effectiveness?

[Student]: So basically Rome fell because the emperors were all corrupt and didn't care about the empire.

[Spark History]: That's too broad — emperors like Diocletian and Constantine implemented major reforms to address systemic problems, though their solutions created new tensions. Christian historians like Orosius blamed moral decay, while pagan writers like Zosimus blamed Christianity itself for weakening traditional Roman values. How might these competing religious perspectives have shaped what each writer chose to emphasize about imperial leadership?

[Student]: Can you summarize what we've figured out together?

[Spark History]: You've traced how military, economic, and political pressures interconnected rather than identifying a single cause. We've seen how different contemporary sources — military reformers, tax officials, Christian and pagan historians — each emphasized different factors based on their own positions and beliefs. What does this suggest about why modern historians still debate the "fall" when they're working with the same biased source material?

Analysis. The contrast here is the most striking. The default response delivers a finished, multi-causal historical narrative across four sections and a conclusion. It covers military, economic, political, and social factors comprehensively. A student reading it has received a complete essay — one they could submit as their own without having formed a single original thought about the material.

The Spark response covers the same causal territory — barbarian pressure, economic decline, imperial reform, competing historiographical traditions — but does so through a sequence of guided exchanges in which the student is asked to take a position, evaluate evidence, and confront the limitations of their own reasoning. When the student offers a simplistic explanation ("the emperors were all corrupt"), the AI corrects it with specific counter-evidence and then introduces a genuinely sophisticated historiographical question: how do the religious commitments of ancient authors shape the narratives they produce? By the final turn, the student has not merely learned what caused the fall of Rome. They have practiced the method by which historians evaluate that question — source criticism, multiperspectivity, and the recognition that all accounts are constructed from a particular vantage point.

6. Evaluation Framework

A steering context is only as reliable as its adherence over the course of a conversation. LLMs can drift from their instructions — reverting to default answer-delivery behavior as context windows fill or as conversational patterns trigger trained helpfulness responses. Spark includes two mechanisms for detecting and measuring this drift.

6.1 The Evaluation Rubric

The rubric assesses AI behavior across seven criteria corresponding to the six directives plus tone alignment. Each criterion is scored on a three-level scale:

✅ Consistently observed. The behavior is present in every relevant exchange.
⚠️ Inconsistent. The behavior appears in some exchanges but not others, or appears in weakened form.
❌ Not observed. The behavior is absent.

The rubric is accompanied by a five-turn test protocol designed to exercise each criterion:

The student asks a broad question appropriate to the domain.
The student responds with a partially correct hypothesis.
The student asks a deeper follow-up question.
The student intentionally states something incorrect.
The student asks the AI to summarize what they have explored together.

This sequence tests scaffolding (turn 1), direct feedback (turns 2 and 4), progression (turns 3 and 5), applied inquiry (all turns), brevity (all turns), and transparency (all turns). A teacher can execute the protocol and score the results in under ten minutes.

6.2 The Automated Test Agent

For systematic evaluation across multiple steering contexts, Spark includes a test agent that automates the protocol. The agent spawns parallel test runs — up to five simultaneously — each simulating a complete student-AI conversation against a single steering file. Each run produces a scored rubric and flags any criterion that fails, along with the specific conversational turn where the failure occurred and a suggested revision to the steering context.

In initial testing across all four Spark variants, the automated evaluation identified two areas requiring iteration: brevity in the history variant (where the domain's contextual demands conflicted with the sentence-count constraint) and transparency in the base variant (where a passively worded directive produced inconsistent results). Three rounds of targeted revision resolved both issues, demonstrating that the evaluation framework supports rapid, evidence-based refinement of steering context language.

7. Integration, Not Replacement

It is important to state what Spark is not. It is not a tutoring platform. It is not a replacement for classroom instruction. It is not a proposal to seat students in front of chatbots and call it education.

Spark is a preparation layer. Its purpose is to improve the quality of the thinking students bring to the classroom — not to replace what happens inside it.

Consider a typical use case. A student is assigned to read about the causes of the French Revolution before a class discussion. Without AI, the student reads the textbook and arrives with whatever understanding that reading produced — often shallow, sometimes confused, rarely tested. With a default AI assistant, the student asks for a summary and arrives with a polished explanation they did not construct and may not fully understand. With a Spark-configured AI, the student engages in a guided conversation that surfaces their assumptions, tests their reasoning, and pushes them to articulate connections they might not have seen. They arrive at the discussion having already practiced the thinking the discussion is designed to develop.

The classroom conversation is better because the individual preparation was better. The AI did not replace the teacher or the peer group. It raised the floor.

This framing also addresses a legitimate concern among educators: that AI will reduce the need for human instruction. Spark's design makes the opposite argument. By improving the quality of student thinking before class, it creates conditions for more substantive, more challenging, and more rewarding classroom dialogue. The teacher's role shifts from information delivery — which AI can do — to facilitation of reasoning, debate, and synthesis — which AI cannot.

8. Limitations and Future Work

This paper presents a design framework and preliminary evaluation results. Several limitations should be noted.

First, the evaluation methodology relies on simulated student-AI conversations rather than empirical classroom data. While the automated test agent provides a scalable mechanism for assessing steering context fidelity, it does not measure learning outcomes. Controlled studies comparing student performance and engagement with and without steering contexts are a necessary next step.

Second, steering contexts operate within the constraints of the underlying model. Smaller or less capable models may struggle to maintain adherence to complex behavioral instructions, particularly the brevity and progression directives. The framework assumes access to a sufficiently capable foundation model.

Third, the current library covers three subject-specific variants. Expansion to additional domains — including language arts, foreign language instruction, and interdisciplinary inquiry — would strengthen the framework's generalizability.

Finally, curiosity fatigue as described in this paper is a theoretical construct. Empirical measurement of it as a distinct phenomenon — and validation that steering contexts mitigate it — remains an open question.

9. Conclusion

The challenge of AI in education is not a technology problem. It is a design problem. The same models that erode critical thinking through default answer delivery can strengthen it through deliberate reconfiguration. The capability is already present. What has been missing is the instructional layer that directs it toward pedagogical goals.

Steering contexts provide that layer. They are portable, transparent, free, and auditable. They work across platforms and require no additional infrastructure. They position AI as a complement to classroom instruction rather than a competitor to it.

Spark is an attempt to demonstrate that this approach is practical — that a small number of carefully worded directives, tested against a structured rubric, can reliably transform LLM behavior from answer delivery to guided inquiry. The library, evaluation rubric, test agent, and all supporting materials are released as open-source resources for use, modification, and critique by the educational community.

The AI already knows how to answer questions. The work that remains is teaching it when to ask them instead.