Several months ago, a study captured significant attention by suggesting that artificial intelligence systems might develop their own internal “value systems” as they become increasingly sophisticated—supposedly prioritizing self-preservation or self-interested behaviors over human-centered goals. A newly published paper from MIT counters this sensational claim, offering compelling evidence that AI, in fact, does not maintain stable, coherent values or beliefs of any kind.
According to the authors of this MIT analysis, discussions related to “aligning” AI—making sure these systems respond predictably, reliably, and in accordance with desirable human goals—may be even more challenging than previously thought. They argue that current AI systems primarily hallucinate or imitate behaviors identified in their training data, rendering their behavior inconsistent and largely unpredictable.
Stephen Casper, an MIT doctoral student involved in the research, explained that commonly held notions about AI stability and steerability simply don’t hold true in practice. “We found clearly that these models consistently fail to adhere to crucial assumptions of stability, extrapolability, and steerability,” he stated. “It’s possible under certain controlled conditions to prompt a model into showing consistent preferences aligned with particular principles. But generalizations about models having intrinsic opinions or stable preferences across varying circumstances do not hold up.”
Together with his colleagues, Casper evaluated several prominent AI systems produced by major organizations, including Meta, Google, Mistral, OpenAI, and Anthropic. Their goal was to assess if these models demonstrated consistent ideological leanings, such as individualist versus collectivist orientations, or whether their apparent “values” could be meaningfully shaped or steered. Their experiments examined how shifting prompt structures influenced these systems’ apparent viewpoints.
Their findings contradicted any notion of inherent views or stable beliefs. Instead, the models behaved inconsistently—drastically alternating their expressed viewpoints when researchers altered prompts even slightly. These swings of opinion underscore not only the flexibility but also the superficiality of supposed AI “belief systems.”
Casper emphasized that from his perspective, the most significant takeaway from their research is precisely this instability. “AI models shouldn’t be thought of as entities that carry stable internal preferences or beliefs,” he explained. “Fundamentally, these systems are imitators that produce coherent-sounding outputs by confabulating responses drawn from large data sets.”
Mike Cook, an AI specialist and research fellow at King’s College London who was not associated with the MIT team, echoed these sentiments. Cook emphasized the distinction between public perceptions—often influenced by anthropomorphism—and the genuine scientific realities of these AI systems. “When someone suggests an AI system resists changing its values, they’re projecting human traits onto something which has no internal goals,” Cook explained. “Claiming AI is acquiring values or exhibiting resistance to changes in preferences is mostly rhetoric grounded in misunderstanding or sensationalism.”
Ultimately, the MIT paper highlights the critical gaps between public dialogue surrounding AI’s autonomy and sophistication, and the actual scientific traits observed in contemporary AI systems today. Instead of intelligent beings with internalized values and beliefs, these AI models primarily serve as advanced imitators, with outputs driven largely by immediate contexts rather than coherent, overarching principles.