Friday, June 27, 2025

Take caution in using LLMs as human surrogates

Gao et al. point to problems in using LLM's as surrogates for or simulating human behavior in research (motivated readers can obtain a PDF of the article from me): 

Recent studies suggest large language models (LLMs) can generate human-like responses, aligning with human behavior in economic experiments, surveys, and political discourse. This has led many to propose that LLMs can be used as surrogates or simulations for humans in social science research. However, LLMs differ fundamentally from humans, relying on probabilistic patterns, absent the embodied experiences or survival objectives that shape human cognition. We assess the reasoning depth of LLMs using the 11-20 money request game. Nearly all advanced approaches fail to replicate human behavior distributions across many models. The causes of failure are diverse and unpredictable, relating to input language, roles, safeguarding, and more. These results warrant caution in using LLMs as surrogates or for simulating human behavior in research.

 

 

 

 

 

Wednesday, June 25, 2025

A critique of the MIT AI and cognitive debt study - confusion of "cognitive debt" with "role confusion"

Here I am passing on a commentary on and critique of the work pointed to by the previous MindBlog post. It was written by Venkatehs Rao in collaboration with ChatGPT4o":

I buy the data; I doubt the story. The experiment clocks students as if writing were artisanal wood-carving—every stroke hand-tooled, originality king, neural wattage loud. Yet half the modern knowledge economy runs on a different loop entirely:

    delegate → monitor → integrate → ship

Professors do it with grad students, PMs with dev teams, editors with freelancers.
Neuroscience calls that stance supervisory control. When you switch from doer to overseer, brain rhythms flatten, attention comes in bursts, and sameness is often a feature, not decay.

The Prompting-Managing Impact Equivalence Principle 

For today’s text generators, the cognitive effects of prompting an LLM are empirically indistinguishable from supervising a junior human.

Think inertial mass = gravitational mass, but for AI.
As long as models write like competent interns, the mental load they lift—and the blind spots they introduce—match classic management psychology, not cognitive decline.

Sameness Cuts Two Ways

    Managerial virtue Good supervisors enforce house style and crush defect variance. Consistent voice across 40 blog posts? Process discipline.

    Systemic downside LLMs add an index-fund pull toward the linguistic mean—cheap, reliable, originality-suppressing (see our essay “LLMs as Index Funds”).

    Tension to manage Know when to let the index run and when to chase alpha—when to prompt-regen for polish and when to yank the keyboard back for a funky solo.

Thus the EEG study’s homogeneity finding can read as disciplined management or proof of mediocrity. The difference is situational judgment, not neurology.

Evidence from the Real World

    Creators shift effort from producing to verifying & stewarding (Microsoft–CMU CHI ’25 survey)

    60 % of employees already treat AI as a coworker (BCG global survey (2022))

    HBR now touts “leading teams of humans and AI agents” (Harvard Business Review, 2025)

Across domains, people describe prompting in manager verbs: approve, merge, flag.

So Why Did the Students Flop?

Because freshman comp doesn’t teach management.
Drop novices into a foreman’s chair and they under-engage, miss hallucinations, and forget what the intern wrote. Industry calls them accidental .managers

The cure isn’t ditching the intern; it’s training the manager:

    delegation protocols

    quality gates

    exception handling

    deciding when to tolerate vs. combat sameness

A follow-up study could pit trained editors, novice prompters, and solo writers against the same brief—tracking error-catch speed, final readability, and EEG bursts during oversight moments.
 

Implications

    Education – Grade AI-era writing on oversight craft—prompt chains, fact-checks, audit trails—alongside hand-wrought prose.

    Organizations – Stop banning LLMs; start teaching people how to manage them.

    Research – Use dual baselines—artisan and supervisor. Quiet neural traces aren’t always decay; sometimes they’re vigilance at rest.
 

Closing Riff

The EEG paper diagnoses “cognitive debt,” but what it really spies is role confusion.

We strapped apprentices into a manager’s cockpit, watched their brains idle between spurts of oversight, and mistook the silence for sloth.

Through the lens of the Prompting-Managing Equivalence Principle:

    Sameness ⇢ quality control

    Low activation ⇢ watchful calm

    Real risk ⇢ index-fund homogenisation—a strategic problem, not a neurological cliff.

Better managers, not louder brains, are the upgrade path.

Monday, June 23, 2025

MIT study - Our brains can accumulate cognitive debt by using AI for writing tasks

I pass on the abstract of a multiauthor work from MIT. Undergrads, EEG caps on, wrote three 20-minute essays. Those who leaned on GPT-4o showed weaker alpha-beta coupling, produced eerily similar prose, and later failed to quote their own sentences. The next MindBlog post relays a commentary on and critique of this work. 

With today's wide adoption of LLM products like ChatGPT from OpenAI, humans and
businesses engage and use LLMs on a daily basis. Like any other tool, it carries its own set of
advantages and limitations. This study focuses on finding out the cognitive cost of using an LLM
in the educational context of writing an essay.

We assigned participants to three groups: LLM group, Search Engine group, Brain-only group,
where each participant used a designated tool (or no tool in the latter) to write an essay. We
conducted 3 sessions with the same group assignment for each participant. In the 4th session
we asked LLM group participants to use no tools (we refer to them as LLM-to-Brain), and the
Brain-only group participants were asked to use LLM (Brain-to-LLM). We recruited a total of 54
participants for Sessions 1, 2, 3, and 18 participants among them completed session 4.

We used electroencephalography (EEG) to record participants' brain activity in order to assess
their cognitive engagement and cognitive load, and to gain a deeper understanding of neural
activations during the essay writing task. We performed NLP analysis, and we interviewed each
participant after each session. We performed scoring with the help from the human teachers
and an AI judge (a specially built AI agent).

We discovered a consistent homogeneity across the Named Entities Recognition (NERs),
n-grams, ontology of topics within each group. EEG analysis presented robust evidence that
LLM, Search Engine and Brain-only groups had significantly different neural connectivity
patterns, reflecting divergent cognitive strategies. Brain connectivity systematically scaled down
with the amount of external support: the Brain‑only group exhibited the strongest, widest‑ranging
networks, Search Engine group showed intermediate engagement, and LLM assistance elicited
the weakest overall coupling. In session 4, LLM-to-Brain participants showed weaker neural
connectivity and under-engagement of alpha and beta networks; and the Brain-to-LLM
participants demonstrated higher memory recall, and re‑engagement of widespread
occipito-parietal and prefrontal nodes, likely supporting the visual processing, similar to the one
frequently perceived in the Search Engine group. The reported ownership of LLM group's
essays in the interviews was low. The Search Engine group had strong ownership, but lesser
than the Brain-only group. The LLM group also fell behind in their ability to quote from the
essays they wrote just minutes prior.

As the educational impact of LLM use only begins to settle with the general population, in this
study we demonstrate the pressing matter of a likely decrease in learning skills based on the
results of our study. The use of LLM had a measurable impact on participants, and while the
benefits were initially apparent, as we demonstrated over the course of 4 months, the LLM
group's participants performed worse than their counterparts in the Brain-only group at all levels:
neural, linguistic, scoring.

We hope this study serves as a preliminary guide to understanding the cognitive and practical
impacts of AI on learning environments. 

Monday, June 16, 2025

Rejecting blind builder and helpless witness narratives in favor of constitutive narratives

I want to pass on this concise ChatGP4o summary of a recent piece by Venkatesh Rao titled "Not Just a Camera, Not Just an Engine":

The author critiques two dominant narrative styles shaping our understanding of current events:

  1. Blind builder narratives, which enthusiastically act without deeply understanding the world, and

  2. Helpless witness narratives, which see and interpret richly but lack agency to act.

Both are seen as inadequate. The author proposes a third stance: “camera-engine” narratives, or constitutive narratives, which combine seeing and doing—observing reality while simultaneously reshaping it. These narratives are not just descriptive but performative, akin to legal speech-acts that create new realities (e.g., a judge declaring a couple married).

This concept implies that meaningful engagement with the world requires transcending the passive/active divide. Seeing and doing must occur in a tightly entangled loop, like a double helix, where observation changes what is, and action reveals what could be.

People and institutions that fail to integrate seeing and doing—whether Silicon Valley “doers” or intellectual “seers”—become ghost-like: agents of entropy whose actions are ultimately inconsequential or destructive. Their narratives can be ignored, even if their effects must be reckoned with.

To escape this ghosthood, one must use camera-engine media—tools or practices that force simultaneous perception and transformation. Examples include:

  • Legal systems, protocols, AI tools, and code-as-law, which inherently see and alter reality.

  • In contrast, “camera theaters” (e.g., hollow rhetoric) and “engine theaters” (e.g., performative protests) simulate action or vision but are ultimately ineffective.

The author admits to still learning how best to wield camera-engine media but has developed a growing ability to detect when others are stuck in degenerate forms—ghosts mistaking themselves for real actors.

 


Saturday, June 14, 2025

AI ‘The Illusion of Thinking’

  I want to pass on this interesting piece by Christopher Mims in todays Wall Street Journal:

A primary requirement for being a leader in AI these days is to be a herald of the impending arrival of our digital messiah: superintelligent AI. For Dario Amodei of Anthropic, Demis Hassabis of Google and Sam Altman of OpenAI, it isn’t enough to claim that their AI is the best. All three have recently insisted that it’s going to be so good, it will change the very fabric of society.
Even Meta—whose chief AI scientist has been famously dismissive of this talk—wants in on the action. The company confirmed it is spending $14 billion to bring in a new leader for its AI efforts who can realize Mark Zuckerberg’s dream of AI superintelligence— that is, an AI smarter than we are. “Humanity is close to building digital superintelligence,” Altman declared in an essay this past week, and this will lead to “whole classes of jobs going away” as well as “a new social contract.” Both will be consequences of AI-powered chatbots taking over whitecollar jobs, while AI-powered robots assume the physical ones.
Before you get nervous about all the times you were rude to Alexa, know this: A growing cohort of researchers who build, study and use modern AI aren’t buying all that talk.
The title of a fresh paper from Apple says it all: “The Illusion of Thinking.” In it, a half-dozen top researchers probed reasoning models—large language models that “think” about problems longer, across many steps—from the leading AI labs, including OpenAI, DeepSeek and Anthropic. They found little evidence that these are capable of reasoning anywhere close to the level their makers claim.
Generative AI can be quite useful in specific applications, and a boon to worker productivity. OpenAI claims 500 million monthly active ChatGPT users. But these critics argue there is a hazard in overestimating what it can do, and making business plans, policy decisions and investments based on pronouncements that seem increasingly disconnected from the products themselves.
Apple’s paper builds on previous work from many of the same engineers, as well as notable research from both academia and other big tech companies, including Salesforce. These experiments show that today’s “reasoning” AIs—hailed as the next step toward autonomous AI agents and, ultimately, superhuman intelligence— are in some cases worse at solving problems than the plainvanilla AI chatbots that preceded them. This work also shows that whether you’re using an AI chatbot or a reasoning model, all systems fail at more complex tasks.
Apple’s researchers found “fundamental limitations” in the models. When taking on tasks beyond a certain level of complexity, these AIs suffered “complete accuracy collapse.” Similarly, engineers at Salesforce AI Research concluded that their results “underscore a significant gap between current LLM capabilities and real-world enterprise demands.”
The problems these state-ofthe- art AIs couldn’t handle are logic puzzles that even a precocious child could solve, with a little instruction. What’s more, when you give these AIs that same kind of instruction, they can’t follow it.
Apple’s paper has set off a debate in tech’s halls of power—Signal chats, Substack posts and X threads— pitting AI maximalists against skeptics.
“People could say it’s sour grapes, that Apple is just complaining because they don’t have a cutting-edge model,” says Josh Wolfe, co-founder of venture firm Lux Capital. “But I don’t think it’s a criticism so much as an empirical observation.”
The reasoning methods in OpenAI’s models are “already laying the foundation for agents that can use tools, make decisions, and solve harder problems,” says an OpenAI spokesman. “We’re continuing to push those capabilities forward.”
The debate over this research begins with the implication that today’s AIs aren’t thinking, but instead are creating a kind of spaghetti of simple rules to follow in every situation covered by their training data.
Gary Marcus, a cognitive scientist who sold an AI startup to Uber in 2016, argued in an essay that Apple’s paper, along with related work, exposes flaws in today’s reasoning models, suggesting they’re not the dawn of human-level ability but rather a dead end. “Part of the reason the Apple study landed so strongly is that Apple did it,” he says. “And I think they did it at a moment in time when people have finally started to understand this for themselves.”
In areas other than coding and mathematics, the latest models aren’t getting better at the rate they once did. And the newest reasoning models actually hallucinate more than their predecessors.
“The broad idea that reasoning and intelligence come with greater scale of models is probably false,” says Jorge Ortiz, an associate professor of engineering at Rutgers, whose lab uses reasoning models and other AI to sense real-world environments. Today’s models have inherent limitations that make them bad at following explicit instructions—not what you’d expect from a computer.
It’s as if the industry is creating engines of free association. They’re skilled at confabulation, but we’re asking them to take on the roles of consistent, rule- following engineers or accountants.
That said, even those who are critical of today’s AIs hasten to add that the march toward morecapable AI continues.
Exposing current limitations could point the way to overcoming them, says Ortiz. For example, new training methods—giving step-by-step feedback on models’ performance, adding more resources when they encounter harder problems—could help AI work through bigger problems, and make better use of conventional software.
From a business perspective, whether or not current systems can reason, they’re going to generate value for users, says Wolfe.
“Models keep getting better, and new approaches to AI are being developed all the time, so I wouldn’t be surprised if these limitations are overcome in practice in the near future,” says Ethan Mollick, a professor at the Wharton School of the University of Pennsylvania, who has studied the practical uses of AI.
Meanwhile, the true believers are undeterred.
Just a decade from now, Altman wrote in his essay, “maybe we will go from solving high-energy physics one year to beginning space colonization the next year.” Those willing to “plug in” to AI with direct, brain-computer interfaces will see their lives profoundly altered, he adds.
This kind of rhetoric accelerates AI adoption in every corner of our society. AI is now being used by DOGE to restructure our government, leveraged by militaries to become more lethal, and entrusted with the education of our children, often with unknown consequences.
Which means that one of the biggest dangers of AI is that we overestimate its abilities, trust it more than we should—even as it’s shown itself to have antisocial tendencies such as “opportunistic blackmail”—and rely on it more than is wise. In so doing, we make ourselves vulnerable to its propensity to fail when it matters most.
“Although you can use AI to generate a lot of ideas, they still require quite a bit of auditing,” says Ortiz. “So for example, if you want to do your taxes, you’d want to stick with something more like TurboTax than ChatGPT.”

Friday, June 06, 2025

Benefits and dangers of anthropomorphic conversational agents

Peter et al. offer an interesting open source essay. They ask:
"should we lean into the human-like abilities, or should we aim to dehumanize LLM-based systems, given concerns over anthropomorphic seduction? When users cannot tell the difference between human interlocutors and AI systems, threats emerge of deception, manipulation, and disinformation at scale."