An important open source article from Loru et al:
Significance
Large
Language Models (LLMs) are used in evaluative tasks across domains.
Yet, what appears as alignment with human or expert judgments may
conceal a deeper shift in how “judgment” itself is operationalized.
Using news outlets as a controlled benchmark, we compare six LLMs to
expert ratings and human evaluations under an identical, structured
framework. While models often match expert outputs, our results suggest
that they may rely on lexical associations and statistical priors rather
than contextual reasoning or normative criteria. We term this
divergence epistemia: the illusion of knowledge emerging when surface
plausibility replaces verification. Our findings suggest not only
performance asymmetries but also a shift in the heuristics underlying
evaluative processes, raising fundamental questions about delegating
judgment to LLMs.
Abstract
Large
Language Models (LLMs) are increasingly embedded in evaluative
processes, from information filtering to assessing and addressing
knowledge gaps through explanation and credibility judgments. This
raises the need to examine how such evaluations are built, what
assumptions they rely on, and how their strategies diverge from those of
humans. We benchmark six LLMs against expert ratings—NewsGuard and
Media Bias/Fact Check—and against human judgments collected through a
controlled experiment. We use news domains purely as a controlled
benchmark for evaluative tasks, focusing on the underlying mechanisms
rather than on news classification per se. To enable direct comparison,
we implement a structured agentic framework in which both models and
nonexpert participants follow the same evaluation procedure: selecting
criteria, retrieving content, and producing justifications. Despite
output alignment, our findings show consistent differences in the
observable criteria guiding model evaluations, suggesting that lexical
associations and statistical priors could influence evaluations in ways
that differ from contextual reasoning. This reliance is associated with
systematic effects: political asymmetries and a tendency to confuse
linguistic form with epistemic reliability—a dynamic we term epistemia,
the illusion of knowledge that emerges when surface plausibility
replaces verification. Indeed, delegating judgment to such systems may
affect the heuristics underlying evaluative processes, suggesting a
shift from normative reasoning toward pattern-based approximation and
raising open questions about the role of LLMs in evaluative processes.
No comments:
Post a Comment