Monday, December 25, 2023

Large Language Models are not yet providing theories of human language.

 From Dentella et al. (open source):

Significance
The synthetic language generated by recent Large Language Models (LMs) strongly resembles the natural languages of humans. This resemblance has given rise to claims that LMs can serve as the basis of a theory of human language. Given the absence of transparency as to what drives the performance of LMs, the characteristics of their language competence remain vague. Through systematic testing, we demonstrate that LMs perform nearly at chance in some language judgment tasks, while revealing a stark absence of response stability and a bias toward yes-responses. Our results raise the question of how knowledge of language in LMs is engineered to have specific characteristics that are absent from human performance.
Abstract
Humans are universally good in providing stable and accurate judgments about what forms part of their language and what not. Large Language Models (LMs) are claimed to possess human-like language abilities; hence, they are expected to emulate this behavior by providing both stable and accurate answers, when asked whether a string of words complies with or deviates from their next-word predictions. This work tests whether stability and accuracy are showcased by GPT-3/text-davinci-002, GPT-3/text-davinci-003, and ChatGPT, using a series of judgment tasks that tap on 8 linguistic phenomena: plural attraction, anaphora, center embedding, comparatives, intrusive resumption, negative polarity items, order of adjectives, and order of adverbs. For every phenomenon, 10 sentences (5 grammatical and 5 ungrammatical) are tested, each randomly repeated 10 times, totaling 800 elicited judgments per LM (total n = 2,400). Our results reveal variable above-chance accuracy in the grammatical condition, below-chance accuracy in the ungrammatical condition, a significant instability of answers across phenomena, and a yes-response bias for all the tested LMs. Furthermore, we found no evidence that repetition aids the Models to converge on a processing strategy that culminates in stable answers, either accurate or inaccurate. We demonstrate that the LMs’ performance in identifying (un)grammatical word patterns is in stark contrast to what is observed in humans (n = 80, tested on the same tasks) and argue that adopting LMs as theories of human language is not motivated at their current stage of development.

No comments:

Post a Comment