Corpus Linguistics v. LLM AIs
What is the allure of LLM AI chatbots in the search for empirical evidence of ordinary meaning? Judge Newsom’s two concurring opinions channel recent scholarship in developing four main selling points. And they advance those points as grounds for endorsing LLM AIs over tools of corpus linguistics.
Our draft article presents our opposing view that corpus tools have the advantage notwithstanding—or even because of—the purported features of LLM AIs. We outline some key points below.
LLM AIs are enormous
The first claim is that LLMs “train on” a “mind-bogglingly enormous” dataset (400-500 billion words in GPT-3.5 turbo)—language “run[ning] the gamut from … Hemingway novels and PhD dissertations to gossip rags and comment threads.” The focus is on the size and the breadth of LLMs. The assertion is that those features assure that the LLMs’ training “data … reflect and capture how individuals use language in their everyday lives.”
Corpus size can be an advantage. But size alone is no guarantee of representativeness. A corpus is representative only if it “permits accurate generalizations about the quantitative linguistic patters that are typical” in a given speech community. Representativeness is often “more strongly influenced by the quality of the sample than by its size.” And we have no basis for concluding that an LLM like ChatGPT is representative. At most, OpenAI tells us that ChatGPT uses information that is “publicly available on the internet,” “licensed from third parties,” and provided by “users or human trainers.” This tells us nothing about the real-world language population the creators were targeting or how successfully they represented it. And it certainly doesn’t tell us what sources are drawn upon in answering a given query. In fact, “it’s next to impossible to pinpoint exactly what training data an LLM draws on when answering a particular question.”
Even if LLM AIs were representative, that would not answer the ordinary meaning question in Snell and related cases. Representativeness speaks only to the speech community dimension of ordinary meaning (how “landscaping” is used by the general public). It elides the core empirical question (whether and to what extent the general public uses “landscaping” to encompass non-botanical, functional improvements).
LLM AIs produce human-sounding answers by “mathy” and “scien
Article from Reason.com
The Reason Magazine website is a go-to destination for libertarians seeking cogent analysis, investigative reporting, and thought-provoking commentary. Championing the principles of individual freedom, limited government, and free markets, the site offers a diverse range of articles, videos, and podcasts that challenge conventional wisdom and advocate for libertarian solutions. Whether you’re interested in politics, culture, or technology, Reason provides a unique lens that prioritizes liberty and rational discourse. It’s an essential resource for those who value critical thinking and nuanced debate in the pursuit of a freer society.