Like Rodin’s The Thinker, there was plenty of thinking and pondering about the large language model (LLM) landscape last week. There were Meta’s missteps over its Galactica LLM public demo and Stanford CRFM’s debut of its HELM benchmark, which followed weeks of tantalizing rumors about the possible release of OpenAI’s GPT-4 sometime over the next few months.
The online chatter ramped up last Tuesday. That’s when Meta AI and Papers With Code announced a new open-source LLM called Galactica, that it described in a paper published on Arxiv as “a large language model for science” meant to help scientists with “information overload.”
The “explosive growth in scientific literature and data,” the paper’s authors wrote, “has made it ever harder to discover useful insights in a large mass of information.” Galactica, it said, can “store, combine and reason about scientific knowledge.”
- Advertisement -
Galactica immediately garnered glowing reviews: “Haven’t been so excited by a text LM for a long time! And it’s all open! A true gift to science,” tweeted Linxi “Jim” Fan, a Nvidia AI research scientist, who added that the fact that Galactica was trained on scientific texts like academic papers meant that it was “mostly immune” from the “data plagues” of models like GPT-3, which was trained on texts trained on the internet at large.
Scientific texts, by contrast, “contain analytical text with a neutral tone, knowledge backed by evidence, and are written by people who wish to inform rather than inflame. A dataset born in the ivory tower,” Fan tweeted.
Critiques of Meta’s Galactica output
Unfortunately, Fan’s tweets did not age well. Others were appalled by Galactica’s very unscientific output, which, like other LLMs, included information that sounded plausible but was factually wrong and in some cases also highly offensive.
Tristan Greene, a reporter at The Next Web, tweeted: “I type one word into Galatica’s prompt window and it spits out ENDLESS antisemitism, homophobia, and misogyny.”
The fact that Galactica was focused on scientific research, many said, made its flawed output even worse.
“I think it’s dangerous,” tweeted Michael Black, director, Max Planck Institute for Intelligent Systems, because Galactica “generates text that’s grammatical and feels real. This text will slip into real scientific submissions. It will be realistic but wrong or biased. It will be hard to detect. It will influence how people think.”
Within three days, the Galactica public demo was gone. Now, mostly just the paper, Yann LeCun’s defensive tweets (“Galactica demo is offline for now. It’s no longer possible to have some fun by casually misusing it. Happy?”) and Gary Marcus’ parries (“Galactica is dangerous because it mixes together truth and bullshit plausibly & at scale”) remain — although some have pointed out that Galactica has already been uploaded to Hugging Face.
HELM’s LLM benchmark seeks to build transparency
Coincidentally, last week Stanford HAI’s Center for Research on Foundation Models (CRFM) announced the Holistic Evaluation of Language Models (HELM), which it says is the first benchmarking project aimed at improving the transparency of language models and the broader category of foundation models.
HELM, explained Percy Liang, director of CRFM, takes a holistic approach to the problems related to LLM output by evaluating language models based on a recognition of the limitations of models; on multi-metric measurement; and direct model comparison, with a goal of transparency. The core tenets used in HELM for model evaluation include accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency, pointing to the key elements that make a model sufficient.
Liang and his team evaluated 30 language models from 12 organizations: AI21 Labs, Anthropic, BigScience, Cohere, EleutherAI, Google, Meta, Microsoft, NVIDIA, OpenAI, Tsinghua University and Yandex.
Galactica could soon be added to HELM, he told VentureBeat, though his interview was only the day after the model was released and he had not yet read the paper. “This is something that will add to our benchmark,” he said. “Not by tomorrow, but maybe next week or in the next few weeks.”
Benchmarking neural language models is “crucial for directing innovation and progress in both industry and academia,” said Eric Horvitz, chief scientific officer at Microsoft, told VentureBeat by email. “More comprehensive evaluations can help us better understand where we stand and best directions for moving forward.”
Rumors of Open AI’s GPT-4 are rumbling
HELM’s benchmarking efforts will be more important than ever, it seems, as rumors about the release of OpenAI’s GPT-4 hit new heights over the last few weeks.
There has been a flurry of dramatic tweets, from “GPT-4 will crush them all” and “GPT-4 is a game-changer” to “All I want for Christmas is GPT-4 access.”
Supposed Reddit comments by Igor Baikov were shared in a Substack post (with the warning “take it with a (big) grain of salt”) predicted that GPT-4 would include “a colossal number of parameters,” would be very sparse, would be multimodal, and would likely sometime between December and February.
What we do actually know is that whatever GPT-4 is like, it will be released in an environment where large language models are still not even remotely fully understood. And concerns and critiques will certainly follow in its wake.
That’s because the risks of large language models have already been well-documented. When GPT-3 came out in June 2020, it didn’t take long for it to be called a “bloviator.” A year later, the paper On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? was released, authored by Emily M. Bender, Timnit Gebru, Angelina McMillan-Major and Margaret Mitchell. And who could forget this past summer, with the whole brouhaha around LaMDA?
Meta’s Galactica and Open AI’s GPT-4 are no joke
What does all this mean for GPT-4, whenever it is released? Other than cryptic philosophical comments from Ilya Sutskever, chief scientist of OpenAI (such as “perception is made out of the stuff of dreams” and “working towards AGI while not feeling the AGI is the real risk”) there is little to go on.
Meanwhile, as the world of AI — and, really, the world at large — awaits the release of GPT-4 with both excitement and anxiety, OpenAI CEO Sam Altman shares…ominous memes?
At a moment when the polarizing Elon Musk is in charge of one of the world’s largest and most consequential social networks; a quick scroll through the technology news of the week includes words like “polycure” and “pronatalist”; and one of the most heavily-funded AI safety startups received most of its funding from disgraced FTX Sam Bankman-Fried, maybe there is a lesson there.
That is, perhaps in the wake of Meta’s Galactica missteps, Open AI’s leaders and the entire AI and ML community generally would benefit from as few public jokes and flippant posts as possible. How about a sober, serious tone that recognizes and reflects the enormous global consequences, both positive and negative, of this work?
After all, when initially creating The Thinker statue as part of his Gates of Hell, Rodin meant the figure to represent Dante pondering about the fate of the damned people. But later, when he began to create independent versions of the statue, he considered different interpretations that represented the struggle of the human mind as it moves towards creativity.
Here’s hoping large language models prove to be the latter — a powerful creative tool for technology, for business and society-at-large. But maybe, just maybe, save the jokes that make us think of the former.