This article is part of our coverage of the latest news AI research.
Amid the wave of layoffs and falling stock price, Meta (Facebook) faced another crisis after releasing the latest artificial intelligence announcement: Galactica.
Galactica is “a large language model that can store, combine and reason scientific knowledge,” according to one paper published by Meta AI. It is a transformer model which is trained on a carefully curated dataset of 48 million papers, textbooks and lecture notes, millions of compounds and proteins, scientific websites, encyclopedias and more.
Galactica was supposed to help scientists navigate the masses of published scientific information. The developers presented it as an opportunity to find citations, summarize academic literature, solve math problems and perform other tasks that help scientists in research and writing papers.
In collaboration with Papers with Code, Meta AI has made Galactica open source and a website that allowed visitors to interact with the model.
However, three days after Galactica’s release, Meta was forced to shut down the online demo after an outpouring of criticism from scientists and tech media over the model’s incorrect and biased output.
While Galactica was clearly not a success, I believe its brief history offers us some useful lessons about LLMs and the future of AI research.
1- Be careful how you present your model
Large language models represent impressive advancements in artificial intelligence and have even become the basis for several commercial products. Over the past few years, LLMs have continued to push the boundaries of what is possible deep neural networks. Galactica is no exception. Reading the paper, there’s a lot to learn about managing data, configuring tokens, and adapting the architecture of the model to do more with less.
However, LLMs are also a controversial topic. When it comes to topics such as understanding, reasoning, schedule, and common sense, scholars are divided on how to rate LLMs. Some dismiss LLMs as stochastic parrots, while others go as far as since they are conscious.
What is clear is that even if we want to attribute any form of intelligence to LLMswe must recognize that it is fundamentally different and largely incompatible with human intelligence – even if at first glance it produces a very convincing result.
This is unfortunately where I think Meta AI is at fault. In their article, they used some of these controversial terms, such as ‘reasoning about scientific knowledge’. And on Twitter, the model was presented in a way that gave the impression that it can write its own scientific papers.
To its credit, Meta and Papers with Code explicitly state on the Galactica website that “there are no guarantees for truthful or reliable output of language models, even large ones trained on high-quality data like Galactica.” They also recognize that Galactica performs best when used to generate content on highly cited concepts. And they warn that in some cases Galactica can generate text that appears authentic but is inaccurate.
But the use of vague terms in the paper, on the website, and in tweets was enough to overshadow those warnings and spark a backlash from scientists and researchers who are (rightly) exhausted by the unfounded hype surrounding grand language models. (I won’t address those criticisms here because they’ve been widely reported by tech media.)
2- Benchmarks are often misleading
are benchmarks one of the most thorny problems of AI research. On the one hand, researchers need a way to evaluate and compare their models. On the other hand, some concepts are really hard to measure.
Galactica is making impressive progress on some of the benchmarks used to measure reasoning, planning, and problem-solving abilities in AI systems. With a maximum size of 120 billion parameters, Galactica is significantly smaller than other advanced language models such as GPT-3, BLOOM, and PALM. But according to Meta AI experiments, Galactica outperforms these SOTA models by a comfortable margin on benchmarks like MMLU and MATH.
However, the problem with these benchmarks is that we usually look at them from a human intelligence perspective. Take chess as a simplified example, long considered the ultimate challenge of AI. We consider chess as a complicated intelligence challenge, because on the way to master chess, people need to acquire some cognitive skills through hard work and talent. Therefore, we expect chess masters to be able to make smart decisions on a wider range of tasks that require long-term planning, but are not directly related to chess. But from a computational perspective, you can take a shortcut to finding good chess moves through pure computation, a good algorithm, and the right inductive biases. And you don’t need any of the general intelligence skills that human chess masters have.
Scientists do their best to create benchmarks that cannot be “cheated” with computational shortcuts. But it’s a very difficult feat. Computer scientist Melanie Mitchell has thoroughly investigated the shortcomings of benchmarks used to evaluate reasoning in deep learning models. And according to her findingseven some of the most carefully crafted benchmarks are sensitive to computational shortcuts.
This means that while benchmarks are a good tool for comparing machine learning models, they are not anthropomorphic measures of cognitive ability in machines.
3- Recognize the boundaries and powers of LLMs
One of the great challenges of large language models is that they can create output that is convincingly human, but not based on human cognition. Models like the Galactica can be extremely powerful, but also dangerously misleading.
As some researchers have pointed out, Galactica’s output may feel real, but not based on real facts. This doesn’t happen all the time, but it happens often enough to make you want to check the suggestions the language model provides, rather than blindly accepting them. This applies not only to Galactica, but also to other LLMs used for reasoning and problem-solving tasks, such as source code generation.
But does this mean Galactica should be dismissed as useless in math, science, and programming? Absolutely not. In fact, there is ample evidence that LLMs – with all their shortcomings – can be very effective tools. To take GitHub copilota programming AI tool powered by OpenAI’s Codex model. Multiple studies show that Copilot makes programmers’ jobs much more fun and productive.
That said, I’m a little disappointed with the scientists and media outlets who used Galactica’s failures as an opportunity to bash deep learning, large language models, and the work of Meta researchers. With the right interface and guardrails, a model like Galactica can complement scientific search tools like Google Scholar.
In other words, we should view the initial failure of Galactica as just another science experiment. And as the history of scientific discovery has proven time and time again, every failed experiment brings us one step closer to success.