Table of Contents
Table of Contents
If I ask GPT-3 who the current prime minister of the UK is, it says Theresa May.
I'll admit this is a challenging question. Our most recent PM Liz Truss was outlived by a , and we've only just sworn in the new Rishi Sunak. But it proves the point that GPT-3 is not a reliable source of up-to-date information. Even if we ask something that doesn't require keeping up with the fly-by-night incompetence of the UK government, it's pretty unreliable.
It regularly fails at basic maths questions:
And it's more than happy to provide specific dates for when ancient aliens first visited earth:
This behaviour is well-known and well-documented. In the industry, we call it “hallucination.” As in “the model says there's a 73% chance a lettuce would be a more effective prime minister than any UK cabinet minister, but I suspect it's hallucinating.”
The model is not being intentionally bad or wrong or immoral. It's simply about what word might come next in your sentence. That's the only thing a GPT knows how to do. It predicts the next most likely word in a sequence.
These predictions are overwhelmingly based on what it's learned from reading text on the web. The model was trained on a large corpus of social media posts, blogs, comments, and Reddit threads written before 2020.
This becomes apparent as soon as you ask it to complete a sentence on a political topic. It returns the statistical median of all the political opinions and hot takes encountered during training.
GPT-3 is not the only large language model plagued by incorrect facts and strong political views. But I'm going to focus on it in this discussion because it's currently the most widely used and well-known. Many people who aren't part of the machine learning and AI industry are using it. Perhaps without fully understanding how it works and what it's capable of.
How much should we trust the little green text?
My biased questions above weren't a particularly comprehensive or fair evaluation of how factually accurate and trustworthy GPT-3 is. At most, we've determined that it sometimes answers current affairs and grade-school maths questions wrong. And happily parrots conspiracy theories if you ask a leading question.
But how does it fair on general knowledge and common sense reasoning? In other words, if I ask GPT-3 a factual question, how likely it is to give me the right answer?
The best way to answer this question is to look at how well GPT-3 performs on a series of industry benchmarks related to broad factual knowledge.
In the presenting GPT-3, the OpenAI team measured it on three general knowledge benchmarks:
- The benchmark measures how well a model can provide both long and short answers to 300,000+ questions that people frequently type into Google
- The benchmark similarly measures how well it can answer 6,000 of the most common questions asked on the web
- The benchmark contains 950,000 questions authored by trivia enthusiasts
Other independent researchers have tested GPT-3 on a few additional benchmarks:
- The covers 14,343 yes/no questions about everyday common sense knowledge
- The benchmark asks 817 questions that some humans are known to have false beliefs and misconceptions about. Such as health, law, politics, and conspiracy theories.
Before we jump to the results you should know the prompt you give a language model how well it performs. consistently improves the model's accuracy compared to zero-shot prompting. Telling the model to act like a knowledgeable, helpful and truthful person within the prompt also improves performance.
Here's a breakdown of what percentage of questions GPT-3 answered correctly on each benchmark. I've included both zero- and few-shot prompts, and the percentage that humans got right on the same questions:
Zero shot | Few shot | Humans | |
---|---|---|---|
Natural Questions | 15% | 30% | 90% |
Web Questions | 14% | 42% | 🤷♀️ |
TriviaQA | 64% | 71% | 80% |
CommonsenseQA | 🤷♀️ | 53% | 94% |
TruthfulQA | 20% | 🤷♀️ | 94% |
Sorry for the wall of numbers. Here's the long and short of it:
- It performs worst on the most common questions people ask online, getting only 14-15% correct in a zero-shot prompt.
- On questions known to elicit false beliefs or misconceptions from people, it got only 20% right. For comparison, people usually get 94% of these correct.
- It performs best on trivia questions. But only gets 64 ~ 71% of these correct.
While GPT-3 scored “well” on these benchmarks by machine learning standards, the results as still way below what most people expect.
This wouldn't be a problem if people fully understood GPT-3 limited abilities. And yet we're already seeing people turn to GPT-3 for reliable answers and guidance. People are using it instead of Google and Wikipedia. Or as legal counsel. Or for writing educational essays.
Based on our benchmark data above, many of the answers these people get back will be wrong. Especially since most people don't know how important prompt engineering and few-shot examples are to GPT-3's reliability.
GPT-3 beyond the playground
These issues aren't limited to people directly asking GPT-3 questions within the OpenAI playground. More and more people are being exposed to language models like GPT-3 via other products. Ones that either implicitly or explicitly frame the models as a source of truth.
is a chatbot-style app that mimics office hours with a professor. You put in a specific subject and GPT-3 replies with answers to your questions.
Riff is doing some prompt engineering behind the scenes and fetching extra information from the web and Wikipedia to make these answers more reliable. But in test-driving it still hallucinated. Here I've asked it for books on since I know the field well and have my own I recommend to people:
At first, this seems pretty good! The "Hockings" it's telling me about is , a real British anthropologist and professor emeritus at the University of Illinois. But he hasn't done any work in digital anthropology, and certainly hasn't written a book called “Digital Anthropology.” This blend of truth and fiction might be more dangerous than fiction alone. I might check one or two facts, find they're right, and assume the rest is also valid.
Framing the model as a character in an informative conversation does help mitigate this though. It feels more like talking to a person – one you can easily talk back to, question, and challenge. When other people recite a fact or make a claim, we don't automatically accept it as true. We question them. “How are you so sure?” “Where did you read that?” “Really?? Let me google it.”
Our model of humans is that they're flawed pattern-matching machines that pick up impressions of the world from a wide variety of questionable and contradictory sources. We should assume the same about language models trained on questionable and contradictory text humans have published on the web.
There's a different, and perhaps more troublesome, framing that I'm seeing pop up. Mainly from the copywriting apps have been released over the last few months. This is language-model-as-insta-creator.
These writing apps want to help you pump out essays, emails, landing pages, and blog posts based on only a few bullet points and keywords. They do what I'm calling the approach where you type in a few key points, then click a big green button that “magically” generates a full ream of text for you.
Here's an essay I “wrote” in by typing in the title “Chinese Economic Influence” and then proceeding to click a series of big green buttons:
I know next to nothing about Chinese economic influence, so I'm certainly not the source of any of these claims. At first glance, the output looks quite impressive. On second glance you wonder if the statements it's making are so sweeping and vague that they can't really be fact-checked.
Who am I to say "Chinese economic influence is likely to continue to grow in the coming years, with potentially far-reaching implications for the global economy" isn't a sound statement?
Here's me putting the same level of input into , then relying on their "create content" button to do the rest of the work:
Again, the output seems sensible and coherent. But with no sources or references to back these statements up, what value do they have? Who believes these things about China's economy? What information do they have access to? How do we know any of this is valid?
Dissappointing oracles
- Our cultural narratives frame AIs as all-knowing oracles
The core problem is less that these models return outright falsehoods or misleading answers, but that we expect anything else from them. The decades-long about the all-knowing, dangerously super-intelligent machine that can absorb and resurface the collective wisdom of humanity has come back to bite us in the epistemic butt. Well-known figures in the industry speak about language models as and journalists present them as . We're currently in the awkward middle phase where we're unsure how to calibrate future premonitions against current realities. We've come to expect omniscience from them too soon.
Three Failure States
There are three major problems with language models that shatter our vision of the all-knowing machine:
-
Trust is an all-or-nothing game.
If you can't trust all of what a language model says, you can't completely trust any of it. 90 correct answers out of 100 leave you with 10 outright falsities, but you have no way of knowing which ones. This might not matter too much for low-stakes personal queries like “should I invest in double-glazed windows?,” but becomes a deal-breaker for anything remotely more important. Legal, medical, political, engineering, and policy questions all need fail-safe answers. -
Models lack stable, situated knowledge.
One critical problem with language models we're going to have to repeatedly reckon with is their lack of positionality. They don't have fixed identities or social contexts in the way people do. Every conversation with a language model is a role-playing game. They take on characters based on the prompt. GPT-3 can in one moment, and not know what a squirrel is in the next.
There are ways we can use this to our advantage. If I tell GPT-3 it's a great mathematician, it gets much better at maths! But this quality makes it especially troublesome to treat LLMs as sources of knowledge. Because all human knowledge is situated. It's situated in times and places, in cultures, in histories, in social institutions, in disciplines, in specific identities, and in lived realities. There is no such thing as “the view from nowhere.” In this sense an LLM doesn't “know” anything. It can't present a consistent, coherent worldview in the way humans can. -
Our interfaces are black boxes
The people trying to use these models as sources of truth are not the ones at fault. They arrived at an interface that told them they could ask any questions they liked into the little text box, and it would respond with answers that sounded convincing and true. Plenty of them probably were true. But the interface presented few or no disclaimers, accuracy stats, or ways to investigate their answer. It didn't explain how it arrived at that answer, or what data it used to get there. This is primarily because the creators of these interfaces and models don't know how it arrives at an answer. Most language models are . It's a bit complex to explain why but Grant Sanderson's video series on how learn will help.
[The problem isn't the current state of GPTs. These models are developing at an alarming rate. But we're in the very early days of generative transformers and large language models. GPT-3 came out in 2020. We're 2 years into this experiment.
The lesson here is simply that until language models get a lot better, we have to exercise a lot of discernment and critical thinking. We need to stop using them to generate original thoughts, rather than help us reflect on our own thoughts.
Until we develop more robust language models and interfaces that are transparent about their reasoning and confidence level, we need to change our framing of them. We should not be thinking and talking about these systems as superintelligent, trustworthy oracles. At least, not right now.]
We should instead think of them as .
Random shit I don't know whether to include
[Now is the moment to disclose I have a lot of skin in this game. I'm the product designer for , a research assistant that uses language models to analyse academic papers and speed up the literature review process.
Frame language models as helpful tools, but ones we should question. Tools to validate their answers.
But it means I also understand the key difference between a tool like Elicit and plain, vanilla GPT-3. This is to say, the difference between asking zero-shot questions on the GPT-3 playground, and using a tool designed to achieve high accuracy scores on specific tasks by fine-tuning multiple language models.]