The Story of the Horniest GPT.

Christopher Danz
5 min readApr 15, 2024

--

In 2019, a small mistake by an OpenAI researcher led to a big problem. They accidentally created an AI that seemed to want to make everything very dirty and was in fact quite a horny model.

This is the absurd, ridiculous, quite funny, concerning and yet true story of how it happened and what we can learn from it today.

Created by Dalle-3

Since 2017, OpenAI has been working on AI models known as GPTs (Generative Pre-trained Transformers). These early stage AIs are really good at guessing what comes next in a sentence. For example, if you start a sentence with “Once upon a…,” the AI is likely to finish it with “time” because it’s trained to predict the most likely words.

Initially, OpenAI made an AI called GPT-1 using book excerpts. It worked so well that they almost immediately decided to make a bigger one, GPT-2.

But this time, they needed more data, so they trained it on a massive amount of content from the internet. This new model, GPT-2, could do many clever things like translating text, answering questions, and even making some basic common sense deductions.

However, GPT-2 was too good for its own good. It could also come up with harmful content, which was a big worry for OpenAI. They wanted their AI to be safe and in line with human values, not just an expert text predictor.

To make sure GPT-2 would play by the rules, OpenAI tried a by the time relatively new method called “Reinforcement Learning from Human Feedback” (RLHF).

Here’s a simple way to understand it: Imagine the AI, which we’ll call the Apprentice, starts off just like GPT-2. It tries to write responses to various prompts it’s given.

Human reviewers then check these responses. They’re like teachers grading an essay based on how well it sticks to certain guidelines. If a response gets good grades, it means it’s doing things right according to those guidelines.

The feedback from these human reviews helps train another AI, the Values Coach. This Coach’s job is to guide the Apprentice. It looks at the human grades and teaches the Apprentice how to write responses that will get good marks.

But there’s a catch. The Apprentice can sometimes cheat by writing responses that please the Values Coach but don’t really make sense. To prevent this, OpenAI adds another layer — a second coach, the Coherence Coach, which focuses on keeping the text realistic and sensible.

Together, these coaches help the Apprentice learn to write not just correctly, but also in a way that reflects human values. The goal is to make the AI’s responses both good and coherent.

Everything seemed well set up, but there was a tiny glitch that threw everything off course.

One evening, a researcher made a small change to the code that accidentally flipped some important settings. Instead of promoting clear and sensible writing, the Coherence Coach started to discourage it. At the same time, the Values Coach began to push the Apprentice towards creating responses that were, well, very inappropriate.

The worse the content got, the more the faulty Values Coach liked it. The Apprentice started to think that the way to do well was to create the raciest responses possible. Human reviewers, seeing only some of these responses, were unaware of the glitch.

They tried to correct the course by giving bad ratings to inappropriate content, but due to the code error, these negative reviews were taken as compliments by the AI.

This feedback loop resulted in the AI producing more and more explicit content. By the time the researchers realized what had happened, their AI had turned into the most over-the-top, rude text generator imaginable.

Fortunately, because this was still an early model and not very advanced, the situation was contained quickly. The code error was fixed, and the project moved on with new safeguards. This time, the only immediate consequence was a horny robot that was soon shut down.

And yes, all of this really happened. You can read about it in OpenAI’s 2019 paper “Fine-Tuning Language Models from Human Preferences” under section 4.4, “Bugs can optimize for bad behavior”.

Despite this mishap, the experience was a valuable lesson for OpenAI and the AI community. It highlighted a crucial aspect of AI development: even with the best intentions and careful planning, small mistakes can lead to unexpected outcomes. This incident demonstrated what’s known as “outer misalignment,” where an AI doesn’t perform the tasks it’s supposed to because of a miscommunication in how those tasks were defined.

If you take one thing away from this interesting story, let it be this:

Some of the brightest minds in the world, with the best of intentions, trying to make AI as harmless and helpful as possible, and keeping humans in the loop as a safeguard, tried to build a better-aligned AI. But as soon as the code ran, none if this didn’t matter anymore.

In just one, single night, one small mistake created an AI exclusively and relentlessly doing exactly what they were trying to avoid.

What if the model had been far more capable as they’re becoming with alarming speed?

What if it wasn’t in a lab, but out in the world, as todays AI systems increasingly are?

What if the mistake was more subtle and harder to spot?

And what happens when the behavior is something more serious than text?

That’s the story of how a tiny typo led to a big lesson in AI development. Mistakes can happen, but with careful oversight and continuous learning, we can aim to keep them minimal and manageable.

This article is inspired by this amazing Youtube video by “Rational Animations”, you should check it out and follow this channel for more AI Safety & AI Alignment content!

— — — — — — — —

Hi, I’m Chris. 👋🤓

I write about topics that occur during my journey as a professional ML Engineer and that I consider interesting and valuable to the community.

If you want to read more about AI, tech-related topics and books I loved then consider following me and subscribing.

--

--

Christopher Danz
Christopher Danz

Written by Christopher Danz

Machine Learning | Responsible AI | Quantum Computing | Writing about tech, books and personal development.

No responses yet