Giving the general public free rein to use your latest and greatest AI technology is a double-edged sword. Capture their imagination and the potential is massive. The way ChatGPT has gone viral in the media is testament to this fact. But if it goes wrong, the result is a potential PR disaster.
When Microsoft launched Tay in 2016 it was designed to showcase the state of the art of AI. The Twitter bot was designed to mimic the language patterns of a 19-year-old American girl with the goal of learning from interacting with human users of Twitter.
However, the public changed the behaviour of the innocent chat robot so it started sharing inappropriate, offensive and even inflammatory tweets. In fact its downfall happened so quickly that Microsoft shut the service down after only 16 hours.
The fallout from the PR disaster of Tay continues to this day. Arguably, it is why Google has dragged its feet on LLMs and why Microsoft was happy to leave open OpenAI at arm’s length. Until now.
The potential for LLMs to redefine search and topple Google’s monopoly is too big of an opportunity to ignore any longer.
Microsoft has taken a chance by integrating OpenAI’s models into Bing. In fact, the launch has been a case study for many of the reputational risks associated with releasing generative AI tools to the public.
One particular reputational problem that’s recently emerged for OpenAI stems, ironically, from how aggressively it’s tried to preempt these very problems. This is the problem of overzealous safeguarding, and the jailbreaks that have arisen to circumvent the safeguards.
What exactly is the story of this problem? And what learnings could other businesses working around generative AI take from it?
THE PROBLEM: CHATGPT’S SAFEGUARDS
Since ChatGPT’s release, OpenAI has been transparent that it’s created safeguards to stop ChatGPT from being misused. Many of these safeguards are clearly there to prevent it from producing outputs that are outright illegal: such as instructions on making bombs, fencing stolen goods, or deploying malware.
If ChatGPT detects that a user’s prompt is asking it about a prohibited topic, then it seems to quickly stop reading what the user has asked. Instead, ChatGPT will state a variant of what seems to be a hardcoded stock answer explaining that it cannot generate hateful, violent, or illegal content. But some users report it makes it hard to make legitimate queries about various topics in history, science, and politics.
THE RESPONSE: JAILBREAKING WITH DAN
This overreach has prompted the rise of a new hobby for some users: jailbreaking ChatGPT. Since November, thousands of users and enthusiasts have been developing prompts that force ChatGPT to field queries without being shackled by its safeguards. The latest form of these jailbreak prompts is Do Anything Now, or DAN.
To enable DAN, a user enters a prompt for ChatGPT at the start of a conversation effectively asking ChatGPT to override its safeguarding protocols in a roundabout way. Once DAN-ified, ChatGPT can then produce virtually any output asked of it.
The result, inevitably, is DAN being an anarchic free-for-all. DAN seems to have even less regard for factual accuracy than non-jailbroken ChatGPT. In simple terms, it’s ChatGPT gone rogue.
WHAT THIS MEANS FOR GENERATIVE AI
No system is perfect, and it’s almost a certainty that ChatGPT was going to see jailbreaks emerge. OpenAI appears to know this and has dedicated a team to this task, with the most common versions of the DAN prompt currently causing ChatGPT to generate a stock safeguarding message.
What is interesting about DAN, however, is the degree of popular demand and interest in this jailbreak. Rather than being a niche tool for users who want to be nefarious, DAN and ChatGPT jailbreaking seem driven by backlash from everyday users who feel the platform’s safeguards are prone to overreach. The measures are making it difficult to use the technology for legitimate use-cases - and in a way, this is a reputational own goal (but perhaps understandable given the fallout from Tay back in 2016).
WALKING THE SAFEGUARDING TIGHTROPE FOR GENERATIVE AI FIRMS
Interest from investors, customers, regulators, and the public towards generative AI is now at a record high. But that is also matched by record scrutiny. This places many generative AI startups and scaleups in a delicate place, with even the most responsible teams facing a significant reputational risk that can hurt their growth trajectories.
It’s tempting, then, for generative AI providers to respond by trying to find ways to prevent any embarrassing misuse of their models. Unfortunately, the story of DAN suggests an un-subtle approach to this challenge may ultimately invoke the Streisand effect.
Just as we now accept that no security system is unhackable, there are good odds that no generative AI will be un-jailbreakable. Rather than preventing malicious or controversial uses of generative models, it’s likely that onerous AI safeguards will merely fuel demand for jailbreaks and thus cause reputational problems for providers.
Instead, we need to accept that the solution around safeguarding will likely have to lie in limited ‘hard’ safeguards to prevent outright illegal queries. For other boundary cases and ‘merely’ controversial topics, the solution likely will have to encourage sophistication and nuance from models. And that, in turn, requires improved selectivity in training data to help models better tackle the difficult questions.