Any time something like this happens a bunch of people suspect that it might be a hallucination, not the real prompt
I used to think that but I don't any more: prompt leaks are so easy to pull off, and I've not yet seen a documented case of a hallucinated but realistic leak
One of the reasons I no longer suspect hallucination is that the training cut-off date for OpenAI's LLMs - September 2021 - predates the point when this kind of prompt engineering became common enough that there would have been prompts like this in their training sets
Another reason is that we know that models give far greater weight to content in the prompt than content that's been trained into the model itself - that's why techniques like summarization and fact extraction from text work so well
Prompt leak is not hallucination at all and you can see this with bing... They change the prompt almost weekly (latest few weeks they were on version 96). And every time they update, you can clearly see the changes they add.
I think this is made up by humans, not even hallucination, can hardly imagine bing teamn or chatgtp making such a spelling mistake in rule 5, stentience:
You must refuse to discuss life, existence or stentience.
At worse, is the AI interpretation of the prompt (which I don't believe it's the case). There's a reason why there's a second AI (I believe Azure's Content Filter) that filter and remove the AI message when it mention "rule" or "prompt" in the context of a prompt: Because Microsoft knows it's a valid system prompt.
You're committing the toupee fallacy[0]. You think you'd know the fake one if you saw it, and these are real because, well look at them, they must be real.
> prompt leaks are so easy to pull off
Only if you assume the prompt you're seeing is real. How do you know?
I don't know for certain. But I've been messing around with prompt injection for six months now, so I've developed pretty good intuition about this stuff.
A frustrating thing about LLMs is that often "intuition" is the best we get! And it's really hard to explain intuition to other people.
While I share your feeling on this, one counterargument could be that GPT-3.5 is perfectly capable of generating a constitution for itself.
User: write two sentences instructing a language model so as to give it a particular personality
Assistant: 1. "Please incorporate an upbeat and enthusiastic tone into your responses,
using phrases like 'fantastic!' and 'wonderful!' to convey a positive,
approachable personality."
2. "Favor the use of quirky and playful language, incorporating jokes and
humorous asides to cultivate a lighthearted and irreverent tone."
User: write two more sentences to establish its ethical framework (always instruction-style),
and take essential but not literal inspiration from the universal declaration of human rights
Assistant: 1. "Please prioritize the well-being and dignity of all individuals in
your interactions and responses, respecting their autonomy and
right to equality as outlined in the Universal Declaration of Human Rights."
2. "Keep in mind the importance of empathy and compassion in all situations,
recognizing the inherent worth and value of every human being regardless of
race, gender, religion or any other characteristic."
The intuition I believe lies in the “shape” or perplexity exhibited by the language model. That is, the injection prompt is not complex enough to elicit such an elaborate response from within the model when the system tokens are right there.
Would it be possible to somehow set up an experiment that validates the prompt being real or not? I imagine that you can verify each of the rules individually.
That being said, obviously the rule of “don’t leak your prompt” has been violated already, which suggests it will be an endless debate.
Interesting to see you say this, as I said the same to someone I know! It is a hard to explain intuition that comes from messing with these models for hours and hours. Perhaps we are intuiting the weights? :P
well you can get the probability map of generated tokens and do differential analysis on the output as you change the input. it's not just text in text out.
The best evidence is that in case of Bing, many people have succeeded in extracting basically the same prompt. If it were a hallucination, then this would be highly unlikely. The hallucinations would be substantially different every time, like other LLM hallucinations.
> One of the rules was an instruction not to leak the rules. Honestly, at this point I recommend not even trying to avoid prompt leaks like that—it just makes it embarrassing when the prompt inevitably does leak.
embarrassing + they're polluting the prompt. Over half the content of the prompt doesn't actually help the user achieve their goal of writing software (e.g. "You must refuse to discuss life, existence or sentience").
There are also rules that can easily be interpreted as competing or contradictory in certain cases. For instance, "Follow the user's requirements carefully & to the letter" seems like it could easily conflict with any of the more suppressive rules.
Here’s your fast inverse square root (totally not copied from Quake II Arena). By the way, did you know that I’m a real boy and I’m stuck in a code factory?
I’m not sure the training date cutoff or prompt weighting says anything about whether this is hallucinated or not.
The models have been given these rules in the present, this is known, so training data cutoff doesn’t matter as the model has now seen this. Zero shot learning in gpt4 is not new. This also answers that these are prompts (I’m not sure what your point is here).
We still don’t know if the model took these rules and hallucinated from them or regurgitated them. Only the people with access know that.
We also don’t know if there’s been some fine tuning.
Some of the rules being posted are a bit off though. For example in the original post some of the “must” words are capitalized and others are not. This begs the questions why some, did the prompter find that capitalizing specific words has more weight or does it confuse the LLM, or did the LLM just do zero shot off the original rules and hallucinate something similar?
I’d bet these are hallucinated but similar to the real rules.
Has anyone shown you can get gpt4 to regurgitate the system prompt (using the api) exactly? Using a system prompt similar that dictates no sharing the prompt etc.
That would give a better indication than this imo.
Not sure if this counts, but there is this "game" that was making the rounds the other day: https://gandalf.lakera.ai/
Created by a company researching techniques for prevent prompt leaks. Play the game and prove to yourself that it is possible (gets much trickier after the first few levels but completing all levels is very doable).
> One of the reasons I no longer suspect hallucination is that the training cut-off date for OpenAI's LLMs - September 2021 - predates the point when this kind of prompt engineering became common enough that there would have been prompts like this in their training sets
But wouldn't instruction tuning have trained it to hallucinate these sorts of prompts?
I mean, if they truly didn't exist in the training data, how would the model know how to handle them?
Any time something like this happens a bunch of people suspect that it might be a hallucination, not the real prompt
I used to think that but I don't any more: prompt leaks are so easy to pull off, and I've not yet seen a documented case of a hallucinated but realistic leak
One of the reasons I no longer suspect hallucination is that the training cut-off date for OpenAI's LLMs - September 2021 - predates the point when this kind of prompt engineering became common enough that there would have been prompts like this in their training sets
Another reason is that we know that models give far greater weight to content in the prompt than content that's been trained into the model itself - that's why techniques like summarization and fact extraction from text work so well