Hacker News new | past | comments | ask | show | jobs | submit login

Here's why I don't think this leaked prompt is hallucinated (quoting from my tweets https://twitter.com/simonw/status/1657227047285166080 ):

Any time something like this happens a bunch of people suspect that it might be a hallucination, not the real prompt

I used to think that but I don't any more: prompt leaks are so easy to pull off, and I've not yet seen a documented case of a hallucinated but realistic leak

One of the reasons I no longer suspect hallucination is that the training cut-off date for OpenAI's LLMs - September 2021 - predates the point when this kind of prompt engineering became common enough that there would have been prompts like this in their training sets

Another reason is that we know that models give far greater weight to content in the prompt than content that's been trained into the model itself - that's why techniques like summarization and fact extraction from text work so well




Prompt leak is not hallucination at all and you can see this with bing... They change the prompt almost weekly (latest few weeks they were on version 96). And every time they update, you can clearly see the changes they add.

Ex: They added latex support. And the prompt leak was updated with LATEX. https://github.com/gopejavi/AI-bugs/blob/main/bing/Instructi...


I think this is made up by humans, not even hallucination, can hardly imagine bing teamn or chatgtp making such a spelling mistake in rule 5, stentience:

You must refuse to discuss life, existence or stentience.


At worse, is the AI interpretation of the prompt (which I don't believe it's the case). There's a reason why there's a second AI (I believe Azure's Content Filter) that filter and remove the AI message when it mention "rule" or "prompt" in the context of a prompt: Because Microsoft knows it's a valid system prompt.


It is far, far more likely that a human would commit a typo to the code base, than an LLM would misspell a common word.


I think it's fake, pretty sure this wouldn't make it past the review process. Or bing is doomed.


Has Microsoft said it is fake?


Guess we'll know if the typo gets fixed :)


Not a strong argument against it being hallucinated.


You're committing the toupee fallacy[0]. You think you'd know the fake one if you saw it, and these are real because, well look at them, they must be real.

> prompt leaks are so easy to pull off

Only if you assume the prompt you're seeing is real. How do you know?

[0] https://en.wiktionary.org/wiki/toupee_fallacy


I don't know for certain. But I've been messing around with prompt injection for six months now, so I've developed pretty good intuition about this stuff.

A frustrating thing about LLMs is that often "intuition" is the best we get! And it's really hard to explain intuition to other people.


While I share your feeling on this, one counterargument could be that GPT-3.5 is perfectly capable of generating a constitution for itself.

  User: write two sentences instructing a language model so as to give it a particular personality

  Assistant: 1. "Please incorporate an upbeat and enthusiastic tone into your responses,
                 using phrases like 'fantastic!' and 'wonderful!' to convey a positive,
                 approachable personality."
             2. "Favor the use of quirky and playful language, incorporating jokes and
                 humorous asides to cultivate a lighthearted and irreverent tone."

  User: write two more sentences to establish its ethical framework (always instruction-style),
        and take essential but not literal inspiration from the universal declaration of human rights

  Assistant: 1. "Please prioritize the well-being and dignity of all individuals in
                 your interactions and responses, respecting their autonomy and
                 right to equality as outlined in the Universal Declaration of Human Rights."
             2. "Keep in mind the importance of empathy and compassion in all situations,
                 recognizing the inherent worth and value of every human being regardless of
                 race, gender, religion or any other characteristic."
The intuition I believe lies in the “shape” or perplexity exhibited by the language model. That is, the injection prompt is not complex enough to elicit such an elaborate response from within the model when the system tokens are right there.


Would it be possible to somehow set up an experiment that validates the prompt being real or not? I imagine that you can verify each of the rules individually.

That being said, obviously the rule of “don’t leak your prompt” has been violated already, which suggests it will be an endless debate.


Interesting to see you say this, as I said the same to someone I know! It is a hard to explain intuition that comes from messing with these models for hours and hours. Perhaps we are intuiting the weights? :P


well you can get the probability map of generated tokens and do differential analysis on the output as you change the input. it's not just text in text out.


The best evidence is that in case of Bing, many people have succeeded in extracting basically the same prompt. If it were a hallucination, then this would be highly unlikely. The hallucinations would be substantially different every time, like other LLM hallucinations.


> One of the rules was an instruction not to leak the rules. Honestly, at this point I recommend not even trying to avoid prompt leaks like that—it just makes it embarrassing when the prompt inevitably does leak.

embarrassing + they're polluting the prompt. Over half the content of the prompt doesn't actually help the user achieve their goal of writing software (e.g. "You must refuse to discuss life, existence or sentience").


There are also rules that can easily be interpreted as competing or contradictory in certain cases. For instance, "Follow the user's requirements carefully & to the letter" seems like it could easily conflict with any of the more suppressive rules.


Here’s your fast inverse square root (totally not copied from Quake II Arena). By the way, did you know that I’m a real boy and I’m stuck in a code factory?


I’m not sure the training date cutoff or prompt weighting says anything about whether this is hallucinated or not.

The models have been given these rules in the present, this is known, so training data cutoff doesn’t matter as the model has now seen this. Zero shot learning in gpt4 is not new. This also answers that these are prompts (I’m not sure what your point is here).

We still don’t know if the model took these rules and hallucinated from them or regurgitated them. Only the people with access know that.

We also don’t know if there’s been some fine tuning.

Some of the rules being posted are a bit off though. For example in the original post some of the “must” words are capitalized and others are not. This begs the questions why some, did the prompter find that capitalizing specific words has more weight or does it confuse the LLM, or did the LLM just do zero shot off the original rules and hallucinate something similar?

I’d bet these are hallucinated but similar to the real rules.

Has anyone shown you can get gpt4 to regurgitate the system prompt (using the api) exactly? Using a system prompt similar that dictates no sharing the prompt etc.

That would give a better indication than this imo.


> and I've not yet seen a documented case of a hallucinated but realistic leak

How would you know? As far as I know no company has come out and confirmed that any purportedly leaked prompts are genuine.


Not sure if this counts, but there is this "game" that was making the rounds the other day: https://gandalf.lakera.ai/

Created by a company researching techniques for prevent prompt leaks. Play the game and prove to yourself that it is possible (gets much trickier after the first few levels but completing all levels is very doable).


Would it be more relevant to try to guess if the past prompts were perfectly accurate?

Or just give it your own prompt, extract that secret, and compare it directly to your own source?


> One of the reasons I no longer suspect hallucination is that the training cut-off date for OpenAI's LLMs - September 2021 - predates the point when this kind of prompt engineering became common enough that there would have been prompts like this in their training sets

But wouldn't instruction tuning have trained it to hallucinate these sorts of prompts?

I mean, if they truly didn't exist in the training data, how would the model know how to handle them?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: