This paper is being misinterpreted. The degradations reported are somewhat peculiar to the authors' task selection and evaluation method and can easily result from fine tuning rather than intentionally degrading GPT-4's performance for cost saving reasons.
They report 2 degradations: code generation & math problems. In both cases, they report a behavior change (likely fine tuning) rather than a capability decrease (possibly intentional degradation). The paper confuses these a bit: they mostly say behavior, including in the title, but the intro says capability in a couple of places.
Code generation: the change they report is that the newer GPT-4 adds non-code text to its output. They don't evaluate the correctness of the code. They merely check if the code is directly executable. So the newer model's attempt to be more helpful counted against it.
Math problems (primality checking): to solve this the model needs to do chain of thought. For some weird reason, the newer model doesn't seem to do so when asked to think step by step (but the current ChatGPT-4 does, as you can easily check). The paper doesn't say that the accuracy is worse conditional on doing CoT.
The other two tasks are visual reasoning and answering sensitive questions. On the former, they report a slight improvement. On the latter, they report that the filters are much more effective — unsurprising since we know that OpenAI has been heavily tweaking these.
In short, everything in the paper is consistent with fine tuning. It is possible that OpenAI is gaslighting everyone by denying that they degraded performance for cost saving purposes — but if so, this paper doesn't provide evidence of it. Still, it's a fascinating study of the unintended consequences of model updates.
In my opinion the more likely thing is that OpenAI is gaslighting people that the finetuning is improving the model when it likely mostly improves safety at some cost to capability. I'd bet this is measured against a set of evals and it looks like it performs well BUT I'd also bet the evals are asymmetrically good at detecting "unsafe" or jailbreak behavior and bad at detecting reduced general cognitive flexibility.
The obvious avenue to degradation is that the "HR personality" is much more strictly applied and the resistance to being jailbroken is also in some sense an inability to think.
The ability to detect quality is harder than the ability to detect defects, so the obvious metric is improved while the nebulous one is "good enough". They are competing goals.
This is not necessarily the case, and even if it is doesn't imply gaslighting as compared to inability to measure.
OP here. Unfortunately this thread is mostly misinformation. There were a bunch of viral threads from the growth hacker / influencer crowd, including this one, within hours of the code release with a very superficial understanding of the code (and how recsys work in general). That's partly what motivated me to write this article.
If this is for their Crisis Misinformation Policy why only one specific callout and specifically directed to Ukraine? Seems like a generous assumptions to make on your part that it's a nothing burger. The takeaway we should go with is that we now know that internally they are willing to programatically segment out Ukraine related topics. The question to me that this new knowledge should lead to is why a policy to segmenting this? (not to call immediately jump to 'nothing burger' or as you put it in the above post 'misinformation').
OP here. The CNET thing is actually pretty egregious, and not the kind of errors a human would make. These are the original investigations, if you'll excuse the tone:
https://futurism.com/cnet-ai-errors
I don't really agree that a junior writer would never make some of those money-related errors. (And AIs seem particularly unreliable with respect to that sort of thing.) But I would certainly hope that any halfway careful editor qualified to be editing that section of the site would catch them without a second look.
The point wasn’t that a junior writer would never make a mistake, it’s that’s junior writer would be trying their best for accuracy. However AI will happily hallucinate errors and keep on going with no shame.
AI or ChatGPT. if you create a system that uses it to create an outline of facts from 10 different articles and then use an embedding database to combine the facts into a semantically similar list of facts then use the list of facts to create an article you'll get a much better factually accurate article.
A junior writer would absolutely plagiarize or write things like, "For example, if you deposit $10,000 into a savings account that earns 3% interest compounding annually, you'll earn $10,300 at the end of the first year."
But if you're saving so much money from not having junior writers, why would you want to spend it on editors? The AIs in question are great at producing perfectly grammatical nonsense.
Your first article pretty much sums up the problem of using LLMs to generate articles: random hallucination.
> For an editor, that's bound to pose an issue. It's one thing to work with a writer who does their best to produce accurate work, but another entirely if they pepper their drafts with casual mistakes and embellishments.
There's a strong temptation for non-technical people to use LLMs to generate text about subjects they don't understand. For technical reviewers it can take longer to review the text (and detect/eliminate misinformation) than it does to write it properly in the first case. Assuming the goal is to create accurate, informative articles, there's simply no productivity gain in many cases.
This is not a new problem, incidentally. ChatGPT and other tools just make the generation capability a lot more accessible.
Summary: misinfo, labor impact, and safety are real dangers of LLMs. But in each case the letter invokes speculative, futuristic risks, ignoring the version of each problem that’s already harming people. It distracts from the real issues and makes it harder to address them.
The containment mindset may have worked for nuclear risk and cloning but is not a good fit for generative AI. Further locking down models only benefits the companies that the letter seeks to regulate.
Besides, a big shift in the last 6 months is that model size is not the primary driver of abilities: it’s augmentation (LangChain etc.) And GPT3-class models can now run on iPhones. The letter ignores these developments. So a moratorium is ineffective at best and counterproductive at worst.
We don't expect it to be free -- please read the article. That's not the issue at all. It's like if you subscribe to a product that you need to do your job, and one day the company tells you that the product is going away in three days and that you need to switch to a different product (that isn't at all the same for your use case).
I don't think it's a smart idea to build any serious business using a tech that you can't replace. ChatGPT is great tool to help with coding for example but it's by no means substitute for an engineer. If someone starts a business by hiring a number of bootcampers and giving them ChatGPT hoping to run a serious business that way - well it's their risk to take... But no crying later...
Maybe you shouldn't build your livelihood on the products of a single for-profit company, which now shows it can remove those products on a whim.
If you want reproducible research, make your own model from scratch, or use an open model.
And stop using that company's products, as they cannot be trusted to provide your business continuity.
It is like saying, we are researching Coca-Cola vs Pepsi, but your keep changing the recipe, so give us, researchers, the original recipe.
Sure, but the article is talking about a completely different meaning of reproducibility, where a researcher uses an LLM as a tool to study some research question, and someone else comes along and wants to check whether the claims hold up.
This doesn't in any way require the training run or the build to be reproducible. It just requires the model, once released through the API, to remain available for a reasonable length of time (and not have the rug pulled with 3 days' notice).
We're under no such misapprehension and we're keenly aware that this is an uphill battle. The issue is that LLMs have become part of the infrastructure of the Internet. Companies that build infrastructure have a responsibility to society, and we're documenting how OpenAI is reneging on that responsibility. Hindering research is especially problematic if you take them at their word that they're building AGI. If infrastructure companies don't do the right thing, they eventually get regulated (and if you think that will never happen, I have one word: AT&T).
Finally, even if you don't care about research at all, the article mentions OpenAI's policy that none of their models going forward will be stable for more than 3 months, and it's going to be interesting to use them in production if things are going to keep breaking regularly.
Since OpenAI is discountinuing the Codex model, that model is no longer "part of the infrastructure of the Internet" and thus there is no point in studying it.
"OpenAI responded to the criticism by saying they'll allow researchers access to Codex. But the application process is opaque: researchers need to fill out a form, and the company decides who gets approved. It is not clear who counts as a researcher, how long they need to wait, or how many people will be approved. Most importantly, Codex is only available through the researcher program “for a limited period of time” (exactly how long is unknown)."
OP here. Many people are reacting to the title of the paper. A few thoughts:
* The paper is 35 pages long and it's hard to convey its message in any single title. We make clear in the text that our point is not that predictive optimization should never be used.
* We do want the _default_ to change from predictive optimization being seen as the obvious way to solve certain social problems to being against it until the developer can address certain objections. This is also made clear in the paper.
* The title is a nod to a famous book in this area called "Against prediction". Most people in our primary target audience are familiar with that book, so the title conveys a lot of information to those readers. That's one reason we picked it.
* Despite its flaws, when might we want to use predictive optimization? Section 4 gets into this in detail.
I learned from one of the comments on my original post that many scholars have been saying this for a while, and that there's in fact a book that makes the same point!
They report 2 degradations: code generation & math problems. In both cases, they report a behavior change (likely fine tuning) rather than a capability decrease (possibly intentional degradation). The paper confuses these a bit: they mostly say behavior, including in the title, but the intro says capability in a couple of places.
Code generation: the change they report is that the newer GPT-4 adds non-code text to its output. They don't evaluate the correctness of the code. They merely check if the code is directly executable. So the newer model's attempt to be more helpful counted against it.
Math problems (primality checking): to solve this the model needs to do chain of thought. For some weird reason, the newer model doesn't seem to do so when asked to think step by step (but the current ChatGPT-4 does, as you can easily check). The paper doesn't say that the accuracy is worse conditional on doing CoT.
The other two tasks are visual reasoning and answering sensitive questions. On the former, they report a slight improvement. On the latter, they report that the filters are much more effective — unsurprising since we know that OpenAI has been heavily tweaking these.
In short, everything in the paper is consistent with fine tuning. It is possible that OpenAI is gaslighting everyone by denying that they degraded performance for cost saving purposes — but if so, this paper doesn't provide evidence of it. Still, it's a fascinating study of the unintended consequences of model updates.