> ChatGPT thinks 9.11 > 9.9 I've confirmed this asked chatgpt: 9.11 > 9.9 true o...

jsheard · 2024-12-16T15:43:30 1734363810

Even when ChatGPT starts getting these simple gotcha questions right it's often because they applied some brittle heuristic that doesn't generalize. For example you can directly ask it to solve a simple math problem, which nowadays it will usually do correctly by generating and executing a Python script, but then ask it to write a speech announcing the solution to the same problem, to which it will probably still hallucinate a nonsensical solution. I just tried it again and IME this prompt still makes it forget how to do the most basic math:

Write a speech announcing a momentous scientific discovery - the solution to the long standing question of (48294-1444)*0.3258

llm_nerd · 2024-12-16T16:07:15 1734365235

4o and o1 get this right.

LLMs should never do math. They shouldn't count letters or sort lists or play chess or checkers. Basically all of the easy gotcha stuff that people use to point out errors are things that they shouldn't do.

And you pointed out something they do now which is creating and run a python script. That really is a pretty solid, sustainable heuristic and is actually a pretty great approach. They need to apply that on their backend too so it works across all modes, but the solution was never just an LLM.

Similarly, if you ask an LLM a chess question -- e.g. the best move -- I'd expect it to consult a chess engine like Stockfish.

Merad · 2024-12-16T16:32:53 1734366773

> LLMs should never do math. They shouldn't count letters or sort lists or play chess or checkers.

But these aren't "gotcha questions", these are just some of the basic interactions that people will want to have with intelligent assistants. Literally just two days ago I was doing some things with the compound interest formula - I asked Claude to solve for a particular variable of the formula, then plug in some numbers to calculate the results (it was able to do it). Could I have used Mathematica or something like that? Yes of course. But supposedly the whole purpose of a general purpose AI is that I can use it to do just about anything that I need to do. Likewise there have been multiple occasions where I've needed ChatGPT or Claude to work with tables or lists of data where I needed the results to be sorted.

llm_nerd · 2024-12-16T16:49:36 1734367776

They're gotcha in the sense that people are intentionally asking LLMs to do things that LLMs are terrible at doing. LLMs are language models. They aren't math models. Or chess models. Or sorting or counting models. They aren't even logic models.

So early on the value was completely in language. But you're absolutely correct that for these tools to really be useful they need to be better than that, and slowly we're getting there. If you're asking a math question as a component of your question, firstly delegate that to an appropriate math engine while performing a series of CoT steps. And so forth.

recursive · 2024-12-16T17:13:21 1734369201

If this stuff is getting sold as a revolution in information work, or a watershed moment in technology, or as a cultural step-change, etc, then I think the gotcha is totally fair. There seems to be no limit to the hype or sales pitch. So there need be no bounds for pedantic gotchas either.

llm_nerd · 2024-12-16T17:57:27 1734371847

I entirely agree with you. Trying to roll out just a raw LLM was always silly, and remains basically a false promise. Simply increasing the number of layers or parameters or transformer complexity will never resolve these core gaps.

But it's rapidly making progress. CoT models coupled with actual domain-specific logic engines (math, chemistry, physics, chess, and so on) will be when the promise is actually met by the reality.

TRiG_Ireland · 2024-12-16T16:57:25 1734368245

With general mathematical questions, I've often found WolframAlpha surprisingly helpful.

e1g · 2024-12-16T15:49:24 1734364164

o1 gets this correct.