Answer is still no and still for the above reason. Compute resources are only re...

pixl97 · 2024-02-25T23:20:23 1708903223

Then why does chain of thought work better than asking for short answers?

p1esk · 2024-02-25T23:27:25 1708903645

Because it’s a better prompt. Works better for people too.

og_kalu · 2024-02-25T23:43:49 1708904629

That's not the only reason.

More tokens = more useful compute towards making a prediction. A query with more tokens before the question is literally giving the LLM more "thinking time"

razodactyl · 2024-02-26T13:17:01 1708953421

It correlates but the intuition is a bit misleading. What's actually happening is that by asking a model to generate more tokens, it increases the amount of information it has learnt to be present in its context block.

It's why "RAG" techniques work, the models learn during training to make use of information in context.

At the core of self-attention is dot product measurement which causes the model to act like a search engine.

It's helpful to think about it in terms of search: the shape of the outputs look like conversation but were actually prompting the model to surface information from the QKV matrices internally.

Does it feel familiar? When we brainstorm we usually chart graphs of related concepts e.g. blueberry -> pie -> apple.

og_kalu · 2024-02-26T18:41:55 1708972915

>What's actually happening is that by asking a model to generate more tokens, it increases the amount of information it has learnt to be present in its context block.

I'm not saying this isn't part of it but even if it's just dummy tokens without any new information, it works.

https://arxiv.org/abs/2310.02226

p1esk · 2024-02-26T00:18:57 1708906737

It’s not clear that more tokens are better.

og_kalu · 2024-02-26T01:02:15 1708909335

I think it's pretty clear

https://arxiv.org/abs/2310.02226

I mean, i can imagine you wouldn't always need the extra compute.

p1esk · 2024-02-26T02:22:06 1708914126

This paper is a great illustration of how little is understood about this question. They discovered that appending dummy tokens (ignored during both training and inference) improves performance somehow. Don’t confuse their guess as to why this might be happening with actual understanding. But in any case, this phenomenon has little to do with increasing the size of the prompt using meaningful tokens. We still have no clue if it helps or not.

og_kalu · 2024-02-26T03:36:51 1708918611

I just found this paper i read a while ago. Doesn't this answer the question ?

The Impact of Reasoning Step Length on Large Language Models - https://arxiv.org/abs/2401.04925

>They discovered that appending dummy tokens (ignored during both training and inference) improves performance somehow. Don’t confuse their guess as to why this might be happening with actual understanding.

More tokens is more compute time for the model to utilize, that is completely true.

What they guess is that the model can utilize the extra compute for better predictions even if there's no extra information to accompany this extra "thinking time".

p1esk · 2024-02-26T05:11:00 1708924260

Yes, more tokens means doing more compute, that much is true. The question is whether this extra compute helps or hurts. This question is yet to be answered, as far as I know. I tend to make my GPT-4 questions quite verbose, hoping it helps.

This is completely orthogonal to CoT, which is simply a better prompt - it probably causes some sort of better pattern matching (again very poorly understood).

og_kalu · 2024-02-26T05:33:20 1708925600

>The question is whether this extra compute helps or hurts.

I've linked 2 papers now that show very clearly the extra compute helps. I honestly don't understand what else it is you're looking for.

>This is completely orthogonal to CoT, which is simply a better prompt - it probably causes some sort of better pattern matching (again very poorly understood).

That paper specifically dives in on the effect of the length of the CoT prompt. It makes little sense to say - "oh it's just the better prompt" when Cot prompts with more tokens perform better than the shorter ones even when the shorter ones contain the same information. There is also the clear correlation with task difficulty and length.

p1esk · 2024-02-26T20:28:11 1708979291

Yes, the CoT paper does provide some evidence that a more verbose prompt works better. Thank you for pointing me to it.

Though I still don’t quite understand what is going on in the dummy tokens paper - what is “computation width” and why would it provide any benefit?

frannyg · 2024-02-26T19:08:14 1708974494

So "compute" includes just having more data ... that can also be "ignored"/ "skipped" for whatever reasons (e.g. weights), ok.

razodactyl · 2024-02-26T13:22:08 1708953728

I have a theory that the results are actually a side effect of having the information in a different area of the context block.

Models can be sensitive to the location of a needle in the haystack of its input block.

It's why there are models which are great at single turn conversation but can't hold a conversation past that without multi-turn training.

You can even corrupt the outputs by pushing past the number of turns / show the model data in a form it hasn't really seen before.

p1esk · 2024-02-26T20:44:11 1708980251

Models can be sensitive to the location of a needle in the haystack of its input block.

But only if we use some sort of attention optimization. For the quadratic attention algo it shouldn’t matter where the needle is, right?