Hacker News new | past | comments | ask | show | jobs | submit login

Directly answering your question requires making some assumptions about what you mean and also what "class" of models you are asking about. Unfortunately I don't think it's just a yes or no, since I can answer in both directions depending on the interpretation.

[No] If you mean "during inference", then the answer is mostly no in my opinion, but it depends on what you are calling a "LLM" and "processing power", haha. This is the interpretation I think you are asking for though.

[Yes] If you mean everything behind an endpoint is an LLM, eg. that includes a RAG system, specialized prompting, special search algorithms for decoding logits into tokens, then actually the answer is obviously a yes, those added things can increase skill/better-ness by using more processing power and increasing latency.

If you mean the raw model itself, and purely inference, then there's sorta 2 classes of answers.

[No] 1. On one side you have the standard LLM (just a gigantic transformer), and these run the same "flop" of compute to predict logits for 1 token's output (at fixed size input), and don't really have a tunable parameter for "think harder" -> this is the "no" that I think your question is mostly asking.

[Yes] 2. For mixture of experts, though they don't do advanced adaptive model techniques, they do sometimes have a "top-K" parameter (eg. top-1 top-2 experts) which "enable" more blocks of weights to be used during inference, in which case you could make the argument that they're gaining skill by running more compute. That said, afaik, everyone seems to run inference with the same N number of experts once set up and don't do dynamic scaling selection.

[Yes] Another interpretation: broadly there's the question of "what factors matter the most" for LLM skill, if you include training compute as part of compute (amortize it or whatever) --> then, per the scaling law papers: it seems like the 3 key things to keep in mind are: [FLOPs, Parameters, Tokens of training data], and in these parameters there is seemingly power-law scaling of behavior, showing that if you can "increase these" then the resulting skill also will keep "improving" (hence an interpretation of "more processing power" (training) and "time per request" (bigger model / inference latency) is correlated to "better" LLMs.

[No] You mention this idea of "more recursive queries into their training data", and its worth noting a trained model no longer has access to the training data. And in fact, the training data that gets sent to the model during training (eg. when gradients are being computed and weights are being updated) is sent on some "schedule" usually (or sampling strategy), and isn't really something that is being adaptively controlled or dynamically "sampled" even during training. So it doesn't have the ability to "look back" (unless a retrieval style architecture or a RAG inference setup)

[Yes] Another thing is the prompting strategy / decoding strategy, hinted at above. eg. you can decode with just taking 1 output, or you can take 10 outputs in parallel, rank them somehow (consensus ranking, or otherwise), and then yes, that can also improve (eg. this was contentious when gemini ultra was released, because their benchmarks used slightly different prompting strategies than GPT-4 prompting strategies, which made it even more opaque to determine "better" score per cost (as some meta-metric)) (some terms are chain/tree/graph of thought, etc.)

[Yes (weak)] Next, there's another "concept" of your question about "more processing power leading to better results", which you could argue "in-context learning" is itself more compute (takes flops to run the context tokens through the model (N^2 scaling, though with caches)) - and purely by "giving a model" more instructions in the beginning, you increase the compute and memory required, but also often "increase the skill" of the output tokens. So maybe in that regard, even a frozen model is a "yes" it does get smarter (with the right prompt / context).

One interesting detail about current SotA models, even the Mixture of Expert style models, is that they're "Static" in where their weights and "flow" of activations along the "layer" direction. They're dynamic (re-use) weights in the "token"/"causal" ordering direction (the N^2 part). I've personally spent some time (~1 month in Nov last year) working on trying to make more advanced "adaptive models", that use switches like those from the MoE style network, but to route to "the same" QKV attention matrices, so that something like what you describe is possible (make the "number of layers" a dynamic property, and have the model learn to predict after 2 layers, 10 layers, or 5,000 layers, and see if "more time to think" can improve the results, do math with concepts, etc. -- but for there to be dynamic layers, the weights can't be "frozen in place" like they currently are) -- currently I have nothing good to show here though. One interesting finding though (now that I'm rambling and just typing a lot) is that in a static model, you can "shuffle" the layers (eg. swap layer 4's weights with layer 7's weights) and the resulting tokens roughly seem similar (likely caused by the ResNet style backbone). Only the first ~3 layers and last ~3 layers seem "important to not permute". It kinda makes me interpret models as using the first few layers to get into some "universal" embedding space, operating in that space "without ordering in layer-order", and then "projecting back" to token space at the end. (rather than staying in token space the whole way through). This is why I think it's possible to do more dynamic routing in the middle of networks, which I think is what you're implying when you say "do they make more recursive queries into their data" (I'm projecting, but when i imagine the idea of "self-reflection" or "thought" like that, inside of a model, I imagine it at this layer -- which, as far as I know, has not been shown/tested in any current LLM / transformer architecture)




They inner layer permutability is super interesting. Is that result published anywhere? That's consistent with this graph here, which seems to imply different layers are kind of working in very related latent spaces.

If you skip to the graph here that shows the attention + feed forward displacements tending to align (after a 2d projection), is this something known/understood? Are the attention and feed forward displacement vectors highly correlated and mostly pointing in the same direction.

https://shyam.blog/posts/beyond-self-attention/

Skip to the graph above this paragraph: "Again, the red arrow represents the input vector, each green arrow represents one block’s self-attention output, each blue arrow represents one block’s feed-forward network output. Arranged tip to tail, their endpoint represents the final output from the stack of 6 blocks, depicted by the gray arrow."


Those curves of "embedding displacement" are very interesting!

quickly scanning the blog led to this notebook which shows how they're computed and shows other examples too with similar behavior. https://github.com/spather/transformer-experiments/blob/mast...


I haven't published it nor have I seen it published.

I can copy paste some of my raw notes / outputs from poking around with a small model (Phi-1.5) into a gist though: https://gist.github.com/bluecoconut/6a080bd6dce57046a810787f...


Thanks for the detailed explanations. And the rambling as well!

Pretty much every Yes and No apply. I had to understand bits of the gaps I was trying to close myself, so thanks for taking the time to interpret into my question.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: