Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Do LLMs get "better" with more processing power and or time per request?
57 points by frannyg 11 months ago | hide | past | favorite | 76 comments
Do they make more (recursive) queuries into their training data for breadth and depth? Or does the code limit the algorithms by design and or by constraints other than the incompleteness of the encoded semantics?



There's a misconception in the question that is important to address first: when an LLM is running inference it isn't querying its training data at all, it's just using a function that we created previously (the "model") to predict the next word in a block of text. That's it. When considering plain inference (no web search or document lookup), the decisions that determine a model's speed and capabilities come before the inference step, during the creation of the model.

Building an LLM model consists of defining its "architecture" (an enormous mathematical function that defines the model's shape) and then using a lot of trial and error to guess which "parameters" (constants that we plug in to the function, like 'm' and 'b' in y=mx+b) will be most likely to produce text that resembles the training data.

So, to your question: LLMs tend to perform better the more parameters they have, so larger models will tend to beat smaller models. Larger models also require a lot of processing power and/or time per inferred token, so we do tend to see that better models take more processing power. But this is because larger models tend to be better, not because throwing more compute at an existing model helps it produce better results.


An analogy that works without having to explain anything at all about how LLMs actually work (or maybe does explain a lot, depending on how you look at it) could be:

* LLMs are lossy compression functions on their training data.

* The size of the model dictates how lossy the compression is.

* You can't spend compute to get more detail out of a model once it's been compressed/trained, anymore than you can spend compute to get an incredibly lossily-compressed movie to go from 240p back to the original 1080p source.


You obviously can do that though; diffusion models produce better (fsvo better) images the more steps you run of them.

Similarly, LLMs can produce better answers if you teach them thinking strategies that remind them to put the available evidence and intermediate steps in their context window. Otherwise they'll tend to hallucinate an answer out of vaguely correct words.


Diffusion models are a different architecture, namely, a recursive or iterative one. Transformer models are not recursive or iterative.


Sure they are. It only natively outputs one token; the recursive process is how you get the rest out of them.


You’re totally right … should’ve thought that one through more.


> You can't spend compute to get more detail [...]

Upscaling, technically, is a thing without limits, no?


> But this is because larger models tend to be better, not because throwing more compute at an existing model helps it produce better results.

There's a caveat here - allowing the model to produce more tokens (i.e. giving it more compute time to "think") can produce better results. E.g. asking a model to reason before producing an answer, leads to better answers. And the extra tokens = more compute.


True! It's important to first understand the fundamentals of what makes an LLM "good" and what makes it fast, but yes, there are lots of techniques you can apply right before and during the inference step that can trade off between speed and capabilities.

Different prompting techniques like what you're describing are one way, and RAG [0] and ART [1] are also in a similar category.

[0] https://stackoverflow.blog/2023/10/18/retrieval-augmented-ge...

[1] https://www.promptingguide.ai/techniques/art


And adding some more here. I don't know if any models doing this but there is the possibility of generating tokens that it does not show to the user. I think there's quite a lot of scope for internal monologue/chain of thought that could provide concise but clever answers. The difficulty in this is the latency while it ponders to itself, but having played with the groq demos. I think there's scope for a decent interactive experience.

The concern people might feel when they realise an ai might have private thoughts is another issue entirely.


That was indeed part of what I wondering about.

Larger and smaller, in my beginner mind, was a difference of much recursiveness the design of the model allowed.

- User request implies knowledge about X. - PULLING in weights for X. - Probability of user knowing about Xm and Xz is low (because the training data says Xm and Xz are PhD-level knowledge or something). - Pulling in weights for an ELI5-level explanation of Xm and Xz ...

I thought, an LLM would do this recursive pulling of weights based on the semantics of the user request, which it does, but it doesn't do that "dynamically" based on "recalculated" weights and regenerated combos of tokens, which could happen if the training data wasn't "frozen" and accessible, which I learned further down in the comments, isn't.

That's why I wondered whether more processing power and or time would benefit this recursive generation and pulling.


Yea, doing thing like Chain of Thought, and/or running a second query to examine the first set of tokens it generated commonly improve answers.


This doesn't change the point of your answer, but to add on, the result of that learned function is the probability of all tokens occurring next which is sampled when inference is happening. The type of sampling used can be different at inference time.


I'm still figuring out "inference time" but what left me puzzled at first was that there is - to humans at least - an infinite amount of tokens that might come next, technical jargon, synonyms, lexical levels in general, so in my mind there was an RNG build into the function, that, after "filtering" the weights based on the user request - and a lot of different tokens, even those meaning the same or almost the same have the same weights - simply rolled the dice to produce the return string.

I thought the LLM was "getting to know the user" but it had it a short memory span (the context) and thus "forgot" already calculated weights that it would use to (re)generate new weights.

Further down I learned it freaking forgets all the previous weights in general (I think that's what I learned, I'm getting there)


could a training model be fed the raw data or source and weights of an llm and create better functioning llms by spotting patterns and things between models? like if you could feed it all the open source models and it could create sub models off of those and maybe even a 2nd Gen 'self' instance to better train on the second set such that maybe it could find ways to get the same results with 5b model as 75b.


People take a model and continue training it all the time (that is, start with already derived weights of one model and doing more training on it to make it something different). Usually this is done to make the model more purpose fit to a specific task, but it won't often make it generically better assuming the first effort was using the model to its full potential (not "underfit").

The 75B param model simply has more complexity to work with than the 5B model.

In the same sense that: `y = mx + b` is just not as expressive as `y = ax^2 + bx + c`.


well, i was thinking more like..... something that could spit out an android app because it's source is 5k android apps binary/hex code...i.e. it goes off internals, basically its a model of models. So it could find some common ground between all models, and create a new model that's the best of all of them. Then add itself to that list of models, and start up the next generation to do it all over again, including itself, and keep repeating until it can't get any better maybe, or until it finds a new way of doing training, or something. I guess I'm looking for a way to speedup the ai singularity when ai can build upon itself, or really learn like a human -as in receive new input and it's added to the whole of the thing in real time.


That's mostly a shortcut to making the model worse rather than better because it'll just continually get more obsessive having learned about its own biases.

It's viable if you have tools or humans in the loop to comment on them and add new insights.

But the speed isn't really a factor here, and seeing 1000 new apps isn't obviously going to make it better if the model is already at the limits of what it can represent with its parameter count and compression so to speak.


I could imagine something like that working in theory, but the amount of examples you would need to train such a model makes it completely impractical. We tend to need billions of examples to get a modern deep learning model working well, and it will be a very long time before reach that many examples of good LLMs.


In a way this is already how the model is trained. Model makes a prediction, loss function calculates how “wrong” the prediction was, and we update the weights of the model to minimize the loss.


This is an excellent concise descriptions on how LLM works, thanks.


You are incorrect. Increasing compute during inference renders similar gains to increasing parameters/compute during training time (see self-consistency, tree of thoughts, etc.)


Can you elaborate upon that? Apart from the multiplication and accumulations of activations and weights what additional computations can be applied to improve the outputs.

I think it has already been implied that we are not talking about increasing the quantity of parameters in this context but the possibily of applying additional compute to a model with a given number of parameters


You can train a smaller model and run inference multiple times and it will reach similar performance as a larger model running inference just once. What's the best way to make use of those multiple inferences is still up to debate, but we already know it works (self-consistency is one example).


I wasn't able to elaborate on what I mean with "better" when I asked the question but the idea can indeed be summarized with "will an LLM increase quantity and quality of parameters if you give it more processing power and time". Now I know that language models don't do that at all and that the weights of the user request stored in the "frozen" training data is what assembles the return after generating possible output strings, which are selected by pre-prompts like asking for chain of thought and reasoning paths and so on, which in the end, are nothing more than more weights pulling in more specific context. (I'm just thinking out loud here)


Yeah, I totally forgot about training time and time of request (aaah, inference time! now I get it.) being completely different points in time because the LLM has no access to the training data anymore.


Right on. A total misconception on my part. And your answer was a nice primer before diving in to the rest of the comments. Thanks!


There are all sorts of changes one could imagine being made to how LLMs are trained and run, but if you are asking about what actually exists today, then:

1) At runtime, when you feed a "request" (prompt) into the model, the model will use a fixed amount of compute/time to generate each word of output. There is no looping going on internally - just a fixed number of steps to generate each word. Giving it more or less processing power at runtime will not change the output, just how fast that output is generated.

If you, as a user, are willing to take more time (and spend more money) to get a better answer, then a trick that often works is to take the LLM's output and feed it back in as a request, just asking the LLM to refine/reword it. You can do this multiple times.

2) At training time, for a given size of model and given set of training data, there is essentially an optimal amount of time to train for (= amount of computing power and time taken to train). Train for too short a time and the model won't have learnt all that it could. Train for too long a time (repeating the training data), and the model will start to memorize the training set rather than generalize from it, meaning that the model is getting worse.


> there is no looping going on internally

My thoughts after this sentence filled a huge gap I was wondering about, thanks.


More processing power does not make a model better. You can train models on CPUs with same result based on same model architecture and dataset. It'll just take longer to get those results.

What makes models "good" is if the dataset "fits" the model architecture properly and you have given it enough time (epochs) to have a semi accurate prediction ratio (lets say 90% accurate). For image classification models I've done around ~100 epochs for 10,000 items seems to be the best certain data sets will ever get. There will at some point come a time when the continued training of the model is either underfitting/overfitting and no amount of continued training/processing power would help improve it.


The OP asks "per request", not training time.


Answer is still no and still for the above reason. Compute resources are only relevant to how fast it can answer not the quality.


Then why does chain of thought work better than asking for short answers?


Because it’s a better prompt. Works better for people too.


That's not the only reason.

More tokens = more useful compute towards making a prediction. A query with more tokens before the question is literally giving the LLM more "thinking time"


It correlates but the intuition is a bit misleading. What's actually happening is that by asking a model to generate more tokens, it increases the amount of information it has learnt to be present in its context block.

It's why "RAG" techniques work, the models learn during training to make use of information in context.

At the core of self-attention is dot product measurement which causes the model to act like a search engine.

It's helpful to think about it in terms of search: the shape of the outputs look like conversation but were actually prompting the model to surface information from the QKV matrices internally.

Does it feel familiar? When we brainstorm we usually chart graphs of related concepts e.g. blueberry -> pie -> apple.


>What's actually happening is that by asking a model to generate more tokens, it increases the amount of information it has learnt to be present in its context block.

I'm not saying this isn't part of it but even if it's just dummy tokens without any new information, it works.

https://arxiv.org/abs/2310.02226


It’s not clear that more tokens are better.


I think it's pretty clear

https://arxiv.org/abs/2310.02226

I mean, i can imagine you wouldn't always need the extra compute.


This paper is a great illustration of how little is understood about this question. They discovered that appending dummy tokens (ignored during both training and inference) improves performance somehow. Don’t confuse their guess as to why this might be happening with actual understanding. But in any case, this phenomenon has little to do with increasing the size of the prompt using meaningful tokens. We still have no clue if it helps or not.


I just found this paper i read a while ago. Doesn't this answer the question ?

The Impact of Reasoning Step Length on Large Language Models - https://arxiv.org/abs/2401.04925

>They discovered that appending dummy tokens (ignored during both training and inference) improves performance somehow. Don’t confuse their guess as to why this might be happening with actual understanding.

More tokens is more compute time for the model to utilize, that is completely true.

What they guess is that the model can utilize the extra compute for better predictions even if there's no extra information to accompany this extra "thinking time".


Yes, more tokens means doing more compute, that much is true. The question is whether this extra compute helps or hurts. This question is yet to be answered, as far as I know. I tend to make my GPT-4 questions quite verbose, hoping it helps.

This is completely orthogonal to CoT, which is simply a better prompt - it probably causes some sort of better pattern matching (again very poorly understood).


>The question is whether this extra compute helps or hurts.

I've linked 2 papers now that show very clearly the extra compute helps. I honestly don't understand what else it is you're looking for.

>This is completely orthogonal to CoT, which is simply a better prompt - it probably causes some sort of better pattern matching (again very poorly understood).

That paper specifically dives in on the effect of the length of the CoT prompt. It makes little sense to say - "oh it's just the better prompt" when Cot prompts with more tokens perform better than the shorter ones even when the shorter ones contain the same information. There is also the clear correlation with task difficulty and length.


Yes, the CoT paper does provide some evidence that a more verbose prompt works better. Thank you for pointing me to it.

Though I still don’t quite understand what is going on in the dummy tokens paper - what is “computation width” and why would it provide any benefit?


So "compute" includes just having more data ... that can also be "ignored"/ "skipped" for whatever reasons (e.g. weights), ok.


I have a theory that the results are actually a side effect of having the information in a different area of the context block.

Models can be sensitive to the location of a needle in the haystack of its input block.

It's why there are models which are great at single turn conversation but can't hold a conversation past that without multi-turn training.

You can even corrupt the outputs by pushing past the number of turns / show the model data in a form it hasn't really seen before.


Models can be sensitive to the location of a needle in the haystack of its input block.

But only if we use some sort of attention optimization. For the quadratic attention algo it shouldn’t matter where the needle is, right?


Ok, thanks. My misconception kind of prohibited the insight of a potential (theoretical) assert statement, which is kind of what is meant by

> if the [resulting] dataset "fits" the model architecture properly,

right?

I have too many questions. It seems unreasonable to ask away and I should instead read the studies and some books.


No, the standard LLM implementations currently used will apply a fixed amount of computations during inference, which is chosen and "baked in" by the model architecture before training. They don't really have the option to "think a bit more" before giving the answer, generating each token makes the exact same amount of matrix multiplications. Well, they probably theoretically could be modified to do it, but we don't do that properly yet, even if some styles of prompts e.g. "let's think step by step" kind of nudge the model in that direction.

The same model will give the same result, and more processing power will simply enable you to get the inference done faster.

On the other hand, more resources may enable (or be required for) a different, better model.


> the same model with give the same result

Is it wrong to think of this as misleading? Don't the results for exactly the same request differ because there are multiple output strings with the same computed weights?

Or do you include "multiple ways to phrase the same" in "same results" and I'm being a noob?


There is certain intentional randomness in how the tokens are selected, and certain unintentional randomness due to letting some optimizations cause small side-effects, but in any case in that sentence I didn't really intended to talk about the result being identical but rather about the result not being any better just because more compute was available, as by default that extra available potential simply wouldn't get used in any way other than getting a speedup.


There's fixed compute per token but more tokens = more compute so a LLM will technically have more "time" for a query with more tokens preceding it.


A key aspect is the information bottleneck enforced by the mechanism as the next "iteration" only gets to access the new token computed and discards all the other information it computed.

So if you want it to spend more "time" in a useful manner without changing the architecture, you have to get it to write down the temporary information in the tokens, as "think step by step" does or alternatively iterative prompts "write a draft for the rough structure" "now rewrite it better with more detail".


This blew my mind a little as it feels unintuitive to do this since you wouldn't just forget what you based your previous reply on, at least not after some practice with your mind and memory (which I need to catch up on, I must add).

It also feels like a multiplication of required processing power but I have no clue yet how one could use the previous generation of weights of and the tokens themselves to improve, elaborate on, widen the range of predicted potential results.


One caveat not mentioned yet is that you can get better responses through priming, fewshot and chain of thought. That means if you start talking about a related problem/concept, mention some keywords, then provide a few examples, then ask the LLM to provide chain of thought reasoning, you will get a better answer. Those will extend the runtime and processing power in practice.


Without knowing the particulars of a Implementation, it's hard to say. Some can refine results by running the model a few more times, so yeah, better processing and/or more time would help, though probably not by much.

Most models, however, don't, so no special benefit from better processing other than speed


Interestingly it used to be quite standard with 'small' language models to use a search algorithm to render a full block of text, the most basic being beam search. Then you can get better with more processing power to do a wider path search. This is not what OP is talking about, it just means generating a larger number of candidate continuations. However it's not necessary or optimal for newer LLMs, because it tends to siphon the LLM into quite generic places, and it can get very repetitive.


Nope, this definitelly fills a few gaps, thanks. I'm still too lazy of thinking about this whole O(n) time thing even though I'm constantly wondering whether "more" or better results could be achieved by throwing CPUs at stuff, hahaha. I rarely think in terms of time in general, just about depth, breadth and clarity.


No.

An LLM can only give probabilities of the next token of output. The time to improve an LLM is during design, training, or fine tuning. Once you've got the final weights, the function is "locked in" and doesn't change.

However part of the process of learning to predict human output from the internet, literature, etc. causes some deeper learning to occur, potentially even more than in humans, certainly of a different nature. The LLM is communicating through a lossy process, and there is some randomness imposed on its outputs, so results may vary.

The nature of the prompt used can trigger some of this deeper learning, and yield better results than you might otherwise get. These weren't put in by design, they are emergent properties of the LLM. For instance "train of thought" prompting has been show to result in better output.

Prompt "engineering" is an empirical process of discovering the quirks and hidden strengths in the model. It is entirely possible that there is a super-human set of cognitive skills embedded inside GPT4, Mistral, or even LLAMA. Given sufficient time, there might be some prompting that could expose it and make it usable.

Because LLMs aren't "programs" in the traditional sense, you should treat them as if they were an alien intelligence, because that is effectively what they are. They don't understand humans, no matter how well they act like it at times. They are wild beasts, and we haven't figured out how to domesticate them yet.


Short answer is No.

I highly recommended watching Andrej Karpathy's Intro to LLMs talk, particularly the section on System 1 vs System 2 thinking. Long story short, what you are describing, using more processing to prepare a better response, is something that is an area of interest, but is not currently part of ChatGPT (or any other LLM that I am aware of).

See: https://youtu.be/zjkBMFhNj_g?t=2100&si=jaImuf3UCn6ReTp4


Yes, but not the reasons you're thinking

- If you have a fixed time budget and increase the GPU memory+compute available, you can directly query a bigger model. Raw models are basically giant lookup functions, and without the extra memory+compute, they'll spill to slower layers of your memory hierarchy, e.g., GPU RAM -> CPU RAM -> disk. Likewise, with MoE models, there are multiple concurrent models being queried.

- Most 'good' LLM systems are not just direct model calls, but code-based agent frameworks on top that call code tools, analyze the results, and decide to edit+retry things. For example, if doing code generation, they may decide to run lint analysis & type checking on a generated output, and if issues, ask the LLM to try again. In Louie.AI, we will even generate database queries and run GPU analytics & visualizations in on-the-fly Python sandboxes. These systems will do backtracking etc retries, and > 50% of the quality can easily come from these layers: LLM leaderboards like HumanEval increasingly report both the raw model + what agent framework on top. All this adds up and can quickly become more expensive than the LLM. So better systems can enable more here too.


Nice. Thank you for the addition of slower memory layers.

So MoE models are a bit like thinking tools running concurrently, right(?), sieving through training data on paths that are the same contextually, but different in terms of specificity and sensitivity.

If the agents/experts/ architectures - the code - don't have the minimum required amount of memory & processing power, they might even miss entire bunches of tokens that are or might be relevant within the given (the prompt) and predicted/requested context. So more processing power and or time is relevant only to the extent, here: size, of the to-be-queried-at-inference-time training data (tokens and weights).

Now here's where I find myself exactly within the realm that I was in when I phrased my question: analysing the result of a request and evaluating different sets of tokens, which, I now understand, makes much more sense within the subject of code generation than with the recitation of facts or bits of narratives.

Generated code has functions (things to do with other things). Functions can be done more or less efficient, while even the least efficient code works "more than good and fast enough". There is no value in looping through versions of fact and fiction when the answer fits the expectation. And if it doesn't fit, users can have an actual conversation, which is where I get another part of my answer, which is that more processing power only becomes relevant in relation to the amount of concurrent requests in relation to the parts of the training data that are queried at inference time.

No single request will ever query so much data at the same time, that memory and compute become a bottleneck.

It definitely can become a bottleneck when a long/large/broad( but specific) request gets processed by MoEs simultaneously or when versions of results of engineering tasks are being evaluated. But that is simply not within the task or design of current LLMs and is instead added on top (or as a wrapper, for example, which I still fail to find a non-replaceable usecase for while also still being certain that I will find one once I get to LLMs and AIs).

Again, thanks!


For inference the common answer will be "no", you use the model you get and it takes a constant time to process.

However the truth is that inference platforms do take shortcuts that affect accuracy. E.g. LLama.cpp will down convert fp32 intermediates to 8-bit quantized so it can do the work using 8-bit integers. This is degrading the computation's accuracy for performance.


I have no freaking idea what you said in the second paragraph but I love it and it will linger in the back of my head until I understand enough to look it up.

[nodding repeatedly with a serious face and lot of resolve]


My understanding of GPT4 is that it is a mixture of experts. In other words, multiple GPT 3.5 models responding to the same prompt in parallel, and another model on top choosing the best response among them.

So in that case, more models could give a better response, which costs more compute.


Where did you get that understanding? This doesn't really make any sense, how would GPT be able to stream token at a time in the first place?


There's actually information provided during token generation that act as a level of confidence.

You can definitely stream and choose the highest scoring values amongst a few shots at generating the best next token candidate.


The same model will not get better by having more processing power or time. However, that's not the full story.

Larger models generally perform better than smaller models (this is a generalization, but a good enough one for now). The problem is that larger models are also slower.

This ends up being a balancing act for model developers. They could get better results but it may end up being a worse user experience. Models size can also limit where the model can be deployed.


LLMs output a probability distribution for the next word. Searching the space of what the best next word to use is takes more time than just picking a good one and assuming that it was a good enough choice.


Both

It takes time to train them. More = better. Usually about 6 months or so. More processing power can allow the model to cram more power in


The OP asks about request time (and, I imagine, processing power) not training


Directly answering your question requires making some assumptions about what you mean and also what "class" of models you are asking about. Unfortunately I don't think it's just a yes or no, since I can answer in both directions depending on the interpretation.

[No] If you mean "during inference", then the answer is mostly no in my opinion, but it depends on what you are calling a "LLM" and "processing power", haha. This is the interpretation I think you are asking for though.

[Yes] If you mean everything behind an endpoint is an LLM, eg. that includes a RAG system, specialized prompting, special search algorithms for decoding logits into tokens, then actually the answer is obviously a yes, those added things can increase skill/better-ness by using more processing power and increasing latency.

If you mean the raw model itself, and purely inference, then there's sorta 2 classes of answers.

[No] 1. On one side you have the standard LLM (just a gigantic transformer), and these run the same "flop" of compute to predict logits for 1 token's output (at fixed size input), and don't really have a tunable parameter for "think harder" -> this is the "no" that I think your question is mostly asking.

[Yes] 2. For mixture of experts, though they don't do advanced adaptive model techniques, they do sometimes have a "top-K" parameter (eg. top-1 top-2 experts) which "enable" more blocks of weights to be used during inference, in which case you could make the argument that they're gaining skill by running more compute. That said, afaik, everyone seems to run inference with the same N number of experts once set up and don't do dynamic scaling selection.

[Yes] Another interpretation: broadly there's the question of "what factors matter the most" for LLM skill, if you include training compute as part of compute (amortize it or whatever) --> then, per the scaling law papers: it seems like the 3 key things to keep in mind are: [FLOPs, Parameters, Tokens of training data], and in these parameters there is seemingly power-law scaling of behavior, showing that if you can "increase these" then the resulting skill also will keep "improving" (hence an interpretation of "more processing power" (training) and "time per request" (bigger model / inference latency) is correlated to "better" LLMs.

[No] You mention this idea of "more recursive queries into their training data", and its worth noting a trained model no longer has access to the training data. And in fact, the training data that gets sent to the model during training (eg. when gradients are being computed and weights are being updated) is sent on some "schedule" usually (or sampling strategy), and isn't really something that is being adaptively controlled or dynamically "sampled" even during training. So it doesn't have the ability to "look back" (unless a retrieval style architecture or a RAG inference setup)

[Yes] Another thing is the prompting strategy / decoding strategy, hinted at above. eg. you can decode with just taking 1 output, or you can take 10 outputs in parallel, rank them somehow (consensus ranking, or otherwise), and then yes, that can also improve (eg. this was contentious when gemini ultra was released, because their benchmarks used slightly different prompting strategies than GPT-4 prompting strategies, which made it even more opaque to determine "better" score per cost (as some meta-metric)) (some terms are chain/tree/graph of thought, etc.)

[Yes (weak)] Next, there's another "concept" of your question about "more processing power leading to better results", which you could argue "in-context learning" is itself more compute (takes flops to run the context tokens through the model (N^2 scaling, though with caches)) - and purely by "giving a model" more instructions in the beginning, you increase the compute and memory required, but also often "increase the skill" of the output tokens. So maybe in that regard, even a frozen model is a "yes" it does get smarter (with the right prompt / context).

One interesting detail about current SotA models, even the Mixture of Expert style models, is that they're "Static" in where their weights and "flow" of activations along the "layer" direction. They're dynamic (re-use) weights in the "token"/"causal" ordering direction (the N^2 part). I've personally spent some time (~1 month in Nov last year) working on trying to make more advanced "adaptive models", that use switches like those from the MoE style network, but to route to "the same" QKV attention matrices, so that something like what you describe is possible (make the "number of layers" a dynamic property, and have the model learn to predict after 2 layers, 10 layers, or 5,000 layers, and see if "more time to think" can improve the results, do math with concepts, etc. -- but for there to be dynamic layers, the weights can't be "frozen in place" like they currently are) -- currently I have nothing good to show here though. One interesting finding though (now that I'm rambling and just typing a lot) is that in a static model, you can "shuffle" the layers (eg. swap layer 4's weights with layer 7's weights) and the resulting tokens roughly seem similar (likely caused by the ResNet style backbone). Only the first ~3 layers and last ~3 layers seem "important to not permute". It kinda makes me interpret models as using the first few layers to get into some "universal" embedding space, operating in that space "without ordering in layer-order", and then "projecting back" to token space at the end. (rather than staying in token space the whole way through). This is why I think it's possible to do more dynamic routing in the middle of networks, which I think is what you're implying when you say "do they make more recursive queries into their data" (I'm projecting, but when i imagine the idea of "self-reflection" or "thought" like that, inside of a model, I imagine it at this layer -- which, as far as I know, has not been shown/tested in any current LLM / transformer architecture)


They inner layer permutability is super interesting. Is that result published anywhere? That's consistent with this graph here, which seems to imply different layers are kind of working in very related latent spaces.

If you skip to the graph here that shows the attention + feed forward displacements tending to align (after a 2d projection), is this something known/understood? Are the attention and feed forward displacement vectors highly correlated and mostly pointing in the same direction.

https://shyam.blog/posts/beyond-self-attention/

Skip to the graph above this paragraph: "Again, the red arrow represents the input vector, each green arrow represents one block’s self-attention output, each blue arrow represents one block’s feed-forward network output. Arranged tip to tail, their endpoint represents the final output from the stack of 6 blocks, depicted by the gray arrow."


Those curves of "embedding displacement" are very interesting!

quickly scanning the blog led to this notebook which shows how they're computed and shows other examples too with similar behavior. https://github.com/spather/transformer-experiments/blob/mast...


I haven't published it nor have I seen it published.

I can copy paste some of my raw notes / outputs from poking around with a small model (Phi-1.5) into a gist though: https://gist.github.com/bluecoconut/6a080bd6dce57046a810787f...


Thanks for the detailed explanations. And the rambling as well!

Pretty much every Yes and No apply. I had to understand bits of the gaps I was trying to close myself, so thanks for taking the time to interpret into my question.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: