Directly answering your question requires making some assumptions about what you...

robrenaud · 2024-02-25T22:39:22 1708900762

They inner layer permutability is super interesting. Is that result published anywhere? That's consistent with this graph here, which seems to imply different layers are kind of working in very related latent spaces.

If you skip to the graph here that shows the attention + feed forward displacements tending to align (after a 2d projection), is this something known/understood? Are the attention and feed forward displacement vectors highly correlated and mostly pointing in the same direction.

https://shyam.blog/posts/beyond-self-attention/

Skip to the graph above this paragraph: "Again, the red arrow represents the input vector, each green arrow represents one block’s self-attention output, each blue arrow represents one block’s feed-forward network output. Arranged tip to tail, their endpoint represents the final output from the stack of 6 blocks, depicted by the gray arrow."

bluecoconut · 2024-02-25T22:49:45 1708901385

Those curves of "embedding displacement" are very interesting!

quickly scanning the blog led to this notebook which shows how they're computed and shows other examples too with similar behavior. https://github.com/spather/transformer-experiments/blob/mast...

bluecoconut · 2024-02-25T22:46:21 1708901181

I haven't published it nor have I seen it published.

I can copy paste some of my raw notes / outputs from poking around with a small model (Phi-1.5) into a gist though: https://gist.github.com/bluecoconut/6a080bd6dce57046a810787f...

frannyg · 2024-02-26T21:53:56 1708984436

Thanks for the detailed explanations. And the rambling as well!

Pretty much every Yes and No apply. I had to understand bits of the gaps I was trying to close myself, so thanks for taking the time to interpret into my question.