A lot of the mech interp stuff has seemed to me like a different kind of voodoo: the Integer Quantum Hall Effect? Overloading the term “Superposition” in a weird analogy not governed by serious group representation theory and some clear symmetry? You guys are reaching. And I’ve read all the papers. Spot the postdoc who decided to get paid.
But there is one thing in particular that I’ll acknowledge as a great insight and the beginnings of a very plausible research agenda: bounded near orthogonal vector spaces are wildly counterintuitive in high dimensions and there are existing results around it that create scope for rigor [1].
Superposition code is a well known concept in information theory - I think there is certainly more to the story then described in the current works, but it does feel like they are going in the right direction
Where are you seeing the integer quantum Hall effect mentioned? Or are you bringing it up rather than responding to it being brought up elsewhere? I don’t understand what the connection between IQHE and these SAE interpretability approaches is supposed to be.
Pardon me, the reference is to the fractional Hall effect.
"But our results may also be of broader interest. We find preliminary evidence that superposition may be linked to adversarial examples and grokking, and might also suggest a theory for the performance of mixture of experts models. More broadly, the toy model we investigate has unexpectedly rich structure, exhibiting phase changes, a geometric structure based on uniform polytopes, "energy level"-like jumps during training, and a phenomenon which is qualitatively similar to the fractional quantum Hall effect in physics, among other striking phenomena. We originally investigated the subject to gain understanding of cleanly-interpretable neurons in larger models, but we've found these toy models to be surprisingly interesting in their own right."
But there is one thing in particular that I’ll acknowledge as a great insight and the beginnings of a very plausible research agenda: bounded near orthogonal vector spaces are wildly counterintuitive in high dimensions and there are existing results around it that create scope for rigor [1].
[1] https://en.m.wikipedia.org/wiki/Johnson%E2%80%93Lindenstraus...