A lot of the mech interp stuff has seemed to me like a different kind of voodoo:...

txnf · 2024-11-22T03:07:25 1732244845

Superposition code is a well known concept in information theory - I think there is certainly more to the story then described in the current works, but it does feel like they are going in the right direction

drdeca · 2024-11-22T03:18:53 1732245533

Where are you seeing the integer quantum Hall effect mentioned? Or are you bringing it up rather than responding to it being brought up elsewhere? I don’t understand what the connection between IQHE and these SAE interpretability approaches is supposed to be.

benreesman · 2024-11-22T04:37:41 1732250261

Pardon me, the reference is to the fractional Hall effect.

"But our results may also be of broader interest. We find preliminary evidence that superposition may be linked to adversarial examples and grokking, and might also suggest a theory for the performance of mixture of experts models. More broadly, the toy model we investigate has unexpectedly rich structure, exhibiting phase changes, a geometric structure based on uniform polytopes, "energy level"-like jumps during training, and a phenomenon which is qualitatively similar to the fractional quantum Hall effect in physics, among other striking phenomena. We originally investigated the subject to gain understanding of cleanly-interpretable neurons in larger models, but we've found these toy models to be surprisingly interesting in their own right."

https://transformer-circuits.pub/2022/toy_model/index.html