I spent a lot of time and money on this rather big side project of mine that attempts to replicate the mechanistic interpretability research on proprietary LLMs that was quite popular this year and produced great research papers by Anthropic [1], OpenAI [2] and Deepmind [3].
I am quite proud of this project and since I consider myself the target audience for HackerNews did I think that maybe some of you would appreciate this open research replication as well. Happy to answer any questions or face any feedback.
Cheers
[1] https://transformer-circuits.pub/2024/scaling-monosemanticit...
[2] https://arxiv.org/abs/2406.04093
[3] https://arxiv.org/abs/2408.05147
Rhetoric isn’t reasoning. True explainability, like what overfitted Sparse Autoencoders claim they offer, basically results in the causal sequence of “thoughts” the model went through as it produces an answer. It’s the same way you may have a bunch of ephemeral thoughts in different directions while you think about anything.
reply