I don't believe so, or at least if someone tried it didn't work well enough that... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

entilzha 47 days ago | parent | context | favorite | on: Byte Latent Transformer: Patches Scale Better Than...

I don't believe so, or at least if someone tried it didn't work well enough that I remember :). Some of the motivation for the architecture changes in encoding patches stemmed from finding FLOP efficient ways to express relationships between byte sequences. E.G., having a long context window makes sense when dealing with tokens, but you don't need as long as an attention window if you're attending byte sequences to make patch representations, since the patch representations will implicitly be part of a longer context window in terms of number of patches.

cs702 47 days ago [–]

Thanks for the quick reply!

Interesting. I would have thought one of those "minimum viable" RNNs (like https://arxiv.org/abs/2410.01201) would have been ideal for this. I might tinker a bit with this :-)

Consider applying for YC's Spring batch! Applications are open till Feb 11.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact