There is at least some work on character based modeling, but it hasn’t scaled well before. The challenge I think with something more adhoc for exceptional tokens is that it’s hard to see gains since they are by definition, infrequent. If the text is rare enough, BPE should produce many single byte tokens, so current models actually expend more compute on these rare sequences.
BLT scales well because it expends less compute (by patching) on more predictable (low entropy) byte sequences. Current models only to some degree get this benefit, if it’s a larger BPE token, but that only goes so far.
So it’s really two related, but different motivations.
There is at least some work on character based modeling, but it hasn’t scaled well before. The challenge I think with something more adhoc for exceptional tokens is that it’s hard to see gains since they are by definition, infrequent. If the text is rare enough, BPE should produce many single byte tokens, so current models actually expend more compute on these rare sequences.
BLT scales well because it expends less compute (by patching) on more predictable (low entropy) byte sequences. Current models only to some degree get this benefit, if it’s a larger BPE token, but that only goes so far.
So it’s really two related, but different motivations.